Everything I know
Progressive Web Apps
Content management systems
- Fast high-level web crawling & scraping framework for Python. (
- Service for running Scrapy spiders. (
- Web app for Scrapyd cluster management, Scrapy log analysis & visualization, Auto packaging, Timer tasks, Monitor & Alert, and Mobile UI.
- Extract data from any website in seconds.
- Web Scraping API.
Easy web scraping with Scrapy (2019)
A guide to Web Scraping without getting blocked in 2020
- Distributed web crawler admin platform for spiders management regardless of languages and frameworks.
- Simple, fast web crawler designed for easy, quick discovery of endpoints and assets within a web application.
- Tool for scraping job websites, and filtering and reviewing the job listings.
- Tiny command-line utility to download media contents (videos, audios, images) from the Web.
Universal Reddit Scraper
- Scrape Subreddits, Redditors, and comments on posts. A command-line tool written in Python.
- Distributed Crawler Management Framework Based on Scrapy, Scrapyd, Django and Vue.js.
Ask HN: Best practices for ethical web scraping? (2020)
- Programmatically collect normalized news from (almost) any website. (
- Simple and easy-to-use scraper and crawler in Go.
- Elegant Scraper and Crawler Framework for Golang.
Python Web Scraping with Virtual Private Networks (2020)
- Flask code to deploy an API that pulls structured data from online news articles.
- Scrape websites for text by CSS selector.
List all the broken links on your website
Creating a Robust, Reusable Link-Checker (2020)
- Small library for extracting rich content from urls.
- Easy and cheap way to scrape the internet. (
Website Sitemap Parser
- Download URLs and verify the contents against a publicly recorded cryptographic log.
- Yet another URL library.
- Web Scraping, Data Extraction and Automation.
- Pure-C HTML5 parser.
What is a present-day web scraping in 2020?
- Web scraping. Data extraction tools
Awesome Web Scraping
- Open repository of web crawl data that can be accessed and analyzed by anyone. (
Analysing Petabytes of Websites using Common Crawl (2017)
Cognito Common Crawl
- Search the common crawl using lambda functions.
- All in One Scraping API. Rotating Proxies. Headless Chrome.
Django Dynamic Scraper
- Creating Scrapy scrapers via the Django admin interface.
- Smart, Automatic, Fast and Lightweight Web Scraper for Python.
- Dead-simple crawler which focuses on ease of use and speed. Return a list of all URls of a web page.
Scraping News and Articles From Public APIs with Python (2020)
- Simple and affordable web scraping API.
- Tiny python web crawler.
Booking site web scraper
- Downloads all of the accommodations for the chosen country and saves them in a file.
Reddit Media Downloader
- Scrapes Reddit to download media of your choice.
Web scraping with JS (2020)
Web scraping that just works with OpenFaaS with Puppeteer (2020)
What Happened to XPath? (2020)
- Turn web content into useful data. (
- Library for extracting embedded metadata from HTML markup.
Introduction to Scraping in Python (2020)
Test driving a HackerNews scraper with Node.js (2020)
- Web browser that's built for scraping. (
- Turns every website into an open API. Access any dataset on the world wide web. (
- Simple HTML parser that enables search for nodes using CSS selectors.
NYT Vote Scraper
- Scrapes the NYT Votes Remaining Page JSON and commits it back to this repo. Nice use of GitHub actions for git scraping.
- Scrapes an instagram user's photos and videos.
- Get notified as soon as your next CPU, GPU, or game console is in stock.
Guide on preventing Website Scraping
Bibliographies of the Bibliometric-enhanced Information Retrieval workshops and related other workshops
- Open source, easy-to-use news crawler that extracts structured information from almost any news website.
Web crawling with Python (2020)
- Scrape data from websites using Open Graph, HTML metadata & fallbacks. (
- Download pictures (or videos) along with their captions and other metadata from Instagram. (
- Manage URLs and scrape main text and metadata.
- Find the publication date of web pages.
Filtering links to gather texts on the web (2020)
Evaluating scraping and text extraction tools for Python (2020)
Using sitemaps to crawl websites (2019)
Evaluation of date extraction tools for Python (2020)
- Tool for removing boilerplate content, such as navigation links, headers, and footers from HTML pages.
- Module for automatic summarization of text documents and HTML pages.
- Write your own web crawler/scraper as a state machine in rust.
- Fast, highly configurable, cloud native dark web crawler.
- Makes it easy to scrape a website with R.
Scraping HN content with declarative programming
- Social networking service scraper in Python.
- Framework for rapidly archiving a large number of URLs with little overhead.
- Rust library to extract useful data from HTML documents, suitable for web scraping.
- Provides access to a variety of scraper scripts for most commonly used machine learning and data science domains.
Visual scraping with Elixir and Crawly (2021)
Headless Chrome Crawler
- Distributed crawler powered by Headless Chrome.
Tips for reliable web automation and scraping selectors (2021)
Web Crawler for scraping Financial data
Web Scraping 101 with Python (2021)
- No-code Web Automation Tool. Automation Tool to Extract Data From Any Website.
Scaling up a Serverless Web Crawler and Search Engine (2021)
- List of of HTTP user-agents used by robots, crawlers, and spiders as in single JSON file.
- Web crawler for Go.
- Implementation of an API, which allows you to scrape Google, Bing, Yandex, and Qwant.
- Scala library for scraping content from HTML pages.
Next.js Web Scraper Playground
- Build and test your own web scraper APIs with Next.js API Routes and cheerio. (
- Web scraping library and command-line tool for text discovery and extraction (main content, metadata, comments).
- Visual Sitemaps & Website Planning Tool. (
- Hide your scrapers IP behind the cloud. (
- Proxy server to bypass Cloudflare protection.
Schema API for the Semantic Web
- Extract structured content from the semantic web.
- Standalone tool that runs alongside your web scraper, and instantly makes your existing web scraper scalable, maintainable and unblockable. (
Mastering Web Scraping in Python: Crawling from Scratch (2021)
Data-Mining Wikipedia for Fun and Profit (2021)
Wikidata or Scraping Wikipedia
- Powerful Spider (Web Crawler) System in Python. (
- HTML Content / Article Extractor, web scrapping lib in Python.
- Designed for reliable, flexible and fast web crawling, providing some high-level, comprehensive features without compromising speed.
How to Crawl the Web with Scrapy (2021)
- Page metadata scraper with several fallback strategies.
- Take a list of domains, crawl URLs and scan for endpoints, secrets, API keys, file extensions, tokens and more.
- Crawler/scraper based on Go + colly, configurable via JSON.
- Fast web spider written in Go.
The State Of Web Scraping in 2021
- Web scraping tool for text discovery and retrieval.
- Readability / HTML Content / Article Extractor & Web Scrapping library written in PHP.
Web scraping by watching requests (2021)
Effortless Crawling with Scrapy with one method (2021)
Avoiding bot detection: How to scrape the web without getting blocked?
- Crawls web pages and prints any link it can find.
- Archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns.
- Python module to bypass Cloudflare's anti-bot page.
- Scraping/crawling library for Node.js, written in Typescript.
- Collect links to profiles by username through search engines.
Web Scraping with Go (2021)
- Collect a dossier on a person by username from thousands of sites.
Notes on Writing Web Scrapers (2021)
Scraping Websites With Logins (2021)
- Scan web pages for changes using Julia & GitHub Actions.
- Package to bypass Cloudflare's protection.
- Page Object pattern for Scrapy.
Go Download Web
- Download an entire website with Go.
- Fast link checker.
- Fast, flexible, sync/async, Python 3.6+ screen scraping client specifically for network devices.
- scrapli, but in go.
- Get URLs from the Wayback Machine. Able to handle large outputs.
- Self-Hosted, Open Source, Change Monitoring of Web Pages.
- Detect new images and video on social media feeds and dispatch webhooks on updates.
Building a scalable scraper in Rust (2021)
- Allows you to scrape posts from a user's profile page, hashtag page, or place.