This list contains python libraries related to web scraping and data processing
- Network
- Web Scraping
- HTML/XML
- Text processing
- Structured Formats
- Serialization
- Natural Language Processing
- Browser automation
- Multiprocessing
- Job Queue
- Message Queue
- Cloud Computing
- URL and Network Address
- Web Content Extraction
- Asynchronous
- WebSocket
- DNS Resolving
- Computer Vision
- Proxy Server
- Whois
- Website Specific Scraper
- JavaScript Engine Bindings
- Other Python Lists
- urllib - network library (stdlib)
- requests - network library
- grab - network library (pycurl based)
- pycurl - network library (binding to libcurl)
- urllib3 - Python HTTP library with thread-safe connection pooling, file post support, sanity friendly, and more.
- httplib2 - Small, fast HTTP client library. Features persistent connections, cache, and Google App Engine support.
- RoboBrowser - A simple, Pythonic library for browsing the web without a standalone web browser.
- MechanicalSoup - A Python library for automating interaction with websites.
- mechanize - Stateful programmatic web browsing.
- socket low-level networking interface (stdlib)
- Unirest for Python - Unirest is a set of lightweight HTTP libraries available in multiple languages
- hyper - HTTP/2 Client for Python
- PySocks - Updated and actively maintained version of SocksiPy, with bug fixes and extra features. Acts as a drop-in replacement to the socket module.
- dpkt - fast, simple packet creation / parsing, with definitions for the basic TCP/IP protocols
- pyOpenSSL - A Python wrapper around the OpenSSL library
- tlslite-ng - TLS implementation in pure python
- scapy - powerful Python-based interactive packet manipulation program and library
- impacket - low-level programmatic access to the packets of network protocols
- grab - web-scraping framework (pycurl/multicurl based)
- scrapy - web-scraping framework (twisted based).
- pyspider - A powerful spider system.
- cola - A distributed crawling framework.
- ruia - Async Python 3.6+ web scraping micro-framework based on asyncio
- ioweb - Web scraping framework based on gevent and lxml
- autoscraper - A smart, automatic and lightweight web scraper
- frontera - A scalable frontier for web crawlers
- portia - Visual scraping for Scrapy.
- restkit - HTTP resource kit for Python. It allows you to easily access to HTTP resource and build objects around it.
- requests-html - Pythonic HTML Parsing for Humans.
- ScrapydWeb - A full-featured web UI for Scrapyd cluster management, which supports Scrapy Log Analysis & Visualization, Auto Packaging, Timer Tasks, Email Notice and so on.
- Starbelly - Starbelly is a user-friendly and highly configurable web crawler front end.
- Gerapy - Distributed Crawler Management Framework Based on Scrapy, Scrapyd, Django and Vue.js
- cloudscraper - A Python module to bypass Cloudflare's anti-bot page.
- lxml - effective HTML/XML processing library. Supports XPATH. Written in C.
- cssselect - working with DOM tree with CSS selectors
- pyquery - working with DOM tree with jQuery-like selectors
- BeautifulSoup - slow HTML/XMl processing library, written in pure python
- html5lib - builds DOM of HTML/XML document according to WHATWG spec. That spec is used in all modern browsers.
- feedparser - parsing of RSS/ATOM feeds.
- MarkupSafe - Implements a XML/HTML/XHTML Markup safe string for Python.
- xmltodict - Working with XML feel like you are working with JSON.
- xhtml2pdf - HTML/CSS to PDF converter.
- untangle - Converts XML documents to Python objects for easy access.
- hodor - Configuration driven wrapper around lxml and cssselect.
- chopper - Tool to extract a part from HTML page with corresponding CSS rules and preserving correct HTML.
- selectolax - Python bindings to Modest engine (fast HTML5 parser with CSS selectors).
- parsel - Lets you extract data from XML/HTML documents using XPath or CSS selectors.
- html5-parser - Fast C based HTML 5 parsing for python.
- gazpacho - A simple, fast, and modern web scraping library.
- Bleach - cleaning of HTML (requires html5lib)
- sanitize - Bringing sanity to world of messed-up data.
- extruct - A library for extracting embedded metadata from HTML markup.
Libraries for parsing and manipulating plain texts.
- difflib - (Python standard library) Helpers for computing deltas.
- Levenshtein - Fast computation of Levenshtein distance and string similarity.
- fuzzywuzzy - Fuzzy String Matching.
- esmre - Regular expression accelerator.
- ftfy - Makes Unicode text less broken and more consistent automagically.
- unidecode - ASCII transliterations of Unicode text.
- uniout - Print readable chars instead of the escaped string.
- chardet - Python 2/3 compatible character encoding detector.
- xpinyin - A library to translate Chinese hanzi (漢字) to pinyin (拼音).
- pangu.py - Spacing texts for CJK and alphanumerics.
- cchardet - cChardet is high speed universal character encoding detector. - binding to uchardet.
- awesome-slugify - A Python slugify library that can preserve unicode.
- python-slugify - A Python slugify library that translates unicode to ASCII.
- unicode-slugify - A slugifier that generates unicode slugs.
- pytils - Simple tools for processing strings in russian (including pytils.translit.slugify)
- PLY - Implementation of lex and yacc parsing tools for Python
- pyparsing - A general purpose framework for generating parsers.
- python-nameparser - Parsing human names into their individual components.
- phonenumbers - Parsing, formatting, storing and validating international phone numbers.
- HTTP Agent Parser - Python HTTP Agent Parser
- uap-python - Python implementation of ua-parser
- python-user-agents - Browser user agent parser.
- fake-useragent - Python user agent string faker, based on world statistic of browsers
- user_agent - Generator of User-Agent data
- reppy - Modern robots.txt Parser for Python
- dateutil - Useful extensions to the standard Python datetime features
- dateparser - python parser for human readable dates
- price-parser - a small library for extracting price and currency from raw text strings.
Libraries for parsing and manipulating specific text formats.
- tablib - A module for Tabular Datasets in XLS, CSV, JSON, YAML.
- textract - Extract text from any document, Word, PowerPoint, PDFs, etc.
- messytables - Tools for parsing messy tabular data
- rows - A common, beautiful interface to tabular data, no matter the format (currently CSV, HTML, XLS, TXT -- more coming!)
- python-docx - Reads, queries and modifies Microsoft Word 2007/2008 docx files.
- xlwt / xlrd - Writing and reading data and formatting information from Excel files.
- XlsxWriter - A Python module for creating Excel .xlsx files.
- xlwings - A BSD-licensed library that makes it easy to call Python from Excel and vice versa.
- openpyxl - A library for reading and writing Excel 2010 xlsx/xlsm/xltx/xltm files.
- Marmir - Takes Python data structures and turns them into spreadsheets.
- PDFMiner - A tool for extracting information from PDF documents.
- PyPDF2 - A library capable of splitting, merging and transforming PDF pages.
- ReportLab - Allowing Rapid creation of rich PDF documents.
- pdftables - Extract tables from PDF files directly
- Python-Markdown - A Python implementation of John Gruber’s Markdown.
- Mistune - Fastest and full featured pure Python parsers of Markdown.
- markdown2 - A fast and complete Python implementation of Markdown
- mistletoe - A fast, extensible and spec-compliant Markdown parser in pure Python
- PyYAML - YAML implementations for Python.
- cssutils - A CSS library for Python.
- feedparser - Universal feed parser.
- sqlparse - A non-validating SQL parser.
- http-parser - HTTP request/response parser for python in C
- httptools - a Python binding for nodejs HTTP parser
- opengraph - A Python module to parse the Open Graph Protocol tags
- pefile - A multi-platform module to parse and work with Portable Executable (aka PE) files.
- psd-tools - reading Adobe Photoshop PSD files (as described in specification) to Python data structures.
- bookmarks-parser - Parses Firefox/Chrome HTML bookmarks files
- orjson - Fast, correct Python JSON library supporting dataclasses and datetimes
- ujson - Ultra fast JSON decoder and encoder written in C with Python bindings
Libraries for working with human languages.
- NLTK - A leading platform for building Python programs to work with human language data.
- spacy - Enables using State-of-the-Art Deep Learning models for common NLP tasks.
- fastai - Deep Learning library with free video tutorials + active forum community, downside of lib: GPU needed
- gensim - library for topic modeling, document indexing and similarity retrieval with large corpora
- Pattern - A web mining module for the Python. It has tools for natural language processing, machine learning, among others.
- TextBlob - Providing a consistent API for diving into common NLP tasks. Stands on the giant shoulders of NLTK and Pattern.
- jieba - Chinese Words Segmentation Utilities.
- SnowNLP - A library for processing Chinese text.
- loso - Another Chinese segmentation library.
- genius - A Chinese segment base on Conditional Random Field.
- langid.py - Stand-alone language identification system.
- Korean - A library for Korean morphology.
- pymorphy2 - Morphological analyzer (POS tagger + inflection engine) for Russian language.
- PyPLN - A distributed pipeline for natural language processing, made in Python. he goal of the project is to create an easy way to use NLTK for processing big corpora, with a Web interface.
- langdetect - Port of Google's language-detection library to Python
- selenium - automating real browsers (Chrome, Firefox, Opera, IE)
- Ghost.py - wrapper of QtWebKit (requires PyQT)
- Spynner - wrapper of QtWebKit QtWebKit (requires PyQT)
- Splinter - univeral API to browser emulators (selenium webdrivers, django client, zope)
- Requestium - Integration layer between Requests and Selenium for automation of web actions.
- Splash - Lightweight, scriptable browser as a service with an HTTP API.
- pyppeteer - Headless chrome/chromium automation library (unofficial port of puppeteer)
- Playwright - Playwright is a Python library to automate Chromium, Firefox and WebKit browsers with a single API
- seleniumbase - Python framework for Web/UI testing + RPA. 🤖 🏰 Fast, easy, and reliable.
- xvfbwrapper - Python wrapper for running a display inside X virtual framebuffer (Xvfb)
- threading - standard python library to run threads. Effective for I/O-bound tasks. Useless for CPU-bound tasks because of python GIL.
- multiprocessing - standard python library to run processes.
- concurrent-futures - The concurrent.futures module provides a high-level interface for asynchronously executing callables.
Libraries for asynchronous networking programming.
- asyncio - (Python standard library in Python 3.4+) Asynchronous I/O, event loop, coroutines and tasks.
- Twisted - An event-driven networking engine.
- Tornado - A Web framework and asynchronous networking library.
- pulsar - Event-driven concurrent framework for Python.
- diesel - Greenlet-based event I/O Framework for Python.
- gevent - A coroutine-based Python networking library that uses greenlet.
- eventlet - Asynchronous framework with WSGI support.
- Tomorrow - Magic decorator syntax for asynchronous code.
- grequests - Make asynchronous HTTP Requests easily.
- celery - An asynchronous task queue/job queue based on distributed message passing.
- huey - Little multi-threaded task queue.
- mrq - Mr. Queue - A distributed worker task queue in Python using Redis & gevent.
- RQ - lightweight task queue manager based on redis
- simpleq - A simple, infinitely scalable, Amazon SQS based queue.
- python-gearman - python API for Gearman
- kombu - Messaging library for Python
- picloud - executing python-code in cloud
- dominoup.com - executing R, Python и matlab code in cloud
- minigun-requests - Web scraping API to outsource tons of GET & xpath to cloud computing
- pythonista-chromeless - AWS lambda which execute given python code on selenium
Libraries for parsing email.
- flanker - A email address and Mime parsing library.
- Talon - Mailgun library to extract message quotations and signatures.
Libraries for parsing/modifying URLs, network addresses, domain names.
- furl - A small Python library that makes manipulating URLs simple.
- purl - A simple, immutable URL class with a clean API for interrogation and manipulation.
- urllib.parse - interface to break Uniform Resource Locator (URL) strings up in components (addressing scheme, network location, path etc.), to combine the components back into a URL string, and to convert a “relative URL” to an absolute URL given a “base URL.” (stdlib)
- netaddr - A Python library for representing and manipulating network addresses.
- micawber - A small library for extracting rich content from URLs.
- tldextract - Accurately separate the TLD from the registered domain and subdomains of a URL, using the Public Suffix List.
- find_domains - a library to search for domain names in text data
Libraries for extracting web contents.
- newspaper - News extraction, article extraction and content curation in Python.
- python-goose - HTML Content/Article Extractor.
- scrapely - Library for extracting structured data from HTML pages. Given some example web pages and the data to be extracted, scrapely constructs a parser for all similar pages.
- htmldate - Find creation date using common structural patterns or text-based heuristics.
- lassie - Web Content Retrieval for Humans.
- html2text - Convert HTML to Markdown-formatted text.
- libextract - Extract data from websites.
- python-readability - Fast Python port of arc90's readability tool.
- sumy - A module for automatic summarization of text documents and HTML pages.
- Haul - An Extensible Image Crawler.
- you-get - A YouTube/Youku/Niconico video downloader written in Python 3.
- youtube-dl - A small command-line program to download videos from YouTube.
- WikiTeam - Tools for downloading and preserving wikis.
- linkchecker - check links in web documents or full websites
- python-sitemap - Mini website crawler to make sitemap from a website.
- trafilatura - Fast extraction of main text and comments along with structure, conversion to TXT, CSV & XML.
- advertools - A customizable crawler to analyze SEO and content of pages and websites.
- photon - Incredibly fast crawler designed for OSINT
Libraries for working with WebSocket.
- Crossbar - Open-source Unified Application Router (Websocket & WAMP for Python on Autobahn).
- AutobahnPython - WebSocket & WAMP for Python on Twisted and asyncio.
- WebSocket-for-Python - WebSocket client and server library for Python 2 and 3 as well as PyPy.
- dnspython - a powerful DNS toolkit for python
- dnsyo - Check your DNS against over 1500 global DNS servers.
- pycares - interface to c-ares. c-ares is a C library that performs DNS requests and name resolutions asynchronously
- OpenCV - Open Source Computer Vision Library.
- SimpleCV - Concise, readable interface for cameras, image manipulation, feature extraction, and format conversion (based on OpenCV).
- mahotas - fast computer vision algorithms (all implemented in C++) operating over numpy arrays.
- scylla - Intelligent proxy pool for Humans
- ProxyBroker - Proxy [Finder | Checker | Server]. HTTP(S) & SOCKS
- shadowsocks - A fast tunnel proxy that helps you bypass firewalls (TCP & UDP support, User management API, TCP Fast Open, Workers and graceful restart, Destination IP blacklist)
- tproxy - tproxy is a simple TCP routing proxy (layer 7) built on Gevent that lets you configure the routine logic in Python
- python-whois - A python module for retrieving and parsing WHOIS data
- twitter-scraper - Scrape the Twitter Frontend API without authentication
- Ultimate-Facebook-Scraper - A bot which scrapes almost everything about a Facebook user's profile
- instagram-scraper - Scrapes an instagram user's photos and videos