This list contains JavaScript libraries related to web scraping and data processing. The content of that list is focused on libs that could be run in nodejs (without real web-browser).
- JavaScript Web Scraping
- Network
- Web-scraping Frameworks
- HTML/XML Parsing
- Text processing
- Specific Formats Processing
- Natural Language Processing
- Browser automation and emulation
- Multiprocessing
- Queue
- URL and Network Address Manipulation
- Web Content Extracting
- Asynchronous
- WebSocket
- DNS Resolving
- Computer Vision
- Proxy Server
- Other JavaScript Lists
- Data Structure
- request - Simplified HTTP request client.
- socks5-http-client - SOCKS v5 HTTP client implementation in JavaScript for Node.js
- rest - RESTful HTTP client for JavaScript
- wreck - HTTP Client Utilities
- got - Simplified HTTP requests
- node-fetch - A light-weight module that brings window.fetch to Node.js
- bent - Functional HTTP client for Node.js w/ async/await
- axios - Promise based HTTP client for the browser and node.js
- superagent - Ajax for Node.js and browsers (JS HTTP client)
- urllib - Request HTTP(s) URLs in a complex world
- needle - Nimble, streamable HTTP client for Node.js. With proxy, iconv, cookie, deflate & multipart support
- webparsy - NodeJS lib and cli for scraping websites using Puppeteer and YAML
- node-crawler - Web Crawler/Spider for NodeJS + server-side jQuery
- node-simplecrawler - Flexible event driven crawler for node
- Apify SDK - The scalable web crawling and scraping library for JavaScript. Enables development of data extraction and web automation jobs (not only) with headless Chrome and Puppeteer.
- Ayakashi - The next generation web scraping framework. Features all the necessary tools to create reliable and maintainable scraping and automation systems.
- pjscrape - A web-scraping framework written in Javascript, using PhantomJS and jQuery
- General
- parse5 - WHATWG HTML5 specification-compliant, fast and ready for production HTML parsing/serialization toolset for Node and io.js
- htmlparser2 - forgiving html and xml parser
- sax-js - A sax style parser for JS
- cheerio - Fast, flexible, and lean implementation of core jQuery designed specifically for the server
- Sanitizing
Libraries for parsing and manipulating plain texts.
- General
- string.js - Extra JavaScript string methods.
- accounting.js - A lightweight JavaScript library for number, money and currency formatting - fully localisable, zero dependencies.
- validator.js - String validation and sanitization.
- Date and time
- moment - Parse, validate, manipulate, and display dates in javascript.
- moment-timezone - Timezone support for moment.js.
- date - Date() for humans.
- ms.js - Tiny millisecond conversion utility.
- moment - Parse, validate, manipulate, and display dates in javascript.
- HTML entities
- he - A robust HTML entity encoder/decoder written in JavaScript.
- Money
- money.js - Simple and tiny JavaScript library for realtime currency conversion and exchange rate calculation, from any currency, to any currency.
- Color
- User Agent
- UAParser.js - Lightweight JavaScript-based User-Agent string parser. Supports browser & node.js environment.
- Semantic Version
- node-semver - The semver parser for node
Libraries for parsing and manipulating specific text formats.
- General
- jBinary - High-level I/O (loading, parsing, manipulating, serializing, saving) for binary files with declarative syntax for describing file types and data structures.
- Office
- js-xlsx - XLSX / XLSM / XLSB / XLS / SpreadsheetML (Excel Spreadsheet) / ODS parser and writer
- CSV
- JSON
- json3 - A modern JSON implementation compatible with nearly all JavaScript platforms.
- EXIF
- exif-js - JavaScript library for reading EXIF image metadata
- CSS
- parse-css - Standards-based CSS Parser
- parser-lib CSS parser - The ParserLib CSS parser is a CSS3 SAX-inspired parser written in JavaScript. By default, the parser only deals with standard CSS syntax and doesn't do validation (checking of property names and values).
- Torrent
- parse-torrent - Parse a torrent identifier (magnet uri, .torrent file, info hash)
- SQL
- SQL Parser - SQL Parser is a lexer, grammar and parser for SQL written in JS. Currently it is only capable of parsing fairly basic SELECT queries.
- YAML
- JS-YAML - JavaScript YAML parser and dumper. Very fast.
- Markdown
- markdown-it - Markdown parser, done right. 100% CommonMark support, extensions, syntax plugins & high speed
- Atom/RSS
- node-feedparser - Robust RSS, Atom, and RDF feed parsing in Node.js
- Netscape Bookmarks(Firefox, Google Chrome, ...)
- node-bookmarks-parser - Parses Firefox/Chrome HTML bookmarks files
Libraries for working with human languages.
- General
- natural - general natural language facilities for node
- nlp_compromise - natural language processing
- Hanzi - HanziJS is a Chinese character and NLP module for Chinese language processing for Node.js
- salient - Machine Learning, Natural Language Processing and Sentiment Analysis Toolkit for Node.js
- node-summary - Node module that summarizes text using a naive summarization algorithm
- Stemmer
- snowball-js - javascript implementation of the popular snowball word stemming nlp algorithm
- porter-stemmer - Martin Porter's stemmer for node.js
- Porter-Stemmer - A Javascript Implementation of the Porter Stemmer
- lunr-languages - a collection of languages stemmers and stopwords for Lunr Javascript library
- Language detection
- franc - Natural language detection
- guessLanguage.js - A natural language detection library based on trigram statistical analysis for Node.js
- phantomjs - Scriptable Headless WebKit.
- slimerjs - A PhantomJS-like tool running Gecko.
- casperjs - Navigation scripting & testing utility for PhantomJS and SlimerJS.
- zombie - Insanely fast, full-stack, headless browser testing using node.js.
- nightmare - Nightmare is a high level wrapper for PhantomJS that lets you automate browser tasks
- puppeteer - Puppeteer is a Node library which provides a high-level API to control headless Chrome or Chromium over the DevTools Protocol. It can also be configured to use full (non-headless) Chrome or Chromium.
- headless-chrome-crawler - Distributed crawler powered by Headless Chrome
- puppeteer-recorder - Puppeteer recorder is a Chrome extension that records your browser interactions and generates a Puppeteer script.
- wendigo - Test-oriented headless browser, built on top of Puppeteer.
- Playwright - Node.js library to automate Chromium, Firefox and WebKit with a single API
- nexpect - spawn and control child processes in node.js with ease
- respawn - Spawn a process and restart it if it crashes
- node-webworker - A WebWorkers implementation for NodeJS
Libraries for asynchronous networking programming.
- socket.io - Realtime application framework (Node.JS server)
- engine.io - Engine.IO is the implementation of transport-based cross-browser/cross-device bi-directional communication layer for Socket.IO
- async - Async utilities for node and the browser
- kue - Kue is a priority job queue backed by redis, built for node.js
- bull - A lightweight, robust and fast job processing queue. Carefully written for rock solid stability and atomicity.
Libraries for parsing email.
- mailparser - Decode mime formatted e-mails
Libraries for parsing/modifying URLs and network addresses.
- URL
- query-string - Parse and stringify URL query strings.
- URI.js - Javascript URL mutation library.
- jsurl - Lightweight URL manipulation with JavaScript.
- arg.js - Lightweight URL argument and parameter parser
- Network Address
- node-ip - IP address tools for node.js
- ip-address - A library for parsing and manipulating IPv6 (and v4) addresses in JavaScript
Libraries for extracting web contents.
- node-read - Get Readable Content from any page. Based on Arc90's readability project using cheerio engine.
- node-ytdl-core - Youtube video downloader in javascript
- ImageResolver - Does its best to determine the main image on a URL without loading all images.
Libraries for working with WebSocket.
- websocket.io - WebSocket.IO is an abstraction of the websocket server previously used by Socket.IO. It has the broadest support for websocket protocol/specifications and an API that allows for interoperability with higher-level frameworks such as Engine, Socket.IO's realtime core.
- WebScoket-Node - A WebSocket Implementation for Node.JS (Draft -08 through the final RFC 6455)
- multicast-dns - Low level multicast-dns implementation in pure javascript
- node-dns - Replacement dns module in pure javascript for node.js
- tracking.js - A modern approach for Computer Vision on the web.
- ocrad.js - OCR in Javascript via Emscripten.
- toxy - Hackable HTTP proxy to simulate server failure scenarios and unexpected network conditions
- proxy-chain - Node.js implementation of a proxy server (think Squid) with support for SSL, authentication and upstream proxy chaining