scrapelect
is a web scraping language inspired by CSS that turns
a web page into structured JSON data. Select elements with CSS
selectors, apply filters to extract and modify the data you want from
a web page, and get the output in a structured, machine-readable,
interoperable format.
Install the Rust toolchain. Using cargo
,
run:
$ cargo install scrapelect
to install the scrapelect
interpreter.
Write a scrapelect
program into a .scrp
file. Documentation
for the language can be found in the scrapelect
book.
A quick example, title.scrp
, retrieves the title of a Wikipedia article:
title: .mw-page-title-main {
content: $element | text();
};
Run the scrp
with the URL of the web page to scrape:
$ scrapelect title.scrp "https://en.wikipedia.org/wiki/Cat"
It will output:
{
"title": {
"content": "Cat"
}
}
- The
scrapelect
book contains documentation on language concepts and how to write ascrapelect
program. - Additionally, documentation for scrapelect's built-in filters is located at docs.rs
- Developer-level documentation is also at docs.rs, but it is currently incomplete.
- GitHub issues
and discussions
are great places to report bugs, request features, and get help
using
scrapelect
- Also, consider submitting a pull request to contribute to the code or documentation.
- See the contributing
chapter of the
scrapelect
book for more information on contributing toscrapelect
.
scrapelect
is available under the MIT or Apache 2 licenses, at your
option. Copies of these licenses are included at
LICENSE-MIT and
LICENSE-APACHE
at the root directory.
scrapelect: scrape + select, also -lect