Skip to content

suaviloquence/scrapelect

Repository files navigation

scrapelect

scrapelect is a web scraping language inspired by CSS that turns a web page into structured JSON data. Select elements with CSS selectors, apply filters to extract and modify the data you want from a web page, and get the output in a structured, machine-readable, interoperable format.

installation

Install the Rust toolchain. Using cargo, run:

$ cargo install scrapelect

to install the scrapelect interpreter.

usage

Write a scrapelect program into a .scrp file. Documentation for the language can be found in the scrapelect book.

A quick example, title.scrp, retrieves the title of a Wikipedia article:

title: .mw-page-title-main {
  content: $element | text();
};

Run the scrp with the URL of the web page to scrape:

$ scrapelect title.scrp "https://en.wikipedia.org/wiki/Cat"

It will output:

{
  "title": {
    "content": "Cat"
  }
}

documentation

  • The scrapelect book contains documentation on language concepts and how to write a scrapelect program.
  • Additionally, documentation for scrapelect's built-in filters is located at docs.rs
  • Developer-level documentation is also at docs.rs, but it is currently incomplete.

community

  • GitHub issues and discussions are great places to report bugs, request features, and get help using scrapelect
  • Also, consider submitting a pull request to contribute to the code or documentation.
  • See the contributing chapter of the scrapelect book for more information on contributing to scrapelect.

license

scrapelect is available under the MIT or Apache 2 licenses, at your option. Copies of these licenses are included at LICENSE-MIT and LICENSE-APACHE at the root directory.

scrapelect: scrape + select, also -lect

About

Declarative web scraping DSL with CSS-inspired syntax

Topics

Resources

License

Apache-2.0, MIT licenses found

Licenses found

Apache-2.0
LICENSE-APACHE
MIT
LICENSE-MIT

Stars

Watchers

Forks

Packages

No packages published

Languages