html2parquet Transform

This tranforms iterate through zip of HTML files or single HTML files and generates parquet files containing the converted document in string.

The HTML conversion is using the Trafilatura.

Output format

The output format will contain the following colums

{
	"title": "string"             // the member filename
	"document": "string"          // the base of the source archive
	"contents": "string"          // the content of the HTML
    "document_id": "string",      // the document id, a hash of `contents`
    "size": "string",             // the size of `contents`
    "date_acquired": "date",      // the date when the transform was executing
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

html2parquet Transform

Output format

Parameters

Files

README.md

Latest commit

History

README.md

File metadata and controls

html2parquet Transform

Output format

Parameters