Skip to content

Latest commit

 

History

History
21 lines (16 loc) · 777 Bytes

File metadata and controls

21 lines (16 loc) · 777 Bytes

html2parquet Transform

This tranforms iterate through zip of HTML files or single HTML files and generates parquet files containing the converted document in string.

The HTML conversion is using the Trafilatura.

Output format

The output format will contain the following colums

{
	"title": "string"             // the member filename
	"document": "string"          // the base of the source archive
	"contents": "string"          // the content of the HTML
    "document_id": "string",      // the document id, a hash of `contents`
    "size": "string",             // the size of `contents`
    "date_acquired": "date",      // the date when the transform was executing
}

Parameters