You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We should consider using a more efficient storage format, such as Parquet, for the host and domain-level rankings. The tooling to read Parquet files has improved in recent years, and readers for this format are now available for almost all programming languages.
Requirements (at least nice to have):
smaller storage footprint
easy analysis and quick lookups by domain name using big data tools (e.g. Amazon Athena - a wish expressed on the CC group)
note: this will probably require sorting the data by reverse domain name
still fast to get the top-n ranking domains
well-defined table schema including column descriptions
example code how to use the new data format
(optionally) store also the column holding the node IDs
this would make the vertex file(s) obsolete
could also drop the textual files holding the edges because the edges (the unlabeled graph) are stored anyway and more efficiently in the webgraph (.graph) format
would allow to add more columns, e.g. indegrees and outdegrees, with little overhead
The text was updated successfully, but these errors were encountered:
We should consider using a more efficient storage format, such as Parquet, for the host and domain-level rankings. The tooling to read Parquet files has improved in recent years, and readers for this format are now available for almost all programming languages.
Requirements (at least nice to have):
.graph
) formatThe text was updated successfully, but these errors were encountered: