Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explore columnar storage format for webgraph rankings and node labels #7

Open
sebastian-nagel opened this issue Feb 28, 2023 · 0 comments

Comments

@sebastian-nagel
Copy link
Contributor

sebastian-nagel commented Feb 28, 2023

We should consider using a more efficient storage format, such as Parquet, for the host and domain-level rankings. The tooling to read Parquet files has improved in recent years, and readers for this format are now available for almost all programming languages.

Requirements (at least nice to have):

  • smaller storage footprint
  • easy analysis and quick lookups by domain name using big data tools (e.g. Amazon Athena - a wish expressed on the CC group)
    • note: this will probably require sorting the data by reverse domain name
  • still fast to get the top-n ranking domains
  • well-defined table schema including column descriptions
  • example code how to use the new data format
  • (optionally) store also the column holding the node IDs
    • this would make the vertex file(s) obsolete
    • could also drop the textual files holding the edges because the edges (the unlabeled graph) are stored anyway and more efficiently in the webgraph (.graph) format
  • would allow to add more columns, e.g. indegrees and outdegrees, with little overhead
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant