Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix parser for recent WBM snapshots #60

Open
travisbrown opened this issue Mar 16, 2022 · 1 comment
Open

Fix parser for recent WBM snapshots #60

travisbrown opened this issue Mar 16, 2022 · 1 comment

Comments

@travisbrown
Copy link
Owner

Several people have noted that the deleted tweet reports are not including some recent WBM snapshots (e.g. here). I think this is some kind of special handling for manually archived tweets but haven't looked in detail. In any case we need to fix the parser.

@travisbrown
Copy link
Owner Author

Some notes:

  • This format is fairly rare (for one recent test scrape I found 42 snapshots with this format out of ~20k total).
  • The digests returned from the CDX index for these snapshots are consistently incorrect (possibly because they're being computed in some non-standard way that isn't documented and that we don't know about).
  • These snapshots are a minimal HTML representation that includes Schema.org metadata (something that Twitter doesn't seem to use anywhere else). This seems like an experiment at Twitter, possibly in conversation with the Internet Archive, but I haven't been able to find any documentation or anyone who's able to talk about it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant