-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
produce WET files? #55
Comments
This is a good idea, but as you can see from the other issues opened by @sebastian-nagel , we're short on engineering resources for news crawl work. |
Thanks for the clarification. |
@wumpus as a partial solve, is there some up-to-date way that we can generate WET files ourselves from the news WARC files? I'm trying to run the WET extractor, as per Sebastian's 2017 comments (https://groups.google.com/g/common-crawl/c/hsb90GHq6to), but running into some issues with building ia-hadoop-tools. [edit: I've now found another issue related to this -- https://github.com/commoncrawl/ia-hadoop-tools/issues/4] |
In theory all of the code needed to make WETs is public from us, but unfortunately we have limited Sebastian time, and I am not so good at Java! If you come up with some better instructions, I'm happy to check them in somewhere. That's a great example of something that's in the mailing list archive that ought to be promoted to be directly visible and updated for modern versions. |
@wumpus Thanks for getting back to me. I got the WET extractor running in the end, just a small issue since ia-hadoop-tools doesn't build with recent Maven versions. I posted what worked back to https://groups.google.com/g/common-crawl/c/hsb90GHq6to/m/V5W-gUBbAgAJ |
@eukaryoting if you could put that in the form of a pull request, I'd be happy to review it and @wumpus could get it committed. |
I'm not sure if this is the right place to ask this, (feel free to direct me where)
But would it be possible to also produce WET files from this library?
Many downstream libraries of CC consume WET files (such as oscar-project/ungoliant)
And it would be useful if there were WET files available alongside WARC files.
The text was updated successfully, but these errors were encountered: