Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

produce WET files? #55

Open
chris-ha458 opened this issue Jul 19, 2023 · 6 comments
Open

produce WET files? #55

chris-ha458 opened this issue Jul 19, 2023 · 6 comments

Comments

@chris-ha458
Copy link

I'm not sure if this is the right place to ask this, (feel free to direct me where)
But would it be possible to also produce WET files from this library?

Many downstream libraries of CC consume WET files (such as oscar-project/ungoliant)
And it would be useful if there were WET files available alongside WARC files.

@wumpus
Copy link
Member

wumpus commented Jul 19, 2023

This is a good idea, but as you can see from the other issues opened by @sebastian-nagel , we're short on engineering resources for news crawl work.

@chris-ha458
Copy link
Author

Thanks for the clarification.
I will leave this issue open for further reference. Hopefully if it becomes relevant again, discussion can be done here.
Otherwise, if it decided or considered not worth attempting (even beyond the issue of engineering resources shortages),
feel free to close it.

@eukaryoting
Copy link

eukaryoting commented Sep 1, 2023

@wumpus as a partial solve, is there some up-to-date way that we can generate WET files ourselves from the news WARC files?

I'm trying to run the WET extractor, as per Sebastian's 2017 comments (https://groups.google.com/g/common-crawl/c/hsb90GHq6to), but running into some issues with building ia-hadoop-tools.

[edit: I've now found another issue related to this -- https://github.com/commoncrawl/ia-hadoop-tools/issues/4]

@wumpus
Copy link
Member

wumpus commented Sep 1, 2023

In theory all of the code needed to make WETs is public from us, but unfortunately we have limited Sebastian time, and I am not so good at Java! If you come up with some better instructions, I'm happy to check them in somewhere. That's a great example of something that's in the mailing list archive that ought to be promoted to be directly visible and updated for modern versions.

@eukaryoting
Copy link

@wumpus Thanks for getting back to me. I got the WET extractor running in the end, just a small issue since ia-hadoop-tools doesn't build with recent Maven versions. I posted what worked back to https://groups.google.com/g/common-crawl/c/hsb90GHq6to/m/V5W-gUBbAgAJ

@tfmorris
Copy link

@eukaryoting if you could put that in the form of a pull request, I'd be happy to review it and @wumpus could get it committed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants