-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is it possible to use further filters? #4
Comments
I've added some filters for score and date, it's been on my TODO list for a while. What issues are you running in to, are they disk space or speed? Maybe what you want can be achieved with flags rather than additional filters? I'm also open to pull requests that add regex filters on title and text bodies. |
Cool, that's great, thanks!
Good question. I'm still a bit lost, and both aspects are a challenge to me. Ideally, I need to parse the complete dumps and extract/process all comments that contain specific words. Preferably across all subreddits or at least across a subset of maybe 100 mid-size subreddits. I previously used the Pushshift API via Turns out that this is challenging too, though. I have an M1 Macbook and I could put everything on a large external disk, but from what I have seen so far (converted 2021-06 as a pilot), putting everything in an SQL database using your tool would create a huge database plus processing would take ages. I guess another option is to do all of this on a VM (e.g. Google Cloud) to use lots of CPUs, but then I need a giant disk and upload everything. Do you happen to have advice?
What do you mean by flags?
I know zero Rust, unfortunately, so can't help there :/ |
Combining both flags, I have been able to create a sqlite database in < 36 hours. The final size of the database is between 3.5-4 TB. |
Thanks for this information!
Cool, I'll try. Plus this suggests that in the 'safe mode', one can just re-run the DB creation command if interrupted, and it would more or less pick up from where it ended?
So this means that without FTS, queries would be much slower, right? Still quick enough to be manageable? The thing is that the columns that are part of the FTS table don't contain all the information I need (e.g.
An SQLite database of the complete dumps? Wow, that's impressive. That would certainly be feasible both in terms of processing time and disk usage. Were you using a "normal" laptop/desktop or was this using 100 CPUs and 1 TB RAM on the cloud? 😉 So would you recommend I do this on my local machine using a big external hard disk rather than on a cloud VM? Finally, you don't plan on sharing the resulting DB, do you? There has been some interest in this on the Pushshift subreddit. It's not only me, I think :) |
The importer will try to ingest all of the files in the input directory, but if the contents are already the DB, no changes will be made.
Much slower. I couldn't say how long a text search would take, I haven't tried it yet. If you limit by some field such as author or subreddit you can probably make significant performance improvements by adding an index on that field. For example I created an index on subreddit in my DB (which took some time) and ran this query:
So, a text search with an index and without FTS took about 2 minutes on a small subreddit. This would probably be much faster on a SSD, but I don't have one large enough.
As you suggest, you can use a join. eg for comments:
This was done on a desktop computer with an AMD Ryzen 7 2700X CPU with 64 GB of memory on a spinning disk drive. An import should be possible on an M1 given enough disk space. A VM in the cloud could work as well, though you might want to be mindful of egress costs if you want to export that DB.
I'll consider it. I'm not sure if my ISP would appreciate that kind of upload usage. |
Thanks a lot, that's very helpful!
Good to know, thanks!
I didn't know I could use
Again, very good to know, thanks. I wouldn't have enough disk space on my internal SSD. I would have to use an external hard disk for doing it. I would hope that's still feasible.
Oh yeah, I see. It just seems like a big waste of time and energy that so many people parse the dumps from scratch. I think the suggestion on the Pushshift subreddit was to share a DB version of the dumps via P2P -- a torrent or sth. like this. |
I have just written code to execute this complete pipeline on Google Colab ( download dumps, extract them to a database and filter by keyword, and then download the result as I had to treat By the way, queries with a |
I had to disable building FTS by default. Without filters it can take a very long time to build the database. FTS can be enable with
It might be some time before I can implement this. I'm happy to review PRs. |
Is it possible to use further filters? It would be brilliant to be able to filter by date or by string matches of the text column (e.g. all comments that contain the word pushshift).
If not, is there an efficient way to work around this? Ideally, I would like to work with the full dumps -- the whole time range and all subreddits --, but having little experience with 'big data', I find processing the sheer amount of data very hard.
Thanks a lot for this tool btw. I just tried it for the first time, but it seems very promising :)
The text was updated successfully, but these errors were encountered: