Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is it possible to use further filters? #4

Open
wuqui opened this issue Dec 1, 2021 · 8 comments
Open

Is it possible to use further filters? #4

wuqui opened this issue Dec 1, 2021 · 8 comments

Comments

@wuqui
Copy link

wuqui commented Dec 1, 2021

Is it possible to use further filters? It would be brilliant to be able to filter by date or by string matches of the text column (e.g. all comments that contain the word pushshift).

If not, is there an efficient way to work around this? Ideally, I would like to work with the full dumps -- the whole time range and all subreddits --, but having little experience with 'big data', I find processing the sheer amount of data very hard.

Thanks a lot for this tool btw. I just tried it for the first time, but it seems very promising :)

@Paul-E
Copy link
Owner

Paul-E commented Dec 5, 2021

I've added some filters for score and date, it's been on my TODO list for a while. What issues are you running in to, are they disk space or speed? Maybe what you want can be achieved with flags rather than additional filters?

I'm also open to pull requests that add regex filters on title and text bodies.

@wuqui
Copy link
Author

wuqui commented Dec 6, 2021

I've added some filters for score and date, it's been on my TODO list for a while.

Cool, that's great, thanks!

What issues are you running in to, are they disk space or speed?

Good question. I'm still a bit lost, and both aspects are a challenge to me. Ideally, I need to parse the complete dumps and extract/process all comments that contain specific words. Preferably across all subreddits or at least across a subset of maybe 100 mid-size subreddits.

I previously used the Pushshift API via praw to download the data, but given the vast volume it's pretty slow. This is why I thought processing the dumps would be better.

Turns out that this is challenging too, though. I have an M1 Macbook and I could put everything on a large external disk, but from what I have seen so far (converted 2021-06 as a pilot), putting everything in an SQL database using your tool would create a huge database plus processing would take ages.

I guess another option is to do all of this on a VM (e.g. Google Cloud) to use lots of CPUs, but then I need a giant disk and upload everything. Do you happen to have advice?

Maybe what you want can be achieved with flags rather than additional filters?

What do you mean by flags?

I'm also open to pull requests that add regex filters on title and text bodies.

I know zero Rust, unfortunately, so can't help there :/

@Paul-E
Copy link
Owner

Paul-E commented Dec 7, 2021

--unsafe-mode will greatly speed up database creation by disabling a lot of the data integrity features of sqlite. The drawback is that interrupting the import will almost certainly result in a corrupt database file. This usually isn't an issue if you are starting from scratch anyways. You just start over.

--disable-fts skips the FTS table creation and population during import. This also speeds up the import process and results in a smaller file. The drawback here is that without FTS you can't run queries like SELECT * FROM comment_fts WHERE body MATCH 'snoo' quickly. Searches will for submissions or comments with a given word require a full table scan every time. With that said, you can create your own FTS tables by hand after the import completes.

Combining both flags, I have been able to create a sqlite database in < 36 hours. The final size of the database is between 3.5-4 TB.

@wuqui
Copy link
Author

wuqui commented Dec 7, 2021

Thanks for this information!

--unsafe-mode will greatly speed up database creation by disabling a lot of the data integrity features of sqlite. The drawback is that interrupting the import will almost certainly result in a corrupt database file. This usually isn't an issue if you are starting from scratch anyways. You just start over.

Cool, I'll try.

Plus this suggests that in the 'safe mode', one can just re-run the DB creation command if interrupted, and it would more or less pick up from where it ended?

--disable-fts skips the FTS table creation and population during import. This also speeds up the import process and results in a smaller file. The drawback here is that without FTS you can't run queries like SELECT * FROM comment_fts WHERE body MATCH 'snoo' quickly. Searches will for submissions or comments with a given word require a full table scan every time. With that said, you can create your own FTS tables by hand after the import completes.

So this means that without FTS, queries would be much slower, right? Still quick enough to be manageable?

The thing is that the columns that are part of the FTS table don't contain all the information I need (e.g. created_utc). Is there a way to quickly join the non-FTS columns' info with the results? If not, the current FTS table wouldn't be very useful for me anyway.

Combining both flags, I have been able to create a sqlite database in < 36 hours. The final size of the database is between 3.5-4 TB.

An SQLite database of the complete dumps? Wow, that's impressive. That would certainly be feasible both in terms of processing time and disk usage. Were you using a "normal" laptop/desktop or was this using 100 CPUs and 1 TB RAM on the cloud? 😉 So would you recommend I do this on my local machine using a big external hard disk rather than on a cloud VM?

Finally, you don't plan on sharing the resulting DB, do you? There has been some interest in this on the Pushshift subreddit. It's not only me, I think :)

@Paul-E
Copy link
Owner

Paul-E commented Dec 8, 2021

Plus this suggests that in the 'safe mode', one can just re-run the DB creation command if interrupted, and it would more or less pick up from where it ended?

The importer will try to ingest all of the files in the input directory, but if the contents are already the DB, no changes will be made.

So this means that without FTS, queries would be much slower, right? Still quick enough to be manageable?

Much slower. I couldn't say how long a text search would take, I haven't tried it yet. If you limit by some field such as author or subreddit you can probably make significant performance improvements by adding an index on that field. For example I created an index on subreddit in my DB (which took some time) and ran this query:

sqlite> .timer on
sqlite> SELECT COUNT(*) FROM comment WHERE subreddit = 'pushshift' AND body LIKE '%cool%';
75
Run Time: real 110.155 user 0.126618 sys 0.588360

So, a text search with an index and without FTS took about 2 minutes on a small subreddit. This would probably be much faster on a SSD, but I don't have one large enough.

The thing is that the columns that are part of the FTS table don't contain all the information I need (e.g. created_utc). Is there a way to quickly join the non-FTS columns' info with the results? If not, the current FTS table wouldn't be very useful for me anyway.

As you suggest, you can use a join. eg for comments:

SELECT comment.created_utc
FROM comment_fts
JOIN comment ON comment_fts.rowid = comment.id
WHERE comment_fts.body MATCH 'foo';

An SQLite database of the complete dumps? Wow, that's impressive. That would certainly be feasible both in terms of processing time and disk usage. Were you using a "normal" laptop/desktop or was this using 100 CPUs and 1 TB RAM on the cloud? So would you recommend I do this on my local machine using a big external hard disk rather than on a cloud VM?

This was done on a desktop computer with an AMD Ryzen 7 2700X CPU with 64 GB of memory on a spinning disk drive. An import should be possible on an M1 given enough disk space. A VM in the cloud could work as well, though you might want to be mindful of egress costs if you want to export that DB.

Finally, you don't plan on sharing the resulting DB, do you? There has been some interest in this on the Pushshift subreddit. It's not only me, I think :)

I'll consider it. I'm not sure if my ISP would appreciate that kind of upload usage.

@wuqui
Copy link
Author

wuqui commented Dec 9, 2021

Thanks a lot, that's very helpful!

So this means that without FTS, queries would be much slower, right? Still quick enough to be manageable?

Much slower. I couldn't say how long a text search would take, I haven't tried it yet. If you limit by some field such as author or subreddit you can probably make significant performance improvements by adding an index on that field. For example I created an index on subreddit in my DB (which took some time) and ran this query:

sqlite> .timer on
sqlite> SELECT COUNT(*) FROM comment WHERE subreddit = 'pushshift' AND body LIKE '%cool%';
75
Run Time: real 110.155 user 0.126618 sys 0.588360

So, a text search with an index and without FTS took about 2 minutes on a small subreddit. This would probably be much faster on a SSD, but I don't have one large enough.

Good to know, thanks!

The thing is that the columns that are part of the FTS table don't contain all the information I need (e.g. created_utc). Is there a way to quickly join the non-FTS columns' info with the results? If not, the current FTS table wouldn't be very useful for me anyway.

As you suggest, you can use a join. eg for comments:

SELECT comment.created_utc
FROM comment_fts
JOIN comment ON comment_fts.rowid = comment.id
WHERE comment_fts.body MATCH 'foo';

I didn't know I could use rowidfor the join. Thanks, that's brilliant, and it's really fast.

An SQLite database of the complete dumps? Wow, that's impressive. That would certainly be feasible both in terms of processing time and disk usage. Were you using a "normal" laptop/desktop or was this using 100 CPUs and 1 TB RAM on the cloud? So would you recommend I do this on my local machine using a big external hard disk rather than on a cloud VM?

This was done on a desktop computer with an AMD Ryzen 7 2700X CPU with 64 GB of memory on a spinning disk drive. An import should be possible on an M1 given enough disk space. A VM in the cloud could work as well, though you might want to be mindful of egress costs if you want to export that DB.

Again, very good to know, thanks. I wouldn't have enough disk space on my internal SSD. I would have to use an external hard disk for doing it. I would hope that's still feasible.

Finally, you don't plan on sharing the resulting DB, do you? There has been some interest in this on the Pushshift subreddit. It's not only me, I think :)

I'll consider it. I'm not sure if my ISP would appreciate that kind of upload usage.

Oh yeah, I see. It just seems like a big waste of time and energy that so many people parse the dumps from scratch. I think the suggestion on the Pushshift subreddit was to share a DB version of the dumps via P2P -- a torrent or sth. like this.

@ruanchaves
Copy link

I have just written code to execute this complete pipeline on Google Colab ( download dumps, extract them to a database and filter by keyword, and then download the result as csv files ), see here: reddit_keywords .

I had to treat Pushshift-Importer as a black box in my code, but it would be really nice to have regex filters as built-in features.

By the way, queries with a MATCH operator like SELECT * FROM comment_fts WHERE body MATCH 'snoo' did not work for me. I had to write my queries with LIKE.

@Paul-E
Copy link
Owner

Paul-E commented Apr 9, 2022

By the way, queries with a MATCH operator like SELECT * FROM comment_fts WHERE body MATCH 'snoo' did not work for me. I had to write my queries with LIKE.

I had to disable building FTS by default. Without filters it can take a very long time to build the database. FTS can be enable with --enable-fts

I had to treat Pushshift-Importer as a black box in my code, but it would be really nice to have regex filters as built-in features.

It might be some time before I can implement this. I'm happy to review PRs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants