Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add page limit option #32

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

milanvarady
Copy link

I've noticed when downloading over 30 images or so, sometimes it just can't find more, and it keeps indexing the pages without any success. To counter this, I added a page_limit option that limits the number of pages it indexes. I changed the README as well to include this option, and I also added some prints to show whether it stopped because of the download limit or the page limit.

@milanvarady
Copy link
Author

milanvarady commented Jul 4, 2022

This also fixes issue #3.

@ghost
Copy link

ghost commented Aug 23, 2022

So I have a question for you. I made a fork of this, and I've been putting in improvements to it, and I'm debating adding this in. However, I've been doing large downloads as you mentioned 50+, and each time since I've been getting a list of queries I've just let it go over night, and eventually it has either found enough images or reached an "out of links" error. Is this not your experience? If so would you mind sharing the query that stalls it out?

@milanvarady
Copy link
Author

I found that it stalls every time, the query doesn't really matter. You mentioned that you let it run overnight, this is good if you want to do a single query and have time, but I found that you can get the best results if you run multiple queries with variations. For instance, if I want to make a rabbit dataset I would run the program multiple times with queries like this: rabbit, domestic rabbit, bunny, white rabbit, black rabbit, baby rabbit, European rabbit, etc. With this method, it is key that it finishes in a reasonable amount of time.

@ghost
Copy link

ghost commented Aug 23, 2022

Yeah that's not an issue for me as when using the downloader I read a csv of queries in and then iterate the list with a download command for each. So for me it's just put it a list of 300 or so queries set it for 50 to 100 images per query and it finishes when it finishes.

@milanvarady
Copy link
Author

I mean ultimately I can make some changes and make this optional, so if time matters for someone they can turn it on if not then leave it off.

@ghost
Copy link

ghost commented Aug 24, 2022

Don't worry about making the edits, I wasn't trying to get you or anyone, to do any more "work". I was just trying to get a better understanding of what you were saying as I'm fairly new to doing this. I think I may put it in as an optional parameter as you just said. I just wanted to understand what you were seeing first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant