-
I am new to Crawee. I have add 10000 urls to the RequestQueue. After the crawler finished about 5000 urls, my computer restarted and the crawler stopped. So how can I continue to crawl the left urls instead of retry all urls? Yes I know I can do this work by third party components such as Redis just like what I do in Scrapy or other crawler frames. My question is that can I do this job just by Crawee itself? |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 3 replies
-
The crawler state should be still there (in the |
Beta Was this translation helpful? Give feedback.
-
If you want to have it retry errored out requests in the request queue, the way I've found to do it is by parsing the request_queues/default/*.json files and archiving (renaming) ones with failures and then edit some attributes and starting up the crawl without passing in any new requests. (The queue is resumed). My goal was to run until failures, then resume where it failed, giving more attempts after changing something. This assumes it fails the _MAX_RETRIES quantity and if it fails less than that and you interrupt/terminate the crawl, then it will not detect those to resume. There may be another better way, hopefully.
|
Beta Was this translation helpful? Give feedback.
The crawler state should be still there (in the
storage
folder) and should be respected next time you run the crawler, all you need to do is disable the auto-purging of the state, e.g. viaCRAWLEE_PURGE_ON_START
env var, e.g.CRAWLEE_PURGE_ON_START=0 npm start
.