RequestList and stateKeyPrefix #2440

tsrdatatech · 2024-05-01T17:35:43Z

tsrdatatech
May 1, 2024

I have a scraper that I want to maintain the state of it between runs. My goal is to not process any requests done previously on a run. So I had set the env var CRAWLEE_PURGE_ON_START to false. I thought maybe I needed to use the RequestList for this and was reading https://crawlee.dev/api/core/interface/RequestListOptions and at the bottom you state:

Note that the preferred (and simpler) way to persist the state of crawling of the RequestList is to use the stateKeyPrefix parameter instead.

I looked for stateKeyPrefix but found nothing and was wondering how you use that?

My other issue I have is that when not purging the storage and maintaining the state between runs the source urls I start with obviously will not get processed again since they were already. How do I make it so I can still start with the source urls and add more as I currently do using the enqueueLinks function while maintaining the state?

Answered by janbuchar

May 2, 2024

If you want to terminate a crawler and resume it, just disabling CRAWLEE_PURGE_ON_START should be enough. RequestList is not necessary, the default RequestQueue will probably work better.

If you want to process some URLs every time you run the crawler, even if you already processed them before, then giving them a random unique key is a good solution.

View full answer

tsrdatatech · 2024-05-01T19:44:47Z

tsrdatatech
May 1, 2024
Author

I found a way to do this by using the crypto module and generating a randomuuid and just passing in my urls with a uniqueKey like:

{ url: "https://search.bilibili.com/all?vt=55869074&keyword=%E5%A4%A9%E6%B4%A5%E7%8A%AF%E7%BD%[…]p_search&spm_id_from=333.1007&search_source=5&order=pubdate&page=3&o=72", uniqueKey: uuid.toString(), }

This works, question is there a better way?

0 replies

janbuchar · 2024-05-02T10:42:05Z

janbuchar
May 2, 2024
Maintainer

If you want to terminate a crawler and resume it, just disabling CRAWLEE_PURGE_ON_START should be enough. RequestList is not necessary, the default RequestQueue will probably work better.

If you want to process some URLs every time you run the crawler, even if you already processed them before, then giving them a random unique key is a good solution.

1 reply

tsrdatatech May 2, 2024
Author

Ok thanks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RequestList and stateKeyPrefix #2440

{{title}}

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{title}}

Select a reply

RequestList and stateKeyPrefix #2440

tsrdatatech May 1, 2024

Replies: 2 comments · 1 reply

tsrdatatech May 1, 2024 Author

janbuchar May 2, 2024 Maintainer

tsrdatatech May 2, 2024 Author

tsrdatatech
May 1, 2024

Replies: 2 comments 1 reply

tsrdatatech
May 1, 2024
Author

janbuchar
May 2, 2024
Maintainer

tsrdatatech May 2, 2024
Author