Is there some feature to make Crawlee continue crawl from the stop place? #2058

qkxie · 2023-08-31T05:25:55Z

qkxie
Aug 31, 2023

I am new to Crawee. I have add 10000 urls to the RequestQueue. After the crawler finished about 5000 urls, my computer restarted and the crawler stopped.

So how can I continue to crawl the left urls instead of retry all urls?

Yes I know I can do this work by third party components such as Redis just like what I do in Scrapy or other crawler frames.

My question is that can I do this job just by Crawee itself?

Answered by B4nan

Aug 31, 2023

The crawler state should be still there (in the storage folder) and should be respected next time you run the crawler, all you need to do is disable the auto-purging of the state, e.g. via CRAWLEE_PURGE_ON_START env var, e.g. CRAWLEE_PURGE_ON_START=0 npm start.

View full answer

B4nan · 2023-08-31T07:06:07Z

B4nan
Aug 31, 2023
Maintainer

The crawler state should be still there (in the storage folder) and should be respected next time you run the crawler, all you need to do is disable the auto-purging of the state, e.g. via CRAWLEE_PURGE_ON_START env var, e.g. CRAWLEE_PURGE_ON_START=0 npm start.

3 replies

qkxie Aug 31, 2023
Author

Thanks, I have made a try and it indeed continue from the stop point.

Yash-946 May 3, 2024

can you share how to do it like what file you create and what you written

ylwu May 23, 2024

@Yash-946 just run CRAWLEE_PURGE_ON_START=0 npm start.

jawspeak · 2024-03-26T03:01:19Z

jawspeak
Mar 26, 2024

If you want to have it retry errored out requests in the request queue, the way I've found to do it is by parsing the request_queues/default/*.json files and archiving (renaming) ones with failures and then edit some attributes and starting up the crawl without passing in any new requests. (The queue is resumed).

My goal was to run until failures, then resume where it failed, giving more attempts after changing something. This assumes it fails the _MAX_RETRIES quantity and if it fails less than that and you interrupt/terminate the crawl, then it will not detect those to resume. There may be another better way, hopefully.

// Call this with some --resume flag codepath or something
export function _resumeRetryFixRequestQueue(crawleeQueueDir: string) {
    Configuration.getGlobalConfig().set('purgeOnStart', false); // let it resume
    const archivedTime = toFileSafeName(nowEastern()); // just makes a name
    const files = fs.readdirSync(crawleeQueueDir);
    files.forEach(file => {
        const filePath = path.join(crawleeQueueDir, file);
        if (file == ".DS_Store") { // no need for these, delete
            fs.rmSync(filePath);
            return;
        }
        if (path.extname(file) != ".json") {
            console.warn(`Unexpected non-Json file found: ${file}, ignoring`);
            return;
        }

        const reqFromQueue = readJsonFile<_CrawleeInternalRequest>(filePath);
        // Look at each request and if the retryCount is our max, then archive and edit to retry.
        // We might be able to better decide if it was errored out by looking at the __crawlee.state if it is 5 (ERROR_HANDLER), for now use retryCount.
        if (reqFromQueue.retryCount >= _MAX_RETRIES) {
            console.log(`File: ${file} max retries, will reset to retry again, key: ${reqFromQueue.uniqueKey}, full file: ${JSON.stringify(reqFromQueue)}`);
            const archivePath = path.join(crawleeQueueDir,
                'archived_' + archivedTime,
                renameFileBeforeExtension(file, `_archived_${archivedTime}`)); // just rename it so it is ignored but we have a record of the original file
            fs.cpSync(filePath, archivePath);

            // Manipulate the state to enable it to resume
            reqFromQueue.retryCount = 0;
            reqFromQueue.orderNo = Date.now();
            const jsonData = JSON.parse(reqFromQueue.json);
            jsonData['retryCount'] = 0;
            jsonData['handledAt'] = null;
            jsonData['userData']['__crawlee'].state = RequestState.REQUEST_HANDLER; // should be 3, from 5 during errors (ERROR_HANDLER)
            reqFromQueue.json = JSON.stringify(jsonData);
            writeJsonFile(filePath, reqFromQueue);
        }
    });
}


// Copied from request-queue.ts in crawlee to deserialize the json files in the request queue.
type _CrawleeInternalRequest = {
    id: string
    orderNo: number | null
    url: string
    uniqueKey: string
    method: string
    retryCount: number
    json: string // deserialize manually and edit pinpoint values
}

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is there some feature to make Crawlee continue crawl from the stop place? #2058

{{title}}

Replies: 2 comments 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Is there some feature to make Crawlee continue crawl from the stop place? #2058

qkxie Aug 31, 2023

Replies: 2 comments · 3 replies

B4nan Aug 31, 2023 Maintainer

qkxie Aug 31, 2023 Author

Yash-946 May 3, 2024

ylwu May 23, 2024

jawspeak Mar 26, 2024

qkxie
Aug 31, 2023

Replies: 2 comments 3 replies

B4nan
Aug 31, 2023
Maintainer

qkxie Aug 31, 2023
Author

jawspeak
Mar 26, 2024