persistStorage and AWS #2464

tsrdatatech · 2024-05-14T15:35:19Z

tsrdatatech
May 14, 2024

I have a question. I run a scraper on aws using batch jobs. I want to be able to use persistStorage so that I maintain a state between runs however with batch jobs it creates a new container each run thus you lose your state. I am sure I can use an s3 bucket for this purpose but I have no idea how to combine that with persistStorage? Any ideas on how to accomplish this or another way to save the crawler state in between runs?

Answered by janbuchar

May 15, 2024

As of now, there is no native S3 storage adapter in Crawlee. Off the top of my head, you could use EFS and the default MemoryStorage, which is backed by a local directory.

Or if you insist on using S3, you could pull an S3 bucket right before calling crawler.run() to your storage directory and upload it again after finishing - surely there is some npm package that can do that 🙂.

View full answer

janbuchar · 2024-05-15T09:57:03Z

janbuchar
May 15, 2024
Maintainer

As of now, there is no native S3 storage adapter in Crawlee. Off the top of my head, you could use EFS and the default MemoryStorage, which is backed by a local directory.

Or if you insist on using S3, you could pull an S3 bucket right before calling crawler.run() to your storage directory and upload it again after finishing - surely there is some npm package that can do that 🙂.

0 replies

tsrdatatech · 2024-05-15T13:42:41Z

tsrdatatech
May 15, 2024
Author

Thanks for the suggestions. I was considering EFS but was curious if something was available for S3.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

persistStorage and AWS #2464

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

persistStorage and AWS #2464

tsrdatatech May 14, 2024

Replies: 2 comments

janbuchar May 15, 2024 Maintainer

tsrdatatech May 15, 2024 Author

tsrdatatech
May 14, 2024

janbuchar
May 15, 2024
Maintainer

tsrdatatech
May 15, 2024
Author