persistStorage and AWS #2464
-
I have a question. I run a scraper on aws using batch jobs. I want to be able to use persistStorage so that I maintain a state between runs however with batch jobs it creates a new container each run thus you lose your state. I am sure I can use an s3 bucket for this purpose but I have no idea how to combine that with persistStorage? Any ideas on how to accomplish this or another way to save the crawler state in between runs? |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments
-
As of now, there is no native S3 storage adapter in Crawlee. Off the top of my head, you could use EFS and the default Or if you insist on using S3, you could pull an S3 bucket right before calling |
Beta Was this translation helpful? Give feedback.
-
Thanks for the suggestions. I was considering EFS but was curious if something was available for S3. |
Beta Was this translation helpful? Give feedback.
As of now, there is no native S3 storage adapter in Crawlee. Off the top of my head, you could use EFS and the default
MemoryStorage
, which is backed by a local directory.Or if you insist on using S3, you could pull an S3 bucket right before calling
crawler.run()
to yourstorage
directory and upload it again after finishing - surely there is some npm package that can do that 🙂.