-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scrape X-rxiv via API #33
Comments
Hi there. Thanks for work on this project. As a temporary solution, I've saved the dbs on a requester-payer s3 bucket. To download the jsonl files, use these commands: aws s3 cp s3://astrowafflerp/biorxiv.jsonl biorxiv.jsonl --request-payer https://docs.aws.amazon.com/AmazonS3/latest/userguide/ObjectsinRequesterPaysBuckets.html I've got a cron job that runs daily, so they should be current, but let me know if you have any trouble. Here's the maintainer script: https://github.com/AstroWaffleRobot/getlit |
Hi @AstroWaffleRobot, |
Long-term I want to create a lightweight API that I can deploy on my own VM which serves the requests. On the VM I want a daily cronjob to update the data and the API would run the package itself, just in its current mode where data is assumed to be locally available. That way, there's a dual usage, users could either use the package out of the box without slow download of the dumps or do it the old (current) way by downloading dumps first |
I guess no more of that bucket?
FWIW -- I wanted to check sizes, I could have probably picked up serving those from https://datasets.datalad.org/ or some other S3 bucket |
Currently bio/med/chemrxiv scraping requires user to first download the entire DB and store locally.
Ideally, these dumps should be stored on a server and updated regularly (cron job). Users would just send requests to the server API. That would be the new default behaviour, but local download should still be supported too
The text was updated successfully, but these errors were encountered: