virtualenv venv
. venv/bin/activate
pip install -U -r requirements.txt
scrapy crawl opensecrets -s OUTPUT_URI="output/"
scrapy crawl opensecrets -s OUTPUT_URI="s3://bucket/output/" -s AWS_ACCESS_KEY_ID="YOURACCESSKEY" -s AWS_SECRET_ACCESS_KEY="YOURSECRETKEY"
You can use the command-line argument -s CONCURRENT_REQUESTS=x
to set the number of processes to use:
scrapy crawl opensecrets -s OUTPUT_URI="output/" -s CONCURRENT_REQUESTS=8
or with AWS S3 as a storage:
scrapy crawl opensecrets -s OUTPUT_URI="s3://bucket/output/" -s AWS_ACCESS_KEY_ID="YOURACCESSKEY" -s AWS_SECRET_ACCESS_KEY="YOURSECRETKEY" -s CONCURRENT_REQUESTS=8
You can use the command-line argument -s DOWNLOAD_DELAY=x
to set the amount of time (in secs) that the downloader
should wait before downloading consecutive pages.
scrapy crawl opensecrets -s OUTPUT_URI="output/" -s DOWNLOAD_DELAY=1
or with AWS S3 as a storage:
scrapy crawl opensecrets -s OUTPUT_URI="s3://bucket/output/" -s AWS_ACCESS_KEY_ID="YOURACCESSKEY" -s AWS_SECRET_ACCESS_KEY="YOURSECRETKEY" -s DOWNLOAD_DELAY=1
You have a few options for providing the AWS credentials:
Create a ~/.boto file with these contents:
[Credentials]
aws_access_key_id = YOURACCESSKEY
aws_secret_access_key = YOURSECRETKEY
Create a ~/.aws/credentials file with these contents:
[default]
aws_access_key_id = <your access key>
aws_secret_access_key = <your secret key>
More info here http://boto.cloudhackers.com/en/latest/boto_config_tut.html