-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
403 when locally hosted cc-index-server tries to connect to s3://commoncrawl/ #11
Comments
Could you try the branch pywb2?
Apologies, I hoped to finalize the version based on PyWB2 but hadn't the time yet. There's also some work to do:
|
I discovered that it works with long term user creds but not short term. When I got it working, I realized it still didn't pull back the actual content from the warc files for me, it just gave me the same index info I already had working through Athena. So then I had to go back to figure out extracting the gzip data from the S3 warc files myself. I discovered that my blocker there was that I was doing: start_byte = int(row[22]) And then I finally found an example somewhere where someone added the - 1 to that equation. After that, i could successfully extract and decompress with gzip. Since I've solved my issue and I don't currently have time to stop and troubleshoot further, I'll have to stop with the feedback that the root cause is that the current code doesn't support short-term creds. |
Thanks for the feedback. I'll have a look, but it may take some time.
The byte range is inclusive, so it is For bulk look-ups the columnar index is more efficient, see here. A user even wrote a tutorial how to automatize the fetching the WARC records using AWS Lambda. |
In that second link, it looks like we can get event-based triggers from the commoncrawl bucket. Is that the case? If so, that's super helpful. I was assuming we couldn't because it was in their account. However, reading it again, it might just be subscribing to the Ath |
Yes, that's also my understanding: the appearance of a query result file (not on s3://commoncrawl/) triggers the download of all referenced WARC records stored on s3://commoncrawl/. |
Whenever I do a search on the local cc-index-server I get errors. When I look at the debug logs, it looks like the final authorization is only using the access key ID and the secret, but not the session token.
Is this only designed to work with long-term IAM user creds, or does it support short term creds? If I were to go edit the file building that Authorization, where would I find it? I searched the code globally for Authorization, access_key, and access, excluding the cluster.idx files, and found nothing that matched.
I'd be happy to contribute the fix for supporting short-term creds if you help me find where the fix goes in your code.
The text was updated successfully, but these errors were encountered: