Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

403 when locally hosted cc-index-server tries to connect to s3://commoncrawl/ #11

Open
davetbo-amzn opened this issue Apr 1, 2023 · 5 comments

Comments

@davetbo-amzn
Copy link

Whenever I do a search on the local cc-index-server I get errors. When I look at the debug logs, it looks like the final authorization is only using the access key ID and the secret, but not the session token.

Is this only designed to work with long-term IAM user creds, or does it support short term creds? If I were to go edit the file building that Authorization, where would I find it? I searched the code globally for Authorization, access_key, and access, excluding the cluster.idx files, and found nothing that matched.

I'd be happy to contribute the fix for supporting short-term creds if you help me find where the fix goes in your code.

@sebastian-nagel
Copy link

Could you try the branch pywb2?

Apologies, I hoped to finalize the version based on PyWB2 but hadn't the time yet. There's also some work to do:

  • need to rebase on a more recent PyWB2 version
  • the lazy instantiation of the S3 client in the S3 loader might need some more improvements: if the creation of the S3 client or the get_object fails, it is created again and again which is not nice and may cause troubles because any logic implemented in the client to handle the errors (e.g., an exponential back-off on "503 Slow Down" responses) is impossible because the state hold in the client is lost. In addition, in the specific case of s3://commoncrawl/ the fall-back instantiating a client with unauthenticated access is useless anyway.

@davidtbo
Copy link

davidtbo commented Apr 4, 2023

I discovered that it works with long term user creds but not short term. When I got it working, I realized it still didn't pull back the actual content from the warc files for me, it just gave me the same index info I already had working through Athena.

So then I had to go back to figure out extracting the gzip data from the S3 warc files myself. I discovered that my blocker there was that I was doing:

start_byte = int(row[22])
end_byte = start_byte + int(row[23]) # either this or the previous line should have had a -1 in it to shift to zero as the first byte.

And then I finally found an example somewhere where someone added the - 1 to that equation. After that, i could successfully extract and decompress with gzip.

Since I've solved my issue and I don't currently have time to stop and troubleshoot further, I'll have to stop with the feedback that the root cause is that the current code doesn't support short-term creds.

@sebastian-nagel
Copy link

with the feedback that the root cause is that the current code doesn't support short-term creds.

Thanks for the feedback. I'll have a look, but it may take some time.

should have had a -1

The byte range is inclusive, so it is offset -- (offset + length - 1).

For bulk look-ups the columnar index is more efficient, see here. A user even wrote a tutorial how to automatize the fetching the WARC records using AWS Lambda.

@davidtbo
Copy link

davidtbo commented Apr 4, 2023

In that second link, it looks like we can get event-based triggers from the commoncrawl bucket. Is that the case? If so, that's super helpful. I was assuming we couldn't because it was in their account.

However, reading it again, it might just be subscribing to the Ath
ena results landing in my bucket when I do the query for the web pages I want from the index.

@sebastian-nagel
Copy link

Yes, that's also my understanding: the appearance of a query result file (not on s3://commoncrawl/) triggers the download of all referenced WARC records stored on s3://commoncrawl/.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants