-
Notifications
You must be signed in to change notification settings - Fork 610
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement CDN log file parsers #8010
Conversation
If you ever need to use the logs for counting downloads in the database let me know. PostgreSQL is very fast and efficient at ingesting CSVs via the |
@paolobarbolini the problem with ingesting the download counts into the database is that we don't have the I'm currently playing with an approach like this: which appears to work decently well for a typical real-life log file:
|
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## main #8010 +/- ##
==========================================
- Coverage 87.52% 87.51% -0.01%
==========================================
Files 256 263 +7
Lines 25074 25582 +508
==========================================
+ Hits 21945 22388 +443
- Misses 3129 3194 +65 ☔ View full report in Codecov by Sentry. |
That looks like a good approach. You should make sure |
@paolobarbolini thanks, that's a good hint! remind me when I open the follow-up PR that implements this 😉 |
I'm wandering whether async makes sense here, since I'd expect it to mostly be CPU-bound work |
I didn't feel comfortable loading the full 80MB log file into memory. Knowing our traffic graphs, the size of these file will grow quite a bit in the future 😅 |
If the problem is with streaming the file you can get a pre-signed download URL from the AWS SDK and then use whatever HTTP client to stream the file download. |
I totally agree with the two points @paolobarbolini mentioned:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apart from a question about the public API of this crate, no concerns. Looks good!
R: AsyncBufRead + Unpin, | ||
{ | ||
// Read the first byte to determine the file format. | ||
match reader.read_u8().await? { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm handling my notification backlog in chronological order, so there's probably context in a later PR or issue that explains this, but is sniffing something we actually need here? Is there a scenario where we won't know the source of a log file?
(Personal bias: I'm automatically mistrustful of anything that looks like data sniffing.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the alternative is looking at the folder structure on the S3 bucket and choosing the file type based on that. or the other alternative: fastly logs are using zstd and cloudfront is currently using gzip.
both of those approaches feel a bit brittle though, which is why I implemented the auto-detection.
auto-detection also makes it easier to parse files locally, since they probably will be in a different folder structure and might not be compressed at all.
This provides a slightly nicer API on top of the regular `HashMap`, but, most importantly, provides a decently useful and concise `Debug` output.
rebased and feedback addressed. thanks for the review @LawnGnome! :) |
This PR adds a new
crates_io_cdn_logs
package in the repository, with all the code necessary to parse and process log files from our two CDN systems: CloudFront and Fastly. Example log files are included in this PR.When we receive these log files they are compressed using gzip and zstd respectively, thus this PR also contains code to decompress the content on-the-fly.
The code is written in an async/streamy kind of way to avoid buffering the whole log file in memory while processing it. On my local machine a 80 MB log file with 120k records is processed in roughly 1.6 sec.
Note that the initial design was using the
regex
andurl
crates in a couple of places, but, as the code comments hint at, they have proven significantly slower than doing the parsing by hand. Somewhat related: this PR also contains benchmark code for both file formats 😉Feel free to DM me if you need real-life log files to play around with.
Finally: best reviewed commit-by-commit :)