Implement CDN log file parsers #8010

Turbo87 · 2024-01-26T16:31:18Z

This PR adds a new crates_io_cdn_logs package in the repository, with all the code necessary to parse and process log files from our two CDN systems: CloudFront and Fastly. Example log files are included in this PR.

When we receive these log files they are compressed using gzip and zstd respectively, thus this PR also contains code to decompress the content on-the-fly.

The code is written in an async/streamy kind of way to avoid buffering the whole log file in memory while processing it. On my local machine a 80 MB log file with 120k records is processed in roughly 1.6 sec.

Note that the initial design was using the regex and url crates in a couple of places, but, as the code comments hint at, they have proven significantly slower than doing the parsing by hand. Somewhat related: this PR also contains benchmark code for both file formats 😉

Feel free to DM me if you need real-life log files to play around with.

Finally: best reviewed commit-by-commit :)

paolobarbolini · 2024-01-26T16:36:00Z

If you ever need to use the logs for counting downloads in the database let me know. PostgreSQL is very fast and efficient at ingesting CSVs via the COPY command.

Turbo87 · 2024-01-26T16:40:46Z

@paolobarbolini the problem with ingesting the download counts into the database is that we don't have the version_id yet.

I'm currently playing with an approach like this:

which appears to work decently well for a typical real-life log file:

Number of crates: 5702
Number of needed inserts: 15654
Total number of downloads: 116192
Time to parse: 1.48536s

Inserting into database…
Time to insert: 2.165449s

codecov · 2024-01-26T16:40:48Z

Codecov Report

Attention: 68 lines in your changes are missing coverage. Please review.

Comparison is base (66041df) 87.52% compared to head (7527730) 87.51%.
Report is 4 commits behind head on main.

❗ Current head 7527730 differs from pull request most recent head c5fd445. Consider uploading reports for the commit c5fd445 to get more accurate results

Files	Patch %	Lines
crates_io_cdn_logs/examples/count_downloads.rs	0.00%	56 Missing ⚠️
crates_io_cdn_logs/src/download_map.rs	89.39%	7 Missing ⚠️
crates_io_cdn_logs/src/compression.rs	90.90%	2 Missing ⚠️
crates_io_cdn_logs/src/cloudfront.rs	98.94%	1 Missing ⚠️
crates_io_cdn_logs/src/fastly/json.rs	98.11%	1 Missing ⚠️
crates_io_cdn_logs/src/fastly/mod.rs	98.71%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #8010      +/-   ##
==========================================
- Coverage   87.52%   87.51%   -0.01%     
==========================================
  Files         256      263       +7     
  Lines       25074    25582     +508     
==========================================
+ Hits        21945    22388     +443     
- Misses       3129     3194      +65

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

paolobarbolini · 2024-01-26T16:49:40Z

@paolobarbolini the problem with ingesting the download counts into the database is that we don't have the version_id yet.
Number of crates: 5702
Number of needed inserts: 15654
Total number of downloads: 116192
Time to parse: 1.48536s

Inserting into database…
Time to insert: 2.165449s

That looks like a good approach. You should make sure temp_buffers is big enough to fit the entire temporary table, if it's reasonable to do so, in order to avoid spilling to disk.

Turbo87 · 2024-01-26T16:53:06Z

@paolobarbolini thanks, that's a good hint! remind me when I open the follow-up PR that implements this 😉

paolobarbolini · 2024-01-26T16:56:39Z

I'm wandering whether async makes sense here, since I'd expect it to mostly be CPU-bound work

Turbo87 · 2024-01-26T17:00:05Z

I'm wandering whether async makes sense here, since I'd expect it to mostly be CPU-bound work

I didn't feel comfortable loading the full 80MB log file into memory. Knowing our traffic graphs, the size of these file will grow quite a bit in the future 😅

paolobarbolini · 2024-01-26T17:01:48Z

I'm wandering whether async makes sense here, since I'd expect it to mostly be CPU-bound work

I didn't feel comfortable loading the full 80MB log file into memory. Knowing our traffic graphs, the size of these file will grow quite a bit in the future 😅

If the problem is with streaming the file you can get a pre-signed download URL from the AWS SDK and then use whatever HTTP client to stream the file download.

eth3lbert · 2024-01-26T17:03:35Z

I totally agree with the two points @paolobarbolini mentioned:

Preprocessing the data into CSV format and using the COPY statement to insert it into a temporary table can significantly improve the insert speed when the data volume is large.
The temp_buffers setting is also important. You can either declare a large enough size for it initially or choose a suitable batch size.

LawnGnome

Apart from a question about the public API of this crate, no concerns. Looks good!

crates_io_cdn_logs/src/paths.rs

crates_io_cdn_logs/src/cloudfront.rs

crates_io_cdn_logs/src/fastly/mod.rs

LawnGnome · 2024-01-31T02:21:43Z

crates_io_cdn_logs/src/lib.rs

+    R: AsyncBufRead + Unpin,
+{
+    // Read the first byte to determine the file format.
+    match reader.read_u8().await? {


I'm handling my notification backlog in chronological order, so there's probably context in a later PR or issue that explains this, but is sniffing something we actually need here? Is there a scenario where we won't know the source of a log file?

(Personal bias: I'm automatically mistrustful of anything that looks like data sniffing.)

the alternative is looking at the folder structure on the S3 bucket and choosing the file type based on that. or the other alternative: fastly logs are using zstd and cloudfront is currently using gzip.

both of those approaches feel a bit brittle though, which is why I implemented the auto-detection.

auto-detection also makes it easier to parse files locally, since they probably will be in a different folder structure and might not be compressed at all.

This provides a slightly nicer API on top of the regular `HashMap`, but, most importantly, provides a decently useful and concise `Debug` output.

Turbo87 · 2024-01-31T08:47:06Z

rebased and feedback addressed. thanks for the review @LawnGnome! :)

Turbo87 added C-internal 🔧 Category: Nonessential work that would make the codebase more consistent or clear A-backend ⚙️ labels Jan 26, 2024

Turbo87 requested a review from a team January 26, 2024 16:33

Turbo87 force-pushed the cdn-logs branch from 734f245 to 7527730 Compare January 30, 2024 09:02

LawnGnome approved these changes Jan 31, 2024

View reviewed changes

Turbo87 added 12 commits January 31, 2024 09:36

Create crates_io_cdn_logs package

4078094

cdn_logs: Implement DownloadsMap wrapper struct

4a77054

This provides a slightly nicer API on top of the regular `HashMap`, but, most importantly, provides a decently useful and concise `Debug` output.

cdn_logs: Implement enable_tracing_output() test utility

a129d62

cdn_logs: Implement URL/path parsing

0c13c83

cdn_logs: Implement CloudFront download counting

bfed907

cdn_logs: Implement Fastly download counting

cf25315

cdn_logs: Implement download counting benchmarks

5c1d753

cdn_logs: Implement download counting with file format detection

64d2cef

cdn_logs: Implement gzip and zstd compression support

854d926

cdn_logs: Implement "count_downloads" example binary

e230435

cdn_logs/paths: Add additional "path with dashes" test

aaa86b6

cdn_logs: Add log file format documentation links

c5fd445

Turbo87 force-pushed the cdn-logs branch from 7527730 to c5fd445 Compare January 31, 2024 08:46

Turbo87 enabled auto-merge January 31, 2024 08:49

Turbo87 merged commit b5d12aa into rust-lang:main Jan 31, 2024
6 checks passed

Turbo87 deleted the cdn-logs branch January 31, 2024 08:56

Turbo87 mentioned this pull request Jan 31, 2024

Implement ProcessCdnLog and ProcessCdnLogQueue background jobs #8036

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement CDN log file parsers #8010

Implement CDN log file parsers #8010

Turbo87 commented Jan 26, 2024 •

edited

Loading

paolobarbolini commented Jan 26, 2024

Turbo87 commented Jan 26, 2024

codecov bot commented Jan 26, 2024 •

edited

Loading

paolobarbolini commented Jan 26, 2024 •

edited

Loading

Turbo87 commented Jan 26, 2024

paolobarbolini commented Jan 26, 2024

Turbo87 commented Jan 26, 2024 •

edited

Loading

paolobarbolini commented Jan 26, 2024

eth3lbert commented Jan 26, 2024

LawnGnome left a comment

LawnGnome Jan 31, 2024

Turbo87 Jan 31, 2024 •

edited

Loading

Turbo87 commented Jan 31, 2024

Implement CDN log file parsers #8010

Implement CDN log file parsers #8010

Conversation

Turbo87 commented Jan 26, 2024 • edited Loading

paolobarbolini commented Jan 26, 2024

Turbo87 commented Jan 26, 2024

codecov bot commented Jan 26, 2024 • edited Loading

Codecov Report

paolobarbolini commented Jan 26, 2024 • edited Loading

Turbo87 commented Jan 26, 2024

paolobarbolini commented Jan 26, 2024

Turbo87 commented Jan 26, 2024 • edited Loading

paolobarbolini commented Jan 26, 2024

eth3lbert commented Jan 26, 2024

LawnGnome left a comment

Choose a reason for hiding this comment

LawnGnome Jan 31, 2024

Choose a reason for hiding this comment

Turbo87 Jan 31, 2024 • edited Loading

Choose a reason for hiding this comment

Turbo87 commented Jan 31, 2024

Turbo87 commented Jan 26, 2024 •

edited

Loading

codecov bot commented Jan 26, 2024 •

edited

Loading

paolobarbolini commented Jan 26, 2024 •

edited

Loading

Turbo87 commented Jan 26, 2024 •

edited

Loading

Turbo87 Jan 31, 2024 •

edited

Loading