Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement CDN log file parsers #8010

Merged
merged 12 commits into from
Jan 31, 2024
Merged

Implement CDN log file parsers #8010

merged 12 commits into from
Jan 31, 2024

Conversation

Turbo87
Copy link
Member

@Turbo87 Turbo87 commented Jan 26, 2024

This PR adds a new crates_io_cdn_logs package in the repository, with all the code necessary to parse and process log files from our two CDN systems: CloudFront and Fastly. Example log files are included in this PR.

When we receive these log files they are compressed using gzip and zstd respectively, thus this PR also contains code to decompress the content on-the-fly.

The code is written in an async/streamy kind of way to avoid buffering the whole log file in memory while processing it. On my local machine a 80 MB log file with 120k records is processed in roughly 1.6 sec.

Note that the initial design was using the regex and url crates in a couple of places, but, as the code comments hint at, they have proven significantly slower than doing the parsing by hand. Somewhat related: this PR also contains benchmark code for both file formats 😉

Feel free to DM me if you need real-life log files to play around with.

Finally: best reviewed commit-by-commit :)

@Turbo87 Turbo87 added C-internal 🔧 Category: Nonessential work that would make the codebase more consistent or clear A-backend ⚙️ labels Jan 26, 2024
@Turbo87 Turbo87 requested a review from a team January 26, 2024 16:33
@paolobarbolini
Copy link
Contributor

If you ever need to use the logs for counting downloads in the database let me know. PostgreSQL is very fast and efficient at ingesting CSVs via the COPY command.

@Turbo87
Copy link
Member Author

Turbo87 commented Jan 26, 2024

@paolobarbolini the problem with ingesting the download counts into the database is that we don't have the version_id yet.

I'm currently playing with an approach like this:

Bildschirmfoto 2024-01-26 um 15 53 33

which appears to work decently well for a typical real-life log file:

Number of crates: 5702
Number of needed inserts: 15654
Total number of downloads: 116192
Time to parse: 1.48536s

Inserting into database…
Time to insert: 2.165449s

Copy link

codecov bot commented Jan 26, 2024

Codecov Report

Attention: 68 lines in your changes are missing coverage. Please review.

Comparison is base (66041df) 87.52% compared to head (7527730) 87.51%.
Report is 4 commits behind head on main.

❗ Current head 7527730 differs from pull request most recent head c5fd445. Consider uploading reports for the commit c5fd445 to get more accurate results

Files Patch % Lines
crates_io_cdn_logs/examples/count_downloads.rs 0.00% 56 Missing ⚠️
crates_io_cdn_logs/src/download_map.rs 89.39% 7 Missing ⚠️
crates_io_cdn_logs/src/compression.rs 90.90% 2 Missing ⚠️
crates_io_cdn_logs/src/cloudfront.rs 98.94% 1 Missing ⚠️
crates_io_cdn_logs/src/fastly/json.rs 98.11% 1 Missing ⚠️
crates_io_cdn_logs/src/fastly/mod.rs 98.71% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #8010      +/-   ##
==========================================
- Coverage   87.52%   87.51%   -0.01%     
==========================================
  Files         256      263       +7     
  Lines       25074    25582     +508     
==========================================
+ Hits        21945    22388     +443     
- Misses       3129     3194      +65     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@paolobarbolini
Copy link
Contributor

paolobarbolini commented Jan 26, 2024

@paolobarbolini the problem with ingesting the download counts into the database is that we don't have the version_id yet.

Number of crates: 5702
Number of needed inserts: 15654
Total number of downloads: 116192
Time to parse: 1.48536s

Inserting into database…
Time to insert: 2.165449s

That looks like a good approach. You should make sure temp_buffers is big enough to fit the entire temporary table, if it's reasonable to do so, in order to avoid spilling to disk.

@Turbo87
Copy link
Member Author

Turbo87 commented Jan 26, 2024

@paolobarbolini thanks, that's a good hint! remind me when I open the follow-up PR that implements this 😉

@paolobarbolini
Copy link
Contributor

I'm wandering whether async makes sense here, since I'd expect it to mostly be CPU-bound work

@Turbo87
Copy link
Member Author

Turbo87 commented Jan 26, 2024

I'm wandering whether async makes sense here, since I'd expect it to mostly be CPU-bound work

I didn't feel comfortable loading the full 80MB log file into memory. Knowing our traffic graphs, the size of these file will grow quite a bit in the future 😅

@paolobarbolini
Copy link
Contributor

I'm wandering whether async makes sense here, since I'd expect it to mostly be CPU-bound work

I didn't feel comfortable loading the full 80MB log file into memory. Knowing our traffic graphs, the size of these file will grow quite a bit in the future 😅

If the problem is with streaming the file you can get a pre-signed download URL from the AWS SDK and then use whatever HTTP client to stream the file download.

@eth3lbert
Copy link
Contributor

I totally agree with the two points @paolobarbolini mentioned:

  • Preprocessing the data into CSV format and using the COPY statement to insert it into a temporary table can significantly improve the insert speed when the data volume is large.

  • The temp_buffers setting is also important. You can either declare a large enough size for it initially or choose a suitable batch size.

Copy link
Contributor

@LawnGnome LawnGnome left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart from a question about the public API of this crate, no concerns. Looks good!

crates_io_cdn_logs/src/paths.rs Show resolved Hide resolved
crates_io_cdn_logs/src/cloudfront.rs Show resolved Hide resolved
crates_io_cdn_logs/src/fastly/mod.rs Show resolved Hide resolved
R: AsyncBufRead + Unpin,
{
// Read the first byte to determine the file format.
match reader.read_u8().await? {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm handling my notification backlog in chronological order, so there's probably context in a later PR or issue that explains this, but is sniffing something we actually need here? Is there a scenario where we won't know the source of a log file?

(Personal bias: I'm automatically mistrustful of anything that looks like data sniffing.)

Copy link
Member Author

@Turbo87 Turbo87 Jan 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the alternative is looking at the folder structure on the S3 bucket and choosing the file type based on that. or the other alternative: fastly logs are using zstd and cloudfront is currently using gzip.

both of those approaches feel a bit brittle though, which is why I implemented the auto-detection.

auto-detection also makes it easier to parse files locally, since they probably will be in a different folder structure and might not be compressed at all.

@Turbo87
Copy link
Member Author

Turbo87 commented Jan 31, 2024

rebased and feedback addressed. thanks for the review @LawnGnome! :)

@Turbo87 Turbo87 enabled auto-merge January 31, 2024 08:49
@Turbo87 Turbo87 merged commit b5d12aa into rust-lang:main Jan 31, 2024
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-backend ⚙️ C-internal 🔧 Category: Nonessential work that would make the codebase more consistent or clear
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants