Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimizations #70

Merged
merged 18 commits into from
Nov 6, 2024
Merged

Optimizations #70

merged 18 commits into from
Nov 6, 2024

Conversation

noah-weingarden
Copy link
Contributor

@noah-weingarden noah-weingarden commented Nov 22, 2023

Closes #34, closes #69. I deliberately chose not to make these two changes:

Do not store both map output and partitioned map output. Instead, pipe map output directly to partition function.
Do not store both sorted and unsorted group output. Instead, read partitioned mapper output into memory (one file). Sort it. Pipe sorted output to reducer input.

I'm worried that, although these are legitimate optimizations, they would make Madoop significantly more similar to a project 4 solution. We would in fact be verbatim ripping off code from project 4, losing some of Madoop's distinctiveness in the process. Open to debate/suggestions on this though.

With regard to this:

Measure run time with a large input (e.g., EECS 485 Project 5 Wikipedia input) and optimize input partition size.

I measured the runtime of project 5's job 1 and job 2 and found that, on my machine, it truthfully doesn't matter very much. Job 1, which is by far the slowest, is unaffected entirely since the actual input to the framework is minuscule (just a file with ~3,000 filenames,which I consider a severe flaw of the recent HTML dataset changes--see this comment). Job 2 is minimally affected some of the time, but not all the time, and any chunk size higher than ~18 MB is pointless due to the size of the input files. I went with 10 MB as a compromise, as this and anything higher does seem to subtract a few seconds some of the time.

Copy link

codecov bot commented Nov 22, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 98.43%. Comparing base (5563c74) to head (6da2dd6).
Report is 19 commits behind head on develop.

Additional details and impacted files
@@             Coverage Diff             @@
##           develop      #70      +/-   ##
===========================================
+ Coverage    96.32%   98.43%   +2.11%     
===========================================
  Files            4        4              
  Lines          245      256      +11     
===========================================
+ Hits           236      252      +16     
+ Misses           9        4       -5     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@noah-weingarden
Copy link
Contributor Author

noah-weingarden commented Nov 22, 2023

Note that because sorting intermediate files now happens in a subprocess, Coverage isn't figuring out that sort_file() actually runs. pytest-cov has documentation about this, but their suggestion doesn't help. Since Coverage is still letting this pass CI, I lean toward ignoring this.

Never mind, I figured out how to solve this. Coverage needed a configuration file in addition to the change suggested in the docs above. This isn't documented at all within pytest-cov, only its backend, Coverage.py.

@awdeorio
Copy link
Contributor

awdeorio commented Mar 29, 2024

I'm worried that, although these are legitimate optimizations, they would make Madoop significantly more similar to a project 4 solution. We would in fact be verbatim ripping off code from project 4, losing some of Madoop's distinctiveness in the process. Open to debate/suggestions on this though.

Good point. I agree with you.

I measured the runtime of project 5's job 1 and job 2 ... since the actual input to the framework is minuscule (just a file with ~3,000 filenames

This should be resolved soon once https://github.com/eecs485staff/p5-search-engine/pull/710 is finished editing and merged

Copy link
Contributor

@awdeorio awdeorio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM once conflicts are resolved

@noah-weingarden
Copy link
Contributor Author

Was there anything special about the 2 MB that #74 changed to or is 10 MB still good?

@awdeorio
Copy link
Contributor

awdeorio commented Apr 1, 2024

Was there anything special about the 2 MB that #74 changed to or is 10 MB still good?

Not sure. @melodell ?

@melodell
Copy link
Member

melodell commented Apr 1, 2024

Was there anything special about the 2 MB that #74 changed to or is 10 MB still good?

Not sure. @melodell ?

Nope, just needed to bump it above 1 MB, since a handful of our documents were > 1 MB.

Copy link
Contributor

@awdeorio awdeorio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM once the tests pass! I merged in the latest develop and it looks like there's one failure that has to do with spaces in input paths.

EDIT: I think that the parallelization will be nice with https://github.com/eecs485staff/p5-search-engine/pull/710

@noah-weingarden
Copy link
Contributor Author

noah-weingarden commented Nov 6, 2024

LGTM once the tests pass! I merged in the latest develop and it looks like there's one failure that has to do with spaces in input paths.

Fixed that--looks like the only issue left is that we got rate limited uploading codecov reports. Are we missing a token perhaps?

@awdeorio
Copy link
Contributor

awdeorio commented Nov 6, 2024

Looks like the only issue left is that we got rate limited uploading codecov reports. Are we missing a token perhaps?

Taking a look in #79

@awdeorio awdeorio merged commit 5bea37d into develop Nov 6, 2024
3 checks passed
@awdeorio awdeorio deleted the performance-optimizations branch November 6, 2024 20:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Don't open too many output files Performance improvements
3 participants