Optimizations #70

noah-weingarden · 2023-11-22T10:46:33Z

Closes #34, closes #69. I deliberately chose not to make these two changes:

Do not store both map output and partitioned map output. Instead, pipe map output directly to partition function.
Do not store both sorted and unsorted group output. Instead, read partitioned mapper output into memory (one file). Sort it. Pipe sorted output to reducer input.

I'm worried that, although these are legitimate optimizations, they would make Madoop significantly more similar to a project 4 solution. We would in fact be verbatim ripping off code from project 4, losing some of Madoop's distinctiveness in the process. Open to debate/suggestions on this though.

With regard to this:

Measure run time with a large input (e.g., EECS 485 Project 5 Wikipedia input) and optimize input partition size.

I measured the runtime of project 5's job 1 and job 2 and found that, on my machine, it truthfully doesn't matter very much. Job 1, which is by far the slowest, is unaffected entirely since the actual input to the framework is minuscule (just a file with ~3,000 filenames,which I consider a severe flaw of the recent HTML dataset changes--see this comment). Job 2 is minimally affected some of the time, but not all the time, and any chunk size higher than ~18 MB is pointless due to the size of the input files. I went with 10 MB as a compromise, as this and anything higher does seem to subtract a few seconds some of the time.

codecov · 2023-11-22T22:13:24Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 98.43%. Comparing base (5563c74) to head (6da2dd6).
Report is 19 commits behind head on develop.

Additional details and impacted files

@@             Coverage Diff             @@
##           develop      #70      +/-   ##
===========================================
+ Coverage    96.32%   98.43%   +2.11%     
===========================================
  Files            4        4              
  Lines          245      256      +11     
===========================================
+ Hits           236      252      +16     
+ Misses           9        4       -5

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

noah-weingarden · 2023-11-22T23:48:59Z

Note that because sorting intermediate files now happens in a subprocess, Coverage isn't figuring out that sort_file() actually runs. pytest-cov has documentation about this, but their suggestion doesn't help. Since Coverage is still letting this pass CI, I lean toward ignoring this.

Never mind, I figured out how to solve this. Coverage needed a configuration file in addition to the change suggested in the docs above. This isn't documented at all within pytest-cov, only its backend, Coverage.py.

awdeorio · 2024-03-29T13:34:10Z

I'm worried that, although these are legitimate optimizations, they would make Madoop significantly more similar to a project 4 solution. We would in fact be verbatim ripping off code from project 4, losing some of Madoop's distinctiveness in the process. Open to debate/suggestions on this though.

Good point. I agree with you.

I measured the runtime of project 5's job 1 and job 2 ... since the actual input to the framework is minuscule (just a file with ~3,000 filenames

This should be resolved soon once https://github.com/eecs485staff/p5-search-engine/pull/710 is finished editing and merged

awdeorio

Overall LGTM once conflicts are resolved

noah-weingarden · 2024-03-31T02:43:27Z

Was there anything special about the 2 MB that #74 changed to or is 10 MB still good?

awdeorio · 2024-04-01T15:14:22Z

Was there anything special about the 2 MB that #74 changed to or is 10 MB still good?

Not sure. @melodell ?

melodell · 2024-04-01T15:16:38Z

Was there anything special about the 2 MB that #74 changed to or is 10 MB still good?

Not sure. @melodell ?

Nope, just needed to bump it above 1 MB, since a handful of our documents were > 1 MB.

…zations

awdeorio

LGTM once the tests pass! I merged in the latest develop and it looks like there's one failure that has to do with spaces in input paths.

EDIT: I think that the parallelization will be nice with https://github.com/eecs485staff/p5-search-engine/pull/710

…zations

noah-weingarden · 2024-11-06T20:22:56Z

LGTM once the tests pass! I merged in the latest develop and it looks like there's one failure that has to do with spaces in input paths.

Fixed that--looks like the only issue left is that we got rate limited uploading codecov reports. Are we missing a token perhaps?

awdeorio · 2024-11-06T20:29:24Z

Looks like the only issue left is that we got rate limited uploading codecov reports. Are we missing a token perhaps?

Taking a look in #79

noah-weingarden added 4 commits November 22, 2023 04:19

Don't copy input

24ee2ce

Move output files instead of copying

e55d7e0

Parallelize with thread pools and a process pool

997cc2d

Increase chunk size

5ca5407

noah-weingarden requested a review from awdeorio November 22, 2023 10:46

noah-weingarden mentioned this pull request Nov 22, 2023

Don't open too many output files #69

Closed

Helper functions for Group Stage

2787ffb

noah-weingarden added 4 commits November 22, 2023 17:59

Add tests for input splitting

a5d6976

Re-raise exceptions from thread pool

6fffb63

Increase API coverage

6f84f80

Unused import

c147a87

noah-weingarden added 2 commits November 26, 2023 05:40

Allow Coverage to detect code running in subprocesses

b85f9c7

Update MANIFEST.in

f0771ac

noah-weingarden mentioned this pull request Mar 12, 2024

Increase max file size to 2 MB #74

Merged

awdeorio reviewed Mar 29, 2024

View reviewed changes

awdeorio added 2 commits November 6, 2024 14:14

Merge remote-tracking branch 'origin/develop' into performance-optimi…

48e1ba8

…zations

lint

a9e97c1

awdeorio approved these changes Nov 6, 2024

View reviewed changes

awdeorio and others added 3 commits November 6, 2024 14:37

Merge remote-tracking branch 'origin/develop' into performance-optimi…

bf59f46

…zations

Revert docstring

d397c90

Revert change to shell parameter from merge conflict

424a193

Eliminate pkg_resources deprecration warning

71b01aa

Merge branch 'develop' into performance-optimizations

6da2dd6

awdeorio merged commit 5bea37d into develop Nov 6, 2024
3 checks passed

awdeorio deleted the performance-optimizations branch November 6, 2024 20:43

noah-weingarden mentioned this pull request Nov 6, 2024

Fix deprecation warning #68

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimizations #70

Optimizations #70

noah-weingarden commented Nov 22, 2023 •

edited

Loading

codecov bot commented Nov 22, 2023 •

edited

Loading

noah-weingarden commented Nov 22, 2023 •

edited

Loading

awdeorio commented Mar 29, 2024 •

edited

Loading

awdeorio left a comment

noah-weingarden commented Mar 31, 2024

awdeorio commented Apr 1, 2024

melodell commented Apr 1, 2024

awdeorio left a comment •

edited

Loading

noah-weingarden commented Nov 6, 2024 •

edited

Loading

awdeorio commented Nov 6, 2024

Optimizations #70

Optimizations #70

Conversation

noah-weingarden commented Nov 22, 2023 • edited Loading

codecov bot commented Nov 22, 2023 • edited Loading

Codecov Report

noah-weingarden commented Nov 22, 2023 • edited Loading

awdeorio commented Mar 29, 2024 • edited Loading

awdeorio left a comment

Choose a reason for hiding this comment

noah-weingarden commented Mar 31, 2024

awdeorio commented Apr 1, 2024

melodell commented Apr 1, 2024

awdeorio left a comment • edited Loading

Choose a reason for hiding this comment

noah-weingarden commented Nov 6, 2024 • edited Loading

awdeorio commented Nov 6, 2024

noah-weingarden commented Nov 22, 2023 •

edited

Loading

codecov bot commented Nov 22, 2023 •

edited

Loading

noah-weingarden commented Nov 22, 2023 •

edited

Loading

awdeorio commented Mar 29, 2024 •

edited

Loading

awdeorio left a comment •

edited

Loading

noah-weingarden commented Nov 6, 2024 •

edited

Loading