-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimizations #70
Optimizations #70
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## develop #70 +/- ##
===========================================
+ Coverage 96.32% 98.43% +2.11%
===========================================
Files 4 4
Lines 245 256 +11
===========================================
+ Hits 236 252 +16
+ Misses 9 4 -5 ☔ View full report in Codecov by Sentry. |
Never mind, I figured out how to solve this. Coverage needed a configuration file in addition to the change suggested in the docs above. This isn't documented at all within |
Good point. I agree with you.
This should be resolved soon once https://github.com/eecs485staff/p5-search-engine/pull/710 is finished editing and merged |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM once conflicts are resolved
Was there anything special about the 2 MB that #74 changed to or is 10 MB still good? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM once the tests pass! I merged in the latest develop and it looks like there's one failure that has to do with spaces in input paths.
EDIT: I think that the parallelization will be nice with https://github.com/eecs485staff/p5-search-engine/pull/710
Fixed that--looks like the only issue left is that we got rate limited uploading codecov reports. Are we missing a token perhaps? |
Taking a look in #79 |
Closes #34, closes #69. I deliberately chose not to make these two changes:
I'm worried that, although these are legitimate optimizations, they would make Madoop significantly more similar to a project 4 solution. We would in fact be verbatim ripping off code from project 4, losing some of Madoop's distinctiveness in the process. Open to debate/suggestions on this though.
With regard to this:
I measured the runtime of project 5's job 1 and job 2 and found that, on my machine, it truthfully doesn't matter very much. Job 1, which is by far the slowest, is unaffected entirely since the actual input to the framework is minuscule (just a file with ~3,000 filenames,which I consider a severe flaw of the recent HTML dataset changes--see this comment). Job 2 is minimally affected some of the time, but not all the time, and any chunk size higher than ~18 MB is pointless due to the size of the input files. I went with 10 MB as a compromise, as this and anything higher does seem to subtract a few seconds some of the time.