Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closes #2590 improve_joined: improve performance of *joined* functions #2599

Merged
merged 10 commits into from
Dec 13, 2024

Conversation

bundfussr
Copy link
Collaborator

@bundfussr bundfussr commented Dec 9, 2024

Thank you for your Pull Request! We have developed this task checklist from the Development Process Guide to help with the final steps of the process. Completing the below tasks helps to ensure our reviewers can maximize their time on your code as well as making sure the admiral codebase remains robust and consistent.

Please check off each taskbox as an acknowledgment that you completed the task or check off that it is not relevant to your Pull Request. This checklist is part of the Github Action workflows and the Pull Request will not be merged into the main branch until you have checked off each task.

  • Place Closes #<insert_issue_number> into the beginning of your Pull Request Title (Use Edit button in top-right if you need to update)
  • Code is formatted according to the tidyverse style guide. Run styler::style_file() to style R and Rmd files
  • Updated relevant unit tests or have written new unit tests, which should consider realistic data scenarios and edge cases, e.g. empty datasets, errors, boundary cases etc. - See Unit Test Guide
  • If you removed/replaced any function and/or function parameters, did you fully follow the deprecation guidance?
  • Review the Cheat Sheet. Make any required updates to it by editing the file inst/cheatsheet/admiral_cheatsheet.pptx and re-upload a PDF and a PNG version of it to the same folder. (The PNG version can be created by taking a screenshot of the PDF version.)
  • Update to all relevant roxygen headers and examples, including keywords and families. Refer to the categorization of functions to tag appropriate keyword/family.
  • Run devtools::document() so all .Rd files in the man folder and the NAMESPACE file in the project root are updated appropriately
  • Address any updates needed for vignettes and/or templates
  • Update NEWS.md under the header # admiral (development version) if the changes pertain to a user-facing function (i.e. it has an @export tag) or documentation aimed at users (rather than developers). A Developer Notes section is available in NEWS.md for tracking developer-facing issues.
  • Build admiral site pkgdown::build_site() and check that all affected examples are displayed correctly and that all new functions occur on the "Reference" page.
  • Address or fix all lintr warnings and errors - lintr::lint_package()
  • Run R CMD check locally and address all errors and warnings - devtools::check()
  • Link the issue in the Development Section on the right hand side.
  • Address all merge conflicts and resolve appropriately
  • Pat yourself on the back for a job well done! Much love to your accomplishment!

@bundfussr bundfussr linked an issue Dec 9, 2024 that may be closed by this pull request
Copy link

github-actions bot commented Dec 9, 2024

Code Coverage

Package Line Rate Health
admiral 98%
Summary 98% (5175 / 5288)

@manciniedoardo
Copy link
Collaborator

Discussed in today's admiral call:

  • This issue/PR addresses the fact that derive_vars_joined() crashes due to out of memory issues.
  • @bundfussr patched this by altering derive_vars_joined()'s call so that the full join that is done within the function is done in lots of little steps.
  • @bundfussr then suggested to parallelise these little steps, which in principle should speed up the execution. However, the code as of now would only work on unix and linux, not windows.
  • So any application of parallelisation would need to be carefully evaluated to ensure users don't observe unexpected behaviour working across systems.

Decision: remove the parallelisation code for now, and accept that derive_vars_joined() may run slow for large data (but at least won't crash!). Revisit the topic of parallelisation in 1.3 when we have a few months to organise a package-wide approach, as in principle we could parallelise a lot more functions.

@bundfussr bundfussr marked this pull request as ready for review December 12, 2024 11:16
Copy link
Collaborator

@bms63 bms63 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the performance improved for smaller datasets with this update? Sensible to look at something like microbenchmark for what is on CRAN and our new function?

@bundfussr
Copy link
Collaborator Author

Is the performance improved for smaller datasets with this update? Sensible to look at something like microbenchmark for what is on CRAN and our new function?

It didn't test it but I wouldn't expect improved performance (with respect to running time) because we still do the same steps and some additional steps.

@bms63 bms63 self-requested a review December 13, 2024 12:49
Copy link
Collaborator

@bms63 bms63 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's go!

@bms63
Copy link
Collaborator

bms63 commented Dec 13, 2024

@manciniedoardo any thoughts or can we merge this in?

@bms63 bms63 merged commit bb40ca5 into main Dec 13, 2024
19 checks passed
@bms63 bms63 deleted the 2590_improve_joined branch December 13, 2024 18:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

General Issue: Improve derive_*_joined_*() functions
3 participants