-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up group_by
operations in plyranges
#4
Comments
group_by
operations in plyranges
Hi Mike, I was interested when we talked about that at CSAMA, and would be into trying to improve that as a starter to get into package development. Do you think it's doable for someone who does not have experience in package dev or would it be too hard? |
hi Pierre-Paul, it would be great to have your help. I asked Michael Lawrence about this particular issue at BioC2023, and I think one solution if we just want to operate on metadata (not |
ok, so pull out metadata as a tibble and use out of the box tidyverse functions? I guess then the variables in |
Right. You don't need a tibble dependency, you could use
I think blank repo is a good approach |
@ppaxisa please add your authorship details here https://docs.google.com/spreadsheets/d/19XqhN3xAMekCJ-esAolzoWT6fttruSEermjIsrOFcoo/edit?usp=sharing |
I could use some advice here. Sorry if this is obvious, I'm just trying to understand how I've reused @mikelove's gist
when the object is grouped, So I'm wondering, if I should leave most of that structure in place and just work on If it's the later, it could go like this:
Does that make sense? PS: I'm a bit out of depth on the
Not sure whether how important this is? But I think I can avoid touching this maybe... |
Thanks @ppaxisa this is great.
I think this way, as the
why so? I am also out of my depth with the NSE in plyranges. We should set up a time for @sa-lee and/or @lawremi to walk us through how we can contribute in a way that doesn't break existing code. |
Sounds good, thanks for the feedback! re: need for core columns:
I think some classical use cases would be:
if we go towards this I think maybe one useful thing would be to implement a |
I thought we were going down a path where: 1) grouping hasn't involved a core column, 2) mutate hasn't made use of a core column. And then we are just handling the mutate operation, so then a subsequent |
well yes I guess that was the original idea but I think it does not increase the complexity too much to use core columns as inputs as long as we don't modify them. We could throw an error if the call tries to modify the core columns. This is roughly what I had in mind:
One thing I'm not sure about is how to manage groupings when we convert to data.frame. So here I use Regarding Does that make sense? please tell me if I am going totally off-road ^^ |
This makes sense to me, but I'm interested to hear Stuart's thoughts on Tuesday. |
Met with Stuart who had some useful comments, we recorded the call so I can share that with you. Re:
I think now that it should be a fork, because we need to tool around in the plyranges code base to define when to call |
Yes, thanks for the recording, I just listened to it! Should I fork it to the tidyomics org or to my GitHub? |
Your GH is fine I think. |
Hi Mike, I set up the repo here, with a branch to rewrite the mutate as a start, see mutate code. I have a seemingly weird problem: when I run test, I get no errors:
but when I run the check, I get an error related to my rewriting of the
I don't understand why the tests pass but when running the checks it fails. I realise too that I was using the Rstudio GUI for running test and check operations but I could also do it with devtools. It seems devtools offers more flexibility for debugging? |
Update: I realise
|
Awesome, I will check this out this week. I'm flying tomorrow to a conference, but can pull your repo and poke around during downtime. |
I have solved the 2nd error related to Regarding the first error of
Update: I think the simplest solution is to |
Hi Mike, I've done some reading on the 2 last error checks that pertain to |
Sorry for the silence, past two weeks I was traveling a lot. This makes sense to me, if a user wants to do mutate on a GroupedGenomicRanges, we can offer a big speed-up when operating on non-core columns, so long as these play well with dplyr (not classes defined in S4Vectors). This seems reasonable for now, and will speed up many operations. Our motivating example was working with transcripts/exons and wanting to group by gene, and from TxDb or EnsDb, you get back character columns. |
@ppaxisa sorry for the delay again, been a busy Fall as I'm starting to teach a new module. I'm happy to come through and start testing if that would be useful? |
Yes! But I have not submitted a commit to drop the tests on the |
Hi Mike, sorry for the slow turn around. I've made some commits to solve the last test errors: basically coercing all
PS: I did not manage to regenerate the documentation so I fixed some code in examples Rd files manually. Not sure why exit status is 1 if there are only warnings. Is that normal? |
I think CHECK does exit status 1 for both warnings and errors. So these generics and methods have different signature, did that pre-exist your additions? Remind me, in the |
RE generics and methods, yes all of those existed before and I did not touch it. RE |
re: generics and methods, ok so we can PR regardless of those warnings i think. Let's check in with @sa-lee on that because I think the last time I spoke with him we were thinking of a fork:
...because from what I remember, mutate on ungrouped ranges wasn't so bad, not nearly as bad as mutate on grouped ranges, which was getting passed to lapply. Stuart even optimized the former IIRC. |
Sounds good. Also I wanted to run the coercion back to
I'm just putting a |
Ok I should be able to check today |
November slipped away from me. @ppaxisa what would it take to scale back from modification of all |
Hi Mike, I don't think too much work: I rewrote the 2 internal functions to be "tidyverse-dependent". But I could bring back the original ones and split between original vs new functions when dealing with non-grouped vs grouped objects. I'm a bit busy with faculty applications rn, but I'll try to push a commit over the break if I get a moment. |
Ok let’s loop back later. |
Hi Mike, here I am! Finally found some time to roll back some changes so we use the original Here is the
Still a few old warnings (it was already there before). I've branched this version at https://github.com/ImmuneAxisa/plyranges/tree/mutate_grp_revamp |
awesome! I will check this out right now. |
This is fantastic, I tried out the test example at the top of the issue. With the Bioc version:
And your new code:
20x speedup!
I've got a PR ready for plyranges also, would be great to get these in before April. |
Let’s get this in! Wanna PR? separately, do you think I can solve the S3 generic/method consistency issue by changing .data to .tbl? I’ll work on that in the new devel branch. |
I can submit a PR but it says "This branch is 10 commits ahead of, 22 commits behind tidyomics/plyranges:devel. Should I sync my repo first to the latest devel version? I have mixed feelings about this method consistency: it's confusing to pretend that the input to the function is a tibble when it's not. But if warnings are not allowed to maintain the package on bioconductor, I guess we need to tend to it... |
I think you could sync or not. I don’t think that matters for the PR. I’ll investigate the warnings more this week: |
group_by
followed by summarization in plyranges is not as fast as dplyr on tibbletidyomics/plyranges#30
However, an even more important issue is that
group_by
without summarization is much slowerhttps://gist.github.com/mikelove/d788831af3cf76de642ba03af7a0124b
Discovered in this context
https://github.com/tidyomics/tidy-ranges-tutorial/blob/main/isoform-analysis.Rmd#L86-L90
The text was updated successfully, but these errors were encountered: