-
Notifications
You must be signed in to change notification settings - Fork 353
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
State of the project #3653
Comments
@lbeltrame, thank you for your post and for being a long-term contributor/user of bcbio. We appreciate your concern and succinct summary of some of the challenges. Your questions are timely as we (the Harvard Chan Bioinformatics Core) are in the process of writing up a proposal for the CZI EOSS Cycle 5 awards and have been contemplating how best to continue to maintain and develop bcbio.
I agree that bcbio has several advantages over other workflow systems, including the ones you have mentioned. Our challenge is that many of our developers have since moved on to other positions and can no longer support this project. @naumenko-sa has been doing a fantastic job trying to juggle ad hoc support with his research/consulting projects in the core, but he is now the sole contributor in our group and this work has not been funded since the start of the year (hence the application for funds to support this effort for the next two years).
I’ll leave it to @naumenko-sa to comment on the largest low-level issues. In terms of help, we need developers who are familiar with the bcbio ecosystem to become more involved. We have attempted to recruit additional team members but have found it challenging to find the right combination of skills for this project. If we could identify someone who is already familiar with how bcbio works, understands the biological applications, is motivated and has the skills to participate, we’d be interested in speaking to that person (see also https://bioinformatics.sph.harvard.edu/careers; though it does not mention bcbio, if someone comes along who is a good fit for it, we can adjust). In the long term however, this project does need, as you so aptly put it, “to raise awareness in the community to increase the critical mass needed to go forward.”
The application we’re putting together now shows that bcbio is widely used (367 published papers, 133000 downloads from bioconda the past three years, ~1400 unique visitors to this github over the past two weeks). Some questions for the community here:
* What would help to encourage others to contribute? Should we consider a hackathon, an annual meeting, other events that would bring this community together?
* What are the technical roadblocks for other contributors? How can we help make it easier?
I have further thoughts but would love to hear from the community. I encourage anyone interested in contributing more to contact us to help galvanize this effort.
|
Hi @lbeltrame ! Thanks for bringing this issue up and also for many years of your commitment to bcbio as a contributor! If anybody can help to run another installation test - that would be helpful, to test the solution to the openssl issue. The next thing is to bring T2T reference, which I am also interested in doing. Other than that, I don't think there are any burning issues, please post links here if I missed anything super important. The time of other top bcbio contributors who have done fantastic work (Rory, Lorena, Michael, Brad) is not available now. Everyone in the bcbio community is grateful to them - they put in a lot of effort and created a very robust code base. I am trying my best to take on the issues in the GitHub, given the time/budget limitations (my major effort goes to bioinformatics consulting at https://bioinformatics.sph.harvard.edu/). Also, I can't ignore an elephant in the room: https://github.com/vshymanskyy/StandWithUkraine. The last 3 months have been devastating. Please support Ukraine in the way you can! Overall, we are trying to make sure bcbio is working for major use cases and we are using it to process data in our projects, i.e. bcbio now in a maintenance rather than active development mode. Our projects lately included bulk RNA-seq, SV calling in WGS, WGBS, CHIP calling (T only somatic on germline data). If anybody could contribute more to the maintenance of PureCN pipeline, T/N, T only, UMI pipelines to complement the set of projects our group receives, that would be greatly appreciated. Many groups are running new projects or production on Illumina Dragen or Broad Terra, which is understandable given the speed of Dragen (30min for a WGS, 10 min for RNA-seq), and the scalability of Terra workspaces. The remaining Bcbio niche is small labs and projects, underfunded labs and specific use cases you have mentioned (some of the downstream analyses and integration) and we continue supporting bcbio for them. The big topic that needs to be addressed in bcbio is bringing back container support for separate pipelines - help needed. More specifically:
Please feel free to DM if you like at snaumenko[at]the same domain as our main contact if you have more questions or suggestions. SN |
Hi, As for the comments:
I'm not aware of any issues. We recently updated VEP in our bcbio installation without a hitch. I'll look into pushing another update if I find the time.
The PR had been open for a while, but there seemed to be little interest to support a new aligner, so I abandoned the effort. We just replaced
We're using dbNSFP v4.3a, with the config in the PR. As you said, the preprocessing to make the dataset usable is immense, although I think a week is a bit of an overstatement. I ran the download and processing overnight, but this may largely be dependent on individual internet connections and disk speeds. As I see it, the paradigm has shifted from bringing the data to the analysis to the other way around. As it is now, bcbio is huge monolith, which sometimes makes it difficult to use it as a portable analysis platform. In my opinion, the following would pull bcbio back to the front page:
The pipeline offered by bcbio is rock solid, so I think the project should focus on that. There's a lot of workflow languages that can be used to make bcbio more portable. There's no longer a need to maintain a scheduling manager to track and run jobs. I realise this is not a trivial task, but embracing a workflow language could make the codebase way easier to maintain. I also think cloudbiolinux should be pulled or forked into the bcbio organisation and strippped down to only handle the bcbio installation in it's current form. There's no need for ansible, homebrew or all the other complexities (in my experience). Anyways, just my 5 cents. Keep up the good work! |
Luca, Matthias, Sergey and Shannan; |
Thanks a lot for the responses to this thread!
I figured as much: this is an endemic problem with maintaining software in the context of doing research and certainly not anyone's fault.
My opinion, as someone deeply involved in Free Software projects also outside my profession: something that would be truly useful if possible would be identifying "junior jobs" or low hanging fruit for contributors to hack on. This helps in getting familiar with the codebase without getting stumped into too large projects. A hackathon or something like that would be useful as well for those wanting to get into bcbio, or at least to see what can be changed / improved. But even there, my suggestion for outsiders is: start small. To be honest, I didn't back in the days, but in retrospect, I should have. ;)
The code is quite large, and in particular the handling of the "outside" dependencies can be daunting. In addition, the "correct" way of doing things in 2013-2015 changed in later years. As I said, the first step would be to identify what are (bite-sized) areas of improvement.
While in "maintenance mode", there are a few areas where I think that new improvements can land. For example, sWGS pipelines (the golden standard is QDNAseq, although it is fairly old) or specific ctDNA methods like ichorCNA (in active development). Then, as you said, UMI pipelines are something worth looking into (e.g., dual UMI solutions like IDT's). Other stuff, fairly lower level, would be benchmarking. There's already quite a lot of stuff for variants (very useful) but little else (that said, I have no idea how much benchmarking is around for, say, CNVs). I wouldn't mind also dropping support for some software if maintainability becomes a problem.
One of the reasons I'm "pushing" for bcbio is because outside the US, even in a G7 country like mine, policies, availability of funding, and concerns ("the cloud") make relying on such platforms untenable. At my previous job there were zero resources to run analyses like that (and the connectivity wouldn't support it) and bcbio allowed us to run many (10+K jobs) analyses on inexpensive, old-generation hardware. This software is IMO invaluable for those who work on-premises.
You mean, kind of like what Nextflow does? Probably this is a worthy long-term goal. Splitting it into chunks of what "needs to be done" would help in prioritizing work.
While perhaps I have a weaker opinion on this, I too honestly don't see the need of cloudbiolinux as a separate project at this stage. Clearly it grew organically, but perhaps a "diet" would help, and I would support folding it into bcbio. Agreed on the scheduling, which is now handled better than when support was introduced, but as far as I can see this is deeply tied inside bcbio's internals, so that would probably be a longer term effort. |
This ⏫ . I work at a hospital, and absolutely NO data can leave the premises without possible legal issues. |
Hi everyone, happy to see renewed interest in bcbio! @naumenko-sa could I also ask for your help regarding a recent issue, where we're unable to install the sacCer3 genome data in a fresh install? #3652 @marianastase0912 is building some wrappers for bcbio as part of her B.Sc. diploma project and we've been using sacCer3 data for building demos. We were able to install the genome data before the latest bcbio release, but this stopped in the mean time, for some reason. |
@amizeranschi please try installing sacCer3 again. |
DISCLAIMER: This post is not meant to raise unduly criticism and / or fuel anxiety about the project. It is meant as a way to kickstart discussion for those who are interested in improving it.
It is evident to anyone following this project that its development (for many, and probably justified reasons) has de facto stalled (no new commits for a month and a half). Low activity is not per se a problem, but it can be when there are open PRs for a while, some closed after a while (for example support for additional BWA methods, or newer VEP) with no one that is actually reviewing them.
This post is meant, as I wrote above, on enquiring about how bcbio stands today, and if the community can do anything to help. I understand that the landscape in the past years has changed considerably, with WDL, CWL, Snakemake and Nextflow entering the fray, and containerization of analyses (more Apptainer / Singularity than Docker, but that's my personal opinion, and out of scope). However, bcbio gets some things right that other workflows don't, yet:
OTOH, it is clear that there's some technical debt accumulating over the years, and this can prevent further improvement. It would be sad to see this project wither away, so my question is: what can be done to help?
I'm aware of a roadmap issue but as far as I can see it's fairly high level. I'm also aware that multiple hands are needed because the codebase is large, and that also involves cloudbiolinux.
So, these questions go to @naumenko-sa and @roryk:
Of course I'd love input from some other contributors here.
The text was updated successfully, but these errors were encountered: