Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

can we get champagne run to work without the need for init? #178

Open
kelly-sovacool opened this issue Feb 19, 2024 · 3 comments
Open

can we get champagne run to work without the need for init? #178

kelly-sovacool opened this issue Feb 19, 2024 · 3 comments
Labels
cli Related to the Command Line Interface

Comments

@kelly-sovacool
Copy link
Member

nextflow run nf-core/chipseq -profile test,docker --outdir output just works for nf-core pipelines, without the need to first download anything at all. In theory we should be able to get the same working for our nextflow pipelines so champagne run -profile test ... will just work without initializing anything.

@kelly-sovacool
Copy link
Member Author

init could still be helpful as an optional helper to copy an example samplesheet or params.yml file for users to customize for their needs, but this way it could be optional instead of required.

@kelly-sovacool kelly-sovacool added the cli Related to the Command Line Interface label Feb 19, 2024
@slsevilla
Copy link
Contributor

offering my thoughts on this- i think (biased- see below) init is crucial for reproducibility and tracking and while its nice to maybe avoid that in dev, i dont think for production it should be optional.

from a historical standpoint, init has evolved from what was just a copy some configs and random files to the outputdir to how it is set up on most pipelines. i've pushed for and added (hence, bias) a lot of the code in our pipelines related to init because of experiences i've had with my own work, and with assisting others running our pipelines. a few thoughts on why it should remain a requirement for production.

  1. (for all users) it saves all configuration metrics and input params with the output directory. this ensures if we go back to the project months down the line, we don't have to dig through log files to know what params were being used (and it's easily labeled so PI's can look it up without asking us). when i go to write manuscripts i always go to the tools json/config file to easily pull all version (or containers) to pop into the methods section. no searching through logs.
  2. (power users) related, running more than one project on the same pipeline, at the same time, makes for dangerous outcomes if you haven't initialized a single source for configs etc.
  3. (for all users) similarly provides defined manifests for users to update, rather than need to recreate on their own. this is the number one source of headache for new users with new pipelines - the manifest file is formatted incorrectly; even with the best documentation.
  4. (power users) copies scripts used directly to the output directly which were run during the creation of the output material. this allows users to perform single-project needed updates to code quickly, while saving these changes directly in the output dir for reproducibility.

all that aside, for dev, this is a super helpful tool, especially if the pipeline has test profiles. in that instance you're just running on what is already set up and defined and dont really care about history. I think it's a perfect use case for that, but not as much for production.

@kelly-sovacool
Copy link
Member Author

Currently the init implemented here is substantially smaller / has fewer features than the init of the snakemake workflows. We can definitely satisfy all of the goals/pros you listed, it's just a matter of deciding how that should/could be implemented with our nextflow pipelines.

I think some of the reproducibility goals may be better handled at run time rather than during initialization. We do plan to eventually get all processes to output their software versions just like the nf-core pipelines (see #27) -- in those pipelines that's something handled at run time because the version can change depending on which container is used for a given process.

Config options set at the CLI also can change between runs/reruns in the same directory, so we'll want to make sure we're capturing those at run time also -- I think that actually may already be handled by nextflow's built-in execution report but we should double check, and if not make sure we output a time-stamped file with an exhaustive list of the params used.

I do think copying boilerplate config/params files, example sample sheets, etc are good examples of tasks best handled by init.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cli Related to the Command Line Interface
Projects
None yet
Development

No branches or pull requests

2 participants