-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MRG: support prefix csv input for manysketch
#243
Conversation
random thought - maybe a good check would be to make sure that files aren't being used in multiple sketches? |
that's definitely doable! A challenge I'm having right now is whether to just enable What would be your desired level of flexibility? I felt like I needed an |
65ec686
to
b0ebe30
Compare
Unless you have a specific use case for it, maybe go with simpler for now? I feel like the additional complexity of "unimagined future use case" rarely ends up working out. Learned that lesson on the last big software project 😆 . |
to do this, I could store an extra HashMap of input FASTA and make sure we never see the same file twice. Think it's worth it? |
I think so? But, maybe better than erroring out - just report the number of files that were consumed more than once to stderr, and exit -1 unless -f is used (so that workflows don't pass this step automatically). |
@ctb ready for review |
manysketch
manysketch
hmm - I'm not sure i understand the different between erroring out and just exiting -1? I currently only Err out if force is not set. |
You could complete building the sketches, output them, and still exit with -1 at that point. That would prevent workflow steps from succeeding unless -1 was set, but still allow users to run steps successfully, albeit with a warning. On reflection, I think this is a not that smart idea tho, and I'm fine with the code as you've implemented it! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The functionality works really well, thank you!
I'm not sure we should be adding *
to the end of the prefix. I guess it makes sense (given that it's "prefix") but it catches everything in a directory, so if you have non-FASTA derivative files, those are picked up. For example, in my trial code, I had to exclude *.zip
so that it didn't pick up the individual signatures I'd already constructed.
I think for now, merge it as is, and if it causes trouble in real use cases we can revisit!
Nice work!
I didn't think about it before, but we could restrict to standard fasta/fastq extensions? i.e. fa, fq, fna, faa, fasta, fastq? |
And gz, bz2, etc. I feel like it makes things complicated (and then you also might want to let people specify their own which makes it even more complicated). Dunno ;). |
good point, keeping as-is for now :) |
Prefix-based sketching
#184 introduces a new input type that better supports metagenome reads, but doesn't really make things that much simpler for the power user. We can probably support prefix-style naming, as suggested in dib-lab/sourmash-slainte#11.
Here we introduce a 'prefix' CSV type with the following columns:
name,input_moltype,prefix,exclude
.Here we:
exclude
filesThis just uses
glob
, noregex
, so*
are fine inprefix
andexclude
, but not full regex patterns.