Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MRG: support prefix csv input for manysketch #243

Merged
merged 26 commits into from
Mar 1, 2024
Merged

MRG: support prefix csv input for manysketch #243

merged 26 commits into from
Mar 1, 2024

Conversation

bluegenes
Copy link
Contributor

@bluegenes bluegenes commented Feb 27, 2024

Prefix-based sketching

#184 introduces a new input type that better supports metagenome reads, but doesn't really make things that much simpler for the power user. We can probably support prefix-style naming, as suggested in dib-lab/sourmash-slainte#11.

Here we introduce a 'prefix' CSV type with the following columns:
name,input_moltype,prefix,exclude.

Here we:

  1. glob to find all files that match prefix
  2. glob to find all files that match exclude
  3. filter prefix files to exclude exclude files

This just uses glob, no regex, so * are fine in prefix and exclude, but not full regex patterns.

@ctb
Copy link
Collaborator

ctb commented Feb 27, 2024

random thought - maybe a good check would be to make sure that files aren't being used in multiple sketches?

@bluegenes
Copy link
Contributor Author

bluegenes commented Feb 27, 2024

random thought - maybe a good check would be to make sure that files aren't being used in multiple sketches?

that's definitely doable!

A challenge I'm having right now is whether to just enable glob or (better and more complicated) to do full regex pattern matching.

What would be your desired level of flexibility? I felt like I needed an exclude because I was trying to match test data, which are short.fa, short2.fa, short3.fa, and short-protein.fa and I wasn't sure how to just match the dna files using glob alone...

@bluegenes bluegenes changed the base branch from main to ms-mgx February 27, 2024 02:05
@ctb
Copy link
Collaborator

ctb commented Feb 27, 2024

A challenge I'm having right now is whether to just enable glob or (better and more complicated) to do full regex pattern matching.

Unless you have a specific use case for it, maybe go with simpler for now?

I feel like the additional complexity of "unimagined future use case" rarely ends up working out. Learned that lesson on the last big software project 😆 .

Base automatically changed from ms-mgx to main February 27, 2024 17:57
@bluegenes
Copy link
Contributor Author

random thought - maybe a good check would be to make sure that files aren't being used in multiple sketches?

to do this, I could store an extra HashMap of input FASTA and make sure we never see the same file twice. Think it's worth it?

@ctb
Copy link
Collaborator

ctb commented Feb 29, 2024

to do this, I could store an extra HashMap of input FASTA and make sure we never see the same file twice. Think it's worth it?

I think so?

But, maybe better than erroring out - just report the number of files that were consumed more than once to stderr, and exit -1 unless -f is used (so that workflows don't pass this step automatically).

@bluegenes
Copy link
Contributor Author

@ctb ready for review

@bluegenes bluegenes changed the title WIP: support prefix csv input for manysketch MRG: support prefix csv input for manysketch Feb 29, 2024
@bluegenes
Copy link
Contributor Author

But, maybe better than erroring out - just report the number of files that were consumed more than once to stderr, and exit -1 unless -f is used (so that workflows don't pass this step automatically).

hmm - I'm not sure i understand the different between erroring out and just exiting -1? I currently only Err out if force is not set.

@ctb
Copy link
Collaborator

ctb commented Feb 29, 2024

But, maybe better than erroring out - just report the number of files that were consumed more than once to stderr, and exit -1 unless -f is used (so that workflows don't pass this step automatically).

hmm - I'm not sure i understand the different between erroring out and just exiting -1? I currently only Err out if force is not set.

You could complete building the sketches, output them, and still exit with -1 at that point. That would prevent workflow steps from succeeding unless -1 was set, but still allow users to run steps successfully, albeit with a warning.

On reflection, I think this is a not that smart idea tho, and I'm fine with the code as you've implemented it!

Copy link
Collaborator

@ctb ctb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The functionality works really well, thank you!

I'm not sure we should be adding * to the end of the prefix. I guess it makes sense (given that it's "prefix") but it catches everything in a directory, so if you have non-FASTA derivative files, those are picked up. For example, in my trial code, I had to exclude *.zip so that it didn't pick up the individual signatures I'd already constructed.

I think for now, merge it as is, and if it causes trouble in real use cases we can revisit!

Nice work!

@bluegenes
Copy link
Contributor Author

I'm not sure we should be adding * to the end of the prefix. I guess it makes sense (given that it's "prefix") but it catches everything in a directory, so if you have non-FASTA derivative files, those are picked up. For example, in my trial code, I had to exclude *.zip so that it didn't pick up the individual signatures I'd already constructed.

I didn't think about it before, but we could restrict to standard fasta/fastq extensions? i.e. fa, fq, fna, faa, fasta, fastq?

@ctb
Copy link
Collaborator

ctb commented Mar 1, 2024

I'm not sure we should be adding * to the end of the prefix. I guess it makes sense (given that it's "prefix") but it catches everything in a directory, so if you have non-FASTA derivative files, those are picked up. For example, in my trial code, I had to exclude *.zip so that it didn't pick up the individual signatures I'd already constructed.

I didn't think about it before, but we could restrict to standard fasta/fastq extensions? i.e. fa, fq, fna, faa, fasta, fastq?

And gz, bz2, etc. I feel like it makes things complicated (and then you also might want to let people specify their own which makes it even more complicated).

Dunno ;).

@bluegenes
Copy link
Contributor Author

I'm not sure we should be adding * to the end of the prefix. I guess it makes sense (given that it's "prefix") but it catches everything in a directory, so if you have non-FASTA derivative files, those are picked up. For example, in my trial code, I had to exclude *.zip so that it didn't pick up the individual signatures I'd already constructed.

I didn't think about it before, but we could restrict to standard fasta/fastq extensions? i.e. fa, fq, fna, faa, fasta, fastq?

And gz, bz2, etc. I feel like it makes things complicated (and then you also might want to let people specify their own which makes it even more complicated).

Dunno ;).

good point, keeping as-is for now :)

@bluegenes bluegenes merged commit ef1d999 into main Mar 1, 2024
1 check passed
@bluegenes bluegenes deleted the prefix-csv branch March 1, 2024 17:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants