Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dp24 exclude file #48

Merged
merged 4 commits into from
Sep 12, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
132 changes: 65 additions & 67 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,113 +15,111 @@ Collaborators and contributors:

This is a re-write of the current fasta manipulation scripts I've written whilst at ToL, as well as adding some functionality needed for future projects.

Currently, this program has the following arguments:
Currently, this program has the following arguments.

- yaml_validator (v2)
## yaml_validator (v2)

THIS FUNCTION IS SPECIFIC TO THE TREEVAL.yaml FILE
THIS FUNCTION IS SPECIFIC TO THE TREEVAL.yaml FILE

Updated for new yaml style and now uses struct methods.
Updated for new yaml style and now uses struct methods.

This validates a given yaml against the TreeVal yaml standard. This is specific to the TreeVal pipeline.
This command will go through the yaml and validate file and directory paths as well as files are in the expected format.
This validates a given yaml against the TreeVal yaml standard. This is specific to the TreeVal pipeline.
This command will go through the yaml and validate file and directory paths as well as files are in the expected format.

This has been tested by downloading the TreeValTinyTest data set:
This has been tested by downloading the TreeValTinyTest data set:

```bash
curl https://tolit.cog.sanger.ac.uk/test-data/resources/treeval/TreeValTinyData.tar.gz | tar xzf -
```
```bash
curl https://tolit.cog.sanger.ac.uk/test-data/resources/treeval/TreeValTinyData.tar.gz | tar xzf -
```

`validateyaml ${PATH TO YAML}`
`validateyaml ${PATH TO YAML}`

TODO:
### TODO:

- Add CRAM validator to the module
- Check for sorting order
- SO record (added now) or
- Take first 100 records and determine whether they are paired reads
- Find equiv to `samtools quickcheck -vvv` for a report on completeness of cram.
- if not then it will be a secondary process (external to FasMan)
- Better report
- Report should complete and if there are fails then panic! or std::process::exit("FAILED DUE TO: ...") this is so that it can be added to the Nextflow pipelines and cause them to error out at the right place, e.g, not rely on scanning the report.log throught functions in NF.
- Add CRAM validator to the module
- Find equiv to `samtools quickcheck -vvv` for a report on completeness of cram.
- if not then it will be a secondary process (external to FasMan)
- Better report
- Report should complete and if there are fails then panic! or std::process::exit("FAILED DUE TO: ...") this is so that it can be added to the Nextflow pipelines and cause them to error out at the right place, e.g, not rely on scanning the report.log throught functions in NF.

- map_headers
## map_headers

This command generates a mapping file of a given fasta files headers to new names, this standarises headers to a small form factor with no special characters (by default this is 'FMM'). The fasta file is then copied with the new mapped headers in place. The output directory folder must already exist.
This command generates a mapping file of a given fasta files headers to new names, this standarises headers to a small form factor with no special characters (by default this is 'FMM'). The fasta file is then copied with the new mapped headers in place. The output directory folder must already exist.

`mapheaders --fasta-file ${PATH TO FASTA} --output-directory ${OUTPUT LOCATION} --replace-with ${STRING FOR NEW HEADER}`
`mapheaders --fasta-file ${PATH TO FASTA} --output-directory ${OUTPUT LOCATION} --replace-with ${STRING FOR NEW HEADER}`

- remap_headers
## remap_headers

This compliments the above function by using the above generated map file to regenerate the original headers.
This compliments the above function by using the above generated map file to regenerate the original headers.

- split_by_count
## split_by_count

This command will generate a directory of files made up of a user given number of sequences from the input fasta. This is useful when generating geneset data for TreeVal use or sub-setting data in a non-random manner.
The count will be the upper limit, as there will be a left over number of records.
This command will generate a directory of files made up of a user given number of sequences from the input fasta. This is useful when generating geneset data for TreeVal use or sub-setting data in a non-random manner.
The count will be the upper limit, as there will be a left over number of records.

This will generate files in `{outdir}/{fasta-file.prefix}/{data_type}/{input_file_prefix}_f{file_count}_c{requested_chunk_count}-a{actual_chunk_count}.fa`
This will generate files in `{outdir}/{fasta-file.prefix}/{data_type}/{input_file_prefix}_f{file_count}_c{requested_chunk_count}-a{actual_chunk_count}.fa`

`splitbycount --fasta-file ${PATH TO FASTA} --output-directory ${OUTPUT LOCATION} --count {NUMBER OF FASTA RECORDS PER FILE} --data_type ['pep','cdna', 'cds', 'rna', 'other']`
`splitbycount --fasta-file ${PATH TO FASTA} --output-directory ${OUTPUT LOCATION} --count {NUMBER OF FASTA RECORDS PER FILE} --data_type ['pep','cdna', 'cds', 'rna', 'other']`

- split_by_size
## split_by_size

This command will generate a directory of files, of user given size (in MB), generated from the input fasta. This is useful for consistent sizes of files used in geneset alignments.
The mem-size will be approximate as some records may exceed the chosen size, inversely, there will be a final file collecting small sequences which do not meet the limit.
This command will generate a directory of files, of user given size (in MB), generated from the input fasta. This is useful for consistent sizes of files used in geneset alignments.
The mem-size will be approximate as some records may exceed the chosen size, inversely, there will be a final file collecting small sequences which do not meet the limit.

`splitbysize --fasta-file ${PATH TO FASTA} --output-directory ${OUTPUT LOCATION} --mem-size ${SIZE OF OUTPUT FILES IN Mega Bytes}`
`splitbysize --fasta-file ${PATH TO FASTA} --output-directory ${OUTPUT LOCATION} --mem-size ${SIZE OF OUTPUT FILES IN Mega Bytes}`

- generate_csv
THIS IS SPECIFIC TO TREEVAL AND THE STUCTURE OF THE GENESET DATA IN USE FOR IT
## generate_csv
THIS IS SPECIFIC TO TREEVAL AND THE STUCTURE OF THE GENESET DATA IN USE FOR IT

This function generates CSV files summarising the contents of a directory structure like shown below and saves this in csv_data dir:
This function generates CSV files summarising the contents of a directory structure like shown below and saves this in csv_data dir:

```
geneset_data_dir
```
geneset_data_dir
|
insect
|
insect
|
csv_data
| |
| ApisMellifera.AMel1-data.csv
csv_data
| |
| ApisMellifera.AMel1-data.csv
|
ApisMellifera
|
ApisMellifera
ApisMellifera.AMel1
|
ApisMellifera.AMel1
{pep, cdna, cds, rna}
|
{pep, cdna, cds, rna}
|
split.fasta files
```
split.fasta files
```

- curate
## curate

Use a tpf and fasta file to generate a curated fasta file.
Use a tpf and fasta file to generate a curated fasta file.

`curate --fasta input.fasta --tpf input.tpf --output curated.fasta`
`curate --fasta input.fasta --tpf input.tpf --output curated.fasta`

- filterfasta
## filterfasta

Given a comma seperated list, create a new fasta file removing the named sequence.
Given a comma seperated list, create a new fasta file removing the named sequence.

`filterfasta -f input.fasta -l "SUPER_1,SUPER_2" -o output.fasta`
`filterfasta -f input.fasta { -l "SUPER_1,SUPER_2" | -c names.lst } -o output.fasta`

- mergehaps (NOT YET WRITTEN)
## mergehaps (NOT YET WRITTEN)

Given two fasta files, generate 1 merged fasta file and rename the scaffs.
Given two fasta files, generate 1 merged fasta file and rename the scaffs.

`mergehaps -p primary.fasta -s secondary.fasta -n PRI/HAP -o merged.fasta`
`mergehaps -p primary.fasta -s secondary.fasta -n PRI/HAP -o merged.fasta`

- profile (IN PROGRESS)
## profile (IN PROGRESS)

Profile a given fasta file:
Profile a given fasta file:

- GC percentage per scaffold + counts
- GC percentage whole genome
- N50 and N90
- L50
- GAP count and length (summary with average length)
- GC percentage per scaffold + counts
- GC percentage whole genome
- N50 and N90
- L50
- GAP count and length (summary with average length)

`profile -f input.fasta -o outdir`
`profile -f input.fasta -o outdir`

# Notes
If there are other options that would be useful to any other teams, leave a message or issue.
37 changes: 30 additions & 7 deletions src/exclude_seq.rs → src/filter_fasta.rs
Original file line number Diff line number Diff line change
@@ -1,8 +1,13 @@
pub mod exclude_seq_mod {
pub mod filter_fasta_mod {
use clap::ArgMatches;
use noodles::fasta;
use std::error::Error;
use std::{fs, io::BufRead, str};
use std::fs::File;
use std::{
fs,
io::{BufRead, BufReader},
path::Path,
};

fn open_fasta<'a>(
exclusions: Vec<&str>,
Expand All @@ -27,11 +32,11 @@ pub mod exclude_seq_mod {
let mut binding = fasta;
for result in binding.records() {
let record = result?;
let record_name = str::from_utf8(record.name())?;
let record_name = std::str::from_utf8(record.name())?;
if !exclusions.contains(&record_name) {
writer.write_record(&record)?;
} else {
println!("Found record to exclude: {:?}", &record.name());
println!("Found record to exclude: {:?}", &record_name);
}
}
Ok("Removed Exclusionary List")
Expand All @@ -40,11 +45,29 @@ pub mod exclude_seq_mod {
}
}

fn lines_from_file(filename: impl AsRef<Path>) -> Vec<String> {
let file = File::open(filename).expect("no such file");
let buf = BufReader::new(file);
buf.lines()
.map(|l| l.expect("Could not parse line"))
.collect()
}

pub fn filter_fasta(arguments: std::option::Option<&ArgMatches>) {
let fasta = arguments.unwrap().get_one::<String>("fasta").unwrap();
let exclude = arguments.unwrap().get_one::<String>("filter_list").unwrap();
let outfile = arguments.unwrap().get_one::<String>("output").unwrap();
let list_to_exclude = exclude.split(',').collect::<Vec<&str>>();
let _x = open_fasta(list_to_exclude, fasta, outfile);
let exclude_list = arguments.unwrap().get_one::<String>("filter_list").unwrap();
let exclude_file = arguments.unwrap().get_one::<String>("filter_file").unwrap();

if exclude_file != "None" {
let exclusion_list = lines_from_file(exclude_file);
let exclusion_list_parsed = exclusion_list.iter().map(|p| p.as_str()).collect();
let _x = open_fasta(exclusion_list_parsed, fasta, outfile);
}

if exclude_list != "None" {
let list_to_exclude = exclude_list.split(',').collect::<Vec<&str>>();
let _y = open_fasta(list_to_exclude, fasta, outfile);
}
}
}
14 changes: 11 additions & 3 deletions src/main.rs
Original file line number Diff line number Diff line change
Expand Up @@ -26,8 +26,8 @@ mod generics;
mod tpf_fasta;
use crate::tpf_fasta::tpf_fasta_mod::curate_fasta;

mod exclude_seq;
use crate::exclude_seq::exclude_seq_mod::filter_fasta;
mod filter_fasta;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wondering why we need to do this? Is this standard in Rust. We have had to do it in a few places when adding tests and not entirely sure why.

use crate::filter_fasta::filter_fasta_mod::filter_fasta;

fn main() -> Result<(), Error> {
let split_options = ["pep", "cds", "cdna", "rna", "other"];
Expand Down Expand Up @@ -272,7 +272,6 @@ fn main() -> Result<(), Error> {
.about("Filter a given list of sequences from fasta file")
.arg(
Arg::new("fasta")
.short('f')
.required(true)
.help("A fasta file for processing")
)
Expand All @@ -285,8 +284,17 @@ fn main() -> Result<(), Error> {
.arg(
Arg::new("filter_list")
.short('l')
.required(false)
.default_value("None")
.help("A string comma-separated list of sequence names to exclude from the final fasta")
)
.arg(
Arg::new("filter_file")
.short('f')
.required(false)
.default_value("None")
.help("A txt file (such as names.lst) with a sequence header per line to exclude from a final fasta file")
)
)
.subcommand(
Command::new("mergehaps")
Expand Down
6 changes: 6 additions & 0 deletions test_data/filterfasta/FiilteredFasta.fa
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
>SCAFFOLD_3
AGTGTATTTTTATGCA
>SCAFFOLD_3
AGTGTATTTTTATGCA
>SCAFFOLD_3
AGTGTATTTTTATGCA
1 change: 1 addition & 0 deletions test_data/filterfasta/names.lst
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
SCAFFOLD_1
File renamed without changes.
Loading