Skip to content

Commit

Permalink
version 0.2.0
Browse files Browse the repository at this point in the history
  • Loading branch information
fmaccha committed Jan 30, 2024
1 parent 07c1e0e commit ea6c768
Show file tree
Hide file tree
Showing 14 changed files with 435 additions and 295 deletions.
4 changes: 3 additions & 1 deletion Cargo.toml
Original file line number Diff line number Diff line change
@@ -1,8 +1,10 @@
[package]
name = "tataki"
authors = ["Tazro Ohta ([email protected])"]
version = "0.2.0"
edition = "2021"
license = "Apache-2.0"
repository = "https://github.com/sapporo-wes/tataki"
license = "apache-2.0"

# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html

Expand Down
152 changes: 128 additions & 24 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,44 +1,148 @@
# Tataki

**This repository is currently under development.**
Tataki is a command-line tool designed primarily for detecting file formats in the bio-science field with the following features:

Tataki is a command line tool for detecting life science data types.

Currently supports the following file types.

- bam
- fasta
- fastq
- fastq.gz
- bed
- Supports various **file formats mainly used in bio-science**
- bam
- bcf
- bed
- cram
- fasta
- fastq
- gff3
- gtf
- sam
- vcf
- will be added in the future
- Allows for the invocation of a [**CWL document**](https://www.commonwl.org/) and enables users to define their own complex criteria for detection.
- Can target both local files and remote URLs
- Compatible with [EDAM ontology](https://edamontology.org/page)

## Installation

A single binary is available (supports Linux only):
A single binary is available for Linux x86_64.

```shell
curl -fsSL -O https://github.com/suecharo/tataki/releases/download/0.1.0/tataki
curl -fsSL -O https://github.com/sapporo-wes/tataki/releases/latest/download/tataki
chmod +x ./tataki
./tataki -h
./tataki -V
```

Or, you could clone the repository, then run `cargo build`.

## Example
## Usage

Specify the paths of the files as arguments to `tataki`. Both local file path and remote URL are supported.

```shell
tataki <FILE|URL>...
```

For more details:

```shell
$ tataki --help
Usage: tataki [OPTIONS] [FILE|URL]...

Arguments:
[FILE|URL]... Path to the file

Options:
-o, --output <FILE> Path to the output file [default: stdout]
-f <OUTPUT_FORMAT> [default: csv] [possible values: yaml, tsv, csv, json]
--cache-dir <DIR> Specify the directory in which to create a temporary directory. If this option is not provided, a temporary directory will be created in the default system temporary directory (/tmp)
-c, --conf <FILE> Specify the tataki configuration file. If this option is not provided, the default configuration will be used. The option `--dry-run` shows the default configuration file
--dry-run Output the configuration file in yaml format and exit the program. If `--conf` option is not provided, the default configuration file will be shown
-v, --verbose Sets the level of verbosity
-q, --quiet Suppress all log messages
-h, --help Print help
-V, --version Print version

```txt
$ tataki bed12.bed
bed12.bed: 12 column BED file
$ tataki fastq01.fq.gz
fastq01.fq.gz: gzip compressed fastq file
Version: 0.2.0
```

## Todo
### Determining Formats in Your Preferred Order

Using the `-c|--conf=<FILE>` option allows you to change the order or set the file formats to use for determination.

The configuration file is in YAML format. Please refer to the default configuration file for the schema.

```yaml
order:
- bam
- bcf
- bed
- cram
- fasta
- fastq
- gff3
- gtf
- sam
- vcf
```
### Executing a CWL Document with External Extension Mode
Tataki can also be used to execute a CWL document with external extension mode. This is useful when determining file formats that are not supported in pre-built mode or when you want to perform complex detections.
This mode is dependent on Docker, so please ensure that 'docker' is in your PATH.
- add support for more file types, such as .sam, .vcf, .gtf, etc.
- add support for EDAM ontology.
- implement fast mode with which the tool could perform well on larger files.
Here are the steps to execute a CWL document with external extension mode.
1. Prepare a CWL document
2. Specify the CWL document in the configuration file
3. Execute `tataki`.

#### Preparation of CWL Document

The CWL document must be prepared in advance. The following is an example of a CWL document that executes `samtools view`.

`edam_Id` and `label` are the two required fields for the CWL document. Both must be listed in the `tataki` prefix listed in the `$namespaces` section of the document.

```cwl
cwlVersion: v1.2
class: CommandLineTool
requirements:
DockerRequirement:
dockerPull: quay.io/biocontainers/samtools:1.18--h50ea8bc_1
InlineJavascriptRequirement: {}
baseCommand: [samtools, head]
successCodes: [0, 139]
inputs:
input_file:
type: File
inputBinding:
position: 1
outputs: {}
$namespaces:
tataki: https://github.com/sapporo-wes/tataki
tataki:edam_id: http://edamontology.org/format_2573
tataki:label: SAM
```

#### Configuration File

Insert a path to the CWL document in [the configuration file](#determining-formats-in-your-preferred-order). This example shown below executes the CWL document followed by SAM and BAM format detection.

```yaml
order:
- ./path/to/cwl_document.cwl
- sam
- bam
```

## License

[Apache License 2.0](https://www.apache.org/licenses/LICENSE-2.0). See the [LICENSE](https://github.com/suecharo/tataki/blob/main/LICENSE).
The contents of this deposit are basically licensed under the [Apache License 2.0](https://www.apache.org/licenses/LICENSE-2.0). See the [LICENSE](https://github.com/sapporo-wes/tataki/blob/main/LICENSE).
However, the following files are licensed under Creative Commons Attribution Share Alike 4.0 International (<https://spdx.org/licenses/CC-BY-SA-4.0.html>).

- ./src/EDAM_1.25.id_label.csv
- Source: <https://github.com/edamontology/edamontology/releases/download/1.25/EDAM_1.25.csv>
- Removed the lines not related to 'format' and the columns other than 'Preferred Label' and 'Class ID'
25 changes: 25 additions & 0 deletions cwl/sam_head.cwl
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
cwlVersion: v1.2
class: CommandLineTool

requirements:
DockerRequirement:
dockerPull: quay.io/biocontainers/samtools:1.18--h50ea8bc_1
InlineJavascriptRequirement: {}

baseCommand: [samtools, head]

successCodes: [0, 139]

inputs:
input_file:
type: File
inputBinding:
position: 1

outputs: {}

$namespaces:
tataki: https://github.com/sapporo-wes/tataki

tataki:edam_id: http://edamontology.org/format_2573
tataki:label: SAM
17 changes: 7 additions & 10 deletions src/args.rs
Original file line number Diff line number Diff line change
Expand Up @@ -6,43 +6,40 @@ pub enum OutputFormat {
Yaml,
Tsv,
Csv,
// Output only Edam label.
Edam,
Json,
}

#[derive(Parser, Debug)]
#[clap(
name = env!("CARGO_PKG_NAME"),
about = env!("CARGO_PKG_DESCRIPTION"),
version = env!("CARGO_PKG_VERSION"),
after_help = "",
after_help = concat!("Version: ", env!("CARGO_PKG_VERSION")),
arg_required_else_help = true,
)]

pub struct Args {
/// Path to the file
#[clap(name = "FILE", required_unless_present = "dry_run")]
// pub input: Option<String>,
#[clap(name = "FILE|URL", required_unless_present = "dry_run")]
pub input: Vec<String>,

/// Path to the output file [default: stdout]
#[clap(short, long, value_name = "FILE")]
pub output: Option<PathBuf>,

#[clap(short = 'f', value_enum, default_value = "edam",conflicts_with_all = ["yaml"])]
#[clap(short = 'f', value_enum, default_value = "csv",conflicts_with_all = ["yaml"])]
output_format: OutputFormat,

#[clap(long, hide = true)]
yaml: bool,

// TODO これの実装がまだ。
/// Specify the directory in which to create a temporary directory. If this option is not provided, a temporary directory will be created in the default system temporary directory (/tmp).
#[clap(long, value_name = "DIR")]
pub cache_dir: Option<PathBuf>,

#[clap(long)]
pub full_fetch: bool,

// TODO
// #[clap(long, hide = true)]
// pub full_fetch: bool,
/// Specify the tataki configuration file. If this option is not provided, the default configuration will be used.
/// The option `--dry-run` shows the default configuration file.
#[clap(short, long, value_name = "FILE")]
Expand Down
15 changes: 4 additions & 11 deletions src/edam.rs
Original file line number Diff line number Diff line change
@@ -1,11 +1,7 @@
use anyhow::Result;
use bimap::BiMap;
use lazy_static::lazy_static;
use log::warn;
use serde::{Deserialize, Serialize};
use std::{collections::HashMap, f32::consts::E};

use crate::OutputFormat;
use serde::Deserialize;

lazy_static! {
#[derive(Debug)]
Expand All @@ -15,9 +11,7 @@ lazy_static! {
#[derive(Debug)]
// A struct to validate user specified EDAM information.
pub struct EdamMap {
// TODO これはBiMapに変えたのであとで消す
// Map of EDAM ID and Edam struct instance whose id is the key.
// label_to_edam: HashMap<String, Edam>,
// A bimap of EDAM ID and EDAM label.
bimap_id_label: BiMap<String, String>,
}

Expand All @@ -28,10 +22,9 @@ impl EdamMap {
.has_headers(true)
.from_reader(&edam_str[..]);

let mut edam_map: HashMap<String, Edam> = HashMap::new();
let mut bimap = BiMap::new();
for result in rdr.deserialize::<Edam>() {
// resultがErrの時はpanicする
// panic if this fails to read EDAM table.
match result {
Ok(record) => {
// edam_map.insert(record.label.clone(), record.clone());
Expand All @@ -54,7 +47,7 @@ impl EdamMap {
}

// check if the given pair of id and label exists in the EDAM table.
pub fn check_id_and_label(&self, id: &str, label: &str) -> Result<bool> {
pub fn correspondence_check_id_and_label(&self, id: &str, label: &str) -> Result<bool> {
let label_from_bimap = self.bimap_id_label.get_by_left(id);

match label_from_bimap {
Expand Down
Loading

0 comments on commit ea6c768

Please sign in to comment.