Skip to content

Commit

Permalink
Adding the reproducibility scripts for RawHash2 and some more optimiz…
Browse files Browse the repository at this point in the history
…ations in default parameters
  • Loading branch information
canfirtina committed Sep 11, 2023
1 parent a09971c commit 6bb4dda
Show file tree
Hide file tree
Showing 66 changed files with 1,183 additions and 1,088 deletions.
4 changes: 2 additions & 2 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,7 @@ test/evaluation/read_mapping/*/comparison/comparison.out
test/evaluation/read_mapping/*/comparison/comparison.err

test/evaluation/*/*/slurm*
test/evaluation/*/slurm*
test/evaluation/*/*slurm*
test/evaluation/*/*/test_*
test/evaluation/*/*/comparison/*/
test/scripts/run_rawhash_test.sh
Expand All @@ -52,7 +52,7 @@ test/scripts/run_rawhash_test*.sh

test/evaluation/read_mapping/*/parameters.txt
test/evaluation/read_mapping/*/results.txt
test/evaluation/read_mapping/parameters*.txt
test/evaluation/read_mapping/*parameters*.txt

test/*.paf
test/eval/*.paf
Expand Down
61 changes: 31 additions & 30 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@

# Overview

RawHash is a hash-based mechanism to map raw nanopore signals to a reference genome in real-time. To achieve this, it 1) generates an index from the reference genome and 2) efficiently and accurately maps the raw signals to the reference genome such that it can match the throughput of nanopore sequencing even when analyzing large genomes (e.g., human genome.
RawHash (and RawHash2) is a hash-based mechanism to map raw nanopore signals to a reference genome in real-time. To achieve this, it 1) generates an index from the reference genome and 2) efficiently and accurately maps the raw signals to the reference genome such that it can match the throughput of nanopore sequencing even when analyzing large genomes (e.g., human genome.

Below figure shows the overview of the steps that RawHash takes to find matching regions between a reference genome and a raw nanopore signal.

Expand All @@ -16,11 +16,11 @@ To efficiently identify similarities between a reference genome and reads, RawHa

RawHash can be used to map reads from **FAST5, POD5, SLOW5, or BLOW5** files to a reference genome in sequence format.

RawHash performs real-time mapping of nanopore raw signals. When the prefix of reads in FAST5 or POD5 file can be mapped to a reference genome, RawHash will stop mapping and provide the mapping information in PAF format. We follow the similar PAF template used in [UNCALLED](https://github.com/skovaka/UNCALLED) and [Sigmap](https://github.com/haowenz/sigmap) to report the mapping information.
RawHash performs real-time mapping of nanopore raw signals. When the prefix of reads can be mapped to a reference genome, RawHash will stop mapping and provide the mapping information in PAF format. We follow the similar PAF template used in [UNCALLED](https://github.com/skovaka/UNCALLED) and [Sigmap](https://github.com/haowenz/sigmap) to report the mapping information.

# Recent changes

* RawHash now supports **POD5** files. RawHash will automatically detect the POD5 files from the file prefix (i.e., ".pod5"). Note: This feature is tested only on the Linux systems.
* We have released RawHash2, a more sensitive and faster raw signal mapping mechanism with substantial improvements over RawHash. RawHash2 is available within this repository. You can still use the earlier version, RawHash v1, from [this release](https://github.com/CMU-SAFARI/RawHash/releases/tag/v1.0).

* It is now possible to disable compiling HDF5, SLOW5, and POD5. Please check the `Compiling with HDF5, SLOW5, and POD5` section below for details.

Expand All @@ -29,22 +29,22 @@ RawHash performs real-time mapping of nanopore raw signals. When the prefix of r
* Clone the code from its GitHub repository (`--recursive` must be used):

```bash
git clone --recursive https://github.com/CMU-SAFARI/RawHash.git rawhash
git clone --recursive https://github.com/CMU-SAFARI/RawHash.git rawhash2
```

* Compile (Make sure you have a C++ compiler and GNU make):

```bash
cd rawhash && make
cd rawhash2 && make
```

If the compilation is successful, the binary will be in `bin/rawhash`.
If the compilation is successful, the path to the binary will be `bin/rawhash2`.

## Compiling with HDF5, SLOW5, and POD5

We are aware that some of the pre-compiled libraries (e.g., POD5) may not work in your system and you may need to compile these libraries from scratch. Additionally, it may be possible that you may not want to compile any of the HDF5, SLOW5, or POD5 libraries if you are not going to use them. RawHash provides a flexible Makefile to enable custom compilation of these libraries.
We are aware that some of the pre-compiled libraries (e.g., POD5) may not work in your system and you may need to compile these libraries from scratch. Additionally, it may be possible that you may not want to compile any of the HDF5, SLOW5, or POD5 libraries if you are not going to use them. RawHash2 provides a flexible Makefile to enable custom compilation of these libraries.

* It is possible to provide your own include and lib directories for *any* of the HDF5, SLOW5, and POD5 libraries, if you do not want to use the source code or the pre-compiled binaries that come with RawHash. To use your own include and lib directories you should pass them to `make` when compiling as follows:
* It is possible to provide your own include and lib directories for *any* of the HDF5, SLOW5, and POD5 libraries, if you do not want to use the source code or the pre-compiled binaries that come with RawHash2. To use your own include and lib directories you should pass them to `make` when compiling as follows:

```bash
#Provide the path to all of the HDF5/SLOW5/POD5 include and lib directories during compilation
Expand All @@ -70,10 +70,10 @@ make NOSLOW5=1 NOPOD5=1

## Getting help

You can print the help message to learn how to use `rawhash`:
You can print the help message to learn how to use `rawhash2`:

```bash
rawhash
rawhash2
```

## Indexing
Expand All @@ -82,65 +82,66 @@ Indexing is similar to minimap2's usage. We additionally include the pore models
Below is an example that generates an index file `ref.ind` for the reference genome `ref.fasta` using a certain k-mer model located under `extern` and `32` threads.

```bash
rawhash -d ref.ind -p extern/kmer_models/r9.4_180mv_450bps_6mer/template_median68pA.model -t 32 ref.fasta
rawhash2 -d ref.ind -p extern/kmer_models/r9.4_180mv_450bps_6mer/template_median68pA.model -t 32 ref.fasta
```

Note that you can directly jump to mapping without creating the index because RawHash is able to generate the index relatively quickly on-the-fly within the mapping step. However, a real-time genome analysis application may still prefer generating the indexing before the mapping step. Thus, we suggest creating the index before the mapping step.
Note that you can directly jump to mapping without creating the index because RawHash2 is able to generate the index relatively quickly on-the-fly within the mapping step. However, a real-time genome analysis application may still prefer generating the indexing before the mapping step. Thus, we suggest creating the index before the mapping step.

## Mapping

It is possible to provide inputs as FAST5 files from multiple directories. It is also possible to provide a list of files matching a certain pattern such as `test/data/contamination/fast5_files/Min*.fast5`

* Example usage where multiple files matching a certain the pattern `test/data/contamination/fast5_files/Min*.fast5` and fast5 files inside the `test/data/d1_sars-cov-2_r94/fast5_files` directory are inputted to rawhash using `32` threads and the previously generated `ref.ind` index:
* Example usage where multiple files matching a certain the pattern `test/data/contamination/fast5_files/Min*.fast5` and fast5 files inside the `test/data/d1_sars-cov-2_r94/fast5_files` directory are inputted to rawhash2 using `32` threads and the previously generated `ref.ind` index:

```bash
rawhash -t 32 ref.ind test/data/contamination/fast5_files/Min*.fast5 test/data/d1_sars-cov-2_r94/fast5_files > mapping.paf
rawhash2 -t 32 ref.ind test/data/contamination/fast5_files/Min*.fast5 test/data/d1_sars-cov-2_r94/fast5_files > mapping.paf
```

* Another example usage where 1) we only input a directory including FAST5 files as set of raw signals and 2) the output is directly saved in a file.

```bash
rawhash -t 32 -o mapping.paf ref.ind test/data/d1_sars-cov-2_r94/fast5_files
rawhash2 -t 32 -o mapping.paf ref.ind test/data/d1_sars-cov-2_r94/fast5_files
```

**IMPORTANT** if there are many fast5 files that rawhash needs to process (e.g., thousands of them), we suggest that you specify **only** the directories that contain these fast5 files
**IMPORTANT** if there are many fast5 files that rawhash2 needs to process (e.g., thousands of them), we suggest that you specify **only** the directories that contain these fast5 files

RawHash also provides a set of default parameters that can be preset automatically.
RawHash2 also provides a set of default parameters that can be preset automatically.

* Mapping reads to a viral reference genome using its corresponding preset:
* Mapping reads to a viral reference genome using its corresponding preset with the high precision goal (as set by --depletion):

```
rawhash -t 32 -x viral ref.ind test/data/d1_sars-cov-2_r94/fast5_files > mapping.paf
rawhash2 -t 32 -x viral --depletion ref.ind test/data/d1_sars-cov-2_r94/fast5_files > mapping.paf
```

* Mapping reads to small reference genomes (<50M bases) using its corresponding preset:
* Mapping reads to small reference genomes (<500M bases) using its corresponding preset:

```
rawhash -t 32 -x sensitive ref.ind test/data/d1_sars-cov-2_r94/fast5_files > mapping.paf
rawhash2 -t 32 -x sensitive ref.ind test/data/d4_green_algae_r94/fast5_files > mapping.paf
```

* Mapping reads to large reference genomes (>50M bases) using its corresponding preset:
* Mapping reads to large reference genomes (>500M bases) using its corresponding preset:

```
rawhash -t 32 -x fast ref.ind test/data/d1_sars-cov-2_r94/fast5_files > mapping.paf
rawhash2 -t 32 -x fast ref.ind test/data/d5_human_na12878_r94/fast5_files > mapping.paf
```

* Although we have not thoroguhly evaluated, RawHash also provides another set of default parameters that can be used for very large metagenomic samples (>10G). To achieve efficient search, it uses the minimizer seeding in this parameter setting. This setting is not evaluated in our manuscript.
RawHash2 provides another set of default parameters that can be used for very large metagenomic samples (>10G). To achieve efficient search, it uses the minimizer seeding in this parameter setting, which is slightly less accurate than the non-minimizer mode but much faster (around 3X).

```
rawhash -t 32 -x faster ref.ind test/data/d1_sars-cov-2_r94/fast5_files > mapping.paf
rawhash2 -t 32 -x faster ref.ind test/data/d5_human_na12878_r94/fast5_files > mapping.paf
```

The output will be saved to `mapping.paf` in a modified PAF format used by [Uncalled](https://github.com/skovaka/UNCALLED).

## Potential issues you may encounter during mapping

It is possible that your reads in fast5 files are compressed with the [VBZ compression](https://github.com/nanoporetech/vbz_compression) from Nanopore. Then you have to download the proper HDF5 plugin from [here](https://github.com/nanoporetech/vbz_compression/releases) and make sure it can be found by your HDF5 library:

```bash
export HDF5_PLUGIN_PATH=/path/to/hdf5/plugins/lib
```

If you have conda you can simply install the following package (`ont_vbz_hdf_plugin`) in your environment and use rawhash while the environment is active:
If you have conda you can simply install the following package (`ont_vbz_hdf_plugin`) in your environment and use rawhash2 while the environment is active:

```bash
conda install ont_vbz_hdf_plugin
Expand All @@ -153,11 +154,11 @@ Please follow the instructions in the [README](test/README.md) file in [test](./

* Direct integration with MinKNOW.
* Ability to specify even/odd channels to eject the reads only from these specified channels.
* Please create issues if you want to see more features that can make RawHash easily integratable with nanopore sequencers for any use case.
* Please create issues if you want to see more features that can make RawHash2 easily integratable with nanopore sequencers for any use case.

# Citing RawHash
# Citing RawHash and RawHash2

To cite RawHash, you can use the following BibTeX:
To cite RawHash and RawHash2, you can use the following BibTeX:

```bibtex
@article{firtina_rawhash_2023,
Expand All @@ -177,6 +178,6 @@ To cite RawHash, you can use the following BibTeX:

# Acknowledgement

RawHash uses [klib](https://github.com/attractivechaos/klib), some code snippets from [Minimap2](https://github.com/lh3/minimap2) (e.g., pipelining, hash table usage, DP and RMQ-based chaining) and [Sigmap](https://github.com/haowenz/sigmap) (e.g., R9.4 segmentation parameters).
RawHash2 uses [klib](https://github.com/attractivechaos/klib), some code snippets from [Minimap2](https://github.com/lh3/minimap2) (e.g., pipelining, hash table usage, DP and RMQ-based chaining) and the R9.4 segmentation parameters from [Sigmap](https://github.com/haowenz/sigmap).

We thank [Melina Soysal](https://github.com/melina2200) and [Marie-Louise Dugua](https://github.com/MarieSarahLouise) for their feedback to improve the RawHash implementation and test scripts.
2 changes: 1 addition & 1 deletion src/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -103,7 +103,7 @@ endif

LIBS+=-lm -lz -ldl

PROG=rawhash
PROG=rawhash2

ifneq ($(aarch64),)
arm_neon=1
Expand Down
12 changes: 6 additions & 6 deletions src/hit.c
Original file line number Diff line number Diff line change
Expand Up @@ -10,22 +10,22 @@
static inline void mm_cal_fuzzy_len(mm_reg1_t *r,
const mm128_t *a)
{
// uint32_t span_mask = (1U<<RI_HASH_SHIFT)-1;
uint32_t span_mask = (1U<<RI_HASH_SHIFT)-1;

int i;
r->mlen = r->blen = 0;
if (r->cnt <= 0) return;
// r->mlen = r->blen = (a[r->as].y>>RI_ID_SHIFT)&span_mask;
r->mlen = r->blen = (a[r->as].y>>RI_ID_SHIFT)&span_mask;
for (i = r->as + 1; i < r->as + r->cnt; ++i) {
// int span = (a[i].y>>RI_ID_SHIFT)&span_mask;
int span = (a[i].y>>RI_ID_SHIFT)&span_mask;
int tl = (int32_t)a[i].x - (int32_t)a[i-1].x;
int ql = (int32_t)a[i].y - (int32_t)a[i-1].y;
r->blen += tl > ql? tl : ql;
// r->mlen += tl > span && ql > span? span : tl < ql? tl : ql;
// r->mlen += tl < ql? tl : ql;
r->mlen += tl > span && ql > span? span : tl < ql? tl : ql;
r->mlen += tl < ql? tl : ql;
}

r->mlen = r->blen;
// r->mlen = r->blen;
}

/*
Expand Down
9 changes: 5 additions & 4 deletions src/lchain.c
Original file line number Diff line number Diff line change
Expand Up @@ -110,6 +110,7 @@ uint64_t *mg_chain_backtrack(void *km,
// n_z = # of anchors with acceptable scores.
for (i = 0, n_z = 0; i < n; ++i) // precompute n_z
if(f[i] >= min_sc) ++n_z;

if(n_z == 0) return 0;

KMALLOC(km, z, n_z);
Expand Down Expand Up @@ -314,9 +315,9 @@ static inline int32_t compute_score(const mm128_t *ai,
// TODO: currently the span is only determined by "e" (number of events concatanated in a seed)
q_span = (aj->y>>RI_ID_SHIFT)&span_mask;

// Calculate the chaining score. Consider the the gap (dg) if it is larger than the span (q_span).
// Matching bases. Consider the the gap (dg) if it is smaller than the span (q_span).
sc = q_span < dg? q_span : dg;

// Integrating penalties to the score
if(dd || dg > q_span){
float lin_pen, log_pen;
Expand Down Expand Up @@ -542,7 +543,7 @@ static inline int32_t comput_sc_simple(const mm128_t *ai,
float lin_pen, log_pen;
lin_pen = chn_pen_gap * (float)dd + chn_pen_skip * (float)dg;
log_pen = dd >= 1? mg_log2(dd + 1) : 0.0f; // mg_log2() only works for dd>=2
// sc -= (int)(lin_pen + .5f * log_pen);
sc -= (int)(lin_pen + .5f * log_pen);
}
return sc;
}
Expand Down Expand Up @@ -692,7 +693,7 @@ mm128_t *mg_lchain_rmq(int max_dist,
}
}
// set max
assert(max_j < 0 || (a[max_j].x < a[i].x && (int32_t)a[max_j].y < (int32_t)a[i].y));
// assert(max_j < 0 || (a[max_j].x <= a[i].x && (int32_t)a[max_j].y <= (int32_t)a[i].y));

// Update the score and the predecessor of anchor i based on the found best predecessor max_j
f[i] = max_f, p[i] = max_j;
Expand Down
Loading

0 comments on commit 6bb4dda

Please sign in to comment.