-
Notifications
You must be signed in to change notification settings - Fork 2
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
add RNAstarlign & ArchiveII datasets
Signed-off-by: Zhiyuan Chen <[email protected]>
- Loading branch information
1 parent
0b55920
commit 1141aab
Showing
11 changed files
with
472 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,9 @@ | ||
--- | ||
authors: | ||
- Zhiyuan Chen | ||
date: 2024-05-04 | ||
--- | ||
|
||
# ArchiveII | ||
|
||
--8<-- "multimolecule/datasets/archiveii/README.md:24:" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,9 @@ | ||
--- | ||
authors: | ||
- Zhiyuan Chen | ||
date: 2024-05-04 | ||
--- | ||
|
||
# RNAStrAlign | ||
|
||
--8<-- "multimolecule/datasets/rnastralign/README.md:24:" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,108 @@ | ||
--- | ||
language: rna | ||
tags: | ||
- Biology | ||
- RNA | ||
license: | ||
- agpl-3.0 | ||
size_categories: | ||
- 10K<n<100K | ||
source_datasets: | ||
- multimolecule/bprna | ||
- multimolecule/pdb | ||
task_categories: | ||
- text-generation | ||
- fill-mask | ||
task_ids: | ||
- language-modeling | ||
- masked-language-modeling | ||
pretty_name: ArchiveII | ||
library_name: multimolecule | ||
--- | ||
|
||
# ArchiveII | ||
|
||
ArchiveII is a dataset of RNA sequences and their secondary structures, widely used in RNA secondary structure prediction benchmarks. | ||
|
||
ArchiveII contains 2975 RNA samples across 10 RNA families, with sequence lengths ranging from 28 to 2968 nucleotides. | ||
This dataset is frequently used to evaluate RNA secondary structure prediction methods, including those that handle both pseudoknotted and non-pseudoknotted structures. | ||
|
||
It is considered complementary to the [RNAStrAlign](./rnastralign) dataset. | ||
|
||
## Disclaimer | ||
|
||
This is an UNOFFICIAL release of the ArchiveII by Mehdi Saman Booy, et al. | ||
|
||
**The team releasing ArchiveII did not write this dataset card for this dataset so this dataset card has been written by the MultiMolecule team.** | ||
|
||
## Dataset Description | ||
|
||
- **Homepage**: https://multimolecule.danling.org/datasets/archiveii | ||
- **datasets**: https://huggingface.co/datasets/multimolecule/archiveii | ||
- **Point of Contact**: [Mehdi Saman Booy](mailto:[email protected]) | ||
|
||
## Example Entry | ||
|
||
| id | sequence | secondary_structure | family | | ||
| ------------------- | ----------------------------------- | ------------------------------------ | ---------- | | ||
| 16S_rRNA-A.fulgidus | AUUCUGGUUGAUCCUGCCAGAGGCCGCUGCUA... | ...(((((...(((.))))).((((((((((.... | 16S_rRNA | | ||
|
||
## Column Description | ||
|
||
- **id**: | ||
A unique identifier for each RNA entry. This ID is derived from the family and the original `.sta` file name, and serves as a reference to the specific RNA structure within the dataset. | ||
|
||
- **sequence**: | ||
The nucleotide sequence of the RNA molecule, represented using the standard RNA bases: | ||
|
||
- **A**: Adenine | ||
- **C**: Cytosine | ||
- **G**: Guanine | ||
- **U**: Uracil | ||
|
||
- **secondary_structure**: | ||
The secondary structure of the RNA represented in dot-bracket notation, using up to three types of symbols to indicate base pairing and unpaired regions, as per bpRNA's standard: | ||
|
||
- **Dots (`.`)**: Represent unpaired nucleotides. | ||
- **Parentheses (`(` and `)`)**: Represent base pairs in standard stems (page 1). | ||
|
||
- **family**: | ||
The RNA family to which the sequence belongs, such as 16S rRNA, 5S rRNA, etc. | ||
|
||
## Variations | ||
|
||
This dataset is available in two additional variants: | ||
|
||
- [archiveii](https://huggingface.co/datasets/multimolecule/archiveii): The main ArchiveII dataset. | ||
- [archiveii.512](https://huggingface.co/datasets/multimolecule/archiveii.512): ArchiveII dataset with sequences no longer than 512 nucleotides. | ||
- [archiveii.1024](https://huggingface.co/datasets/multimolecule/archiveii.1024): ArchiveII dataset with sequences no longer than 1024 nucleotides. | ||
|
||
## Related Datasets | ||
|
||
- [RNAStrAlign](https://huggingface.co/datasets/multimolecule/rnastralign): A database of RNA secondary with the same families as ArchiveII, usually used for training. | ||
- [bpRNA-spot](https://huggingface.co/datasets/multimolecule/bprna-spot): Another commonly used database in RNA secondary structures prediction. | ||
|
||
## License | ||
|
||
This dataset is licensed under the [AGPL-3.0 License](https://www.gnu.org/licenses/agpl-3.0.html). | ||
|
||
```spdx | ||
SPDX-License-Identifier: AGPL-3.0-or-later | ||
``` | ||
|
||
## Citation | ||
|
||
```bibtex | ||
@article{samanbooy2022rna, | ||
author = {Saman Booy, Mehdi and Ilin, Alexander and Orponen, Pekka}, | ||
journal = {BMC Bioinformatics}, | ||
keywords = {Deep learning; Pseudoknotted structures; RNA structure prediction}, | ||
month = feb, | ||
number = 1, | ||
pages = {58}, | ||
publisher = {Springer Science and Business Media LLC}, | ||
title = {{RNA} secondary structure prediction with convolutional neural networks}, | ||
volume = 23, | ||
year = 2022 | ||
} | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,105 @@ | ||
# MultiMolecule | ||
# Copyright (C) 2024-Present MultiMolecule | ||
|
||
# This program is free software: you can redistribute it and/or modify | ||
# it under the terms of the GNU Affero General Public License as published by | ||
# the Free Software Foundation, either version 3 of the License, or | ||
# any later version. | ||
|
||
# This program is distributed in the hope that it will be useful, | ||
# but WITHOUT ANY WARRANTY; without even the implied warranty of | ||
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the | ||
# GNU Affero General Public License for more details. | ||
|
||
# You should have received a copy of the GNU Affero General Public License | ||
# along with this program. If not, see <http://www.gnu.org/licenses/>. | ||
|
||
from __future__ import annotations | ||
|
||
import os | ||
from collections.abc import Mapping | ||
from pathlib import Path | ||
|
||
import torch | ||
from tqdm import tqdm | ||
|
||
from multimolecule.datasets.conversion_utils import ConvertConfig as ConvertConfig_ | ||
from multimolecule.datasets.conversion_utils import save_dataset | ||
|
||
torch.manual_seed(1016) | ||
|
||
|
||
def convert_ct(file) -> Mapping: | ||
if not isinstance(file, Path): | ||
file = Path(file) | ||
with open(file) as f: | ||
lines = f.readlines() | ||
|
||
first_line = lines[0].strip().split() | ||
num_bases = int(first_line[0]) | ||
|
||
sequence = [] | ||
dot_bracket = ["."] * num_bases | ||
|
||
for i in range(1, num_bases + 1): | ||
line = lines[i].strip().split() | ||
sequence.append(line[1]) | ||
pair_index = int(line[4]) | ||
|
||
if pair_index > 0: | ||
if int(lines[pair_index].strip().split()[4]) != i: | ||
raise ValueError( | ||
f"Invalid pairing at position {i}: pair_index {pair_index} does not point back correctly." | ||
) | ||
if pair_index > i: | ||
dot_bracket[i - 1] = "(" | ||
dot_bracket[pair_index - 1] = ")" | ||
|
||
family, name = file.stem.split("_", 1) | ||
if family in ("5s", "16s", "23s"): | ||
family = family.upper() + "_rRNA" | ||
elif family == "srp": | ||
family = family.upper() | ||
elif family == "grp1": | ||
family = "group_I_intron" | ||
elif family == "grp2": | ||
family = "group_II_intron" | ||
id = family + "-" + name | ||
|
||
return { | ||
"id": id, | ||
"sequence": "".join(sequence), | ||
"secondary_structure": "".join(dot_bracket), | ||
"family": family, | ||
} | ||
|
||
|
||
def convert_dataset(convert_config): | ||
max_seq_len = convert_config.max_seq_len | ||
files = [ | ||
os.path.join(convert_config.dataset_path, f) | ||
for f in os.listdir(convert_config.dataset_path) | ||
if f.endswith(".ct") | ||
] | ||
files.sort() | ||
data = [convert_ct(file) for file in tqdm(files, total=len(files))] | ||
if max_seq_len is not None: | ||
data = [d for d in data if len(d["sequence"]) <= max_seq_len] | ||
save_dataset(convert_config, data, filename="test.parquet") | ||
|
||
|
||
class ConvertConfig(ConvertConfig_): | ||
max_seq_len: int | None = None | ||
root: str = os.path.dirname(__file__) | ||
output_path: str = os.path.basename(os.path.dirname(__file__)) | ||
|
||
def post(self): | ||
if self.max_seq_len is not None: | ||
self.output_path = f"{self.output_path}.{self.max_seq_len}" | ||
super().post() | ||
|
||
|
||
if __name__ == "__main__": | ||
config = ConvertConfig() | ||
config.parse() # type: ignore[attr-defined] | ||
convert_dataset(config) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,110 @@ | ||
--- | ||
language: rna | ||
tags: | ||
- Biology | ||
- RNA | ||
license: | ||
- agpl-3.0 | ||
size_categories: | ||
- 10K<n<100K | ||
source_datasets: | ||
- multimolecule/bprna | ||
- multimolecule/pdb | ||
task_categories: | ||
- text-generation | ||
- fill-mask | ||
task_ids: | ||
- language-modeling | ||
- masked-language-modeling | ||
pretty_name: RNAStrAlign | ||
library_name: multimolecule | ||
--- | ||
|
||
# RNAStrAlign | ||
|
||
RNAStrAlign is a comprehensive dataset of RNA sequences and their secondary structures. | ||
|
||
RNAStrAlign aggregates data from multiple established RNA structure repositories, covering diverse RNA families such as 5S ribosomal RNA, tRNA, and group I introns. | ||
|
||
It is considered complementary to the [ArchiveII](./archiveii) dataset. | ||
|
||
## Disclaimer | ||
|
||
This is an UNOFFICIAL release of the RNAStrAlign by Zhen Tan, et al. | ||
|
||
**The team releasing RNAStrAlign did not write this dataset card for this dataset so this dataset card has been written by the MultiMolecule team.** | ||
|
||
## Dataset Description | ||
|
||
- **Homepage**: https://multimolecule.danling.org/datasets/rnastralign | ||
- **datasets**: https://huggingface.co/datasets/multimolecule/rnastralign | ||
- **Point of Contact**: [David H. Mathews](mailto:[email protected]) and [Gaurav Sharma](mailto:[email protected]) | ||
|
||
## Example Entry | ||
|
||
| id | sequence | secondary_structure | family | subfamily | | ||
| -------------------------------- | ----------------------------------- | ------------------------------------ | ---------- | -------------- | | ||
| 16S_rRNA-Actinobacteria-AB002635 | ACACAUGCAAGCGAACGUGAUCUCCAGCUUGC... | .(((.(((..((..((((.(((((.((....)... | 16S_rRNA | Actinobacteria | | ||
|
||
## Column Description | ||
|
||
- **id**: | ||
A unique identifier for each RNA entry. This ID is derived from the family and the original `.sta` file name, and serves as a reference to the specific RNA structure within the dataset. | ||
|
||
- **sequence**: | ||
The nucleotide sequence of the RNA molecule, represented using the standard RNA bases: | ||
|
||
- **A**: Adenine | ||
- **C**: Cytosine | ||
- **G**: Guanine | ||
- **U**: Uracil | ||
|
||
- **secondary_structure**: | ||
The secondary structure of the RNA represented in dot-bracket notation, using up to three types of symbols to indicate base pairing and unpaired regions, as per bpRNA's standard: | ||
|
||
- **Dots (`.`)**: Represent unpaired nucleotides. | ||
- **Parentheses (`(` and `)`)**: Represent base pairs in standard stems (page 1). | ||
|
||
- **family**: | ||
The RNA family to which the sequence belongs, such as 16S rRNA, 5S rRNA, etc. | ||
|
||
- **subfamily**: | ||
A more specific subfamily within the family, such as Actinobacteria for 16S rRNA. | ||
|
||
Not all families have subfamilies, in which case this field will be `None`. | ||
|
||
## Variations | ||
|
||
This dataset is available in two additional variants: | ||
|
||
- [rnastralign](https://huggingface.co/datasets/multimolecule/rnastralign): The main RNAStrAlign dataset. | ||
- [rnastralign.512](https://huggingface.co/datasets/multimolecule/rnastralign.512): RNAStrAlign dataset with sequences no longer than 512 nucleotides. | ||
- [rnastralign.1024](https://huggingface.co/datasets/multimolecule/rnastralign.1024): RNAStrAlign dataset with sequences no longer than 1024 nucleotides. | ||
|
||
## Related Datasets | ||
|
||
- [ArchiveII](https://huggingface.co/datasets/multimolecule/archiveii): A database of RNA secondary with the same families as RNAStrAlign, usually used for testing. | ||
- [bpRNA-spot](https://huggingface.co/datasets/multimolecule/bprna-spot): Another commonly used database in RNA secondary structures prediction. | ||
|
||
## License | ||
|
||
This dataset is licensed under the [AGPL-3.0 License](https://www.gnu.org/licenses/agpl-3.0.html). | ||
|
||
```spdx | ||
SPDX-License-Identifier: AGPL-3.0-or-later | ||
``` | ||
|
||
## Citation | ||
|
||
```bibtex | ||
@article{ran2017turbofold, | ||
author = {Tan, Zhen and Fu, Yinghan and Sharma, Gaurav and Mathews, David H}, | ||
journal = {Nucleic Acids Research}, | ||
month = nov, | ||
number = 20, | ||
pages = {11570--11581}, | ||
title = {{TurboFold} {II}: {RNA} structural alignment and secondary structure prediction informed by multiple homologs}, | ||
volume = 45, | ||
year = 2017 | ||
} | ||
``` |
Oops, something went wrong.