-
Notifications
You must be signed in to change notification settings - Fork 2
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Signed-off-by: Zhiyuan Chen <[email protected]>
- Loading branch information
1 parent
0a0e9dc
commit 9298c7c
Showing
4 changed files
with
166 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,9 @@ | ||
--- | ||
authors: | ||
- Zhiyuan Chen | ||
date: 2024-05-04 | ||
--- | ||
|
||
# EternaBench | ||
|
||
--8<-- "multimolecule/datasets/eternabench/README.md:21:" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,96 @@ | ||
--- | ||
language: rna | ||
tags: | ||
- Biology | ||
- RNA | ||
license: | ||
- agpl-3.0 | ||
size_categories: | ||
- 1K<n<10K | ||
task_categories: | ||
- text-generation | ||
- fill-mask | ||
task_ids: | ||
- language-modeling | ||
- masked-language-modeling | ||
pretty_name: EternaBench | ||
library_name: multimolecule | ||
--- | ||
|
||
# EternaBench | ||
|
||
![EternaBench](https://eternagame.org/sites/default/files/thumb_eternabench_paper.png) | ||
|
||
EternaBench is a database comprising the diverse high-throughput structural data gathered through the crowdsourced RNA design project Eterna, to evaluate the performance of a wide set of structure algorithms. | ||
|
||
## Disclaimer | ||
|
||
This is an UNOFFICIAL release of the [EternaBench](https://github.com/eternagame/EternaBench) by Hannah K. Wayment-Steele, et al. | ||
|
||
**The team releasing EternaBench did not write this dataset card for this dataset so this dataset card has been written by the MultiMolecule team.** | ||
|
||
## Dataset Description | ||
|
||
- **Homepage**: https://multimolecule.danling.org/datasets/eternabench | ||
- **Point of Contact**: [Rhiju Das](https://biochemistry.stanford.edu/people/rhiju-das/) | ||
|
||
## Example Entry | ||
|
||
| ID | design_name | sequence | structure | reactivity | errors | signal_to_noise | | ||
| 769337-1 | d+m plots weaker again | GGAAAAAAAAAAA... | ................ | [0.642,1.4853,0.1629, ...] | [0.3181,0.4221,0.1823, ...] | 3.227 | | ||
|
||
## Column Description | ||
|
||
The EternaBench dataset consists of the following columns, providing crucial insights for understanding RNA stability for vaccine design: | ||
|
||
- **ID**: | ||
A unique identifier for each RNA sequence entry. | ||
|
||
- **design_name**: | ||
The name given to each RNA design by contributors, used for easy reference. | ||
|
||
- **sequence**: | ||
The nucleotide sequence of the RNA, using standard bases. | ||
|
||
- **structure**: | ||
The predicted secondary structure of the RNA, represented using dot-bracket notation. | ||
The structure helps determine the likely secondary interactions within each RNA molecule. | ||
|
||
- **reactivity**: | ||
A list of floating-point values that provide an estimate of the likelihood of the RNA backbone being cut at each nucleotide position. | ||
These values help determine the stability of the RNA structure under various experimental conditions. | ||
|
||
- **errors**: | ||
Arrays of floating-point numbers indicating the experimental errors corresponding to the measurements in the **reactivity**. | ||
These values help quantify the uncertainty in the degradation rates and reactivity measurements. | ||
|
||
## Variations | ||
|
||
This dataset is available in two subsets: | ||
|
||
- [EternaBench-CM](https://huggingface.co/datasets/multimolecule/eternabench-cm) | ||
- [EternaBench-Switch](https://huggingface.co/datasets/multimolecule/eternabench-switch) | ||
|
||
## License | ||
|
||
This dataset is licensed under the [AGPL-3.0 License](https://www.gnu.org/licenses/agpl-3.0.html). | ||
|
||
```spdx | ||
SPDX-License-Identifier: AGPL-3.0-or-later | ||
``` | ||
|
||
## Citation | ||
|
||
```bibtex | ||
@article{waymentsteele2022rna, | ||
author = {Wayment-Steele, Hannah K and Kladwang, Wipapat and Strom, Alexandra I and Lee, Jeehyung and Treuille, Adrien and Becka, Alex and {Eterna Participants} and Das, Rhiju}, | ||
journal = {Nature Methods}, | ||
month = oct, | ||
number = 10, | ||
pages = {1234--1242}, | ||
publisher = {Springer Science and Business Media LLC}, | ||
title = {{RNA} secondary structure packages evaluated and improved by high-throughput experiments}, | ||
volume = 19, | ||
year = 2022 | ||
} | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,60 @@ | ||
# MultiMolecule | ||
# Copyright (C) 2024-Present MultiMolecule | ||
|
||
# This program is free software: you can redistribute it and/or modify | ||
# it under the terms of the GNU Affero General Public License as published by | ||
# the Free Software Foundation, either version 3 of the License, or | ||
# any later version. | ||
|
||
# This program is distributed in the hope that it will be useful, | ||
# but WITHOUT ANY WARRANTY; without even the implied warranty of | ||
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the | ||
# GNU Affero General Public License for more details. | ||
|
||
# You should have received a copy of the GNU Affero General Public License | ||
# along with this program. If not, see <http://www.gnu.org/licenses/>. | ||
|
||
from __future__ import annotations | ||
|
||
import os | ||
|
||
import danling as dl | ||
import pandas as pd | ||
import torch | ||
|
||
from multimolecule.datasets.conversion_utils import ConvertConfig as ConvertConfig_ | ||
from multimolecule.datasets.conversion_utils import save_dataset | ||
|
||
torch.manual_seed(1016) | ||
|
||
cols = [ | ||
"ID", | ||
"design_name", | ||
"sequence", | ||
"structure", | ||
"reactivity", | ||
"errors", | ||
"signal_to_noise", | ||
] | ||
|
||
|
||
def convert_dataset_(df: pd.DataFrame): | ||
df = df[cols] | ||
df.signal_to_noise = df.signal_to_noise.str.split(":").str[-1].astype(float) | ||
return df | ||
|
||
|
||
def convert_dataset(convert_config): | ||
train = dl.load_pandas(convert_config.train_path) | ||
test = dl.load_pandas(convert_config.test_path) | ||
save_dataset(convert_config, {"train": convert_dataset_(train), "test": convert_dataset_(test)}) | ||
|
||
|
||
class ConvertConfig(ConvertConfig_): | ||
root: str = os.path.dirname(__file__) | ||
|
||
|
||
if __name__ == "__main__": | ||
config = ConvertConfig() | ||
config.parse() # type: ignore[attr-defined] | ||
convert_dataset(config) |