add EternaBench dataset

Signed-off-by: Zhiyuan Chen <[email protected]>
DLS5-Omics · Oct 8, 2024 · 50f4bc4 · 50f4bc4
1 parent 0a0e9dc
commit 50f4bc4
Show file tree

Hide file tree

Showing 4 changed files with 167 additions and 0 deletions.
diff --git a/docs/docs/datasets/eternabench.md b/docs/docs/datasets/eternabench.md
@@ -0,0 +1,9 @@
+---
+authors:
+  - Zhiyuan Chen
+date: 2024-05-04
+---
+
+# EternaBench
+
+--8<-- "multimolecule/datasets/eternabench/README.md:21:"
diff --git a/docs/mkdocs.yml b/docs/mkdocs.yml
@@ -23,6 +23,7 @@ nav:
           - bpRNA-spot: datasets/bprna-spot.md
           - bpRNA-new: datasets/bprna-new.md
           - RYOS: datasets/ryos.md
+          - EternaBench: datasets/eternabench.md
   - module:
       - module/index.md
       - heads: module/heads.md

diff --git a/multimolecule/datasets/eternabench/README.md b/multimolecule/datasets/eternabench/README.md
@@ -0,0 +1,97 @@
+---
+language: rna
+tags:
+  - Biology
+  - RNA
+license:
+  - agpl-3.0
+size_categories:
+  - 1K<n<10K
+task_categories:
+  - text-generation
+  - fill-mask
+task_ids:
+  - language-modeling
+  - masked-language-modeling
+pretty_name: EternaBench
+library_name: multimolecule
+---
+
+# EternaBench
+
+![EternaBench](https://eternagame.org/sites/default/files/thumb_eternabench_paper.png)
+
+EternaBench is a database comprising the diverse high-throughput structural data gathered through the crowdsourced RNA design project Eterna, to evaluate the performance of a wide set of structure algorithms.
+
+## Disclaimer
+
+This is an UNOFFICIAL release of the [EternaBench](https://github.com/eternagame/EternaBench) by Hannah K. Wayment-Steele, et al.
+
+**The team releasing EternaBench did not write this dataset card for this dataset so this dataset card has been written by the MultiMolecule team.**
+
+## Dataset Description
+
+- **Homepage**: https://multimolecule.danling.org/datasets/eternabench
+- **Point of Contact**: [Rhiju Das](https://biochemistry.stanford.edu/people/rhiju-das/)
+
+## Example Entry
+
+| ID       | design_name            | sequence         | structure        | reactivity                 | errors                      | signal_to_noise |
+| -------- | ---------------------- | ---------------- | ---------------- | -------------------------- | --------------------------- | --------------- |
+| 769337-1 | d+m plots weaker again | GGAAAAAAAAAAA... | ................ | [0.642,1.4853,0.1629, ...] | [0.3181,0.4221,0.1823, ...] | 3.227           |
+
+## Column Description
+
+The EternaBench dataset consists of the following columns, providing crucial insights for understanding RNA stability for vaccine design:
+
+- **ID**:
+  A unique identifier for each RNA sequence entry.
+
+- **design_name**:
+  The name given to each RNA design by contributors, used for easy reference.
+
+- **sequence**:
+  The nucleotide sequence of the RNA, using standard bases.
+
+- **structure**:
+  The predicted secondary structure of the RNA, represented using dot-bracket notation.
+  The structure helps determine the likely secondary interactions within each RNA molecule.
+
+- **reactivity**:
+  A list of floating-point values that provide an estimate of the likelihood of the RNA backbone being cut at each nucleotide position.
+  These values help determine the stability of the RNA structure under various experimental conditions.
+
+- **errors**:
+  Arrays of floating-point numbers indicating the experimental errors corresponding to the measurements in the **reactivity**.
+  These values help quantify the uncertainty in the degradation rates and reactivity measurements.
+
+## Variations
+
+This dataset is available in two subsets:
+
+- [EternaBench-CM](https://huggingface.co/datasets/multimolecule/eternabench-cm)
+- [EternaBench-Switch](https://huggingface.co/datasets/multimolecule/eternabench-switch)
+
+## License
+
+This dataset is licensed under the [AGPL-3.0 License](https://www.gnu.org/licenses/agpl-3.0.html).
+
+```spdx
+SPDX-License-Identifier: AGPL-3.0-or-later
+```
+
+## Citation
+
+```bibtex
+@article{waymentsteele2022rna,
+  author    = {Wayment-Steele, Hannah K and Kladwang, Wipapat and Strom, Alexandra I and Lee, Jeehyung and Treuille, Adrien and Becka, Alex and {Eterna Participants} and Das, Rhiju},
+  journal   = {Nature Methods},
+  month     = oct,
+  number    = 10,
+  pages     = {1234--1242},
+  publisher = {Springer Science and Business Media LLC},
+  title     = {{RNA} secondary structure packages evaluated and improved by high-throughput experiments},
+  volume    = 19,
+  year      = 2022
+}
+```
diff --git a/multimolecule/datasets/eternabench/eternabench.py b/multimolecule/datasets/eternabench/eternabench.py
@@ -0,0 +1,60 @@
+# MultiMolecule
+# Copyright (C) 2024-Present  MultiMolecule
+
+# This program is free software: you can redistribute it and/or modify
+# it under the terms of the GNU Affero General Public License as published by
+# the Free Software Foundation, either version 3 of the License, or
+# any later version.
+
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU Affero General Public License for more details.
+
+# You should have received a copy of the GNU Affero General Public License
+# along with this program.  If not, see <http://www.gnu.org/licenses/>.
+
+from __future__ import annotations
+
+import os
+
+import danling as dl
+import pandas as pd
+import torch
+
+from multimolecule.datasets.conversion_utils import ConvertConfig as ConvertConfig_
+from multimolecule.datasets.conversion_utils import save_dataset
+
+torch.manual_seed(1016)
+
+cols = [
+    "ID",
+    "design_name",
+    "sequence",
+    "structure",
+    "reactivity",
+    "errors",
+    "signal_to_noise",
+]
+
+
+def convert_dataset_(df: pd.DataFrame):
+    df = df[cols]
+    df.signal_to_noise = df.signal_to_noise.str.split(":").str[-1].astype(float)
+    return df
+
+
+def convert_dataset(convert_config):
+    train = dl.load_pandas(convert_config.train_path)
+    test = dl.load_pandas(convert_config.test_path)
+    save_dataset(convert_config, {"train": convert_dataset_(train), "test": convert_dataset_(test)})
+
+
+class ConvertConfig(ConvertConfig_):
+    root: str = os.path.dirname(__file__)
+
+
+if __name__ == "__main__":
+    config = ConvertConfig()
+    config.parse()  # type: ignore[attr-defined]
+    convert_dataset(config)