From 02755664286b4965a342d72b525bf24dd04d531c Mon Sep 17 00:00:00 2001 From: ZhiyuanChen Date: Wed, 11 Dec 2024 18:20:47 +0000 Subject: [PATCH] deploy: e7b3da6c1856c357dfdfb2b9a44261e04ada9a23 --- feed_rss_created.xml | 2 +- feed_rss_updated.xml | 2 +- index.html | 13 ------------- search/search_index.json | 2 +- zh/index.html | 13 ------------- 5 files changed, 3 insertions(+), 29 deletions(-) diff --git a/feed_rss_created.xml b/feed_rss_created.xml index ad494009..e37e7579 100644 --- a/feed_rss_created.xml +++ b/feed_rss_created.xml @@ -1 +1 @@ - MultiMoleculeNeural Networks for RNA, DNA, and Proteinhttps://multimolecule.danling.org/MultiMoleculehttps://github.com/DLS5-Omics/multimoleculeen Wed, 11 Dec 2024 17:18:37 -0000 Wed, 11 Dec 2024 17:18:37 -0000 1440 MkDocs RSS plugin - v1.17.0 None MultiMoleculehttps://multimolecule.danling.org/ about <h1>About</h1><p style="text-align: center;">Developed by DanLing on Earth</p><p>We are a community of developers, designers, and others from around the world who ...</p>https://multimolecule.danling.org/about/ Wed, 10 Jul 2024 01:11:47 +0000MultiMoleculehttps://multimolecule.danling.org/about/ License <p>--8&lt;-- "LICENSE.md"</p>https://multimolecule.danling.org/about/license/ Wed, 10 Jul 2024 01:11:47 +0000MultiMoleculehttps://multimolecule.danling.org/about/license/ Privacy Notice <p>--8&lt;-- "privacy.md"</p>https://multimolecule.danling.org/about/privacy/ Wed, 10 Jul 2024 01:11:47 +0000MultiMoleculehttps://multimolecule.danling.org/about/privacy/ about <h1>关于</h1><p style="text-align: center;">由丹灵在地球开发</p><p>我们是一个由开发者、设计人员和其他人员组成的社区,致力于让深度学习技术更加开放。</p><p>我们是一个由个体组成的社区,致力于推动深度学习的可能性边界。</p><p>我们对深度学习及其用户充满激情。</p><p>我们是丹灵。</p>https://multimolecule.danling.org/zh/about/ Wed, 10 Jul 2024 01:11:47 +0000MultiMoleculehttps://multimolecule.danling.org/zh/about/ License <p>--8&lt;-- "LICENSE.zh.md"</p>https://multimolecule.danling.org/zh/about/license/ Wed, 10 Jul 2024 01:11:47 +0000MultiMoleculehttps://multimolecule.danling.org/zh/about/license/ Privacy Notice <p>--8&lt;-- "privacy.zh.md"</p>https://multimolecule.danling.org/zh/about/privacy/ Wed, 10 Jul 2024 01:11:47 +0000MultiMoleculehttps://multimolecule.danling.org/zh/about/privacy/ License FAQ <p>--8&lt;-- "about/license-faq.md"</p>https://multimolecule.danling.org/about/license-faq/ Tue, 25 Jun 2024 15:24:27 +0000MultiMoleculehttps://multimolecule.danling.org/about/license-faq/ License FAQ <p>--8&lt;-- "about/license-faq.zh.md"</p>https://multimolecule.danling.org/zh/about/license-faq/ Tue, 25 Jun 2024 15:24:27 +0000MultiMoleculehttps://multimolecule.danling.org/zh/about/license-faq/ data Zhiyuan Chen <h1>data</h1><p>--8&lt;-- "multimolecule/data/README.md:8:"</p>https://multimolecule.danling.org/data/ Sat, 04 May 2024 00:00:00 +0000MultiMoleculehttps://multimolecule.danling.org/data/ Dataset Zhiyuan Chen <h1>Dataset</h1><p>::: multimolecule.data.Dataset</p>https://multimolecule.danling.org/data/dataset/ Sat, 04 May 2024 00:00:00 +0000MultiMoleculehttps://multimolecule.danling.org/data/dataset/ ArchiveII Zhiyuan Chen <h1>ArchiveII</h1><p>--8&lt;-- "multimolecule/datasets/archiveii/README.md:24:"</p>https://multimolecule.danling.org/datasets/archiveii/ Sat, 04 May 2024 00:00:00 +0000MultiMoleculehttps://multimolecule.danling.org/datasets/archiveii/ bpRNA-new Zhiyuan Chen <h1>bpRNA-1m</h1><p>--8&lt;-- "multimolecule/datasets/bprna_new/README.md:23:"</p>https://multimolecule.danling.org/datasets/bprna-new/ Sat, 04 May 2024 00:00:00 +0000MultiMoleculehttps://multimolecule.danling.org/datasets/bprna-new/ bpRNA-spot Zhiyuan Chen <h1>bpRNA-1m</h1><p>--8&lt;-- "multimolecule/datasets/bprna_spot/README.md:24:"</p>https://multimolecule.danling.org/datasets/bprna-spot/ Sat, 04 May 2024 00:00:00 +0000MultiMoleculehttps://multimolecule.danling.org/datasets/bprna-spot/ bpRNA-1m Zhiyuan Chen <h1>bpRNA-1m</h1><p>--8&lt;-- "multimolecule/datasets/bprna/README.md:29:"</p>https://multimolecule.danling.org/datasets/bprna/ Sat, 04 May 2024 00:00:00 +0000MultiMoleculehttps://multimolecule.danling.org/datasets/bprna/ EternaBench-CM Zhiyuan Chen <h1>EternaBench-CM</h1><p>--8&lt;-- "multimolecule/datasets/eternabench_cm/README.md:21:"</p>https://multimolecule.danling.org/datasets/eternabench-cm/ Sat, 04 May 2024 00:00:00 +0000MultiMoleculehttps://multimolecule.danling.org/datasets/eternabench-cm/ EternaBench-External Zhiyuan Chen <h1>EternaBench-External</h1><p>--8&lt;-- "multimolecule/datasets/eternabench_external/README.md:21:"</p>https://multimolecule.danling.org/datasets/eternabench-external/ Sat, 04 May 2024 00:00:00 +0000MultiMoleculehttps://multimolecule.danling.org/datasets/eternabench-external/ EternaBench-Switch Zhiyuan Chen <h1>EternaBench-Switch</h1><p>--8&lt;-- "multimolecule/datasets/eternabench_switch/README.md:21:"</p>https://multimolecule.danling.org/datasets/eternabench-switch/ Sat, 04 May 2024 00:00:00 +0000MultiMoleculehttps://multimolecule.danling.org/datasets/eternabench-switch/ GENCODE Zhiyuan Chen <h1>GENCODE</h1><p>--8&lt;-- "multimolecule/datasets/gencode/README.md:21:"</p>https://multimolecule.danling.org/datasets/gencode/ Sat, 04 May 2024 00:00:00 +0000MultiMoleculehttps://multimolecule.danling.org/datasets/gencode/ Rfam Zhiyuan Chen <h1>Rfam</h1><p>--8&lt;-- "multimolecule/datasets/rfam/README.md:21:"</p>https://multimolecule.danling.org/datasets/rfam/ Sat, 04 May 2024 00:00:00 +0000MultiMoleculehttps://multimolecule.danling.org/datasets/rfam/ RIVAS Zhiyuan Chen <h1>RIVAS</h1><p>--8&lt;-- "multimolecule/datasets/rivas/README.md:21:"</p>https://multimolecule.danling.org/datasets/rivas/ Sat, 04 May 2024 00:00:00 +0000MultiMoleculehttps://multimolecule.danling.org/datasets/rivas/ \ No newline at end of file + MultiMoleculeNeural Networks for RNA, DNA, and Proteinhttps://multimolecule.danling.org/MultiMoleculehttps://github.com/DLS5-Omics/multimoleculeen Wed, 11 Dec 2024 18:20:38 -0000 Wed, 11 Dec 2024 18:20:38 -0000 1440 MkDocs RSS plugin - v1.17.0 None MultiMoleculehttps://multimolecule.danling.org/ about <h1>About</h1><p style="text-align: center;">Developed by DanLing on Earth</p><p>We are a community of developers, designers, and others from around the world who ...</p>https://multimolecule.danling.org/about/ Wed, 10 Jul 2024 01:11:47 +0000MultiMoleculehttps://multimolecule.danling.org/about/ License <p>--8&lt;-- "LICENSE.md"</p>https://multimolecule.danling.org/about/license/ Wed, 10 Jul 2024 01:11:47 +0000MultiMoleculehttps://multimolecule.danling.org/about/license/ Privacy Notice <p>--8&lt;-- "privacy.md"</p>https://multimolecule.danling.org/about/privacy/ Wed, 10 Jul 2024 01:11:47 +0000MultiMoleculehttps://multimolecule.danling.org/about/privacy/ about <h1>关于</h1><p style="text-align: center;">由丹灵在地球开发</p><p>我们是一个由开发者、设计人员和其他人员组成的社区,致力于让深度学习技术更加开放。</p><p>我们是一个由个体组成的社区,致力于推动深度学习的可能性边界。</p><p>我们对深度学习及其用户充满激情。</p><p>我们是丹灵。</p>https://multimolecule.danling.org/zh/about/ Wed, 10 Jul 2024 01:11:47 +0000MultiMoleculehttps://multimolecule.danling.org/zh/about/ License <p>--8&lt;-- "LICENSE.zh.md"</p>https://multimolecule.danling.org/zh/about/license/ Wed, 10 Jul 2024 01:11:47 +0000MultiMoleculehttps://multimolecule.danling.org/zh/about/license/ Privacy Notice <p>--8&lt;-- "privacy.zh.md"</p>https://multimolecule.danling.org/zh/about/privacy/ Wed, 10 Jul 2024 01:11:47 +0000MultiMoleculehttps://multimolecule.danling.org/zh/about/privacy/ License FAQ <p>--8&lt;-- "about/license-faq.md"</p>https://multimolecule.danling.org/about/license-faq/ Tue, 25 Jun 2024 15:24:27 +0000MultiMoleculehttps://multimolecule.danling.org/about/license-faq/ License FAQ <p>--8&lt;-- "about/license-faq.zh.md"</p>https://multimolecule.danling.org/zh/about/license-faq/ Tue, 25 Jun 2024 15:24:27 +0000MultiMoleculehttps://multimolecule.danling.org/zh/about/license-faq/ data Zhiyuan Chen <h1>data</h1><p>--8&lt;-- "multimolecule/data/README.md:8:"</p>https://multimolecule.danling.org/data/ Sat, 04 May 2024 00:00:00 +0000MultiMoleculehttps://multimolecule.danling.org/data/ Dataset Zhiyuan Chen <h1>Dataset</h1><p>::: multimolecule.data.Dataset</p>https://multimolecule.danling.org/data/dataset/ Sat, 04 May 2024 00:00:00 +0000MultiMoleculehttps://multimolecule.danling.org/data/dataset/ ArchiveII Zhiyuan Chen <h1>ArchiveII</h1><p>--8&lt;-- "multimolecule/datasets/archiveii/README.md:24:"</p>https://multimolecule.danling.org/datasets/archiveii/ Sat, 04 May 2024 00:00:00 +0000MultiMoleculehttps://multimolecule.danling.org/datasets/archiveii/ bpRNA-new Zhiyuan Chen <h1>bpRNA-1m</h1><p>--8&lt;-- "multimolecule/datasets/bprna_new/README.md:23:"</p>https://multimolecule.danling.org/datasets/bprna-new/ Sat, 04 May 2024 00:00:00 +0000MultiMoleculehttps://multimolecule.danling.org/datasets/bprna-new/ bpRNA-spot Zhiyuan Chen <h1>bpRNA-1m</h1><p>--8&lt;-- "multimolecule/datasets/bprna_spot/README.md:24:"</p>https://multimolecule.danling.org/datasets/bprna-spot/ Sat, 04 May 2024 00:00:00 +0000MultiMoleculehttps://multimolecule.danling.org/datasets/bprna-spot/ bpRNA-1m Zhiyuan Chen <h1>bpRNA-1m</h1><p>--8&lt;-- "multimolecule/datasets/bprna/README.md:29:"</p>https://multimolecule.danling.org/datasets/bprna/ Sat, 04 May 2024 00:00:00 +0000MultiMoleculehttps://multimolecule.danling.org/datasets/bprna/ EternaBench-CM Zhiyuan Chen <h1>EternaBench-CM</h1><p>--8&lt;-- "multimolecule/datasets/eternabench_cm/README.md:21:"</p>https://multimolecule.danling.org/datasets/eternabench-cm/ Sat, 04 May 2024 00:00:00 +0000MultiMoleculehttps://multimolecule.danling.org/datasets/eternabench-cm/ EternaBench-External Zhiyuan Chen <h1>EternaBench-External</h1><p>--8&lt;-- "multimolecule/datasets/eternabench_external/README.md:21:"</p>https://multimolecule.danling.org/datasets/eternabench-external/ Sat, 04 May 2024 00:00:00 +0000MultiMoleculehttps://multimolecule.danling.org/datasets/eternabench-external/ EternaBench-Switch Zhiyuan Chen <h1>EternaBench-Switch</h1><p>--8&lt;-- "multimolecule/datasets/eternabench_switch/README.md:21:"</p>https://multimolecule.danling.org/datasets/eternabench-switch/ Sat, 04 May 2024 00:00:00 +0000MultiMoleculehttps://multimolecule.danling.org/datasets/eternabench-switch/ GENCODE Zhiyuan Chen <h1>GENCODE</h1><p>--8&lt;-- "multimolecule/datasets/gencode/README.md:21:"</p>https://multimolecule.danling.org/datasets/gencode/ Sat, 04 May 2024 00:00:00 +0000MultiMoleculehttps://multimolecule.danling.org/datasets/gencode/ Rfam Zhiyuan Chen <h1>Rfam</h1><p>--8&lt;-- "multimolecule/datasets/rfam/README.md:21:"</p>https://multimolecule.danling.org/datasets/rfam/ Sat, 04 May 2024 00:00:00 +0000MultiMoleculehttps://multimolecule.danling.org/datasets/rfam/ RIVAS Zhiyuan Chen <h1>RIVAS</h1><p>--8&lt;-- "multimolecule/datasets/rivas/README.md:21:"</p>https://multimolecule.danling.org/datasets/rivas/ Sat, 04 May 2024 00:00:00 +0000MultiMoleculehttps://multimolecule.danling.org/datasets/rivas/ \ No newline at end of file diff --git a/feed_rss_updated.xml b/feed_rss_updated.xml index 5d82fffc..91361154 100644 --- a/feed_rss_updated.xml +++ b/feed_rss_updated.xml @@ -1 +1 @@ - MultiMoleculeNeural Networks for RNA, DNA, and Proteinhttps://multimolecule.danling.org/MultiMoleculehttps://github.com/DLS5-Omics/multimoleculeen Wed, 11 Dec 2024 17:18:37 -0000 Wed, 11 Dec 2024 17:18:37 -0000 1440 MkDocs RSS plugin - v1.17.0 None MultiMoleculehttps://multimolecule.danling.org/ License FAQ <p>--8&lt;-- "about/license-faq.md"</p>https://multimolecule.danling.org/about/license-faq/ Wed, 11 Dec 2024 15:43:20 +0000MultiMoleculehttps://multimolecule.danling.org/about/license-faq/ License <p>--8&lt;-- "LICENSE.md"</p>https://multimolecule.danling.org/about/license/ Wed, 11 Dec 2024 15:43:20 +0000MultiMoleculehttps://multimolecule.danling.org/about/license/ Privacy Notice <p>--8&lt;-- "privacy.md"</p>https://multimolecule.danling.org/about/privacy/ Wed, 11 Dec 2024 15:43:20 +0000MultiMoleculehttps://multimolecule.danling.org/about/privacy/ License FAQ <p>--8&lt;-- "about/license-faq.zh.md"</p>https://multimolecule.danling.org/zh/about/license-faq/ Wed, 11 Dec 2024 15:43:20 +0000MultiMoleculehttps://multimolecule.danling.org/zh/about/license-faq/ License <p>--8&lt;-- "LICENSE.zh.md"</p>https://multimolecule.danling.org/zh/about/license/ Wed, 11 Dec 2024 15:43:20 +0000MultiMoleculehttps://multimolecule.danling.org/zh/about/license/ Privacy Notice <p>--8&lt;-- "privacy.zh.md"</p>https://multimolecule.danling.org/zh/about/privacy/ Wed, 11 Dec 2024 15:43:20 +0000MultiMoleculehttps://multimolecule.danling.org/zh/about/privacy/ RIVAS Zhiyuan Chen <h1>RIVAS</h1><p>--8&lt;-- "multimolecule/datasets/rivas/README.md:21:"</p>https://multimolecule.danling.org/datasets/rivas/ Sun, 27 Oct 2024 13:07:13 +0000MultiMoleculehttps://multimolecule.danling.org/datasets/rivas/ RIVAS Zhiyuan Chen <h1>RIVAS</h1><p>--8&lt;-- "multimolecule/datasets/rivas/README.md:21:"</p>https://multimolecule.danling.org/zh/datasets/rivas/ Sun, 27 Oct 2024 13:07:13 +0000MultiMoleculehttps://multimolecule.danling.org/zh/datasets/rivas/ ArchiveII Zhiyuan Chen <h1>ArchiveII</h1><p>--8&lt;-- "multimolecule/datasets/archiveii/README.md:24:"</p>https://multimolecule.danling.org/datasets/archiveii/ Thu, 24 Oct 2024 10:31:39 +0000MultiMoleculehttps://multimolecule.danling.org/datasets/archiveii/ RNAStrAlign Zhiyuan Chen <h1>RNAStrAlign</h1><p>--8&lt;-- "multimolecule/datasets/rnastralign/README.md:24:"</p>https://multimolecule.danling.org/datasets/rnastralign/ Thu, 24 Oct 2024 10:31:39 +0000MultiMoleculehttps://multimolecule.danling.org/datasets/rnastralign/ ArchiveII Zhiyuan Chen <h1>ArchiveII</h1><p>--8&lt;-- "multimolecule/datasets/archiveii/README.md:24:"</p>https://multimolecule.danling.org/zh/datasets/archiveii/ Thu, 24 Oct 2024 10:31:39 +0000MultiMoleculehttps://multimolecule.danling.org/zh/datasets/archiveii/ RNAStrAlign Zhiyuan Chen <h1>RNAStrAlign</h1><p>--8&lt;-- "multimolecule/datasets/rnastralign/README.md:24:"</p>https://multimolecule.danling.org/zh/datasets/rnastralign/ Thu, 24 Oct 2024 10:31:39 +0000MultiMoleculehttps://multimolecule.danling.org/zh/datasets/rnastralign/ ERNIE-RNA Zhiyuan Chen <h1>ERNIE-RNA</h1><p>--8&lt;-- "multimolecule/models/ernierna/README.md:42:"</p><p>::: multimolecule.models.ernierna</p>https://multimolecule.danling.org/models/ernierna/ Wed, 16 Oct 2024 07:45:39 +0000MultiMoleculehttps://multimolecule.danling.org/models/ernierna/ RiNALMo Zhiyuan Chen <h1>RiNALMo</h1><p>--8&lt;-- "multimolecule/models/rinalmo/README.md:45:"</p><p>::: multimolecule.models.rinalmo</p>https://multimolecule.danling.org/models/rinalmo/ Wed, 16 Oct 2024 07:45:39 +0000MultiMoleculehttps://multimolecule.danling.org/models/rinalmo/ RNABERT Zhiyuan Chen <h1>RNABERT</h1><p>--8&lt;-- "multimolecule/models/rnabert/README.md:42:"</p><p>::: multimolecule.models.rnabert</p>https://multimolecule.danling.org/models/rnabert/ Wed, 16 Oct 2024 07:45:39 +0000MultiMoleculehttps://multimolecule.danling.org/models/rnabert/ RNAErnie Zhiyuan Chen <h1>RNAErnie</h1><p>--8&lt;-- "multimolecule/models/rnaernie/README.md:42:"</p><p>::: multimolecule.models.rnaernie</p>https://multimolecule.danling.org/models/rnaernie/ Wed, 16 Oct 2024 07:45:39 +0000MultiMoleculehttps://multimolecule.danling.org/models/rnaernie/ RNA-FM Zhiyuan Chen <h1>RNA-FM</h1><p>--8&lt;-- "multimolecule/models/rnafm/README.md:42:"</p><p>::: multimolecule.models.rnafm</p>https://multimolecule.danling.org/models/rnafm/ Wed, 16 Oct 2024 07:45:39 +0000MultiMoleculehttps://multimolecule.danling.org/models/rnafm/ RNA-MSM Zhiyuan Chen <h1>RNA-MSM</h1><p>--8&lt;-- "multimolecule/models/rnamsm/README.md:42:"</p><p>::: multimolecule.models.rnamsm</p>https://multimolecule.danling.org/models/rnamsm/ Wed, 16 Oct 2024 07:45:39 +0000MultiMoleculehttps://multimolecule.danling.org/models/rnamsm/ SpliceBERT Zhiyuan Chen <h1>SpliceBERT</h1><p>--8&lt;-- "multimolecule/models/splicebert/README.md:42:"</p><p>::: multimolecule.models.splicebert</p>https://multimolecule.danling.org/models/splicebert/ Wed, 16 Oct 2024 07:45:39 +0000MultiMoleculehttps://multimolecule.danling.org/models/splicebert/ 3UTRBERT Zhiyuan Chen <h1>3UTRBERT</h1><p>--8&lt;-- "multimolecule/models/utrbert/README.md:42:"</p><p>::: multimolecule.models.utrbert</p>https://multimolecule.danling.org/models/utrbert/ Wed, 16 Oct 2024 07:45:39 +0000MultiMoleculehttps://multimolecule.danling.org/models/utrbert/ \ No newline at end of file + MultiMoleculeNeural Networks for RNA, DNA, and Proteinhttps://multimolecule.danling.org/MultiMoleculehttps://github.com/DLS5-Omics/multimoleculeen Wed, 11 Dec 2024 18:20:38 -0000 Wed, 11 Dec 2024 18:20:38 -0000 1440 MkDocs RSS plugin - v1.17.0 None MultiMoleculehttps://multimolecule.danling.org/ License FAQ <p>--8&lt;-- "about/license-faq.md"</p>https://multimolecule.danling.org/about/license-faq/ Wed, 11 Dec 2024 15:43:20 +0000MultiMoleculehttps://multimolecule.danling.org/about/license-faq/ License <p>--8&lt;-- "LICENSE.md"</p>https://multimolecule.danling.org/about/license/ Wed, 11 Dec 2024 15:43:20 +0000MultiMoleculehttps://multimolecule.danling.org/about/license/ Privacy Notice <p>--8&lt;-- "privacy.md"</p>https://multimolecule.danling.org/about/privacy/ Wed, 11 Dec 2024 15:43:20 +0000MultiMoleculehttps://multimolecule.danling.org/about/privacy/ License FAQ <p>--8&lt;-- "about/license-faq.zh.md"</p>https://multimolecule.danling.org/zh/about/license-faq/ Wed, 11 Dec 2024 15:43:20 +0000MultiMoleculehttps://multimolecule.danling.org/zh/about/license-faq/ License <p>--8&lt;-- "LICENSE.zh.md"</p>https://multimolecule.danling.org/zh/about/license/ Wed, 11 Dec 2024 15:43:20 +0000MultiMoleculehttps://multimolecule.danling.org/zh/about/license/ Privacy Notice <p>--8&lt;-- "privacy.zh.md"</p>https://multimolecule.danling.org/zh/about/privacy/ Wed, 11 Dec 2024 15:43:20 +0000MultiMoleculehttps://multimolecule.danling.org/zh/about/privacy/ RIVAS Zhiyuan Chen <h1>RIVAS</h1><p>--8&lt;-- "multimolecule/datasets/rivas/README.md:21:"</p>https://multimolecule.danling.org/datasets/rivas/ Sun, 27 Oct 2024 13:07:13 +0000MultiMoleculehttps://multimolecule.danling.org/datasets/rivas/ RIVAS Zhiyuan Chen <h1>RIVAS</h1><p>--8&lt;-- "multimolecule/datasets/rivas/README.md:21:"</p>https://multimolecule.danling.org/zh/datasets/rivas/ Sun, 27 Oct 2024 13:07:13 +0000MultiMoleculehttps://multimolecule.danling.org/zh/datasets/rivas/ ArchiveII Zhiyuan Chen <h1>ArchiveII</h1><p>--8&lt;-- "multimolecule/datasets/archiveii/README.md:24:"</p>https://multimolecule.danling.org/datasets/archiveii/ Thu, 24 Oct 2024 10:31:39 +0000MultiMoleculehttps://multimolecule.danling.org/datasets/archiveii/ RNAStrAlign Zhiyuan Chen <h1>RNAStrAlign</h1><p>--8&lt;-- "multimolecule/datasets/rnastralign/README.md:24:"</p>https://multimolecule.danling.org/datasets/rnastralign/ Thu, 24 Oct 2024 10:31:39 +0000MultiMoleculehttps://multimolecule.danling.org/datasets/rnastralign/ ArchiveII Zhiyuan Chen <h1>ArchiveII</h1><p>--8&lt;-- "multimolecule/datasets/archiveii/README.md:24:"</p>https://multimolecule.danling.org/zh/datasets/archiveii/ Thu, 24 Oct 2024 10:31:39 +0000MultiMoleculehttps://multimolecule.danling.org/zh/datasets/archiveii/ RNAStrAlign Zhiyuan Chen <h1>RNAStrAlign</h1><p>--8&lt;-- "multimolecule/datasets/rnastralign/README.md:24:"</p>https://multimolecule.danling.org/zh/datasets/rnastralign/ Thu, 24 Oct 2024 10:31:39 +0000MultiMoleculehttps://multimolecule.danling.org/zh/datasets/rnastralign/ ERNIE-RNA Zhiyuan Chen <h1>ERNIE-RNA</h1><p>--8&lt;-- "multimolecule/models/ernierna/README.md:42:"</p><p>::: multimolecule.models.ernierna</p>https://multimolecule.danling.org/models/ernierna/ Wed, 16 Oct 2024 07:45:39 +0000MultiMoleculehttps://multimolecule.danling.org/models/ernierna/ RiNALMo Zhiyuan Chen <h1>RiNALMo</h1><p>--8&lt;-- "multimolecule/models/rinalmo/README.md:45:"</p><p>::: multimolecule.models.rinalmo</p>https://multimolecule.danling.org/models/rinalmo/ Wed, 16 Oct 2024 07:45:39 +0000MultiMoleculehttps://multimolecule.danling.org/models/rinalmo/ RNABERT Zhiyuan Chen <h1>RNABERT</h1><p>--8&lt;-- "multimolecule/models/rnabert/README.md:42:"</p><p>::: multimolecule.models.rnabert</p>https://multimolecule.danling.org/models/rnabert/ Wed, 16 Oct 2024 07:45:39 +0000MultiMoleculehttps://multimolecule.danling.org/models/rnabert/ RNAErnie Zhiyuan Chen <h1>RNAErnie</h1><p>--8&lt;-- "multimolecule/models/rnaernie/README.md:42:"</p><p>::: multimolecule.models.rnaernie</p>https://multimolecule.danling.org/models/rnaernie/ Wed, 16 Oct 2024 07:45:39 +0000MultiMoleculehttps://multimolecule.danling.org/models/rnaernie/ RNA-FM Zhiyuan Chen <h1>RNA-FM</h1><p>--8&lt;-- "multimolecule/models/rnafm/README.md:42:"</p><p>::: multimolecule.models.rnafm</p>https://multimolecule.danling.org/models/rnafm/ Wed, 16 Oct 2024 07:45:39 +0000MultiMoleculehttps://multimolecule.danling.org/models/rnafm/ RNA-MSM Zhiyuan Chen <h1>RNA-MSM</h1><p>--8&lt;-- "multimolecule/models/rnamsm/README.md:42:"</p><p>::: multimolecule.models.rnamsm</p>https://multimolecule.danling.org/models/rnamsm/ Wed, 16 Oct 2024 07:45:39 +0000MultiMoleculehttps://multimolecule.danling.org/models/rnamsm/ SpliceBERT Zhiyuan Chen <h1>SpliceBERT</h1><p>--8&lt;-- "multimolecule/models/splicebert/README.md:42:"</p><p>::: multimolecule.models.splicebert</p>https://multimolecule.danling.org/models/splicebert/ Wed, 16 Oct 2024 07:45:39 +0000MultiMoleculehttps://multimolecule.danling.org/models/splicebert/ 3UTRBERT Zhiyuan Chen <h1>3UTRBERT</h1><p>--8&lt;-- "multimolecule/models/utrbert/README.md:42:"</p><p>::: multimolecule.models.utrbert</p>https://multimolecule.danling.org/models/utrbert/ Wed, 16 Oct 2024 07:45:39 +0000MultiMoleculehttps://multimolecule.danling.org/models/utrbert/ \ No newline at end of file diff --git a/index.html b/index.html index 35627e1f..8fdba69a 100644 --- a/index.html +++ b/index.html @@ -1004,19 +1004,6 @@

Citation day = 4 } -
-

Caution

-

The MultiMolecule project uses a GNU Affero General Public License. -Research papers are considered derivative works, and therefore must be licensed under the same terms.

-

You can only publish your research papers in fully open-access journals, conference, or pre-print servers that do not charge fees to publish or read. -You must obtain a waiver from the authors to publish in a closed-access / author-fee journal, conference, or pre-print server.

-

You may receive auto-approval for waivers if you are submitting to the following non-profit journals:

- -

Read more in the license faq.

-

License

We believe openness is the Foundation of Research.

MultiMolecule is licensed under the GNU Affero General Public License.

diff --git a/search/search_index.json b/search/search_index.json index bb724a4e..ae52a6fd 100644 --- a/search/search_index.json +++ b/search/search_index.json @@ -1 +1 @@ -{"config":{"lang":["en","zh"],"separator":"[\\s\\u200b\\-]","pipeline":["stemmer"]},"docs":[{"location":"","title":"MultiMolecule","text":"

Accelerate Molecular Biology Research with Machine Learning

"},{"location":"#introduction","title":"Introduction","text":"

Welcome to MultiMolecule (\u200b\u6d66\u539f\u200b), a foundational library designed to accelerate scientific research in molecular biology through machine learning. MultiMolecule provides a comprehensive yet flexible set of tools for researchers aiming to leverage AI with ease, focusing on biomolecular data (RNA, DNA, and protein).

"},{"location":"#overview","title":"Overview","text":"

MultiMolecule is built with flexibility and ease of use in mind. Its modular design allows you to utilize only the components you need, integrating seamlessly into your existing workflows without adding unnecessary complexity.

"},{"location":"#installation","title":"Installation","text":"

Install the most recent stable version on PyPI:

Bash
pip install multimolecule\n

Install the latest version from the source:

Bash
pip install git+https://github.com/DLS5-Omics/MultiMolecule\n
"},{"location":"#citation","title":"Citation","text":"

If you use MultiMolecule in your research, please cite us as follows:

BibTeX
@software{chen_2024_12638419,\n  author    = {Chen, Zhiyuan and Zhu, Sophia Y.},\n  title     = {MultiMolecule},\n  doi       = {10.5281/zenodo.12638419},\n  publisher = {Zenodo},\n  url       = {https://doi.org/10.5281/zenodo.12638419},\n  year      = 2024,\n  month     = may,\n  day       = 4\n}\n

Caution

The MultiMolecule project uses a GNU Affero General Public License. Research papers are considered derivative works, and therefore must be licensed under the same terms.

You can only publish your research papers in fully open-access journals, conference, or pre-print servers that do not charge fees to publish or read. You must obtain a waiver from the authors to publish in a closed-access / author-fee journal, conference, or pre-print server.

You may receive auto-approval for waivers if you are submitting to the following non-profit journals:

Read more in the license faq.

"},{"location":"#license","title":"License","text":"

We believe openness is the Foundation of Research.

MultiMolecule is licensed under the GNU Affero General Public License.

Please join us in building an open research community.

SPDX-License-Identifier: AGPL-3.0-or-later

"},{"location":"about/","title":"About","text":"

Developed by DanLing on Earth

We are a community of developers, designers, and others from around the world who are working together to make deep learning more accessible.

We are a community of individuals who seek to push the boundaries of what is possible with deep learning.

We are passionate about Deep Learning and the people who use it.

We are DanLing.

"},{"location":"about/license-faq/","title":"License FAQ","text":"

This License FAQ explains the terms and conditions under which you may use the data, models, code, configuration, documentation, and weights provided by the DanLing Team (also known as DanLing) (\u2018we\u2019, \u2018us\u2019, or \u2018our\u2019). It serves as an addendum to our License.

"},{"location":"about/license-faq/#0-summary-of-key-points","title":"0. Summary of Key Points","text":"

This summary provides key points from our license, but you can find out more details about any of these topics by clicking the link following each key point and by reading the full license.

What constitutes the \u2018source code\u2019 in MultiMolecule?

We consider everything in our repositories to be source code, including data, models, code, configuration, and documentation.

What constitutes the \u2018source code\u2019 in MultiMolecule?

Can I publish research papers using MultiMolecule?

It depends.

You can publish research papers on fully open access journals and conferences or preprint servers following the terms of the License.

You must obtain a separate license from us to publish research papers on closed access journals and conferences.

Can I publish research papers using MultiMolecule?

Can I use MultiMolecule for commercial purposes?

Yes, you can use MultiMolecule for commercial purposes under the terms of the License.

Can I use MultiMolecule for commercial purposes?

Do people affiliated with certain organizations have specific license terms?

Yes, people affiliated with certain organizations have specific license terms.

Do people affiliated with certain organizations have specific license terms?

"},{"location":"about/license-faq/#1-what-constitutes-the-source-code-in-multimolecule","title":"1. What constitutes the \u201csource code\u201d in MultiMolecule?","text":"

We consider everything in our repositories to be source code.

The training process of machine learning models is viewed similarly to the compilation process of traditional software. As such, the model, code, configuration, documentation, and data used for training are all part of the source code, while the trained model weights are part of the object code.

We also consider research papers and manuscripts a special form of documentation, which are also part of the source code.

"},{"location":"about/license-faq/#2-can-i-publish-research-papers-using-multimolecule","title":"2. Can I publish research papers using MultiMolecule?","text":"

Since research papers are considered a form of source code, publishers are legally required to open-source all materials on their server to comply with the License if they publish papers using MultiMolecule. This is generally impractical for most publishers.

As a special exemption under section 7 of the License, we grant permission to publish research papers using MultiMolecule in fully open access journals, conferences, or preprint servers that do not charge any fee from authors, provided all published manuscripts are made available under the GNU Free Documentation License (GFDL), or a Creative Commons license, or an OSI-approved license that permits the sharing of manuscripts.

As a special exemption under section 7 of the License, we grant permission to publish research papers using MultiMolecule in certain non-profit journals, conferences, or preprint servers. Currently, the non-profit journals, conferences, or preprint servers we allow include:

For publishing in closed access journals or conferences, you must obtain a separate license from us. This typically involves co-authorship, a fee to support the project, or both. Contact us at multimolecule@zyc.ai for more information.

While not mandatory, we recommend citing the MultiMolecule project in your research papers.

"},{"location":"about/license-faq/#3-can-i-use-multimolecule-for-commercial-purposes","title":"3. Can I use MultiMolecule for commercial purposes?","text":"

Yes, MultiMolecule can be used for commercial purposes under the License. However, you must open-source any modifications to the source code and make them available under the License.

If you prefer to use MultiMolecule for commercial purposes without open-sourcing your modifications, you must obtain a separate license from us. This typically involves a fee to support the project. Contact us at multimolecule@zyc.ai for further details.

"},{"location":"about/license-faq/#4-do-people-affiliated-with-certain-organizations-have-specific-license-terms","title":"4. Do people affiliated with certain organizations have specific license terms?","text":"

YES!

If you are affiliated with an organization that has a separate license agreement with us, you may be subject to different license terms. Please consult your organization\u2019s legal department to determine if you are subject to a separate license agreement.

Members of the following organizations automatically receive a non-transferable, non-sublicensable, and non-distributable MIT License to use MultiMolecule:

This special license is considered an additional term under section 7 of the License. It is not redistributable, and you are prohibited from creating any independent derivative works. Any modifications or derivative works based on this license are automatically considered derivative works of MultiMolecule and must comply with all the terms of the License. This ensures that third parties cannot bypass the license terms or create separate licenses from derivative works.

"},{"location":"about/license-faq/#5-how-can-i-use-multimolecule-if-my-organization-forbids-the-use-of-code-under-the-agpl-license","title":"5. How can I use MultiMolecule if my organization forbids the use of code under the AGPL License?","text":"

Some organizations, such as Google, have policies that prohibit the use of code under the AGPL License.

If you are affiliated with an organization that forbids the use of AGPL-licensed code, you must obtain a separate license from us. Contact us at multimolecule@zyc.ai for more information.

"},{"location":"about/license-faq/#6-can-i-use-multimolecule-if-i-am-a-federal-employee-of-the-united-states-government","title":"6. Can I use MultiMolecule if I am a federal employee of the United States Government?","text":"

No.

Code written by federal employees of the United States Government is not protected by copyright under 17 U.S. Code \u00a7 105.

As a result, federal employees of the United States Government cannot comply with the terms of the License.

"},{"location":"about/license-faq/#7-do-we-make-updates-to-this-faq","title":"7. Do we make updates to this FAQ?","text":"

In Short

Yes, we will update this FAQ as necessary to stay compliant with relevant laws.

We may update this license FAQ from time to time. The updated version will be indicated by an updated \u2018Last Revised Time\u2019 at the bottom of this license FAQ. If we make any material changes, we will notify you by posting the new license FAQ on this page. We are unable to notify you directly as we do not collect any contact information from you. We encourage you to review this license FAQ frequently to stay informed of how you can use our data, models, code, configuration, documentation, and weights.

"},{"location":"about/license/","title":"GNU AFFERO GENERAL PUBLIC LICENSE","text":"

Version 3, 19 November 2007

Copyright (C) 2007 Free Software Foundation, Inc. https://fsf.org/

Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed.

"},{"location":"about/license/#preamble","title":"Preamble","text":"

The GNU Affero General Public License is a free, copyleft license for software and other kinds of works, specifically designed to ensure cooperation with the community in the case of network server software.

The licenses for most software and other practical works are designed to take away your freedom to share and change the works. By contrast, our General Public Licenses are intended to guarantee your freedom to share and change all versions of a program\u2013to make sure it remains free software for all its users.

When we speak of free software, we are referring to freedom, not price. Our General Public Licenses are designed to make sure that you have the freedom to distribute copies of free software (and charge for them if you wish), that you receive source code or can get it if you want it, that you can change the software or use pieces of it in new free programs, and that you know you can do these things.

Developers that use our General Public Licenses protect your rights with two steps: (1) assert copyright on the software, and (2) offer you this License which gives you legal permission to copy, distribute and/or modify the software.

A secondary benefit of defending all users\u2019 freedom is that improvements made in alternate versions of the program, if they receive widespread use, become available for other developers to incorporate. Many developers of free software are heartened and encouraged by the resulting cooperation. However, in the case of software used on network servers, this result may fail to come about. The GNU General Public License permits making a modified version and letting the public access it on a server without ever releasing its source code to the public.

The GNU Affero General Public License is designed specifically to ensure that, in such cases, the modified source code becomes available to the community. It requires the operator of a network server to provide the source code of the modified version running there to the users of that server. Therefore, public use of a modified version, on a publicly accessible server, gives the public access to the source code of the modified version.

An older license, called the Affero General Public License and published by Affero, was designed to accomplish similar goals. This is a different license, not a version of the Affero GPL, but Affero has released a new version of the Affero GPL which permits relicensing under this license.

The precise terms and conditions for copying, distribution and modification follow.

"},{"location":"about/license/#terms-and-conditions","title":"TERMS AND CONDITIONS","text":""},{"location":"about/license/#0-definitions","title":"0. Definitions.","text":"

\u201cThis License\u201d refers to version 3 of the GNU Affero General Public License.

\u201cCopyright\u201d also means copyright-like laws that apply to other kinds of works, such as semiconductor masks.

\u201cThe Program\u201d refers to any copyrightable work licensed under this License. Each licensee is addressed as \u201cyou\u201d. \u201cLicensees\u201d and \u201crecipients\u201d may be individuals or organizations.

To \u201cmodify\u201d a work means to copy from or adapt all or part of the work in a fashion requiring copyright permission, other than the making of an exact copy. The resulting work is called a \u201cmodified version\u201d of the earlier work or a work \u201cbased on\u201d the earlier work.

A \u201ccovered work\u201d means either the unmodified Program or a work based on the Program.

To \u201cpropagate\u201d a work means to do anything with it that, without permission, would make you directly or secondarily liable for infringement under applicable copyright law, except executing it on a computer or modifying a private copy. Propagation includes copying, distribution (with or without modification), making available to the public, and in some countries other activities as well.

To \u201cconvey\u201d a work means any kind of propagation that enables other parties to make or receive copies. Mere interaction with a user through a computer network, with no transfer of a copy, is not conveying.

An interactive user interface displays \u201cAppropriate Legal Notices\u201d to the extent that it includes a convenient and prominently visible feature that (1) displays an appropriate copyright notice, and (2) tells the user that there is no warranty for the work (except to the extent that warranties are provided), that licensees may convey the work under this License, and how to view a copy of this License. If the interface presents a list of user commands or options, such as a menu, a prominent item in the list meets this criterion.

"},{"location":"about/license/#1-source-code","title":"1. Source Code.","text":"

The \u201csource code\u201d for a work means the preferred form of the work for making modifications to it. \u201cObject code\u201d means any non-source form of a work.

A \u201cStandard Interface\u201d means an interface that either is an official standard defined by a recognized standards body, or, in the case of interfaces specified for a particular programming language, one that is widely used among developers working in that language.

The \u201cSystem Libraries\u201d of an executable work include anything, other than the work as a whole, that (a) is included in the normal form of packaging a Major Component, but which is not part of that Major Component, and (b) serves only to enable use of the work with that Major Component, or to implement a Standard Interface for which an implementation is available to the public in source code form. A \u201cMajor Component\u201d, in this context, means a major essential component (kernel, window system, and so on) of the specific operating system (if any) on which the executable work runs, or a compiler used to produce the work, or an object code interpreter used to run it.

The \u201cCorresponding Source\u201d for a work in object code form means all the source code needed to generate, install, and (for an executable work) run the object code and to modify the work, including scripts to control those activities. However, it does not include the work\u2019s System Libraries, or general-purpose tools or generally available free programs which are used unmodified in performing those activities but which are not part of the work. For example, Corresponding Source includes interface definition files associated with source files for the work, and the source code for shared libraries and dynamically linked subprograms that the work is specifically designed to require, such as by intimate data communication or control flow between those subprograms and other parts of the work.

The Corresponding Source need not include anything that users can regenerate automatically from other parts of the Corresponding Source.

The Corresponding Source for a work in source code form is that same work.

"},{"location":"about/license/#2-basic-permissions","title":"2. Basic Permissions.","text":"

All rights granted under this License are granted for the term of copyright on the Program, and are irrevocable provided the stated conditions are met. This License explicitly affirms your unlimited permission to run the unmodified Program. The output from running a covered work is covered by this License only if the output, given its content, constitutes a covered work. This License acknowledges your rights of fair use or other equivalent, as provided by copyright law.

You may make, run and propagate covered works that you do not convey, without conditions so long as your license otherwise remains in force. You may convey covered works to others for the sole purpose of having them make modifications exclusively for you, or provide you with facilities for running those works, provided that you comply with the terms of this License in conveying all material for which you do not control copyright. Those thus making or running the covered works for you must do so exclusively on your behalf, under your direction and control, on terms that prohibit them from making any copies of your copyrighted material outside their relationship with you.

Conveying under any other circumstances is permitted solely under the conditions stated below. Sublicensing is not allowed; section 10 makes it unnecessary.

"},{"location":"about/license/#3-protecting-users-legal-rights-from-anti-circumvention-law","title":"3. Protecting Users\u2019 Legal Rights From Anti-Circumvention Law.","text":"

No covered work shall be deemed part of an effective technological measure under any applicable law fulfilling obligations under article 11 of the WIPO copyright treaty adopted on 20 December 1996, or similar laws prohibiting or restricting circumvention of such measures.

When you convey a covered work, you waive any legal power to forbid circumvention of technological measures to the extent such circumvention is effected by exercising rights under this License with respect to the covered work, and you disclaim any intention to limit operation or modification of the work as a means of enforcing, against the work\u2019s users, your or third parties\u2019 legal rights to forbid circumvention of technological measures.

"},{"location":"about/license/#4-conveying-verbatim-copies","title":"4. Conveying Verbatim Copies.","text":"

You may convey verbatim copies of the Program\u2019s source code as you receive it, in any medium, provided that you conspicuously and appropriately publish on each copy an appropriate copyright notice; keep intact all notices stating that this License and any non-permissive terms added in accord with section 7 apply to the code; keep intact all notices of the absence of any warranty; and give all recipients a copy of this License along with the Program.

You may charge any price or no price for each copy that you convey, and you may offer support or warranty protection for a fee.

"},{"location":"about/license/#5-conveying-modified-source-versions","title":"5. Conveying Modified Source Versions.","text":"

You may convey a work based on the Program, or the modifications to produce it from the Program, in the form of source code under the terms of section 4, provided that you also meet all of these conditions:

A compilation of a covered work with other separate and independent works, which are not by their nature extensions of the covered work, and which are not combined with it such as to form a larger program, in or on a volume of a storage or distribution medium, is called an \u201caggregate\u201d if the compilation and its resulting copyright are not used to limit the access or legal rights of the compilation\u2019s users beyond what the individual works permit. Inclusion of a covered work in an aggregate does not cause this License to apply to the other parts of the aggregate.

"},{"location":"about/license/#6-conveying-non-source-forms","title":"6. Conveying Non-Source Forms.","text":"

You may convey a covered work in object code form under the terms of sections 4 and 5, provided that you also convey the machine-readable Corresponding Source under the terms of this License, in one of these ways:

A separable portion of the object code, whose source code is excluded from the Corresponding Source as a System Library, need not be included in conveying the object code work.

A \u201cUser Product\u201d is either (1) a \u201cconsumer product\u201d, which means any tangible personal property which is normally used for personal, family, or household purposes, or (2) anything designed or sold for incorporation into a dwelling. In determining whether a product is a consumer product, doubtful cases shall be resolved in favor of coverage. For a particular product received by a particular user, \u201cnormally used\u201d refers to a typical or common use of that class of product, regardless of the status of the particular user or of the way in which the particular user actually uses, or expects or is expected to use, the product. A product is a consumer product regardless of whether the product has substantial commercial, industrial or non-consumer uses, unless such uses represent the only significant mode of use of the product.

\u201cInstallation Information\u201d for a User Product means any methods, procedures, authorization keys, or other information required to install and execute modified versions of a covered work in that User Product from a modified version of its Corresponding Source. The information must suffice to ensure that the continued functioning of the modified object code is in no case prevented or interfered with solely because modification has been made.

If you convey an object code work under this section in, or with, or specifically for use in, a User Product, and the conveying occurs as part of a transaction in which the right of possession and use of the User Product is transferred to the recipient in perpetuity or for a fixed term (regardless of how the transaction is characterized), the Corresponding Source conveyed under this section must be accompanied by the Installation Information. But this requirement does not apply if neither you nor any third party retains the ability to install modified object code on the User Product (for example, the work has been installed in ROM).

The requirement to provide Installation Information does not include a requirement to continue to provide support service, warranty, or updates for a work that has been modified or installed by the recipient, or for the User Product in which it has been modified or installed. Access to a network may be denied when the modification itself materially and adversely affects the operation of the network or violates the rules and protocols for communication across the network.

Corresponding Source conveyed, and Installation Information provided, in accord with this section must be in a format that is publicly documented (and with an implementation available to the public in source code form), and must require no special password or key for unpacking, reading or copying.

"},{"location":"about/license/#7-additional-terms","title":"7. Additional Terms.","text":"

\u201cAdditional permissions\u201d are terms that supplement the terms of this License by making exceptions from one or more of its conditions. Additional permissions that are applicable to the entire Program shall be treated as though they were included in this License, to the extent that they are valid under applicable law. If additional permissions apply only to part of the Program, that part may be used separately under those permissions, but the entire Program remains governed by this License without regard to the additional permissions.

When you convey a copy of a covered work, you may at your option remove any additional permissions from that copy, or from any part of it. (Additional permissions may be written to require their own removal in certain cases when you modify the work.) You may place additional permissions on material, added by you to a covered work, for which you have or can give appropriate copyright permission.

Notwithstanding any other provision of this License, for material you add to a covered work, you may (if authorized by the copyright holders of that material) supplement the terms of this License with terms:

All other non-permissive additional terms are considered \u201cfurther restrictions\u201d within the meaning of section 10. If the Program as you received it, or any part of it, contains a notice stating that it is governed by this License along with a term that is a further restriction, you may remove that term. If a license document contains a further restriction but permits relicensing or conveying under this License, you may add to a covered work material governed by the terms of that license document, provided that the further restriction does not survive such relicensing or conveying.

If you add terms to a covered work in accord with this section, you must place, in the relevant source files, a statement of the additional terms that apply to those files, or a notice indicating where to find the applicable terms.

Additional terms, permissive or non-permissive, may be stated in the form of a separately written license, or stated as exceptions; the above requirements apply either way.

"},{"location":"about/license/#8-termination","title":"8. Termination.","text":"

You may not propagate or modify a covered work except as expressly provided under this License. Any attempt otherwise to propagate or modify it is void, and will automatically terminate your rights under this License (including any patent licenses granted under the third paragraph of section 11).

However, if you cease all violation of this License, then your license from a particular copyright holder is reinstated (a) provisionally, unless and until the copyright holder explicitly and finally terminates your license, and (b) permanently, if the copyright holder fails to notify you of the violation by some reasonable means prior to 60 days after the cessation.

Moreover, your license from a particular copyright holder is reinstated permanently if the copyright holder notifies you of the violation by some reasonable means, this is the first time you have received notice of violation of this License (for any work) from that copyright holder, and you cure the violation prior to 30 days after your receipt of the notice.

Termination of your rights under this section does not terminate the licenses of parties who have received copies or rights from you under this License. If your rights have been terminated and not permanently reinstated, you do not qualify to receive new licenses for the same material under section 10.

"},{"location":"about/license/#9-acceptance-not-required-for-having-copies","title":"9. Acceptance Not Required for Having Copies.","text":"

You are not required to accept this License in order to receive or run a copy of the Program. Ancillary propagation of a covered work occurring solely as a consequence of using peer-to-peer transmission to receive a copy likewise does not require acceptance. However, nothing other than this License grants you permission to propagate or modify any covered work. These actions infringe copyright if you do not accept this License. Therefore, by modifying or propagating a covered work, you indicate your acceptance of this License to do so.

"},{"location":"about/license/#10-automatic-licensing-of-downstream-recipients","title":"10. Automatic Licensing of Downstream Recipients.","text":"

Each time you convey a covered work, the recipient automatically receives a license from the original licensors, to run, modify and propagate that work, subject to this License. You are not responsible for enforcing compliance by third parties with this License.

An \u201centity transaction\u201d is a transaction transferring control of an organization, or substantially all assets of one, or subdividing an organization, or merging organizations. If propagation of a covered work results from an entity transaction, each party to that transaction who receives a copy of the work also receives whatever licenses to the work the party\u2019s predecessor in interest had or could give under the previous paragraph, plus a right to possession of the Corresponding Source of the work from the predecessor in interest, if the predecessor has it or can get it with reasonable efforts.

You may not impose any further restrictions on the exercise of the rights granted or affirmed under this License. For example, you may not impose a license fee, royalty, or other charge for exercise of rights granted under this License, and you may not initiate litigation (including a cross-claim or counterclaim in a lawsuit) alleging that any patent claim is infringed by making, using, selling, offering for sale, or importing the Program or any portion of it.

"},{"location":"about/license/#11-patents","title":"11. Patents.","text":"

A \u201ccontributor\u201d is a copyright holder who authorizes use under this License of the Program or a work on which the Program is based. The work thus licensed is called the contributor\u2019s \u201ccontributor version\u201d.

A contributor\u2019s \u201cessential patent claims\u201d are all patent claims owned or controlled by the contributor, whether already acquired or hereafter acquired, that would be infringed by some manner, permitted by this License, of making, using, or selling its contributor version, but do not include claims that would be infringed only as a consequence of further modification of the contributor version. For purposes of this definition, \u201ccontrol\u201d includes the right to grant patent sublicenses in a manner consistent with the requirements of this License.

Each contributor grants you a non-exclusive, worldwide, royalty-free patent license under the contributor\u2019s essential patent claims, to make, use, sell, offer for sale, import and otherwise run, modify and propagate the contents of its contributor version.

In the following three paragraphs, a \u201cpatent license\u201d is any express agreement or commitment, however denominated, not to enforce a patent (such as an express permission to practice a patent or covenant not to sue for patent infringement). To \u201cgrant\u201d such a patent license to a party means to make such an agreement or commitment not to enforce a patent against the party.

If you convey a covered work, knowingly relying on a patent license, and the Corresponding Source of the work is not available for anyone to copy, free of charge and under the terms of this License, through a publicly available network server or other readily accessible means, then you must either (1) cause the Corresponding Source to be so available, or (2) arrange to deprive yourself of the benefit of the patent license for this particular work, or (3) arrange, in a manner consistent with the requirements of this License, to extend the patent license to downstream recipients. \u201cKnowingly relying\u201d means you have actual knowledge that, but for the patent license, your conveying the covered work in a country, or your recipient\u2019s use of the covered work in a country, would infringe one or more identifiable patents in that country that you have reason to believe are valid.

If, pursuant to or in connection with a single transaction or arrangement, you convey, or propagate by procuring conveyance of, a covered work, and grant a patent license to some of the parties receiving the covered work authorizing them to use, propagate, modify or convey a specific copy of the covered work, then the patent license you grant is automatically extended to all recipients of the covered work and works based on it.

A patent license is \u201cdiscriminatory\u201d if it does not include within the scope of its coverage, prohibits the exercise of, or is conditioned on the non-exercise of one or more of the rights that are specifically granted under this License. You may not convey a covered work if you are a party to an arrangement with a third party that is in the business of distributing software, under which you make payment to the third party based on the extent of your activity of conveying the work, and under which the third party grants, to any of the parties who would receive the covered work from you, a discriminatory patent license (a) in connection with copies of the covered work conveyed by you (or copies made from those copies), or (b) primarily for and in connection with specific products or compilations that contain the covered work, unless you entered into that arrangement, or that patent license was granted, prior to 28 March 2007.

Nothing in this License shall be construed as excluding or limiting any implied license or other defenses to infringement that may otherwise be available to you under applicable patent law.

"},{"location":"about/license/#12-no-surrender-of-others-freedom","title":"12. No Surrender of Others\u2019 Freedom.","text":"

If conditions are imposed on you (whether by court order, agreement or otherwise) that contradict the conditions of this License, they do not excuse you from the conditions of this License. If you cannot convey a covered work so as to satisfy simultaneously your obligations under this License and any other pertinent obligations, then as a consequence you may not convey it at all. For example, if you agree to terms that obligate you to collect a royalty for further conveying from those to whom you convey the Program, the only way you could satisfy both those terms and this License would be to refrain entirely from conveying the Program.

"},{"location":"about/license/#13-remote-network-interaction-use-with-the-gnu-general-public-license","title":"13. Remote Network Interaction; Use with the GNU General Public License.","text":"

Notwithstanding any other provision of this License, if you modify the Program, your modified version must prominently offer all users interacting with it remotely through a computer network (if your version supports such interaction) an opportunity to receive the Corresponding Source of your version by providing access to the Corresponding Source from a network server at no charge, through some standard or customary means of facilitating copying of software. This Corresponding Source shall include the Corresponding Source for any work covered by version 3 of the GNU General Public License that is incorporated pursuant to the following paragraph.

Notwithstanding any other provision of this License, you have permission to link or combine any covered work with a work licensed under version 3 of the GNU General Public License into a single combined work, and to convey the resulting work. The terms of this License will continue to apply to the part which is the covered work, but the work with which it is combined will remain governed by version 3 of the GNU General Public License.

"},{"location":"about/license/#14-revised-versions-of-this-license","title":"14. Revised Versions of this License.","text":"

The Free Software Foundation may publish revised and/or new versions of the GNU Affero General Public License from time to time. Such new versions will be similar in spirit to the present version, but may differ in detail to address new problems or concerns.

Each version is given a distinguishing version number. If the Program specifies that a certain numbered version of the GNU Affero General Public License \u201cor any later version\u201d applies to it, you have the option of following the terms and conditions either of that numbered version or of any later version published by the Free Software Foundation. If the Program does not specify a version number of the GNU Affero General Public License, you may choose any version ever published by the Free Software Foundation.

If the Program specifies that a proxy can decide which future versions of the GNU Affero General Public License can be used, that proxy\u2019s public statement of acceptance of a version permanently authorizes you to choose that version for the Program.

Later license versions may give you additional or different permissions. However, no additional obligations are imposed on any author or copyright holder as a result of your choosing to follow a later version.

"},{"location":"about/license/#15-disclaimer-of-warranty","title":"15. Disclaimer of Warranty.","text":"

THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM \u201cAS IS\u201d WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU. SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING, REPAIR OR CORRECTION.

"},{"location":"about/license/#16-limitation-of-liability","title":"16. Limitation of Liability.","text":"

IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MODIFIES AND/OR CONVEYS THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

"},{"location":"about/license/#17-interpretation-of-sections-15-and-16","title":"17. Interpretation of Sections 15 and 16.","text":"

If the disclaimer of warranty and limitation of liability provided above cannot be given local legal effect according to their terms, reviewing courts shall apply local law that most closely approximates an absolute waiver of all civil liability in connection with the Program, unless a warranty or assumption of liability accompanies a copy of the Program in return for a fee.

END OF TERMS AND CONDITIONS

"},{"location":"about/license/#how-to-apply-these-terms-to-your-new-programs","title":"How to Apply These Terms to Your New Programs","text":"

If you develop a new program, and you want it to be of the greatest possible use to the public, the best way to achieve this is to make it free software which everyone can redistribute and change under these terms.

To do so, attach the following notices to the program. It is safest to attach them to the start of each source file to most effectively state the exclusion of warranty; and each file should have at least the \u201ccopyright\u201d line and a pointer to where the full notice is found.

Text Only
    <one line to give the program's name and a brief idea of what it does.>\n    Copyright (C) <year>  <name of author>\n\n    This program is free software: you can redistribute it and/or modify\n    it under the terms of the GNU Affero General Public License as\n    published by the Free Software Foundation, either version 3 of the\n    License, or (at your option) any later version.\n\n    This program is distributed in the hope that it will be useful,\n    but WITHOUT ANY WARRANTY; without even the implied warranty of\n    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\n    GNU Affero General Public License for more details.\n\n    You should have received a copy of the GNU Affero General Public License\n    along with this program.  If not, see <https://www.gnu.org/licenses/>.\n

Also add information on how to contact you by electronic and paper mail.

If your software can interact with users remotely through a computer network, you should also make sure that it provides a way for users to get its source. For example, if your program is a web application, its interface could display a \u201cSource\u201d link that leads users to an archive of the code. There are many ways you could offer source, and different solutions will be better for different programs; see section 13 for the specific requirements.

You should also get your employer (if you work as a programmer) or school, if any, to sign a \u201ccopyright disclaimer\u201d for the program, if necessary. For more information on this, and how to apply and follow the GNU AGPL, see https://www.gnu.org/licenses/.

"},{"location":"data/","title":"data","text":"

data provides a collection of data processing utilities for handling data.

While datasets is a powerful library for managing datasets, it is a general-purpose tool that may not cover all the specific functionalities of scientific applications.

The data package is designed to complement datasets by offering additional data processing utilities that are commonly used in scientific tasks.

"},{"location":"data/#usage","title":"Usage","text":""},{"location":"data/#load-from-local-data-file","title":"Load from local data file","text":"Python
from multimolecule.data import Dataset\n\ndata = Dataset(\"data/rna/5utr.csv\", split=\"train\", pretrained=\"multimolecule/rna\")\n
"},{"location":"data/#load-from-datasets","title":"Load from datasets","text":"Python
from multimolecule.data import Dataset\n\ndata = Dataset(\"multimolecule/bprna-spot\", split=\"train\", pretrained=\"multimolecule/rna\")\n
"},{"location":"data/dataset/","title":"Dataset","text":""},{"location":"data/dataset/#multimolecule.data.Dataset","title":"multimolecule.data.Dataset","text":"

Bases: Dataset

The base class for all datasets.

Dataset is a subclass of datasets.Dataset that provides additional functionality for handling structured data. It has three main features:

Attributes:

Name Type Description tasks NestedDict

A nested dictionary of the inferred tasks for each label column in the dataset.

tokenizer PreTrainedTokenizerBase

The pretrained tokenizer to use for tokenization.

truncation bool

Whether to truncate sequences that exceed the maximum length of the tokenizer.

max_seq_length int

The maximum length of the input sequences.

data_cols List

The names of all columns in the dataset.

feature_cols List

The names of the feature columns in the dataset.

label_cols List

The names of the label columns in the dataset.

sequence_cols List

The names of the sequence columns in the dataset.

column_names_map Mapping[str, str] | None

A mapping of column names to new column names.

preprocess bool

Whether to preprocess the dataset.

Parameters:

Name Type Description Default Table | DataFrame | dict | list | str

The dataset. This can be a path to a file, a tag on the Hugging Face Hub, a pyarrow.Table, a dict, a list, or a pandas.DataFrame.

required NamedSplit

The split of the dataset.

required PreTrainedTokenizerBase | None

A pretrained tokenizer to use for tokenization. Either tokenizer or pretrained must be specified.

None str | None

The name of a pretrained tokenizer to use for tokenization. Either tokenizer or pretrained must be specified.

None List | None

The names of the feature columns in the dataset. Will be inferred automatically if not specified.

None List | None

The names of the label columns in the dataset. Will be inferred automatically if not specified.

None List | None

The names of the ID columns in the dataset. Will be inferred automatically if not specified.

None bool | None

Whether to preprocess the dataset. Preprocessing involves pre-tokenizing the sequences using the tokenizer. Defaults to True.

None bool | None

Whether to automatically rename sequence columns to standard name. Only works when there is exactly one sequence column You can control the naming through multimolecule.defaults.SEQUENCE_COL_NAME. For more refined control, use column_names_map.

None

Whether to automatically rename label column to standard name. Only works when there is exactly one label column. You can control the naming through multimolecule.defaults.LABEL_COL_NAME. For more refined control, use column_names_map.

required Mapping[str, str] | None

A mapping of column names to new column names. This is useful for renaming columns to inputs that are expected by a model. Defaults to None.

None bool | None

Whether to truncate sequences that exceed the maximum length of the tokenizer. Defaults to False.

None int | None

The maximum length of the input sequences. Defaults to the model_max_length of the tokenizer.

None Mapping[str, Task] | None

A mapping of column names to tasks. Will be inferred automatically if not specified.

None Mapping[str, int] | None

A mapping of column names to discrete mappings. This is useful for mapping the raw value to nominal value in classification tasks. Will be inferred automatically if not specified.

None str

How to handle NaN and inf values in the dataset. Can be \u201cignore\u201d, \u201cerror\u201d, \u201cdrop\u201d, or \u201cfill\u201d. Defaults to \u201cignore\u201d.

'ignore' str | int | float

The value to fill NaN and inf values with. Defaults to 0.

0 DatasetInfo | None

The dataset info.

None Table | None

The indices table.

None str | None

The fingerprint of the dataset.

None Source code in multimolecule/data/dataset.py Python
class Dataset(datasets.Dataset):\n    r\"\"\"\n    The base class for all datasets.\n\n    Dataset is a subclass of [`datasets.Dataset`][] that provides additional functionality for handling structured data.\n    It has three main features:\n\n    - column identification: identify the special columns (sequence and structure columns) in the dataset.\n    - tokenization: tokenize the sequence columns in the dataset using a pretrained tokenizer.\n    - task inference: infer the task type and level of each label column in the dataset.\n\n    Attributes:\n        tasks: A nested dictionary of the inferred tasks for each label column in the dataset.\n        tokenizer: The pretrained tokenizer to use for tokenization.\n        truncation: Whether to truncate sequences that exceed the maximum length of the tokenizer.\n        max_seq_length: The maximum length of the input sequences.\n        data_cols: The names of all columns in the dataset.\n        feature_cols: The names of the feature columns in the dataset.\n        label_cols: The names of the label columns in the dataset.\n        sequence_cols: The names of the sequence columns in the dataset.\n        column_names_map: A mapping of column names to new column names.\n        preprocess: Whether to preprocess the dataset.\n\n    Args:\n        data: The dataset. This can be a path to a file, a tag on the Hugging Face Hub, a pyarrow.Table,\n            a [dict][], a [list][], or a [pandas.DataFrame][].\n        split: The split of the dataset.\n        tokenizer: A pretrained tokenizer to use for tokenization.\n            Either `tokenizer` or `pretrained` must be specified.\n        pretrained: The name of a pretrained tokenizer to use for tokenization.\n            Either `tokenizer` or `pretrained` must be specified.\n        feature_cols: The names of the feature columns in the dataset.\n            Will be inferred automatically if not specified.\n        label_cols: The names of the label columns in the dataset.\n            Will be inferred automatically if not specified.\n        id_cols: The names of the ID columns in the dataset.\n            Will be inferred automatically if not specified.\n        preprocess: Whether to preprocess the dataset.\n            Preprocessing involves pre-tokenizing the sequences using the tokenizer.\n            Defaults to `True`.\n        auto_rename_sequence_col: Whether to automatically rename sequence columns to standard name.\n            Only works when there is exactly one sequence column\n            You can control the naming through `multimolecule.defaults.SEQUENCE_COL_NAME`.\n            For more refined control, use `column_names_map`.\n        auto_rename_label_cols: Whether to automatically rename label column to standard name.\n            Only works when there is exactly one label column.\n            You can control the naming through `multimolecule.defaults.LABEL_COL_NAME`.\n            For more refined control, use `column_names_map`.\n        column_names_map: A mapping of column names to new column names.\n            This is useful for renaming columns to inputs that are expected by a model.\n            Defaults to `None`.\n        truncation: Whether to truncate sequences that exceed the maximum length of the tokenizer.\n            Defaults to `False`.\n        max_seq_length: The maximum length of the input sequences.\n            Defaults to the `model_max_length` of the tokenizer.\n        tasks: A mapping of column names to tasks.\n            Will be inferred automatically if not specified.\n        discrete_map: A mapping of column names to discrete mappings.\n            This is useful for mapping the raw value to nominal value in classification tasks.\n            Will be inferred automatically if not specified.\n        nan_process: How to handle NaN and inf values in the dataset.\n            Can be \"ignore\", \"error\", \"drop\", or \"fill\". Defaults to \"ignore\".\n        fill_value: The value to fill NaN and inf values with.\n            Defaults to 0.\n        info: The dataset info.\n        indices_table: The indices table.\n        fingerprint: The fingerprint of the dataset.\n    \"\"\"\n\n    tokenizer: PreTrainedTokenizerBase\n    truncation: bool = False\n    max_seq_length: int\n    seq_length_offset: int = 0\n\n    _id_cols: List\n    _feature_cols: List\n    _label_cols: List\n\n    _sequence_cols: List\n    _secondary_structure_cols: List\n\n    _tasks: NestedDict[str, Task]\n    _discrete_map: Mapping\n\n    preprocess: bool = True\n    auto_rename_sequence_col: bool = True\n    auto_rename_label_col: bool = False\n    column_names_map: Mapping[str, str] | None = None\n    ignored_cols: List[str] = []\n\n    def __init__(\n        self,\n        data: Table | DataFrame | dict | list | str,\n        split: datasets.NamedSplit,\n        tokenizer: PreTrainedTokenizerBase | None = None,\n        pretrained: str | None = None,\n        feature_cols: List | None = None,\n        label_cols: List | None = None,\n        id_cols: List | None = None,\n        preprocess: bool | None = None,\n        auto_rename_sequence_col: bool | None = None,\n        auto_rename_label_col: bool | None = None,\n        column_names_map: Mapping[str, str] | None = None,\n        truncation: bool | None = None,\n        max_seq_length: int | None = None,\n        tasks: Mapping[str, Task] | None = None,\n        discrete_map: Mapping[str, int] | None = None,\n        nan_process: str = \"ignore\",\n        fill_value: str | int | float = 0,\n        info: datasets.DatasetInfo | None = None,\n        indices_table: Table | None = None,\n        fingerprint: str | None = None,\n        ignored_cols: List[str] | None = None,\n    ):\n        self._tasks = NestedDict()\n        if tasks is not None:\n            self.tasks = tasks\n        if discrete_map is not None:\n            self._discrete_map = discrete_map\n        arrow_table = self.build_table(\n            data, split, feature_cols, label_cols, nan_process=nan_process, fill_value=fill_value\n        )\n        super().__init__(\n            arrow_table=arrow_table, split=split, info=info, indices_table=indices_table, fingerprint=fingerprint\n        )\n        self.identify_special_cols(feature_cols=feature_cols, label_cols=label_cols, id_cols=id_cols)\n        self.post(\n            tokenizer=tokenizer,\n            pretrained=pretrained,\n            preprocess=preprocess,\n            truncation=truncation,\n            max_seq_length=max_seq_length,\n            auto_rename_sequence_col=auto_rename_sequence_col,\n            auto_rename_label_col=auto_rename_label_col,\n            column_names_map=column_names_map,\n        )\n        self.ignored_cols = ignored_cols or self.id_cols\n        self.train = split == datasets.Split.TRAIN\n\n    def build_table(\n        self,\n        data: Table | DataFrame | dict | str,\n        split: datasets.NamedSplit,\n        feature_cols: List | None = None,\n        label_cols: List | None = None,\n        nan_process: str | None = \"ignore\",\n        fill_value: str | int | float = 0,\n    ) -> datasets.table.Table:\n        if isinstance(data, str):\n            try:\n                data = datasets.load_dataset(data, split=split).data\n            except FileNotFoundError:\n                data = dl.load_pandas(data)\n                if isinstance(data, DataFrame):\n                    data = data.loc[:, ~data.columns.str.contains(\"^Unnamed\")]\n                    data = pa.Table.from_pandas(data, preserve_index=False)\n        elif isinstance(data, dict):\n            data = pa.Table.from_pydict(data)\n        elif isinstance(data, list):\n            data = pa.Table.from_pylist(data)\n        elif isinstance(data, DataFrame):\n            data = pa.Table.from_pandas(data, preserve_index=False)\n        if feature_cols is not None and label_cols is not None:\n            data = data.select(feature_cols + label_cols)\n        data = self.process_nan(data, nan_process=nan_process, fill_value=fill_value)\n        return data\n\n    def post(\n        self,\n        tokenizer: PreTrainedTokenizerBase | None = None,\n        pretrained: str | None = None,\n        max_seq_length: int | None = None,\n        truncation: bool | None = None,\n        preprocess: bool | None = None,\n        auto_rename_sequence_col: bool | None = None,\n        auto_rename_label_col: bool | None = None,\n        column_names_map: Mapping[str, str] | None = None,\n    ) -> None:\n        r\"\"\"\n        Perform pre-processing steps after initialization.\n\n        It first identifies the special columns (sequence and structure columns) in the dataset.\n        Then it sets the feature and label columns based on the input arguments.\n        If `auto_rename_sequence_col` is `True`, it will automatically rename the sequence column.\n        If `auto_rename_label_col` is `True`, it will automatically rename the label column.\n        Finally, it sets the [`transform`][datasets.Dataset.set_transform] function based on the `preprocess` flag.\n        \"\"\"\n        if tokenizer is None:\n            if pretrained is None:\n                raise ValueError(\"tokenizer and pretrained can not be both None.\")\n            tokenizer = AutoTokenizer.from_pretrained(pretrained)\n        if max_seq_length is None:\n            max_seq_length = tokenizer.model_max_length\n        else:\n            tokenizer.model_max_length = max_seq_length\n        self.tokenizer = tokenizer\n        self.max_seq_length = max_seq_length\n        if truncation is not None:\n            self.truncation = truncation\n        if self.tokenizer.cls_token is not None:\n            self.seq_length_offset += 1\n        if self.tokenizer.sep_token is not None and self.tokenizer.sep_token != self.tokenizer.eos_token:\n            self.seq_length_offset += 1\n        if self.tokenizer.eos_token is not None:\n            self.seq_length_offset += 1\n        if preprocess is not None:\n            self.preprocess = preprocess\n        if auto_rename_sequence_col is not None:\n            self.auto_rename_sequence_col = auto_rename_sequence_col\n        if auto_rename_label_col is not None:\n            self.auto_rename_label_col = auto_rename_label_col\n        if column_names_map is None:\n            column_names_map = {}\n        if self.auto_rename_sequence_col:\n            if len(self.sequence_cols) != 1:\n                raise ValueError(\"auto_rename_sequence_col can only be used when there is exactly one sequence column.\")\n            column_names_map[self.sequence_cols[0]] = defaults.SEQUENCE_COL_NAME  # type: ignore[index]\n        if self.auto_rename_label_col:\n            if len(self.label_cols) != 1:\n                raise ValueError(\"auto_rename_label_col can only be used when there is exactly one label column.\")\n            column_names_map[self.label_cols[0]] = defaults.LABEL_COL_NAME  # type: ignore[index]\n        self.column_names_map = column_names_map\n        if self.column_names_map:\n            self.rename_columns(self.column_names_map)\n        self.infer_tasks()\n\n        if self.preprocess:\n            self.update(self.map(self.tokenization))\n            if self.secondary_structure_cols:\n                self.update(self.map(self.convert_secondary_structure))\n            if self.discrete_map:\n                self.update(self.map(self.map_discrete))\n            fn_kwargs = {\n                \"columns\": [name for name, task in self.tasks.items() if task.level in [\"token\", \"contact\"]],\n                \"max_seq_length\": self.max_seq_length - self.seq_length_offset,\n            }\n            if self.truncation and 0 < self.max_seq_length < 2**32:\n                self.update(self.map(self.truncate, fn_kwargs=fn_kwargs))\n        self.set_transform(self.transform)\n\n    def transform(self, batch: Mapping) -> Mapping:\n        r\"\"\"\n        Default [`transform`][datasets.Dataset.set_transform].\n\n        See Also:\n            [`collate`][multimolecule.Dataset.collate]\n        \"\"\"\n        return {k: self.collate(k, v) for k, v in batch.items() if k not in self.ignored_cols}\n\n    def collate(self, col: str, data: Any) -> Tensor | NestedTensor | None:\n        r\"\"\"\n        Collate the data for a column.\n\n        If the column is a sequence column, it will tokenize the data if `tokenize` is `True`.\n        Otherwise, it will return a tensor or nested tensor.\n        \"\"\"\n        if col in self.sequence_cols:\n            if isinstance(data[0], str):\n                data = self.tokenize(data)\n            return NestedTensor(data)\n        if not self.preprocess:\n            if col in self.discrete_map:\n                data = map_value(data, self.discrete_map[col])\n            if col in self.tasks:\n                data = truncate_value(data, self.max_seq_length - self.seq_length_offset, self.tasks[col].level)\n        if isinstance(data[0], str):\n            return data\n        try:\n            return torch.tensor(data)\n        except ValueError:\n            return NestedTensor(data)\n\n    def infer_tasks(self, sequence_col: str | None = None) -> NestedDict:\n        for col in self.label_cols:\n            if col in self.tasks:\n                continue\n            if col in self.secondary_structure_cols:\n                task = Task(TaskType.Binary, level=TaskLevel.Contact, num_labels=1)\n                self.tasks[col] = task  # type: ignore[index]\n                warn(\n                    f\"Secondary structure columns are assumed to be {task}. \"\n                    \"Please explicitly specify the task if this is not the case.\"\n                )\n            else:\n                try:\n                    self.tasks[col] = self.infer_task(col, sequence_col)  # type: ignore[index]\n                except ValueError:\n                    raise ValueError(f\"Unable to infer task for column {col}.\")\n        return self.tasks\n\n    def infer_task(self, label_col: str, sequence_col: str | None = None) -> Task:\n        if sequence_col is None:\n            if len(self.sequence_cols) != 1:\n                raise ValueError(\"sequence_col must be specified if there are multiple sequence columns.\")\n            sequence_col = self.sequence_cols[0]\n        sequence = self._data.column(sequence_col)\n        column = self._data.column(label_col)\n        return infer_task(\n            sequence,\n            column,\n            truncation=self.truncation,\n            max_seq_length=self.max_seq_length,\n            seq_length_offset=self.seq_length_offset,\n        )\n\n    def infer_discrete_map(self, discrete_map: Mapping | None = None):\n        self._discrete_map = discrete_map or NestedDict()\n        ignored_cols = set(self.discrete_map.keys()) | set(self.sequence_cols) | set(self.secondary_structure_cols)\n        data_cols = [i for i in self.data_cols if i not in ignored_cols]\n        for col in data_cols:\n            discrete_map = infer_discrete_map(self._data.column(col))\n            if discrete_map:\n                self._discrete_map[col] = discrete_map  # type: ignore[index]\n        return self._discrete_map\n\n    def __getitems__(self, keys: int | slice | Iterable[int]) -> Any:\n        return self.__getitem__(keys)\n\n    def identify_special_cols(\n        self, feature_cols: List | None = None, label_cols: List | None = None, id_cols: List | None = None\n    ) -> Sequence:\n        all_cols = self.data.column_names\n        self._id_cols = id_cols or [i for i in all_cols if i in defaults.ID_COL_NAMES]\n\n        string_cols: list[str] = [k for k, v in self.features.items() if k not in self.id_cols and v.dtype == \"string\"]\n        self._sequence_cols = [i for i in string_cols if i.lower() in defaults.SEQUENCE_COL_NAMES]\n        self._secondary_structure_cols = [i for i in string_cols if i in defaults.SECONDARY_STRUCTURE_COL_NAMES]\n\n        data_cols = [i for i in all_cols if i not in self.id_cols]\n        if label_cols is None:\n            if feature_cols is None:\n                feature_cols = [i for i in data_cols if i in defaults.SEQUENCE_COL_NAMES]\n            label_cols = [i for i in data_cols if i not in feature_cols]\n        self._label_cols = label_cols\n        if feature_cols is None:\n            feature_cols = [i for i in data_cols if i not in self.label_cols]\n        self._feature_cols = feature_cols\n        missing_feature_cols = set(self.feature_cols).difference(data_cols)\n        if missing_feature_cols:\n            raise ValueError(f\"{missing_feature_cols} are specified in feature_cols, but not found in dataset.\")\n        missing_label_cols = set(self.label_cols).difference(data_cols)\n        if missing_label_cols:\n            raise ValueError(f\"{missing_label_cols} are specified in label_cols, but not found in dataset.\")\n        return string_cols\n\n    def tokenize(self, string: str) -> Tensor:\n        return self.tokenizer(string, return_attention_mask=False, truncation=self.truncation)[\"input_ids\"]\n\n    def tokenization(self, data: Mapping[str, str]) -> Mapping[str, Tensor]:\n        return {col: self.tokenize(data[col]) for col in self.sequence_cols}\n\n    def convert_secondary_structure(self, data: Mapping) -> Mapping:\n        return {col: dot_bracket_to_contact_map(data[col]) for col in self.secondary_structure_cols}\n\n    def map_discrete(self, data: Mapping) -> Mapping:\n        return {name: map_value(data[name], mapping) for name, mapping in self.discrete_map.items()}\n\n    def truncate(self, data: Mapping, columns: List[str], max_seq_length: int) -> Mapping:\n        return {name: truncate_value(data[name], max_seq_length, self.tasks[name].level) for name in columns}\n\n    def update(self, dataset: datasets.Dataset):\n        r\"\"\"\n        Perform an in-place update of the dataset.\n\n        This method is used to update the dataset after changes have been made to the underlying data.\n        It updates the format columns, data, info, and fingerprint of the dataset.\n        \"\"\"\n        # pylint: disable=W0212\n        # Why datasets won't support in-place changes?\n        # It's just impossible to extend.\n        self._format_columns = dataset._format_columns\n        self._data = dataset._data\n        self._info = dataset._info\n        self._fingerprint = dataset._fingerprint\n\n    def rename_columns(self, column_mapping: Mapping[str, str], new_fingerprint: str | None = None) -> datasets.Dataset:\n        self.update(super().rename_columns(column_mapping, new_fingerprint=new_fingerprint))\n        self._id_cols = [column_mapping.get(i, i) for i in self.id_cols]\n        self._feature_cols = [column_mapping.get(i, i) for i in self.feature_cols]\n        self._label_cols = [column_mapping.get(i, i) for i in self.label_cols]\n        self._sequence_cols = [column_mapping.get(i, i) for i in self.sequence_cols]\n        self._secondary_structure_cols = [column_mapping.get(i, i) for i in self.secondary_structure_cols]\n        self.tasks = {column_mapping.get(k, k): v for k, v in self.tasks.items()}\n        return self\n\n    def rename_column(\n        self, original_column_name: str, new_column_name: str, new_fingerprint: str | None = None\n    ) -> datasets.Dataset:\n        self.update(super().rename_column(original_column_name, new_column_name, new_fingerprint))\n        self._id_cols = [new_column_name if i == original_column_name else i for i in self.id_cols]\n        self._feature_cols = [new_column_name if i == original_column_name else i for i in self.feature_cols]\n        self._label_cols = [new_column_name if i == original_column_name else i for i in self.label_cols]\n        self._sequence_cols = [new_column_name if i == original_column_name else i for i in self.sequence_cols]\n        self._secondary_structure_cols = [\n            new_column_name if i == original_column_name else i for i in self.secondary_structure_cols\n        ]\n        self.tasks = {new_column_name if k == original_column_name else k: v for k, v in self.tasks.items()}\n        return self\n\n    def process_nan(self, data: Table, nan_process: str | None, fill_value: str | int | float = 0) -> Table:\n        if nan_process == \"ignore\":\n            return data\n        data = data.to_pandas()\n        data = data.replace([float(\"inf\"), -float(\"inf\")], float(\"nan\"))\n        if data.isnull().values.any():\n            if nan_process is None or nan_process == \"error\":\n                raise ValueError(\"NaN / inf values have been found in the dataset.\")\n            warn(\n                \"NaN / inf values have been found in the dataset.\\n\"\n                \"While we can handle them, the data type of the corresponding column may be set to float, \"\n                \"which can and very likely will disrupt the auto task recognition.\\n\"\n                \"It is recommended to address these values before loading the dataset.\"\n            )\n            if nan_process == \"drop\":\n                data = data.dropna()\n            elif nan_process == \"fill\":\n                data = data.fillna(fill_value)\n            else:\n                raise ValueError(f\"Invalid nan_process: {nan_process}\")\n        return pa.Table.from_pandas(data, preserve_index=False)\n\n    @property\n    def id_cols(self) -> List:\n        return self._id_cols\n\n    @property\n    def data_cols(self) -> List:\n        return self.feature_cols + self.label_cols\n\n    @property\n    def feature_cols(self) -> List:\n        return self._feature_cols\n\n    @property\n    def label_cols(self) -> List:\n        return self._label_cols\n\n    @property\n    def sequence_cols(self) -> List:\n        return self._sequence_cols\n\n    @property\n    def secondary_structure_cols(self) -> List:\n        return self._secondary_structure_cols\n\n    @property\n    def tasks(self) -> NestedDict:\n        if not hasattr(self, \"_tasks\"):\n            self._tasks = NestedDict()\n            return self.infer_tasks()\n        return self._tasks\n\n    @tasks.setter\n    def tasks(self, tasks: Mapping):\n        self._tasks = NestedDict()\n        for name, task in tasks.items():\n            if not isinstance(task, Task):\n                task = Task(**task)\n            self._tasks[name] = task\n\n    @property\n    def discrete_map(self) -> Mapping:\n        if not hasattr(self, \"_discrete_map\"):\n            return self.infer_discrete_map()\n        return self._discrete_map\n
"},{"location":"data/dataset/#multimolecule.data.Dataset(data)","title":"data","text":""},{"location":"data/dataset/#multimolecule.data.Dataset(split)","title":"split","text":""},{"location":"data/dataset/#multimolecule.data.Dataset(tokenizer)","title":"tokenizer","text":""},{"location":"data/dataset/#multimolecule.data.Dataset(pretrained)","title":"pretrained","text":""},{"location":"data/dataset/#multimolecule.data.Dataset(feature_cols)","title":"feature_cols","text":""},{"location":"data/dataset/#multimolecule.data.Dataset(label_cols)","title":"label_cols","text":""},{"location":"data/dataset/#multimolecule.data.Dataset(id_cols)","title":"id_cols","text":""},{"location":"data/dataset/#multimolecule.data.Dataset(preprocess)","title":"preprocess","text":""},{"location":"data/dataset/#multimolecule.data.Dataset(auto_rename_sequence_col)","title":"auto_rename_sequence_col","text":""},{"location":"data/dataset/#multimolecule.data.Dataset(auto_rename_label_cols)","title":"auto_rename_label_cols","text":""},{"location":"data/dataset/#multimolecule.data.Dataset(column_names_map)","title":"column_names_map","text":""},{"location":"data/dataset/#multimolecule.data.Dataset(truncation)","title":"truncation","text":""},{"location":"data/dataset/#multimolecule.data.Dataset(max_seq_length)","title":"max_seq_length","text":""},{"location":"data/dataset/#multimolecule.data.Dataset(tasks)","title":"tasks","text":""},{"location":"data/dataset/#multimolecule.data.Dataset(discrete_map)","title":"discrete_map","text":""},{"location":"data/dataset/#multimolecule.data.Dataset(nan_process)","title":"nan_process","text":""},{"location":"data/dataset/#multimolecule.data.Dataset(fill_value)","title":"fill_value","text":""},{"location":"data/dataset/#multimolecule.data.Dataset(info)","title":"info","text":""},{"location":"data/dataset/#multimolecule.data.Dataset(indices_table)","title":"indices_table","text":""},{"location":"data/dataset/#multimolecule.data.Dataset(fingerprint)","title":"fingerprint","text":""},{"location":"data/dataset/#multimolecule.data.Dataset.post","title":"post","text":"Python
post(tokenizer: PreTrainedTokenizerBase | None = None, pretrained: str | None = None, max_seq_length: int | None = None, truncation: bool | None = None, preprocess: bool | None = None, auto_rename_sequence_col: bool | None = None, auto_rename_label_col: bool | None = None, column_names_map: Mapping[str, str] | None = None) -> None\n

Perform pre-processing steps after initialization.

It first identifies the special columns (sequence and structure columns) in the dataset. Then it sets the feature and label columns based on the input arguments. If auto_rename_sequence_col is True, it will automatically rename the sequence column. If auto_rename_label_col is True, it will automatically rename the label column. Finally, it sets the transform function based on the preprocess flag.

Source code in multimolecule/data/dataset.py Python
def post(\n    self,\n    tokenizer: PreTrainedTokenizerBase | None = None,\n    pretrained: str | None = None,\n    max_seq_length: int | None = None,\n    truncation: bool | None = None,\n    preprocess: bool | None = None,\n    auto_rename_sequence_col: bool | None = None,\n    auto_rename_label_col: bool | None = None,\n    column_names_map: Mapping[str, str] | None = None,\n) -> None:\n    r\"\"\"\n    Perform pre-processing steps after initialization.\n\n    It first identifies the special columns (sequence and structure columns) in the dataset.\n    Then it sets the feature and label columns based on the input arguments.\n    If `auto_rename_sequence_col` is `True`, it will automatically rename the sequence column.\n    If `auto_rename_label_col` is `True`, it will automatically rename the label column.\n    Finally, it sets the [`transform`][datasets.Dataset.set_transform] function based on the `preprocess` flag.\n    \"\"\"\n    if tokenizer is None:\n        if pretrained is None:\n            raise ValueError(\"tokenizer and pretrained can not be both None.\")\n        tokenizer = AutoTokenizer.from_pretrained(pretrained)\n    if max_seq_length is None:\n        max_seq_length = tokenizer.model_max_length\n    else:\n        tokenizer.model_max_length = max_seq_length\n    self.tokenizer = tokenizer\n    self.max_seq_length = max_seq_length\n    if truncation is not None:\n        self.truncation = truncation\n    if self.tokenizer.cls_token is not None:\n        self.seq_length_offset += 1\n    if self.tokenizer.sep_token is not None and self.tokenizer.sep_token != self.tokenizer.eos_token:\n        self.seq_length_offset += 1\n    if self.tokenizer.eos_token is not None:\n        self.seq_length_offset += 1\n    if preprocess is not None:\n        self.preprocess = preprocess\n    if auto_rename_sequence_col is not None:\n        self.auto_rename_sequence_col = auto_rename_sequence_col\n    if auto_rename_label_col is not None:\n        self.auto_rename_label_col = auto_rename_label_col\n    if column_names_map is None:\n        column_names_map = {}\n    if self.auto_rename_sequence_col:\n        if len(self.sequence_cols) != 1:\n            raise ValueError(\"auto_rename_sequence_col can only be used when there is exactly one sequence column.\")\n        column_names_map[self.sequence_cols[0]] = defaults.SEQUENCE_COL_NAME  # type: ignore[index]\n    if self.auto_rename_label_col:\n        if len(self.label_cols) != 1:\n            raise ValueError(\"auto_rename_label_col can only be used when there is exactly one label column.\")\n        column_names_map[self.label_cols[0]] = defaults.LABEL_COL_NAME  # type: ignore[index]\n    self.column_names_map = column_names_map\n    if self.column_names_map:\n        self.rename_columns(self.column_names_map)\n    self.infer_tasks()\n\n    if self.preprocess:\n        self.update(self.map(self.tokenization))\n        if self.secondary_structure_cols:\n            self.update(self.map(self.convert_secondary_structure))\n        if self.discrete_map:\n            self.update(self.map(self.map_discrete))\n        fn_kwargs = {\n            \"columns\": [name for name, task in self.tasks.items() if task.level in [\"token\", \"contact\"]],\n            \"max_seq_length\": self.max_seq_length - self.seq_length_offset,\n        }\n        if self.truncation and 0 < self.max_seq_length < 2**32:\n            self.update(self.map(self.truncate, fn_kwargs=fn_kwargs))\n    self.set_transform(self.transform)\n
"},{"location":"data/dataset/#multimolecule.data.Dataset.transform","title":"transform","text":"Python
transform(batch: Mapping) -> Mapping\n

Default transform.

See Also

collate

Source code in multimolecule/data/dataset.py Python
def transform(self, batch: Mapping) -> Mapping:\n    r\"\"\"\n    Default [`transform`][datasets.Dataset.set_transform].\n\n    See Also:\n        [`collate`][multimolecule.Dataset.collate]\n    \"\"\"\n    return {k: self.collate(k, v) for k, v in batch.items() if k not in self.ignored_cols}\n
"},{"location":"data/dataset/#multimolecule.data.Dataset.collate","title":"collate","text":"Python
collate(col: str, data: Any) -> Tensor | NestedTensor | None\n

Collate the data for a column.

If the column is a sequence column, it will tokenize the data if tokenize is True. Otherwise, it will return a tensor or nested tensor.

Source code in multimolecule/data/dataset.py Python
def collate(self, col: str, data: Any) -> Tensor | NestedTensor | None:\n    r\"\"\"\n    Collate the data for a column.\n\n    If the column is a sequence column, it will tokenize the data if `tokenize` is `True`.\n    Otherwise, it will return a tensor or nested tensor.\n    \"\"\"\n    if col in self.sequence_cols:\n        if isinstance(data[0], str):\n            data = self.tokenize(data)\n        return NestedTensor(data)\n    if not self.preprocess:\n        if col in self.discrete_map:\n            data = map_value(data, self.discrete_map[col])\n        if col in self.tasks:\n            data = truncate_value(data, self.max_seq_length - self.seq_length_offset, self.tasks[col].level)\n    if isinstance(data[0], str):\n        return data\n    try:\n        return torch.tensor(data)\n    except ValueError:\n        return NestedTensor(data)\n
"},{"location":"data/dataset/#multimolecule.data.Dataset.update","title":"update","text":"Python
update(dataset: Dataset)\n

Perform an in-place update of the dataset.

This method is used to update the dataset after changes have been made to the underlying data. It updates the format columns, data, info, and fingerprint of the dataset.

Source code in multimolecule/data/dataset.py Python
def update(self, dataset: datasets.Dataset):\n    r\"\"\"\n    Perform an in-place update of the dataset.\n\n    This method is used to update the dataset after changes have been made to the underlying data.\n    It updates the format columns, data, info, and fingerprint of the dataset.\n    \"\"\"\n    # pylint: disable=W0212\n    # Why datasets won't support in-place changes?\n    # It's just impossible to extend.\n    self._format_columns = dataset._format_columns\n    self._data = dataset._data\n    self._info = dataset._info\n    self._fingerprint = dataset._fingerprint\n
"},{"location":"datasets/","title":"datasets","text":"

datasets provide a collection of widely used datasets.

"},{"location":"datasets/#available-datasets","title":"Available Datasets","text":""},{"location":"datasets/#deoxyribonucleic-acid-dna","title":"DeoxyriboNucleic Acid (DNA)","text":""},{"location":"datasets/#ribonucleic-acid-rna","title":"RiboNucleic Acid (RNA)","text":""},{"location":"datasets/#usage","title":"Usage","text":""},{"location":"datasets/#load-with-multimolecule","title":"Load with MultiMolecule","text":"Python
from multimolecule.data import Dataset\n\ndata = Dataset(\"multimolecule/bprna-spot\", split=\"train\", pretrained=\"multimolecule/rna\")\n
"},{"location":"datasets/archiveii/","title":"ArchiveII","text":"

ArchiveII is a dataset of RNA sequences and their secondary structures, widely used in RNA secondary structure prediction benchmarks.

ArchiveII contains 2975 RNA samples across 10 RNA families, with sequence lengths ranging from 28 to 2968 nucleotides. This dataset is frequently used to evaluate RNA secondary structure prediction methods, including those that handle both pseudoknotted and non-pseudoknotted structures.

It is considered complementary to the RNAStrAlign dataset.

"},{"location":"datasets/archiveii/#disclaimer","title":"Disclaimer","text":"

This is an UNOFFICIAL release of the ArchiveII by Mehdi Saman Booy, et al.

The team releasing ArchiveII did not write this dataset card for this dataset so this dataset card has been written by the MultiMolecule team.

"},{"location":"datasets/archiveii/#dataset-description","title":"Dataset Description","text":""},{"location":"datasets/archiveii/#example-entry","title":"Example Entry","text":"id sequence secondary_structure family 16S_rRNA-A.fulgidus AUUCUGGUUGAUCCUGCCAGAGGCCGCUGCUA\u2026 \u2026(((((\u2026(((.))))).((((((((((.... 16S_rRNA"},{"location":"datasets/archiveii/#column-description","title":"Column Description","text":""},{"location":"datasets/archiveii/#variations","title":"Variations","text":"

This dataset is available in two additional variants:

"},{"location":"datasets/archiveii/#related-datasets","title":"Related Datasets","text":""},{"location":"datasets/archiveii/#license","title":"License","text":"

This dataset is licensed under the AGPL-3.0 License.

Text Only
SPDX-License-Identifier: AGPL-3.0-or-later\n
"},{"location":"datasets/archiveii/#citation","title":"Citation","text":"BibTeX
@article{samanbooy2022rna,\n  author    = {Saman Booy, Mehdi and Ilin, Alexander and Orponen, Pekka},\n  journal   = {BMC Bioinformatics},\n  keywords  = {Deep learning; Pseudoknotted structures; RNA structure prediction},\n  month     = feb,\n  number    = 1,\n  pages     = {58},\n  publisher = {Springer Science and Business Media LLC},\n  title     = {{RNA} secondary structure prediction with convolutional neural networks},\n  volume    = 23,\n  year      = 2022\n}\n
"},{"location":"datasets/bprna-new/","title":"bpRNA-1m","text":"

bpRNA-new is a database of single molecule secondary structures annotated using bpRNA.

bpRNA-new is a dataset of RNA families from Rfam 14.2, designed for cross-family validation to assess generalization capability. It focuses on families distinct from those in bpRNA-1m, providing a robust benchmark for evaluating model performance on unseen RNA families.

"},{"location":"datasets/bprna-new/#disclaimer","title":"Disclaimer","text":"

This is an UNOFFICIAL release of the bpRNA-new by Kengo Sato, et al.

The team releasing bpRNA-new did not write this dataset card for this dataset so this dataset card has been written by the MultiMolecule team.

"},{"location":"datasets/bprna-new/#dataset-description","title":"Dataset Description","text":""},{"location":"datasets/bprna-new/#related-datasets","title":"Related Datasets","text":""},{"location":"datasets/bprna-new/#license","title":"License","text":"

This dataset is licensed under the AGPL-3.0 License.

Text Only
SPDX-License-Identifier: AGPL-3.0-or-later\n
"},{"location":"datasets/bprna-new/#citation","title":"Citation","text":"BibTeX
@article{sato2021rna,\n  author    = {Sato, Kengo and Akiyama, Manato and Sakakibara, Yasubumi},\n  journal   = {Nature Communications},\n  month     = feb,\n  number    = 1,\n  pages     = {941},\n  publisher = {Springer Science and Business Media LLC},\n  title     = {{RNA} secondary structure prediction using deep learning with thermodynamic integration},\n  volume    = 12,\n  year      = 2021\n}\n
"},{"location":"datasets/bprna-spot/","title":"bpRNA-1m","text":"

bpRNA-spot is a database of single molecule secondary structures annotated using bpRNA.

bpRNA-spot is a subset of bpRNA-1m. It applies CD-HIT (CD-HIT-EST) to remove sequences with more than 80% sequence similarity from bpRNA-1m. It further randomly splits the remaining sequences into training, validation, and test sets with a ratio of apprxiately 8:1:1.

"},{"location":"datasets/bprna-spot/#disclaimer","title":"Disclaimer","text":"

This is an UNOFFICIAL release of the bpRNA-spot by Jaswinder Singh, et al.

The team releasing bpRNA-spot did not write this dataset card for this dataset so this dataset card has been written by the MultiMolecule team.

"},{"location":"datasets/bprna-spot/#dataset-description","title":"Dataset Description","text":""},{"location":"datasets/bprna-spot/#related-datasets","title":"Related Datasets","text":""},{"location":"datasets/bprna-spot/#license","title":"License","text":"

This dataset is licensed under the AGPL-3.0 License.

Text Only
SPDX-License-Identifier: AGPL-3.0-or-later\n
"},{"location":"datasets/bprna-spot/#citation","title":"Citation","text":"BibTeX
@article{singh2019rna,\n  author    = {Singh, Jaswinder and Hanson, Jack and Paliwal, Kuldip and Zhou, Yaoqi},\n  journal   = {Nature Communications},\n  month     = nov,\n  number    = 1,\n  pages     = {5407},\n  publisher = {Springer Science and Business Media LLC},\n  title     = {{RNA} secondary structure prediction using an ensemble of two-dimensional deep neural networks and transfer learning},\n  volume    = 10,\n  year      = 2019\n}\n\n@article{darty2009varna,\n  author    = {Darty, K{\\'e}vin and Denise, Alain and Ponty, Yann},\n  journal   = {Bioinformatics},\n  month     = aug,\n  number    = 15,\n  pages     = {1974--1975},\n  publisher = {Oxford University Press (OUP)},\n  title     = {{VARNA}: Interactive drawing and editing of the {RNA} secondary structure},\n  volume    = 25,\n  year      = 2009\n}\n\n@article{berman2000protein,\n  author    = {Berman, H M and Westbrook, J and Feng, Z and Gilliland, G and Bhat, T N and Weissig, H and Shindyalov, I N and Bourne, P E},\n  journal   = {Nucleic Acids Research},\n  month     = jan,\n  number    = 1,\n  pages     = {235--242},\n  publisher = {Oxford University Press (OUP)},\n  title     = {The Protein Data Bank},\n  volume    = 28,\n  year      = 2000\n}\n
"},{"location":"datasets/bprna/","title":"bpRNA-1m","text":"

bpRNA-1m is a database of single molecule secondary structures annotated using bpRNA.

"},{"location":"datasets/bprna/#disclaimer","title":"Disclaimer","text":"

This is an UNOFFICIAL release of the bpRNA-1m by Center for Quantitative Life Sciences of the Oregon State University.

The team releasing bpRNA-1m did not write this dataset card for this dataset so this dataset card has been written by the MultiMolecule team.

"},{"location":"datasets/bprna/#dataset-description","title":"Dataset Description","text":""},{"location":"datasets/bprna/#example-entry","title":"Example Entry","text":"id sequence secondary_structure structural_annotation functional_annotation bpRNA_RFAM_1016 AUUGCUUCUCGGCCUUUUGGCUAACAUCAAGU\u2026 ......(((.((((....)))).)))...... EEEEEESSSISSSSHHHHSSSSISSSXXXXXX\u2026 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN\u2026"},{"location":"datasets/bprna/#column-description","title":"Column Description","text":"

The converted dataset consists of the following columns, each providing specific information about the RNA secondary structures, consistent with the bpRNA standard:

"},{"location":"datasets/bprna/#variations","title":"Variations","text":"

This dataset is available in two variants:

"},{"location":"datasets/bprna/#related-datasets","title":"Related Datasets","text":""},{"location":"datasets/bprna/#license","title":"License","text":"

This dataset is licensed under the AGPL-3.0 License.

Text Only
SPDX-License-Identifier: AGPL-3.0-or-later\n
"},{"location":"datasets/bprna/#citation","title":"Citation","text":"BibTeX
@article{danaee2018bprna,\n  author  = {Danaee, Padideh and Rouches, Mason and Wiley, Michelle and Deng, Dezhong and Huang, Liang and Hendrix, David},\n  journal = {Nucleic Acids Research},\n  month   = jun,\n  number  = 11,\n  pages   = {5381--5394},\n  title   = {{bpRNA}: large-scale automated annotation and analysis of {RNA} secondary structure},\n  volume  = 46,\n  year    = 2018\n}\n\n@article{cannone2002comparative,\n  author    = {Cannone, Jamie J and Subramanian, Sankar and Schnare, Murray N and Collett, James R and D'Souza, Lisa M and Du, Yushi and Feng, Brian and Lin, Nan and Madabusi, Lakshmi V and M{\\\"u}ller, Kirsten M and Pande, Nupur and Shang, Zhidi and Yu, Nan and Gutell, Robin R},\n  copyright = {https://www.springernature.com/gp/researchers/text-and-data-mining},\n  journal   = {BMC Bioinformatics},\n  month     = jan,\n  number    = 1,\n  pages     = {2},\n  publisher = {Springer Science and Business Media LLC},\n  title     = {The comparative {RNA} web ({CRW}) site: an online database of comparative sequence and structure information for ribosomal, intron, and other {RNAs}},\n  volume    = 3,\n  year      = 2002\n}\n\n@article{zwieb2003tmrdb,\n  author    = {Zwieb, Christian and Gorodkin, Jan and Knudsen, Bjarne and Burks, Jody and Wower, Jacek},\n  journal   = {Nucleic Acids Research},\n  month     = jan,\n  number    = 1,\n  pages     = {446--447},\n  publisher = {Oxford University Press (OUP)},\n  title     = {{tmRDB} ({tmRNA} database)},\n  volume    = 31,\n  year      = 2003\n}\n\n@article{rosenblad2003srpdb,\n  author    = {Rosenblad, Magnus Alm and Gorodkin, Jan and Knudsen, Bjarne and Zwieb, Christian and Samuelsson, Tore},\n  journal   = {Nucleic Acids Research},\n  month     = jan,\n  number    = 1,\n  pages     = {363--364},\n  publisher = {Oxford University Press (OUP)},\n  title     = {{SRPDB}: Signal Recognition Particle Database},\n  volume    = 31,\n  year      = 2003\n}\n\n@article{sprinzl2005compilation,\n  author    = {Sprinzl, Mathias and Vassilenko, Konstantin S},\n  journal   = {Nucleic Acids Research},\n  month     = jan,\n  number    = {Database issue},\n  pages     = {D139--40},\n  publisher = {Oxford University Press (OUP)},\n  title     = {Compilation of {tRNA} sequences and sequences of {tRNA} genes},\n  volume    = 33,\n  year      = 2005\n}\n\n@article{brown1994ribonuclease,\n  author    = {Brown, J W and Haas, E S and Gilbert, D G and Pace, N R},\n  journal   = {Nucleic Acids Research},\n  month     = sep,\n  number    = 17,\n  pages     = {3660--3662},\n  publisher = {Oxford University Press (OUP)},\n  title     = {The Ribonuclease {P} database},\n  volume    = 22,\n  year      = 1994\n}\n\n@article{griffiths2003rfam,\n  author    = {Griffiths-Jones, Sam and Bateman, Alex and Marshall, Mhairi and Khanna, Ajay and Eddy, Sean R},\n  journal   = {Nucleic Acids Research},\n  month     = jan,\n  number    = 1,\n  pages     = {439--441},\n  publisher = {Oxford University Press (OUP)},\n  title     = {Rfam: an {RNA} family database},\n  volume    = 31,\n  year      = 2003\n}\n\n@article{berman2000protein,\n  author    = {Berman, H M and Westbrook, J and Feng, Z and Gilliland, G and Bhat, T N and Weissig, H and Shindyalov, I N and Bourne, P E},\n  journal   = {Nucleic Acids Research},\n  month     = jan,\n  number    = 1,\n  pages     = {235--242},\n  publisher = {Oxford University Press (OUP)},\n  title     = {The Protein Data Bank},\n  volume    = 28,\n  year      = 2000\n}\n
"},{"location":"datasets/eternabench-cm/","title":"EternaBench-CM","text":"

EternaBench-CM is a synthetic RNA dataset comprising 12,711 RNA constructs that have been chemically mapped using SHAPE and MAP-seq methods. These RNA sequences are probed to obtain experimental data on their nucleotide reactivity, which indicates whether specific regions of the RNA are flexible or structured. The dataset provides high-resolution, large-scale data that can be used for studying RNA folding and stability.

"},{"location":"datasets/eternabench-cm/#disclaimer","title":"Disclaimer","text":"

This is an UNOFFICIAL release of the EternaBench-CM by Hannah K. Wayment-Steele, et al.

The team releasing EternaBench-CM did not write this dataset card for this dataset so this dataset card has been written by the MultiMolecule team.

"},{"location":"datasets/eternabench-cm/#dataset-description","title":"Dataset Description","text":"

The dataset includes a large set of synthetic RNA sequences with experimental chemical mapping data, which provides a quantitative readout of RNA nucleotide reactivity. These data are ensemble-averaged and serve as a critical benchmark for evaluating secondary structure prediction algorithms in their ability to model RNA folding dynamics.

"},{"location":"datasets/eternabench-cm/#example-entry","title":"Example Entry","text":"index design sequence secondary_structure reactivity errors signal_to_noise 769337-1 d+m plots weaker again GGAAAAAAAAAAA\u2026 ................ [0.642,1.4853,0.1629, \u2026] [0.3181,0.4221,0.1823, \u2026] 3.227"},{"location":"datasets/eternabench-cm/#column-description","title":"Column Description","text":""},{"location":"datasets/eternabench-cm/#related-datasets","title":"Related Datasets","text":""},{"location":"datasets/eternabench-cm/#license","title":"License","text":"

This dataset is licensed under the AGPL-3.0 License.

Text Only
SPDX-License-Identifier: AGPL-3.0-or-later\n
"},{"location":"datasets/eternabench-cm/#citation","title":"Citation","text":"BibTeX
@article{waymentsteele2022rna,\n  author    = {Wayment-Steele, Hannah K and Kladwang, Wipapat and Strom, Alexandra I and Lee, Jeehyung and Treuille, Adrien and Becka, Alex and {Eterna Participants} and Das, Rhiju},\n  journal   = {Nature Methods},\n  month     = oct,\n  number    = 10,\n  pages     = {1234--1242},\n  publisher = {Springer Science and Business Media LLC},\n  title     = {{RNA} secondary structure packages evaluated and improved by high-throughput experiments},\n  volume    = 19,\n  year      = 2022\n}\n
"},{"location":"datasets/eternabench-external/","title":"EternaBench-External","text":"

EternaBench-External consists of 31 independent RNA datasets from various biological sources, including viral genomes, mRNAs, and synthetic RNAs. These sequences were probed using techniques such as SHAPE-CE, SHAPE-MaP, and DMS-MaP-seq to understand RNA secondary structures under different experimental and biological conditions. This dataset serves as a benchmark for evaluating RNA structure prediction models, with a particular focus on generalization to natural RNA molecules.

"},{"location":"datasets/eternabench-external/#disclaimer","title":"Disclaimer","text":"

This is an UNOFFICIAL release of the EternaBench-External by Hannah K. Wayment-Steele, et al.

The team releasing EternaBench-External did not write this dataset card for this dataset so this dataset card has been written by the MultiMolecule team.

"},{"location":"datasets/eternabench-external/#dataset-description","title":"Dataset Description","text":"

This dataset includes RNA sequences from various biological origins, including viral genomes and mRNAs, and covers a wide range of probing methods like SHAPE-CE and icSHAPE. Each dataset entry provides sequence information, reactivity profiles, and RNA secondary structure data. This dataset can be used to examine how RNA structures vary under different conditions and to validate structural predictions for diverse RNA types.

"},{"location":"datasets/eternabench-external/#example-entry","title":"Example Entry","text":"name sequence reactivity seqpos class dataset Dadonaite,2019 Influenza genome SHAPE(1M7) SSII-Mn(2+) Mut. TTTACCCACAGCTGTGAATT\u2026 [0.639309,0.813297,0.622869,\u2026] [7425,7426,7427,\u2026] viral_gRNA Dadonaite,2019"},{"location":"datasets/eternabench-external/#column-description","title":"Column Description","text":""},{"location":"datasets/eternabench-external/#variations","title":"Variations","text":"

This dataset is available in four variants:

"},{"location":"datasets/eternabench-external/#related-datasets","title":"Related Datasets","text":""},{"location":"datasets/eternabench-external/#license","title":"License","text":"

This dataset is licensed under the AGPL-3.0 License.

Text Only
SPDX-License-Identifier: AGPL-3.0-or-later\n
"},{"location":"datasets/eternabench-external/#citation","title":"Citation","text":"BibTeX
@article{waymentsteele2022rna,\n  author    = {Wayment-Steele, Hannah K and Kladwang, Wipapat and Strom, Alexandra I and Lee, Jeehyung and Treuille, Adrien and Becka, Alex and {Eterna Participants} and Das, Rhiju},\n  journal   = {Nature Methods},\n  month     = oct,\n  number    = 10,\n  pages     = {1234--1242},\n  publisher = {Springer Science and Business Media LLC},\n  title     = {{RNA} secondary structure packages evaluated and improved by high-throughput experiments},\n  volume    = 19,\n  year      = 2022\n}\n
"},{"location":"datasets/eternabench-switch/","title":"EternaBench-Switch","text":"

EternaBench-Switch is a synthetic RNA dataset consisting of 7,228 riboswitch constructs, designed to explore the structural behavior of RNA molecules that change conformation upon binding to ligands such as FMN, theophylline, or tryptophan. These riboswitches exhibit different structural states in the presence or absence of their ligands, and the dataset includes detailed measurements of binding affinities (dissociation constants), activation ratios, and RNA folding properties.

"},{"location":"datasets/eternabench-switch/#disclaimer","title":"Disclaimer","text":"

This is an UNOFFICIAL release of the EternaBench-Switch by Hannah K. Wayment-Steele, et al.

The team releasing EternaBench-Switch did not write this dataset card for this dataset so this dataset card has been written by the MultiMolecule team.

"},{"location":"datasets/eternabench-switch/#dataset-description","title":"Dataset Description","text":"

The dataset includes synthetic RNA sequences designed to act as riboswitches. These molecules can adopt different structural states in response to ligand binding, and the dataset provides detailed information on the binding affinities for various ligands, along with metrics on the RNA\u2019s ability to switch between conformations. With over 7,000 entries, this dataset is highly useful for studying RNA folding, ligand interaction, and RNA structural dynamics.

"},{"location":"datasets/eternabench-switch/#example-entry","title":"Example Entry","text":"id design sequence activation_ratio ligand switch kd_off kd_on kd_fmn kd_no_fmn min_kd_val ms2_aptamer lig_aptamer ms2_lig_aptamer log_kd_nolig log_kd_lig log_kd_nolig_scaled log_kd_lig_scaled log_AR folding_subscore num_clusters 286 null AGGAAACAUGAGGAU\u2026 0.8824621522 FMN OFF 13.3115 15.084 null null 3.0082 .....(((((x((xxxx)))))))..... .................. .....(((((x((xx\u2026 2.7137 2.5886 1.6123 1.4873 -0.125 null null"},{"location":"datasets/eternabench-switch/#column-description","title":"Column Description","text":""},{"location":"datasets/eternabench-switch/#related-datasets","title":"Related Datasets","text":""},{"location":"datasets/eternabench-switch/#license","title":"License","text":"

This dataset is licensed under the AGPL-3.0 License.

Text Only
SPDX-License-Identifier: AGPL-3.0-or-later\n
"},{"location":"datasets/eternabench-switch/#citation","title":"Citation","text":"BibTeX
@article{waymentsteele2022rna,\n  author    = {Wayment-Steele, Hannah K and Kladwang, Wipapat and Strom, Alexandra I and Lee, Jeehyung and Treuille, Adrien and Becka, Alex and {Eterna Participants} and Das, Rhiju},\n  journal   = {Nature Methods},\n  month     = oct,\n  number    = 10,\n  pages     = {1234--1242},\n  publisher = {Springer Science and Business Media LLC},\n  title     = {{RNA} secondary structure packages evaluated and improved by high-throughput experiments},\n  volume    = 19,\n  year      = 2022\n}\n
"},{"location":"datasets/gencode/","title":"GENCODE","text":"

GENCODE is a comprehensive annotation project that aims to provide high-quality annotations of the human and mouse genomes. The project is part of the ENCODE (ENCyclopedia Of DNA Elements) scale-up project, which seeks to identify all functional elements in the human genome.

"},{"location":"datasets/gencode/#disclaimer","title":"Disclaimer","text":"

This is an UNOFFICIAL release of the GENCODE by Paul Flicek, Roderic Guigo, Manolis Kellis, Mark Gerstein, Benedict Paten, Michael Tress, Jyoti Choudhary, et al.

The team releasing GENCODE did not write this dataset card for this dataset so this dataset card has been written by the MultiMolecule team.

"},{"location":"datasets/gencode/#dataset-description","title":"Dataset Description","text":""},{"location":"datasets/gencode/#license","title":"License","text":"

This dataset is licensed under the AGPL-3.0 License.

Text Only
SPDX-License-Identifier: AGPL-3.0-or-later\n
"},{"location":"datasets/gencode/#datasets","title":"Datasets","text":"

The GENCODE dataset is available in Human and Mouse:

"},{"location":"datasets/gencode/#citation","title":"Citation","text":"BibTeX
@article{frankish2023gencode,\n  author    = {Frankish, Adam and Carbonell-Sala, S{\\'\\i}lvia and Diekhans, Mark and Jungreis, Irwin and Loveland, Jane E and Mudge, Jonathan M and Sisu, Cristina and Wright, James C and Arnan, Carme and Barnes, If and Banerjee, Abhimanyu and Bennett, Ruth and Berry, Andrew and Bignell, Alexandra and Boix, Carles and Calvet, Ferriol and Cerd{\\'a}n-V{\\'e}lez, Daniel and Cunningham, Fiona and Davidson, Claire and Donaldson, Sarah and Dursun, Cagatay and Fatima, Reham and Giorgetti, Stefano and Giron, Carlos Garc{\\i}a and Gonzalez, Jose Manuel and Hardy, Matthew and Harrison, Peter W and Hourlier, Thibaut and Hollis, Zoe and Hunt, Toby and James, Benjamin and Jiang, Yunzhe and Johnson, Rory and Kay, Mike and Lagarde, Julien and Martin, Fergal J and G{\\'o}mez, Laura Mart{\\'\\i}nez and Nair, Surag and Ni, Pengyu and Pozo, Fernando and Ramalingam, Vivek and Ruffier, Magali and Schmitt, Bianca M and Schreiber, Jacob M and Steed, Emily and Suner, Marie-Marthe and Sumathipala, Dulika and Sycheva, Irina and Uszczynska-Ratajczak, Barbara and Wass, Elizabeth and Yang, Yucheng T and Yates, Andrew and Zafrulla, Zahoor and Choudhary, Jyoti S and Gerstein, Mark and Guigo, Roderic and Hubbard, Tim J P and Kellis, Manolis and Kundaje, Anshul and Paten, Benedict and Tress, Michael L and Flicek, Paul},\n  journal   = {Nucleic Acids Research},\n  month     = jan,\n  number    = {D1},\n  pages     = {D942--D949},\n  publisher = {Oxford University Press (OUP)},\n  title     = {{GENCODE}: reference annotation for the human and mouse genomes in 2023},\n  volume    = 51,\n  year      = 2023\n}\n\n@article{frankish2021gencode,\n  author    = {Frankish, Adam and Diekhans, Mark and Jungreis, Irwin and Lagarde, Julien and Loveland, Jane E and Mudge, Jonathan M and Sisu, Cristina and Wright, James C and Armstrong, Joel and Barnes, If and Berry, Andrew and Bignell, Alexandra and Boix, Carles and Carbonell Sala, Silvia and Cunningham, Fiona and Di Domenico, Tom{\\'a}s and Donaldson, Sarah and Fiddes, Ian T and Garc{\\'\\i}a Gir{\\'o}n, Carlos and Gonzalez, Jose Manuel and Grego, Tiago and Hardy, Matthew and Hourlier, Thibaut and Howe, Kevin L and Hunt, Toby and Izuogu, Osagie G and Johnson, Rory and Martin, Fergal J and Mart{\\'\\i}nez, Laura and Mohanan, Shamika and Muir, Paul and Navarro, Fabio C P and Parker, Anne and Pei, Baikang and Pozo, Fernando and Riera, Ferriol Calvet and Ruffier, Magali and Schmitt, Bianca M and Stapleton, Eloise and Suner, Marie-Marthe and Sycheva, Irina and Uszczynska-Ratajczak, Barbara and Wolf, Maxim Y and Xu, Jinuri and Yang, Yucheng T and Yates, Andrew and Zerbino, Daniel and Zhang, Yan and Choudhary, Jyoti S and Gerstein, Mark and Guig{\\'o}, Roderic and Hubbard, Tim J P and Kellis, Manolis and Paten, Benedict and Tress, Michael L and Flicek, Paul},\n  journal   = {Nucleic Acids Research},\n  month     = jan,\n  number    = {D1},\n  pages     = {D916--D923},\n  publisher = {Oxford University Press (OUP)},\n  title     = {{GENCODE} 2021},\n  volume    = 49,\n  year      = 2021\n}\n\n@article{frankish2019gencode,\n  author    = {Frankish, Adam and Diekhans, Mark and Ferreira, Anne-Maud and Johnson, Rory and Jungreis, Irwin and Loveland, Jane and Mudge, Jonathan M and Sisu, Cristina and Wright, James and Armstrong, Joel and Barnes, If and Berry, Andrew and Bignell, Alexandra and Carbonell Sala, Silvia and Chrast, Jacqueline and Cunningham, Fiona and Di Domenico, Tom{\\'a}s and Donaldson, Sarah and Fiddes, Ian T and Garc{\\'\\i}a Gir{\\'o}n, Carlos and Gonzalez, Jose Manuel and Grego, Tiago and Hardy, Matthew and Hourlier, Thibaut and Hunt, Toby and Izuogu, Osagie G and Lagarde, Julien and Martin, Fergal J and Mart{\\'\\i}nez, Laura and Mohanan, Shamika and Muir, Paul and Navarro, Fabio C P and Parker, Anne and Pei, Baikang and Pozo, Fernando and Ruffier, Magali and Schmitt, Bianca M and Stapleton, Eloise and Suner, Marie-Marthe and Sycheva, Irina and Uszczynska-Ratajczak, Barbara and Xu, Jinuri and Yates, Andrew and Zerbino, Daniel and Zhang, Yan and Aken, Bronwen and Choudhary, Jyoti S and Gerstein, Mark and Guig{\\'o}, Roderic and Hubbard, Tim J P and Kellis, Manolis and Paten, Benedict and Reymond, Alexandre and Tress, Michael L and Flicek, Paul},\n  journal   = {Nucleic Acids Research},\n  month     = jan,\n  number    = {D1},\n  pages     = {D766--D773},\n  publisher = {Oxford University Press (OUP)},\n  title     = {{GENCODE} reference annotation for the human and mouse genomes},\n  volume    = 47,\n  year      = 2019\n}\n\n@article{mudge2015creating,\n  author    = {Mudge, Jonathan M and Harrow, Jennifer},\n  copyright = {https://creativecommons.org/licenses/by/4.0},\n  journal   = {Mamm. Genome},\n  language  = {en},\n  month     = oct,\n  number    = {9-10},\n  pages     = {366--378},\n  publisher = {Springer Science and Business Media LLC},\n  title     = {Creating reference gene annotation for the mouse {C57BL6/J} genome assembly},\n  volume    = 26,\n  year      = 2015\n}\n\n@article{harrow2012gencode,\n  author   = {Harrow, Jennifer and Frankish, Adam and Gonzalez, Jose M and Tapanari, Electra and Diekhans, Mark and Kokocinski, Felix and Aken, Bronwen L and Barrell, Daniel and Zadissa, Amonida and Searle, Stephen and Barnes, If and Bignell, Alexandra and Boychenko, Veronika and Hunt, Toby and Kay, Mike and Mukherjee, Gaurab and Rajan, Jeena and Despacio-Reyes, Gloria and Saunders, Gary and Steward, Charles and Harte, Rachel and Lin, Michael and Howald, C{\\'e}dric and Tanzer, Andrea and Derrien, Thomas and Chrast, Jacqueline and Walters, Nathalie and Balasubramanian, Suganthi and Pei, Baikang and Tress, Michael and Rodriguez, Jose Manuel and Ezkurdia, Iakes and van Baren, Jeltje and Brent, Michael and Haussler, David and Kellis, Manolis and Valencia, Alfonso and Reymond, Alexandre and Gerstein, Mark and Guig{\\'o}, Roderic and Hubbard, Tim J},\n  journal  = {Genome Research},\n  month    = sep,\n  number   = 9,\n  pages    = {1760--1774},\n  title    = {{GENCODE}: the reference human genome annotation for The {ENCODE} Project},\n  volume   = 22,\n  year     = 2012\n}\n\n@article{harrow2006gencode,\n  author    = {Harrow, Jennifer and Denoeud, France and Frankish, Adam and  Reymond, Alexandre and Chen, Chao-Kung and Chrast, Jacqueline  and Lagarde, Julien and Gilbert, James G R and Storey, Roy and  Swarbreck, David and Rossier, Colette and Ucla, Catherine and  Hubbard, Tim and Antonarakis, Stylianos E and Guigo, Roderic},\n  journal   = {Genome Biology},\n  month     = aug,\n  number    = {Suppl 1},\n  pages     = {S4.1--9},\n  publisher = {Springer Nature},\n  title     = {{GENCODE}: producing a reference annotation for {ENCODE}},\n  volume    = {7 Suppl 1},\n  year      = 2006\n}\n
"},{"location":"datasets/rfam/","title":"Rfam","text":"

Rfam is a database of structure-annotated multiple sequence alignments, covariance models and family annotation for a number of non-coding RNA, cis-regulatory and self-splicing intron families.

The seed alignments are hand curated and aligned using available sequence and structure data, and covariance models are built from these alignments using the INFERNAL v1.1.4 software suite.

The full regions list is created by searching the RFAMSEQ database using the covariance model, and then listing all hits above a family specific threshold to the model.

Rfam is maintained by a consortium of researchers at the European Bioinformatics Institute, Sean Eddy\u2019s laboratory and Eric Nawrocki.

"},{"location":"datasets/rfam/#disclaimer","title":"Disclaimer","text":"

This is an UNOFFICIAL release of the Rfam by Ioanna Kalvari, Eric P. Nawrocki, Sarah W. Burge, Paul P Gardner, Sam Griffiths-Jones, et al.

The team releasing Rfam did not write this dataset card for this dataset so this dataset card has been written by the MultiMolecule team.

"},{"location":"datasets/rfam/#dataset-description","title":"Dataset Description","text":""},{"location":"datasets/rfam/#license","title":"License","text":"

This dataset is licensed under the AGPL-3.0 License.

Text Only
SPDX-License-Identifier: AGPL-3.0-or-later\n

Tip

The original Rfam dataset is licensed under the CC0 1.0 Universal license and is available at Rfam.

"},{"location":"datasets/rfam/#citation","title":"Citation","text":"BibTeX
@article{kalvari2021rfam,\n  author    = {Kalvari, Ioanna and Nawrocki, Eric P and Ontiveros-Palacios, Nancy and Argasinska, Joanna and Lamkiewicz, Kevin and Marz, Manja and Griffiths-Jones, Sam and Toffano-Nioche, Claire and Gautheret, Daniel and Weinberg, Zasha and Rivas, Elena and Eddy, Sean R and Finn, Robert D and Bateman, Alex and Petrov, Anton I},\n  copyright = {http://creativecommons.org/licenses/by/4.0/},\n  journal   = {Nucleic Acids Research},\n  language  = {en},\n  month     = jan,\n  number    = {D1},\n  pages     = {D192--D200},\n  publisher = {Oxford University Press (OUP)},\n  title     = {Rfam 14: expanded coverage of metagenomic, viral and {microRNA} families},\n  volume    = 49,\n  year      = 2021\n}\n\n@article{hufsky2021computational,\n  author    = {Hufsky, Franziska and Lamkiewicz, Kevin and Almeida, Alexandre and Aouacheria, Abdel and Arighi, Cecilia and Bateman, Alex and Baumbach, Jan and Beerenwinkel, Niko and Brandt, Christian and Cacciabue, Marco and Chuguransky, Sara and Drechsel, Oliver and Finn, Robert D and Fritz, Adrian and Fuchs, Stephan and Hattab, Georges and Hauschild, Anne-Christin and Heider, Dominik and Hoffmann, Marie and H{\\\"o}lzer, Martin and Hoops, Stefan and Kaderali, Lars and Kalvari, Ioanna and von Kleist, Max and Kmiecinski, Ren{\\'o} and K{\\\"u}hnert, Denise and Lasso, Gorka and Libin, Pieter and List, Markus and L{\\\"o}chel, Hannah F and Martin, Maria J and Martin, Roman and Matschinske, Julian and McHardy, Alice C and Mendes, Pedro and Mistry, Jaina and Navratil, Vincent and Nawrocki, Eric P and O'Toole, {\\'A}ine Niamh and Ontiveros-Palacios, Nancy and Petrov, Anton I and Rangel-Pineros, Guillermo and Redaschi, Nicole and Reimering, Susanne and Reinert, Knut and Reyes, Alejandro and Richardson, Lorna and Robertson, David L and Sadegh, Sepideh and Singer, Joshua B and Theys, Kristof and Upton, Chris and Welzel, Marius and Williams, Lowri and Marz, Manja},\n  copyright = {http://creativecommons.org/licenses/by/4.0/},\n  journal   = {Briefings in Bioinformatics},\n  month     = mar,\n  number    = 2,\n  pages     = {642--663},\n  publisher = {Oxford University Press (OUP)},\n  title     = {Computational strategies to combat {COVID-19}: useful tools to accelerate {SARS-CoV-2} and coronavirus research},\n  volume    = 22,\n  year      = 2021\n}\n\n@article{kalvari2018noncoding,\n  author  = {Kalvari, Ioanna and Nawrocki, Eric P and Argasinska, Joanna and Quinones-Olvera, Natalia and Finn, Robert D and Bateman, Alex and Petrov, Anton I},\n  journal = {Current Protocols in Bioinformatics},\n  month   = jun,\n  number  = 1,\n  pages   = {e51},\n  title   = {Non-coding {RNA} analysis using the rfam database},\n  volume  = 62,\n  year    = 2018\n}\n\n@article{kalvari2018rfam,\n  author  = {Kalvari, Ioanna and Argasinska, Joanna and Quinones-Olvera,\n             Natalia and Nawrocki, Eric P and Rivas, Elena and Eddy, Sean R\n             and Bateman, Alex and Finn, Robert D and Petrov, Anton I},\n  journal = {Nucleic Acids Research},\n  month   = jan,\n  number  = {D1},\n  pages   = {D335--D342},\n  title   = {Rfam 13.0: shifting to a genome-centric resource for non-coding {RNA} families},\n  volume  = 46,\n  year    = 2018\n}\n\n@article{nawrocki2015rfam,\n  author    = {Nawrocki, Eric P and Burge, Sarah W and Bateman, Alex and Daub, Jennifer and Eberhardt, Ruth Y and Eddy, Sean R and Floden, Evan W and Gardner, Paul P and Jones, Thomas A and Tate, John and Finn, Robert D},\n  copyright = {http://creativecommons.org/licenses/by/4.0/},\n  journal   = {Nucleic Acids Research},\n  month     = jan,\n  number    = {Database issue},\n  pages     = {D130--7},\n  publisher = {Oxford University Press (OUP)},\n  title     = {Rfam 12.0: updates to the {RNA} families database},\n  volume    = 43,\n  year      = 2015\n}\n\n@article{burge2013rfam,\n  author    = {Burge, Sarah W and Daub, Jennifer and Eberhardt, Ruth and Tate, John and Barquist, Lars and Nawrocki, Eric P and Eddy, Sean R and Gardner, Paul P and Bateman, Alex},\n  copyright = {http://creativecommons.org/licenses/by-nc/3.0/},\n  journal   = {Nucleic Acids Research},\n  month     = jan,\n  number    = {Database issue},\n  pages     = {D226--32},\n  publisher = {Oxford University Press (OUP)},\n  title     = {Rfam 11.0: 10 years of {RNA} families},\n  volume    = 41,\n  year      = 2013\n}\n\n@article{gardner2011rfam,\n  author  = {Gardner, Paul P and Daub, Jennifer and Tate, John and Moore, Benjamin L and Osuch, Isabelle H and Griffiths-Jones, Sam and Finn, Robert D and Nawrocki, Eric P and Kolbe, Diana L and Eddy, Sean R and Bateman, Alex},\n  journal = {Nucleic Acids Research},\n  month   = jan,\n  number  = {Database issue},\n  pages   = {D141--5},\n  title   = {Rfam: Wikipedia, clans and the ``decimal'' release},\n  volume  = 39,\n  year    = 2011\n}\n\n@article{gardner2009rfam,\n  author   = {Gardner, Paul P and Daub, Jennifer and Tate, John G and Nawrocki, Eric P and Kolbe, Diana L and Lindgreen, Stinus and Wilkinson, Adam C and Finn, Robert D and Griffiths-Jones, Sam and Eddy, Sean R and Bateman, Alex},\n  journal  = {Nucleic Acids Research},\n  month    = jan,\n  number   = {Database issue},\n  pages    = {D136--40},\n  title    = {Rfam: updates to the {RNA} families database},\n  volume   = 37,\n  year     = 2009\n}\n\n@article{daub2008rna,\n  author   = {Daub, Jennifer and Gardner, Paul P and Tate, John and Ramsk{\\\"o}ld, Daniel and Manske, Magnus and Scott, William G and Weinberg, Zasha and Griffiths-Jones, Sam and Bateman, Alex},\n  journal  = {RNA},\n  month    = dec,\n  number   = 12,\n  pages    = {2462--2464},\n  title    = {The {RNA} {WikiProject}: community annotation of {RNA} families},\n  volume   = 14,\n  year     = 2008\n}\n\n@article{griffiths2005rfam,\n  author   = {Griffiths-Jones, Sam and Moxon, Simon and Marshall, Mhairi and Khanna, Ajay and Eddy, Sean R. and Bateman, Alex},\n  doi      = {10.1093/nar/gki081},\n  eprint   = {https://academic.oup.com/nar/article-pdf/33/suppl\\_1/D121/7622063/gki081.pdf},\n  issn     = {0305-1048},\n  journal  = {Nucleic Acids Research},\n  month    = jan,\n  number   = {suppl_1},\n  pages    = {D121-D124},\n  title    = {{Rfam: annotating non-coding RNAs in complete genomes}},\n  url      = {https://doi.org/10.1093/nar/gki081},\n  volume   = {33},\n  year     = {2005}\n}\n\n@article{griffiths2003rfam,\n  author   = {Griffiths-Jones, Sam and Bateman, Alex and Marshall, Mhairi and Khanna, Ajay and Eddy, Sean R.},\n  doi      = {10.1093/nar/gkg006},\n  eprint   = {https://academic.oup.com/nar/article-pdf/31/1/439/7125749/gkg006.pdf},\n  issn     = {0305-1048},\n  journal  = {Nucleic Acids Research},\n  month    = jan,\n  number   = {1},\n  pages    = {439-441},\n  title    = {{Rfam: an RNA family database}},\n  url      = {https://doi.org/10.1093/nar/gkg006},\n  volume   = {31},\n  year     = {2003}\n}\n
"},{"location":"datasets/rivas/","title":"RIVAS","text":"

The RIVAS dataset is a curated collection of RNA sequences and their secondary structures, designed for training and evaluating RNA secondary structure prediction methods. The dataset combines sequences from published studies and databases like Rfam, covering diverse RNA families such as tRNA, SRP RNA, and ribozymes. The secondary structure data is obtained from experimentally verified structures and consensus structures from Rfam alignments, ensuring high-quality annotations for model training and evaluation.

"},{"location":"datasets/rivas/#disclaimer","title":"Disclaimer","text":"

This is an UNOFFICIAL release of the RIVAS dataset by Elena Rivas, et al.

The team releasing RIVAS did not write this dataset card for this dataset so this dataset card has been written by the MultiMolecule team.

"},{"location":"datasets/rivas/#dataset-description","title":"Dataset Description","text":""},{"location":"datasets/rivas/#example-entry","title":"Example Entry","text":"id sequence secondary_structure AACY020454584.1_604-676 ACUGGUUGCGGCCAGUAUAAAUAGUCUUUAAG\u2026 ((((........)))).........((........"},{"location":"datasets/rivas/#column-description","title":"Column Description","text":"

The converted dataset consists of the following columns, each providing specific information about the RNA secondary structures, consistent with the bpRNA standard:

"},{"location":"datasets/rivas/#variations","title":"Variations","text":"

This dataset is available in three variants:

"},{"location":"datasets/rivas/#related-datasets","title":"Related Datasets","text":""},{"location":"datasets/rivas/#license","title":"License","text":"

This dataset is licensed under the AGPL-3.0 License.

Text Only
SPDX-License-Identifier: AGPL-3.0-or-later\n
"},{"location":"datasets/rivas/#citation","title":"Citation","text":"BibTeX
@article{rivas2012a,\n  author    = {Rivas, Elena and Lang, Raymond and Eddy, Sean R},\n  journal   = {RNA},\n  month     = feb,\n  number    = 2,\n  pages     = {193--212},\n  publisher = {Cold Spring Harbor Laboratory},\n  title     = {A range of complex probabilistic models for {RNA} secondary structure prediction that includes the nearest-neighbor model and more},\n  volume    = 18,\n  year      = 2012\n}\n
"},{"location":"datasets/rnacentral/","title":"RNAcentral","text":"

RNAcentral is a free, public resource that offers integrated access to a comprehensive and up-to-date set of non-coding RNA sequences provided by a collaborating group of Expert Databases representing a broad range of organisms and RNA types.

The development of RNAcentral is coordinated by European Bioinformatics Institute and is supported by Wellcome. Initial funding was provided by BBSRC.

"},{"location":"datasets/rnacentral/#disclaimer","title":"Disclaimer","text":"

This is an UNOFFICIAL release of the RNAcentral by the RNAcentral Consortium.

The team releasing RNAcentral did not write this dataset card for this dataset so this dataset card has been written by the MultiMolecule team.

"},{"location":"datasets/rnacentral/#dataset-description","title":"Dataset Description","text":""},{"location":"datasets/rnacentral/#variations","title":"Variations","text":"

This dataset is available in five additional variants:

"},{"location":"datasets/rnacentral/#derived-datasets","title":"Derived Datasets","text":"

In addition to the main RNAcentral dataset, we also provide the following derived datasets:

"},{"location":"datasets/rnacentral/#license","title":"License","text":"

This dataset is licensed under the AGPL-3.0 License.

Text Only
SPDX-License-Identifier: AGPL-3.0-or-later\n

Tip

The original RNAcentral dataset is licensed under the CC0 1.0 Universal license and is available at RNAcentral.

"},{"location":"datasets/rnacentral/#citation","title":"Citation","text":"BibTeX
@article{rnacentral2021,\n  author    = {{RNAcentral Consortium}},\n  doi       = {https://doi.org/10.1093/nar/gkaa921},\n  journal   = {Nucleic Acids Research},\n  month     = jan,\n  number    = {D1},\n  pages     = {D212--D220},\n  publisher = {Oxford University Press (OUP)},\n  title     = {{RNAcentral} 2021: secondary structure integration, improved sequence search and new member databases},\n  url       = {https://academic.oup.com/nar/article/49/D1/D212/5940500},\n  volume    = 49,\n  year      = 2021\n}\n\n@article{sweeney2020exploring,\n  author   = {Sweeney, Blake A. and Tagmazian, Arina A. and Ribas, Carlos E. and Finn, Robert D. and Bateman, Alex and Petrov, Anton I.},\n  doi      = {https://doi.org/10.1002/cpbi.104},\n  eprint   = {https://currentprotocols.onlinelibrary.wiley.com/doi/pdf/10.1002/cpbi.104},\n  journal  = {Current Protocols in Bioinformatics},\n  keywords = {Galaxy, ncRNA, non-coding RNA, RNAcentral, RNA-seq},\n  number   = {1},\n  pages    = {e104},\n  title    = {Exploring Non-Coding RNAs in RNAcentral},\n  url      = {https://currentprotocols.onlinelibrary.wiley.com/doi/abs/10.1002/cpbi.104},\n  volume   = 71,\n  year     = 2020\n}\n\n@article{rnacentral2019,\n  author    = {{The RNAcentral Consortium}},\n  doi       = {https://doi.org/10.1093/nar/gky1034},\n  journal   = {Nucleic Acids Research},\n  month     = jan,\n  number    = {D1},\n  pages     = {D221--D229},\n  publisher = {Oxford University Press (OUP)},\n  title     = {{RNAcentral}: a hub of information for non-coding {RNA} sequences},\n  url       = {https://academic.oup.com/nar/article/47/D1/D221/5160993},\n  volume    = 47,\n  year      = 2019\n}\n\n@article{rnacentral2017,\n  author    = {{The RNAcentral Consortium} and Petrov, Anton I and Kay, Simon J E and Kalvari, Ioanna and Howe, Kevin L and Gray, Kristian A and Bruford, Elspeth A and Kersey, Paul J and Cochrane, Guy and Finn, Robert D and Bateman, Alex and Kozomara, Ana and Griffiths-Jones, Sam and Frankish, Adam and Zwieb, Christian W and Lau, Britney Y and Williams, Kelly P and Chan, Patricia Pand Lowe, Todd M and Cannone, Jamie J and Gutell, Robin and Machnicka, Magdalena A and Bujnicki, Janusz M and Yoshihama, Maki and Kenmochi, Naoya and Chai, Benli and Cole, James R and Szymanski, Maciej and Karlowski, Wojciech M and Wood, Valerie and Huala, Eva and Berardini, Tanya Z and Zhao, Yi and Chen, Runsheng and Zhu, Weimin and Paraskevopoulou, Maria D and Vlachos, Ioannis S and Hatzigeorgiou, Artemis G and Ma, Lina and Zhang, Zhang and Puetz, Joern and Stadler, Peter F and McDonald, Daniel and Basu, Siddhartha and Fey, Petra and Engel, Stacia R and Cherry, J Michael and Volders, Pieter-Jan and Mestdagh, Pieter and Wower, Jacek and Clark, Michael B and Quek, Xiu Cheng and Dinger, Marcel E},\n  doi       = {https://doi.org/10.1093/nar/gkw1008},\n  journal   = {Nucleic Acids Research},\n  month     = jan,\n  number    = {D1},\n  pages     = {D128--D134},\n  publisher = {Oxford University Press (OUP)},\n  title     = {{RNAcentral}: a comprehensive database of non-coding {RNA} sequences},\n  url       = {https://academic.oup.com/nar/article/45/D1/D128/2333921},\n  volume    = 45,\n  year      = 2017\n}\n\n@article{rnacentral2015,\n  author  = {{RNAcentral Consortium} and Petrov, Anton I and Kay, Simon J E and Gibson, Richard and Kulesha, Eugene and Staines, Dan and Bruford, Elspeth A and Wright, Mathew W and Burge, Sarah and Finn, Robert D and Kersey, Paul J and Cochrane, Guy and Bateman, Alex and Griffiths-Jones, Sam and Harrow, Jennifer and Chan, Patricia P and Lowe, Todd M and Zwieb, Christian W and Wower, Jacek and Williams, Kelly P and Hudson, Corey M and Gutell, Robin and Clark, Michael B and Dinger, Marcel and Quek, Xiu Cheng and Bujnicki, Janusz M and Chua, Nam-Hai and Liu, Jun and Wang, Huan and Skogerb{\\o}, Geir and Zhao, Yi and Chen, Runsheng and Zhu, Weimin and Cole, James R and Chai, Benli and Huang, Hsien-Da and Huang, His-Yuan and Cherry, J Michael and Hatzigeorgiou, Artemis and Pruitt, Kim D},\n  doi     = {https://doi.org/10.1093/nar/gku991},\n  journal = {Nucleic Acids Research},\n  month   = jan,\n  number  = {Database issue},\n  pages   = {D123--D129},\n  title   = {{RNAcentral}: an international database of {ncRNA} sequences},\n  url     = {https://academic.oup.com/nar/article/43/D1/D123/2439941},\n  volume  = 43,\n  year    = 2015\n}\n\n@article{bateman2011rnacentral,\n  author    = {Bateman, Alex and Agrawal, Shipra and Birney, Ewan and Bruford, Elspeth A and Bujnicki, Janusz M and Cochrane, Guy and Cole, James R and Dinger, Marcel E and Enright, Anton J and Gardner, Paul P and Gautheret, Daniel and Griffiths-Jones, Sam and Harrow, Jen and Herrero, Javier and Holmes, Ian H and Huang, Hsien-Da and Kelly, Krystyna A and Kersey, Paul and Kozomara, Ana and Lowe, Todd M and Marz, Manja and Moxon, Simon andPruitt, Kim D and Samuelsson, Tore and Stadler, Peter F and Vilella, Albert J and Vogel, Jan-Hinnerk and Williams, Kelly P and Wright, Mathew W and Zwieb, Christian},\n  doi       = {https://doi.org/10.1261/rna.2750811},\n  journal   = {RNA},\n  month     = nov,\n  number    = 11,\n  pages     = {1941--1946},\n  publisher = {Cold Spring Harbor Laboratory},\n  title     = {{RNAcentral}: A vision for an international database of {RNA} sequences},\n  url       = {https://rnajournal.cshlp.org/content/17/11/1941.long},\n  volume    = 17,\n  year      = 2011\n}\n
"},{"location":"datasets/rnastralign/","title":"RNAStrAlign","text":"

RNAStrAlign is a comprehensive dataset of RNA sequences and their secondary structures.

RNAStrAlign aggregates data from multiple established RNA structure repositories, covering diverse RNA families such as 5S ribosomal RNA, tRNA, and group I introns.

It is considered complementary to the ArchiveII dataset.

"},{"location":"datasets/rnastralign/#disclaimer","title":"Disclaimer","text":"

This is an UNOFFICIAL release of the RNAStrAlign by Zhen Tan, et al.

The team releasing RNAStrAlign did not write this dataset card for this dataset so this dataset card has been written by the MultiMolecule team.

"},{"location":"datasets/rnastralign/#dataset-description","title":"Dataset Description","text":""},{"location":"datasets/rnastralign/#example-entry","title":"Example Entry","text":"id sequence secondary_structure family subfamily 16S_rRNA-Actinobacteria-AB002635 ACACAUGCAAGCGAACGUGAUCUCCAGCUUGC\u2026 .(((.(((..((..((((.(((((.((....)\u2026 16S_rRNA Actinobacteria"},{"location":"datasets/rnastralign/#column-description","title":"Column Description","text":""},{"location":"datasets/rnastralign/#variations","title":"Variations","text":"

This dataset is available in two additional variants:

"},{"location":"datasets/rnastralign/#related-datasets","title":"Related Datasets","text":""},{"location":"datasets/rnastralign/#license","title":"License","text":"

This dataset is licensed under the AGPL-3.0 License.

Text Only
SPDX-License-Identifier: AGPL-3.0-or-later\n
"},{"location":"datasets/rnastralign/#citation","title":"Citation","text":"BibTeX
@article{ran2017turbofold,\n  author   = {Tan, Zhen and Fu, Yinghan and Sharma, Gaurav and Mathews, David H},\n  journal  = {Nucleic Acids Research},\n  month    = nov,\n  number   = 20,\n  pages    = {11570--11581},\n  title    = {{TurboFold} {II}: {RNA} structural alignment and secondary structure prediction informed by multiple homologs},\n  volume   = 45,\n  year     = 2017\n}\n
"},{"location":"datasets/ryos/","title":"RYOS","text":"

RYOS is a database of RNA backbone stability in aqueous solution.

RYOS focuses on exploring the stability of mRNA molecules for vaccine applications. This dataset is part of a broader effort to address one of the key challenges of mRNA vaccines: degradation during shipping and storage.

"},{"location":"datasets/ryos/#statement","title":"Statement","text":"

Deep learning models for predicting RNA degradation via dual crowdsourcing is published in Nature Machine Intelligence, which is a Closed Access / Author-Fee journal.

Machine learning has been at the forefront of the movement for free and open access to research.

We see no role for closed access or author-fee publication in the future of machine learning research and believe the adoption of these journals as an outlet of record for the machine learning community would be a retrograde step.

The MultiMolecule team is committed to the principles of open access and open science.

We do NOT endorse the publication of manuscripts in Closed Access / Author-Fee journals and encourage the community to support Open Access journals and conferences.

Please consider signing the Statement on Nature Machine Intelligence.

"},{"location":"datasets/ryos/#disclaimer","title":"Disclaimer","text":"

This is an UNOFFICIAL release of the RYOS by Hannah K. Wayment-Steele, et al.

The team releasing RYOS did not write this dataset card for this dataset so this dataset card has been written by the MultiMolecule team.

"},{"location":"datasets/ryos/#dataset-description","title":"Dataset Description","text":""},{"location":"datasets/ryos/#example-entry","title":"Example Entry","text":"id design sequence secondary_structure reactivity errors_reactivity signal_to_noise_reactivity deg_pH10 errors_deg_pH10 signal_to_noise_deg_pH10 deg_50C errors_deg_50C signal_to_noise_deg_50C deg_Mg_pH10 errors_deg_Mg_pH10 signal_to_noise_deg_Mg_pH10 deg_Mg_50C errors_deg_Mg_50C signal_to_noise_deg_Mg_50C SN_filter 9830366 testing GGAAAUUUGC\u2026 .......(((\u2026 [0.4167, 1.5941, 1.2359, \u2026] [0.1689, 0.2323, 0.193, \u2026] 5.326 [1.5966, 2.6482, 1.3761, \u2026] [0.3058, 0.3294, 0.233, \u2026] 4.198 [0.7885, 1.93, 2.0423, \u2026] 3.746 [0.2773, 0.328, 0.3048, \u2026] [1.5966, 2.6482, 1.3761, \u2026] [0.3058, 0.3294, 0.233, \u2026] 4.198 [0.7885, 1.93, 2.0423, \u2026] [0.2773, 0.328, 0.3048, \u2026] 3.746 True"},{"location":"datasets/ryos/#column-description","title":"Column Description","text":"

Note that due to technical limitations, the ground truth measurements are not available for the final bases of each RNA sequence, resulting in a shorter length for the provided labels compared to the full sequence.

"},{"location":"datasets/ryos/#variations","title":"Variations","text":"

This dataset is available in two subsets:

"},{"location":"datasets/ryos/#license","title":"License","text":"

This dataset is licensed under the AGPL-3.0 License.

Text Only
SPDX-License-Identifier: AGPL-3.0-or-later\n
"},{"location":"datasets/ryos/#citation","title":"Citation","text":"BibTeX
@article{waymentsteele2021deep,\n  author  = {Wayment-Steele, Hannah K and Kladwang, Wipapat and Watkins, Andrew M and Kim, Do Soon and Tunguz, Bojan and Reade, Walter and Demkin, Maggie and Romano, Jonathan and Wellington-Oguri, Roger and Nicol, John J and Gao, Jiayang and Onodera, Kazuki and Fujikawa, Kazuki and Mao, Hanfei and Vandewiele, Gilles and Tinti, Michele and Steenwinckel, Bram and Ito, Takuya and Noumi, Taiga and He, Shujun and Ishi, Keiichiro and Lee, Youhan and {\\\"O}zt{\\\"u}rk, Fatih and Chiu, Anthony and {\\\"O}zt{\\\"u}rk, Emin and Amer, Karim and Fares, Mohamed and Participants, Eterna and Das, Rhiju},\n  journal = {ArXiv},\n  month   = oct,\n  title   = {Deep learning models for predicting {RNA} degradation via dual crowdsourcing},\n  year    = 2021\n}\n
"},{"location":"models/","title":"models","text":"

models provide a collection of pre-trained models.

"},{"location":"models/#model-class","title":"Model Class","text":"

In the transformers library, the names of model classes can sometimes be misleading. While these classes support both regression and classification tasks, their names often include xxxForSequenceClassification, which may imply they are only for classification.

To avoid this ambiguity, MultiMolecule provides a set of model classes with clear, intuitive names that reflect their intended use:

Each of these models supports both regression and classification tasks, offering flexibility and precision for a wide range of applications.

"},{"location":"models/#contact-prediction","title":"Contact Prediction","text":"

Contact prediction assign a label to each pair of token in a sentence. One of the most common contact prediction tasks is protein distance map prediction. Protein distance map prediction attempts to find the distance between all possible amino acid residue pairs of a three-dimensional protein structure

"},{"location":"models/#nucleotide-prediction","title":"Nucleotide Prediction","text":"

Similar to Token Classification, but removes the <bos> token and the <eos> token if they are defined in the model config.

<bos> and <eos> tokens

In tokenizers provided by MultiMolecule, <bos> token is pointed to <cls> token, and <sep> token is pointed to <eos> token.

"},{"location":"models/#usage","title":"Usage","text":""},{"location":"models/#build-with-multimoleculeautomodels","title":"Build with multimolecule.AutoModels","text":"Python
from transformers import AutoTokenizer\n\nfrom multimolecule import AutoModelForSequencePrediction\n\nmodel = AutoModelForSequencePrediction.from_pretrained(\"multimolecule/rnafm\")\ntokenizer = AutoTokenizer.from_pretrained(\"multimolecule/rnafm\")\n\nsequence = \"UAGCGUAUCAGACUGAUGUUG\"\noutput = model(**tokenizer(sequence, return_tensors=\"pt\"))\n
"},{"location":"models/#direct-access","title":"Direct Access","text":"

All models can be directly loaded with the from_pretrained method.

Python
from multimolecule.models import RnaFmForTokenPrediction, RnaTokenizer\n\nmodel = RnaFmForTokenPrediction.from_pretrained(\"multimolecule/rnafm\")\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rnafm\")\n\nsequence = \"UAGCGUAUCAGACUGAUGUUG\"\noutput = model(**tokenizer(sequence, return_tensors=\"pt\"))\n
"},{"location":"models/#build-with-transformersautomodels","title":"Build with transformers.AutoModels","text":"

While we use a different naming convention for model classes, the models are still registered to corresponding transformers.AutoModels.

Python
from transformers import AutoModelForSequenceClassification, AutoTokenizer\n\nimport multimolecule  # noqa: F401\n\nmodel = AutoModelForSequenceClassification.from_pretrained(\"multimolecule/mrnafm\")\ntokenizer = AutoTokenizer.from_pretrained(\"multimolecule/mrnafm\")\n\nsequence = \"UAGCGUAUCAGACUGAUGUUG\"\noutput = model(**tokenizer(sequence, return_tensors=\"pt\"))\n

import multimolecule before use

Note that you must import multimolecule before building the model using transformers.AutoModel. The registration of models is done in the multimolecule package, and the models are not available in the transformers package.

The following error will be raised if you do not import multimolecule before using transformers.AutoModel:

Python
ValueError: The checkpoint you are trying to load has model type `rnafm` but Transformers does not recognize this architecture. This could be because of an issue with the checkpoint, or because your version of Transformers is out of date.\n
"},{"location":"models/#initialize-a-vanilla-model","title":"Initialize a vanilla model","text":"

You can also initialize a vanilla model using the model class.

Python
from multimolecule.models import RnaFmConfig, RnaFmForTokenPrediction, RnaTokenizer\n\nconfig = RnaFmConfig()\nmodel = RnaFmForTokenPrediction(config)\ntokenizer = RnaTokenizer()\n\nsequence = \"UAGCGUAUCAGACUGAUGUUG\"\noutput = model(**tokenizer(sequence, return_tensors=\"pt\"))\n
"},{"location":"models/#available-models","title":"Available Models","text":""},{"location":"models/#deoxyribonucleic-acid-dna","title":"DeoxyriboNucleic Acid (DNA)","text":""},{"location":"models/#ribonucleic-acid-rna","title":"RiboNucleic Acid (RNA)","text":""},{"location":"models/calm/","title":"CaLM","text":"

Pre-trained model on protein-coding DNA (cDNA) using a masked language modeling (MLM) objective.

"},{"location":"models/calm/#statement","title":"Statement","text":"

Codon language embeddings provide strong signals for use in protein engineering is published in Nature Machine Intelligence, which is a Closed Access / Author-Fee journal.

Machine learning has been at the forefront of the movement for free and open access to research.

We see no role for closed access or author-fee publication in the future of machine learning research and believe the adoption of these journals as an outlet of record for the machine learning community would be a retrograde step.

The MultiMolecule team is committed to the principles of open access and open science.

We do NOT endorse the publication of manuscripts in Closed Access / Author-Fee journals and encourage the community to support Open Access journals and conferences.

Please consider signing the Statement on Nature Machine Intelligence.

"},{"location":"models/calm/#disclaimer","title":"Disclaimer","text":"

This is an UNOFFICIAL implementation of the Codon language embeddings provide strong signals for use in protein engineering by Carlos Outeiral and Charlotte M. Deane.

The OFFICIAL repository of CaLM is at oxpig/CaLM.

Warning

The MultiMolecule team is unable to confirm that the provided model and checkpoints are producing the same intermediate representations as the original implementation. This is because

The proposed method is published in a Closed Access / Author-Fee journal.

The team releasing CaLM did not write this model card for this model so this model card has been written by the MultiMolecule team.

"},{"location":"models/calm/#model-details","title":"Model Details","text":"

CaLM is a bert-style model pre-trained on a large corpus of protein-coding DNA sequences in a self-supervised fashion. This means that the model was trained on the raw nucleotides of DNA sequences only, with an automatic process to generate inputs and labels from those texts. Please refer to the Training Details section for more information on the training process.

"},{"location":"models/calm/#model-specification","title":"Model Specification","text":"Num Layers Hidden Size Num Heads Intermediate Size Num Parameters (M) FLOPs (G) MACs (G) Max Num Tokens 12 768 12 3072 85.75 22.36 11.17 1024"},{"location":"models/calm/#links","title":"Links","text":""},{"location":"models/calm/#usage","title":"Usage","text":"

The model file depends on the multimolecule library. You can install it using pip:

Bash
pip install multimolecule\n
"},{"location":"models/calm/#direct-use","title":"Direct Use","text":"

You can use this model directly with a pipeline for masked language modeling:

Python
>>> import multimolecule  # you must import multimolecule to register models\n>>> from transformers import pipeline\n>>> unmasker = pipeline(\"fill-mask\", model=\"multimolecule/calm\")\n>>> unmasker(\"agc<mask>cattatggcgaaccttggctgctg\")\n\n[{'score': 0.011160684749484062,\n  'token': 100,\n  'token_str': 'UUN',\n  'sequence': 'AGC UUN CAU UAU GGC GAA CCU UGG CUG CUG'},\n {'score': 0.01067513320595026,\n  'token': 117,\n  'token_str': 'NGC',\n  'sequence': 'AGC NGC CAU UAU GGC GAA CCU UGG CUG CUG'},\n {'score': 0.010549729689955711,\n  'token': 127,\n  'token_str': 'NNC',\n  'sequence': 'AGC NNC CAU UAU GGC GAA CCU UGG CUG CUG'},\n {'score': 0.0103579331189394,\n  'token': 51,\n  'token_str': 'CNA',\n  'sequence': 'AGC CNA CAU UAU GGC GAA CCU UGG CUG CUG'},\n {'score': 0.010322545655071735,\n  'token': 77,\n  'token_str': 'GNC',\n  'sequence': 'AGC GNC CAU UAU GGC GAA CCU UGG CUG CUG'}]\n
"},{"location":"models/calm/#downstream-use","title":"Downstream Use","text":""},{"location":"models/calm/#extract-features","title":"Extract Features","text":"

Here is how to use this model to get the features of a given sequence in PyTorch:

Python
from multimolecule import RnaTokenizer, CaLmModel\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/calm\")\nmodel = CaLmModel.from_pretrained(\"multimolecule/calm\")\n\ntext = \"GCCAGTCGCTGACAGCCGCGG\"\ninput = tokenizer(text, return_tensors=\"pt\")\n\noutput = model(**input)\n
"},{"location":"models/calm/#sequence-classification-regression","title":"Sequence Classification / Regression","text":"

Note: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for sequence classification or regression.

Here is how to use this model as backbone to fine-tune for a sequence-level task in PyTorch:

Python
import torch\nfrom multimolecule import RnaTokenizer, CaLmForSequencePrediction\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/calm\")\nmodel = CaLmForSequencePrediction.from_pretrained(\"multimolecule/calm\")\n\ntext = \"GCCAGTCGCTGACAGCCGCGG\"\ninput = tokenizer(text, return_tensors=\"pt\")\nlabel = torch.tensor([1])\n\noutput = model(**input, labels=label)\n
"},{"location":"models/calm/#token-classification-regression","title":"Token Classification / Regression","text":"

Note: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for nucleotide classification or regression.

Here is how to use this model as backbone to fine-tune for a nucleotide-level task in PyTorch:

Python
import torch\nfrom multimolecule import RnaTokenizer, CaLmForTokenPrediction\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/calm\")\nmodel = CaLmForTokenPrediction.from_pretrained(\"multimolecule/calm\")\n\ntext = \"GCCAGTCGCTGACAGCCGCGG\"\ninput = tokenizer(text, return_tensors=\"pt\")\nlabel = torch.randint(2, (len(text), ))\n\noutput = model(**input, labels=label)\n
"},{"location":"models/calm/#contact-classification-regression","title":"Contact Classification / Regression","text":"

Note: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for contact classification or regression.

Here is how to use this model as backbone to fine-tune for a contact-level task in PyTorch:

Python
import torch\nfrom multimolecule import RnaTokenizer, CaLmForContactPrediction\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/calm\")\nmodel = CaLmForContactPrediction.from_pretrained(\"multimolecule/calm\")\n\ntext = \"GCCAGTCGCTGACAGCCGCGG\"\ninput = tokenizer(text, return_tensors=\"pt\")\nlabel = torch.randint(2, (len(text), len(text)))\n\noutput = model(**input, labels=label)\n
"},{"location":"models/calm/#training-details","title":"Training Details","text":"

CaLM used Masked Language Modeling (MLM) as the pre-training objective: taking a sequence, the model randomly masks 25% of the tokens in the input then runs the entire masked sentence through the model and has to predict the masked tokens. This is comparable to the Cloze task in language modeling.

"},{"location":"models/calm/#training-data","title":"Training Data","text":"

The CaLM model was pre-trained coding sequences of all organisms available on the European Nucleotide Archive (ENA). European Nucleotide Archive provides a comprehensive record of the world\u2019s nucleotide sequencing information, covering raw sequencing data, sequence assembly information and functional annotation.

CaLM collected coding sequences of all organisms from ENA on April 2022, including 114,214,475 sequences. Only high level assembly information (dataclass CON) were used. Sequences matching the following criteria were filtered out:

To reduce redundancy, CaLM grouped the entries by organism, and apply CD-HIT (CD-HIT-EST) with a cut-off at 40% sequence identity to the translated protein sequences.

The final dataset contains 9,858,385 cDNA sequences.

Note that the alphabet in the original implementation is RNA instead of DNA, therefore, we use RnaTokenizer to tokenize the sequences. RnaTokenizer of multimolecule will convert \u201cU\u201ds to \u201cT\u201ds for you, you may disable this behaviour by passing replace_T_with_U=False.

"},{"location":"models/calm/#training-procedure","title":"Training Procedure","text":""},{"location":"models/calm/#preprocessing","title":"Preprocessing","text":"

CaLM used masked language modeling (MLM) as the pre-training objective. The masking procedure is similar to the one used in BERT:

"},{"location":"models/calm/#pretraining","title":"PreTraining","text":"

The model was trained on 4 NVIDIA Quadro RTX4000 GPUs with 8GiB memories.

"},{"location":"models/calm/#citation","title":"Citation","text":"

BibTeX:

BibTeX
@article {outeiral2022coodn,\n    author = {Outeiral, Carlos and Deane, Charlotte M.},\n    title = {Codon language embeddings provide strong signals for protein engineering},\n    elocation-id = {2022.12.15.519894},\n    year = {2022},\n    doi = {10.1101/2022.12.15.519894},\n    publisher = {Cold Spring Harbor Laboratory},\n    abstract = {Protein representations from deep language models have yielded state-of-the-art performance across many tasks in computational protein engineering. In recent years, progress has primarily focused on parameter count, with recent models{\\textquoteright} capacities surpassing the size of the very datasets they were trained on. Here, we propose an alternative direction. We show that large language models trained on codons, instead of amino acid sequences, provide high-quality representations that outperform comparable state-of-the-art models across a variety of tasks. In some tasks, like species recognition, prediction of protein and transcript abundance, or melting point estimation, we show that a language model trained on codons outperforms every other published protein language model, including some that contain over 50 times more parameters. These results suggest that, in addition to commonly studied scale and model complexity, the information content of biological data provides an orthogonal direction to improve the power of machine learning in biology.Competing Interest StatementThe authors have declared no competing interest.},\n    URL = {https://www.biorxiv.org/content/early/2022/12/19/2022.12.15.519894},\n    eprint = {https://www.biorxiv.org/content/early/2022/12/19/2022.12.15.519894.full.pdf},\n    journal = {bioRxiv}\n}\n
"},{"location":"models/calm/#contact","title":"Contact","text":"

Please use GitHub issues of MultiMolecule for any questions or comments on the model card.

Please contact the authors of the CaLM paper for questions or comments on the paper/model.

"},{"location":"models/calm/#license","title":"License","text":"

This model is licensed under the AGPL-3.0 License.

Text Only
SPDX-License-Identifier: AGPL-3.0-or-later\n
"},{"location":"models/calm/#multimolecule.models.calm","title":"multimolecule.models.calm","text":""},{"location":"models/calm/#multimolecule.models.calm.RnaTokenizer","title":"RnaTokenizer","text":"

Bases: Tokenizer

Tokenizer for RNA sequences.

Parameters:

Name Type Description Default Alphabet | str | List[str] | None

alphabet to use for tokenization.

None int

Size of kmer to tokenize.

1 bool

Whether to tokenize into codons.

False bool

Whether to replace T with U.

True bool

Whether to convert input to uppercase.

True

Examples:

Python Console Session
>>> from multimolecule import RnaTokenizer\n>>> tokenizer = RnaTokenizer()\n>>> tokenizer('<pad><cls><eos><unk><mask><null>ACGUNRYSWKMBDHV.X*-I')[\"input_ids\"]\n[1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 2]\n>>> tokenizer('acgu')[\"input_ids\"]\n[1, 6, 7, 8, 9, 2]\n>>> tokenizer('acgt')[\"input_ids\"]\n[1, 6, 7, 8, 9, 2]\n>>> tokenizer = RnaTokenizer(replace_T_with_U=False)\n>>> tokenizer('acgt')[\"input_ids\"]\n[1, 6, 7, 8, 3, 2]\n>>> tokenizer = RnaTokenizer(nmers=3)\n>>> tokenizer('uagcuuauc')[\"input_ids\"]\n[1, 83, 17, 64, 49, 96, 84, 22, 2]\n>>> tokenizer = RnaTokenizer(codon=True)\n>>> tokenizer('uagcuuauc')[\"input_ids\"]\n[1, 83, 49, 22, 2]\n>>> tokenizer('uagcuuauca')[\"input_ids\"]\nTraceback (most recent call last):\nValueError: length of input sequence must be a multiple of 3 for codon tokenization, but got 10\n
Source code in multimolecule/tokenisers/rna/tokenization_rna.py Python
class RnaTokenizer(Tokenizer):\n    \"\"\"\n    Tokenizer for RNA sequences.\n\n    Args:\n        alphabet: alphabet to use for tokenization.\n\n            - If is `None`, the standard RNA alphabet will be used.\n            - If is a `string`, it should correspond to the name of a predefined alphabet. The options include\n                + `standard`\n                + `extended`\n                + `streamline`\n                + `nucleobase`\n            - If is an alphabet or a list of characters, that specific alphabet will be used.\n        nmers: Size of kmer to tokenize.\n        codon: Whether to tokenize into codons.\n        replace_T_with_U: Whether to replace T with U.\n        do_upper_case: Whether to convert input to uppercase.\n\n    Examples:\n        >>> from multimolecule import RnaTokenizer\n        >>> tokenizer = RnaTokenizer()\n        >>> tokenizer('<pad><cls><eos><unk><mask><null>ACGUNRYSWKMBDHV.X*-I')[\"input_ids\"]\n        [1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 2]\n        >>> tokenizer('acgu')[\"input_ids\"]\n        [1, 6, 7, 8, 9, 2]\n        >>> tokenizer('acgt')[\"input_ids\"]\n        [1, 6, 7, 8, 9, 2]\n        >>> tokenizer = RnaTokenizer(replace_T_with_U=False)\n        >>> tokenizer('acgt')[\"input_ids\"]\n        [1, 6, 7, 8, 3, 2]\n        >>> tokenizer = RnaTokenizer(nmers=3)\n        >>> tokenizer('uagcuuauc')[\"input_ids\"]\n        [1, 83, 17, 64, 49, 96, 84, 22, 2]\n        >>> tokenizer = RnaTokenizer(codon=True)\n        >>> tokenizer('uagcuuauc')[\"input_ids\"]\n        [1, 83, 49, 22, 2]\n        >>> tokenizer('uagcuuauca')[\"input_ids\"]\n        Traceback (most recent call last):\n        ValueError: length of input sequence must be a multiple of 3 for codon tokenization, but got 10\n    \"\"\"\n\n    model_input_names = [\"input_ids\", \"attention_mask\"]\n\n    def __init__(\n        self,\n        alphabet: Alphabet | str | List[str] | None = None,\n        nmers: int = 1,\n        codon: bool = False,\n        replace_T_with_U: bool = True,\n        do_upper_case: bool = True,\n        additional_special_tokens: List | Tuple | None = None,\n        **kwargs,\n    ):\n        if codon and (nmers > 1 and nmers != 3):\n            raise ValueError(\"Codon and nmers cannot be used together.\")\n        if codon:\n            nmers = 3  # set to 3 to get correct vocab\n        if not isinstance(alphabet, Alphabet):\n            alphabet = get_alphabet(alphabet, nmers=nmers)\n        super().__init__(\n            alphabet=alphabet,\n            nmers=nmers,\n            codon=codon,\n            replace_T_with_U=replace_T_with_U,\n            do_upper_case=do_upper_case,\n            additional_special_tokens=additional_special_tokens,\n            **kwargs,\n        )\n        self.replace_T_with_U = replace_T_with_U\n        self.nmers = nmers\n        self.codon = codon\n\n    def _tokenize(self, text: str, **kwargs):\n        if self.do_upper_case:\n            text = text.upper()\n        if self.replace_T_with_U:\n            text = text.replace(\"T\", \"U\")\n        if self.codon:\n            if len(text) % 3 != 0:\n                raise ValueError(\n                    f\"length of input sequence must be a multiple of 3 for codon tokenization, but got {len(text)}\"\n                )\n            return [text[i : i + 3] for i in range(0, len(text), 3)]\n        if self.nmers > 1:\n            return [text[i : i + self.nmers] for i in range(len(text) - self.nmers + 1)]  # noqa: E203\n        return list(text)\n
"},{"location":"models/calm/#multimolecule.models.calm.RnaTokenizer(alphabet)","title":"alphabet","text":""},{"location":"models/calm/#multimolecule.models.calm.RnaTokenizer(nmers)","title":"nmers","text":""},{"location":"models/calm/#multimolecule.models.calm.RnaTokenizer(codon)","title":"codon","text":""},{"location":"models/calm/#multimolecule.models.calm.RnaTokenizer(replace_T_with_U)","title":"replace_T_with_U","text":""},{"location":"models/calm/#multimolecule.models.calm.RnaTokenizer(do_upper_case)","title":"do_upper_case","text":""},{"location":"models/calm/#multimolecule.models.calm.CaLmConfig","title":"CaLmConfig","text":"

Bases: PreTrainedConfig

This is the configuration class to store the configuration of a CaLmModel. It is used to instantiate a CaLM model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the CaLM oxpig/CaLM architecture.

Configuration objects inherit from PreTrainedConfig and can be used to control the model outputs. Read the documentation from PreTrainedConfig for more information.

Parameters:

Name Type Description Default int

Vocabulary size of the CaLM model. Defines the number of different tokens that can be represented by the inputs_ids passed when calling [CaLmModel].

131 int

Dimensionality of the encoder layers and the pooler layer.

768 int

Number of hidden layers in the Transformer encoder.

12 int

Number of attention heads for each attention layer in the Transformer encoder.

12 int

Dimensionality of the \u201cintermediate\u201d (often named feed-forward) layer in the Transformer encoder.

3072 float

The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.

0.1 float

The dropout ratio for the attention probabilities.

0.1 int

The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).

1026 float

The standard deviation of the truncated_normal_initializer for initializing all weight matrices.

0.02 float

The epsilon used by the layer normalization layers.

1e-12 str

Type of position embedding. Choose one of \"absolute\", \"relative_key\", \"relative_key_query\", \"rotary\". For positional embeddings use \"absolute\". For more information on \"relative_key\", please refer to Self-Attention with Relative Position Representations (Shaw et al.). For more information on \"relative_key_query\", please refer to Method 4 in Improve Transformer Models with Better Relative Position Embeddings (Huang et al.).

'rotary' bool

Whether the model is used as a decoder or not. If False, the model is used as an encoder.

False bool

Whether or not the model should return the last key/values attentions (not used by all models). Only relevant if config.is_decoder=True.

True bool

Whether to apply layer normalization after embeddings but before the main stem of the network.

False bool

When this is enabled, masked tokens are treated as if they had been dropped out by input dropout.

False

Examples:

Python Console Session
>>> from multimolecule import CaLmModel, CaLmConfig\n>>> # Initializing a CaLM multimolecule/calm style configuration\n>>> configuration = CaLmConfig()\n>>> # Initializing a model (with random weights) from the multimolecule/calm style configuration\n>>> model = CaLmModel(configuration)\n>>> # Accessing the model configuration\n>>> configuration = model.config\n
Source code in multimolecule/models/calm/configuration_calm.py Python
class CaLmConfig(PreTrainedConfig):\n    r\"\"\"\n    This is the configuration class to store the configuration of a [`CaLmModel`][multimolecule.models.CaLmModel]. It\n    is used to instantiate a CaLM model according to the specified arguments, defining the model architecture.\n    Instantiating a configuration with the defaults will yield a similar configuration to that of the CaLM\n    [oxpig/CaLM](https://github.com/oxpig/CaLM) architecture.\n\n    Configuration objects inherit from [`PreTrainedConfig`][multimolecule.models.PreTrainedConfig] and can be used to\n    control the model outputs. Read the documentation from [`PreTrainedConfig`][multimolecule.models.PreTrainedConfig]\n    for more information.\n\n    Args:\n        vocab_size:\n            Vocabulary size of the CaLM model. Defines the number of different tokens that can be represented by the\n            `inputs_ids` passed when calling [`CaLmModel`].\n        hidden_size:\n            Dimensionality of the encoder layers and the pooler layer.\n        num_hidden_layers:\n            Number of hidden layers in the Transformer encoder.\n        num_attention_heads:\n            Number of attention heads for each attention layer in the Transformer encoder.\n        intermediate_size:\n            Dimensionality of the \"intermediate\" (often named feed-forward) layer in the Transformer encoder.\n        hidden_dropout:\n            The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.\n        attention_dropout:\n            The dropout ratio for the attention probabilities.\n        max_position_embeddings:\n            The maximum sequence length that this model might ever be used with. Typically set this to something large\n            just in case (e.g., 512 or 1024 or 2048).\n        initializer_range:\n            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.\n        layer_norm_eps:\n            The epsilon used by the layer normalization layers.\n        position_embedding_type:\n            Type of position embedding. Choose one of `\"absolute\"`, `\"relative_key\"`, `\"relative_key_query\", \"rotary\"`.\n            For positional embeddings use `\"absolute\"`. For more information on `\"relative_key\"`, please refer to\n            [Self-Attention with Relative Position Representations (Shaw et al.)](https://arxiv.org/abs/1803.02155).\n            For more information on `\"relative_key_query\"`, please refer to *Method 4* in [Improve Transformer Models\n            with Better Relative Position Embeddings (Huang et al.)](https://arxiv.org/abs/2009.13658).\n        is_decoder:\n            Whether the model is used as a decoder or not. If `False`, the model is used as an encoder.\n        use_cache:\n            Whether or not the model should return the last key/values attentions (not used by all models). Only\n            relevant if `config.is_decoder=True`.\n        emb_layer_norm_before:\n            Whether to apply layer normalization after embeddings but before the main stem of the network.\n        token_dropout:\n            When this is enabled, masked tokens are treated as if they had been dropped out by input dropout.\n\n    Examples:\n        >>> from multimolecule import CaLmModel, CaLmConfig\n        >>> # Initializing a CaLM multimolecule/calm style configuration\n        >>> configuration = CaLmConfig()\n        >>> # Initializing a model (with random weights) from the multimolecule/calm style configuration\n        >>> model = CaLmModel(configuration)\n        >>> # Accessing the model configuration\n        >>> configuration = model.config\n    \"\"\"\n\n    model_type = \"calm\"\n\n    def __init__(\n        self,\n        vocab_size: int = 131,\n        codon: bool = True,\n        hidden_size: int = 768,\n        num_hidden_layers: int = 12,\n        num_attention_heads: int = 12,\n        intermediate_size: int = 3072,\n        hidden_act: str = \"gelu\",\n        hidden_dropout: float = 0.1,\n        attention_dropout: float = 0.1,\n        max_position_embeddings: int = 1026,\n        initializer_range: float = 0.02,\n        layer_norm_eps: float = 1e-12,\n        position_embedding_type: str = \"rotary\",\n        is_decoder: bool = False,\n        use_cache: bool = True,\n        emb_layer_norm_before: bool = False,\n        token_dropout: bool = False,\n        head: HeadConfig | None = None,\n        lm_head: MaskedLMHeadConfig | None = None,\n        **kwargs,\n    ):\n        super().__init__(**kwargs)\n\n        self.vocab_size = vocab_size\n        self.codon = codon\n        self.hidden_size = hidden_size\n        self.num_hidden_layers = num_hidden_layers\n        self.num_attention_heads = num_attention_heads\n        self.intermediate_size = intermediate_size\n        self.hidden_act = hidden_act\n        self.hidden_dropout = hidden_dropout\n        self.attention_dropout = attention_dropout\n        self.max_position_embeddings = max_position_embeddings\n        self.initializer_range = initializer_range\n        self.layer_norm_eps = layer_norm_eps\n        self.position_embedding_type = position_embedding_type\n        self.is_decoder = is_decoder\n        self.use_cache = use_cache\n        self.emb_layer_norm_before = emb_layer_norm_before\n        self.token_dropout = token_dropout\n        self.head = HeadConfig(**head) if head is not None else None\n        self.lm_head = MaskedLMHeadConfig(**lm_head) if lm_head is not None else None\n
"},{"location":"models/calm/#multimolecule.models.calm.CaLmConfig(vocab_size)","title":"vocab_size","text":""},{"location":"models/calm/#multimolecule.models.calm.CaLmConfig(hidden_size)","title":"hidden_size","text":""},{"location":"models/calm/#multimolecule.models.calm.CaLmConfig(num_hidden_layers)","title":"num_hidden_layers","text":""},{"location":"models/calm/#multimolecule.models.calm.CaLmConfig(num_attention_heads)","title":"num_attention_heads","text":""},{"location":"models/calm/#multimolecule.models.calm.CaLmConfig(intermediate_size)","title":"intermediate_size","text":""},{"location":"models/calm/#multimolecule.models.calm.CaLmConfig(hidden_dropout)","title":"hidden_dropout","text":""},{"location":"models/calm/#multimolecule.models.calm.CaLmConfig(attention_dropout)","title":"attention_dropout","text":""},{"location":"models/calm/#multimolecule.models.calm.CaLmConfig(max_position_embeddings)","title":"max_position_embeddings","text":""},{"location":"models/calm/#multimolecule.models.calm.CaLmConfig(initializer_range)","title":"initializer_range","text":""},{"location":"models/calm/#multimolecule.models.calm.CaLmConfig(layer_norm_eps)","title":"layer_norm_eps","text":""},{"location":"models/calm/#multimolecule.models.calm.CaLmConfig(position_embedding_type)","title":"position_embedding_type","text":""},{"location":"models/calm/#multimolecule.models.calm.CaLmConfig(is_decoder)","title":"is_decoder","text":""},{"location":"models/calm/#multimolecule.models.calm.CaLmConfig(use_cache)","title":"use_cache","text":""},{"location":"models/calm/#multimolecule.models.calm.CaLmConfig(emb_layer_norm_before)","title":"emb_layer_norm_before","text":""},{"location":"models/calm/#multimolecule.models.calm.CaLmConfig(token_dropout)","title":"token_dropout","text":""},{"location":"models/calm/#multimolecule.models.calm.CaLmForContactPrediction","title":"CaLmForContactPrediction","text":"

Bases: CaLmPreTrainedModel

Examples:

Python Console Session
>>> from multimolecule import CaLmConfig, CaLmForContactPrediction, RnaTokenizer\n>>> config = CaLmConfig()\n>>> model = CaLmForContactPrediction(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=torch.randint(2, (1, 5, 5)))\n>>> output[\"logits\"].shape\ntorch.Size([1, 5, 5, 1])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n
Source code in multimolecule/models/calm/modeling_calm.py Python
class CaLmForContactPrediction(CaLmPreTrainedModel):\n    \"\"\"\n    Examples:\n        >>> from multimolecule import CaLmConfig, CaLmForContactPrediction, RnaTokenizer\n        >>> config = CaLmConfig()\n        >>> model = CaLmForContactPrediction(config)\n        >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n        >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n        >>> output = model(**input, labels=torch.randint(2, (1, 5, 5)))\n        >>> output[\"logits\"].shape\n        torch.Size([1, 5, 5, 1])\n        >>> output[\"loss\"]  # doctest:+ELLIPSIS\n        tensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n    \"\"\"\n\n    def __init__(self, config: CaLmConfig):\n        super().__init__(config)\n        self.calm = CaLmModel(config, add_pooling_layer=True)\n        self.contact_head = ContactPredictionHead(config)\n        self.head_config = self.contact_head.config\n\n        # Initialize weights and apply final processing\n        self.post_init()\n\n    def forward(\n        self,\n        input_ids: Tensor | NestedTensor,\n        attention_mask: Tensor | None = None,\n        position_ids: Tensor | None = None,\n        head_mask: Tensor | None = None,\n        inputs_embeds: Tensor | NestedTensor | None = None,\n        labels: Tensor | None = None,\n        output_attentions: bool | None = None,\n        output_hidden_states: bool | None = None,\n        return_dict: bool | None = None,\n        **kwargs,\n    ) -> Tuple[Tensor, ...] | ContactPredictorOutput:\n        if output_attentions is False:\n            warn(\"output_attentions must be True for contact classification and will be ignored.\")\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n        outputs = self.calm(\n            input_ids,\n            attention_mask=attention_mask,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            output_attentions=True,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n            **kwargs,\n        )\n        output = self.contact_head(outputs, attention_mask, input_ids, labels)\n        logits, loss = output.logits, output.loss\n\n        if not return_dict:\n            output = (logits,) + outputs[2:]\n            return ((loss,) + output) if loss is not None else output\n\n        return ContactPredictorOutput(\n            loss=loss,\n            logits=logits,\n            hidden_states=outputs.hidden_states,\n            attentions=outputs.attentions,\n        )\n
"},{"location":"models/calm/#multimolecule.models.calm.CaLmForMaskedLM","title":"CaLmForMaskedLM","text":"

Bases: CaLmPreTrainedModel

Examples:

Python Console Session
>>> from multimolecule import CaLmConfig, CaLmForMaskedLM, RnaTokenizer\n>>> config = CaLmConfig()\n>>> model = CaLmForMaskedLM(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=input[\"input_ids\"])\n>>> output[\"logits\"].shape\ntorch.Size([1, 7, 131])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<NllLossBackward0>)\n
Source code in multimolecule/models/calm/modeling_calm.py Python
class CaLmForMaskedLM(CaLmPreTrainedModel):\n    \"\"\"\n    Examples:\n        >>> from multimolecule import CaLmConfig, CaLmForMaskedLM, RnaTokenizer\n        >>> config = CaLmConfig()\n        >>> model = CaLmForMaskedLM(config)\n        >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n        >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n        >>> output = model(**input, labels=input[\"input_ids\"])\n        >>> output[\"logits\"].shape\n        torch.Size([1, 7, 131])\n        >>> output[\"loss\"]  # doctest:+ELLIPSIS\n        tensor(..., grad_fn=<NllLossBackward0>)\n    \"\"\"\n\n    _tied_weights_keys = [\"lm_head.decoder.weight\", \"lm_head.decoder.bias\"]\n\n    def __init__(self, config: CaLmConfig):\n        super().__init__(config)\n        if config.is_decoder:\n            logger.warning(\n                \"If you want to use `CaLmForMaskedLM` make sure `config.is_decoder=False` for \"\n                \"bi-directional self-attention.\"\n            )\n        self.calm = CaLmModel(config, add_pooling_layer=False)\n        self.lm_head = MaskedLMHead(config, self.calm.embeddings.word_embeddings.weight)\n\n        # Initialize weights and apply final processing\n        self.post_init()\n\n    def forward(\n        self,\n        input_ids: Tensor | NestedTensor,\n        attention_mask: Tensor | None = None,\n        position_ids: Tensor | None = None,\n        head_mask: Tensor | None = None,\n        inputs_embeds: Tensor | NestedTensor | None = None,\n        encoder_hidden_states: Tensor | None = None,\n        encoder_attention_mask: Tensor | None = None,\n        labels: Tensor | None = None,\n        output_attentions: bool | None = None,\n        output_hidden_states: bool | None = None,\n        return_dict: bool | None = None,\n        **kwargs,\n    ) -> Tuple[Tensor, ...] | MaskedLMOutput:\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n        outputs = self.calm(\n            input_ids,\n            attention_mask=attention_mask,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            encoder_hidden_states=encoder_hidden_states,\n            encoder_attention_mask=encoder_attention_mask,\n            output_attentions=output_attentions,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n            **kwargs,\n        )\n        output = self.lm_head(outputs, labels)\n        logits, loss = output.logits, output.loss\n\n        if not return_dict:\n            output = (logits,) + outputs[2:]\n            return ((loss,) + output) if loss is not None else output\n\n        return MaskedLMOutput(\n            loss=loss,\n            logits=logits,\n            hidden_states=outputs.hidden_states,\n            attentions=outputs.attentions,\n        )\n
"},{"location":"models/calm/#multimolecule.models.calm.CaLmForSequencePrediction","title":"CaLmForSequencePrediction","text":"

Bases: CaLmPreTrainedModel

Examples:

Python Console Session
>>> from multimolecule import CaLmConfig, CaLmForSequencePrediction, RnaTokenizer\n>>> config = CaLmConfig()\n>>> model = CaLmForSequencePrediction(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=torch.tensor([[1]]))\n>>> output[\"logits\"].shape\ntorch.Size([1, 1])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n
Source code in multimolecule/models/calm/modeling_calm.py Python
class CaLmForSequencePrediction(CaLmPreTrainedModel):\n    \"\"\"\n    Examples:\n        >>> from multimolecule import CaLmConfig, CaLmForSequencePrediction, RnaTokenizer\n        >>> config = CaLmConfig()\n        >>> model = CaLmForSequencePrediction(config)\n        >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n        >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n        >>> output = model(**input, labels=torch.tensor([[1]]))\n        >>> output[\"logits\"].shape\n        torch.Size([1, 1])\n        >>> output[\"loss\"]  # doctest:+ELLIPSIS\n        tensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n    \"\"\"\n\n    def __init__(self, config: CaLmConfig):\n        super().__init__(config)\n        self.calm = CaLmModel(config, add_pooling_layer=True)\n        self.sequence_head = SequencePredictionHead(config)\n        self.head_config = self.sequence_head.config\n\n        # Initialize weights and apply final processing\n        self.post_init()\n\n    def forward(\n        self,\n        input_ids: Tensor | NestedTensor,\n        attention_mask: Tensor | None = None,\n        position_ids: Tensor | None = None,\n        head_mask: Tensor | None = None,\n        inputs_embeds: Tensor | NestedTensor | None = None,\n        labels: Tensor | None = None,\n        output_attentions: bool | None = None,\n        output_hidden_states: bool | None = None,\n        return_dict: bool | None = None,\n        **kwargs,\n    ) -> Tuple[Tensor, ...] | SequencePredictorOutput:\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n        outputs = self.calm(\n            input_ids,\n            attention_mask=attention_mask,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            output_attentions=output_attentions,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n            **kwargs,\n        )\n        output = self.sequence_head(outputs, labels)\n        logits, loss = output.logits, output.loss\n\n        if not return_dict:\n            output = (logits,) + outputs[2:]\n            return ((loss,) + output) if loss is not None else output\n\n        return SequencePredictorOutput(\n            loss=loss,\n            logits=logits,\n            hidden_states=outputs.hidden_states,\n            attentions=outputs.attentions,\n        )\n
"},{"location":"models/calm/#multimolecule.models.calm.CaLmForTokenPrediction","title":"CaLmForTokenPrediction","text":"

Bases: CaLmPreTrainedModel

Examples:

Python Console Session
>>> from multimolecule import CaLmConfig, CaLmForTokenPrediction, RnaTokenizer\n>>> config = CaLmConfig()\n>>> model = CaLmForTokenPrediction(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=torch.randint(2, (1, 5)))\n>>> output[\"logits\"].shape\ntorch.Size([1, 5, 1])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n
Source code in multimolecule/models/calm/modeling_calm.py Python
class CaLmForTokenPrediction(CaLmPreTrainedModel):\n    \"\"\"\n    Examples:\n        >>> from multimolecule import CaLmConfig, CaLmForTokenPrediction, RnaTokenizer\n        >>> config = CaLmConfig()\n        >>> model = CaLmForTokenPrediction(config)\n        >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n        >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n        >>> output = model(**input, labels=torch.randint(2, (1, 5)))\n        >>> output[\"logits\"].shape\n        torch.Size([1, 5, 1])\n        >>> output[\"loss\"]  # doctest:+ELLIPSIS\n        tensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n    \"\"\"\n\n    def __init__(self, config: CaLmConfig):\n        super().__init__(config)\n        self.calm = CaLmModel(config, add_pooling_layer=True)\n        self.token_head = TokenPredictionHead(config)\n        self.head_config = self.token_head.config\n\n        # Initialize weights and apply final processing\n        self.post_init()\n\n    def forward(\n        self,\n        input_ids: Tensor | NestedTensor,\n        attention_mask: Tensor | None = None,\n        position_ids: Tensor | None = None,\n        head_mask: Tensor | None = None,\n        inputs_embeds: Tensor | NestedTensor | None = None,\n        labels: Tensor | None = None,\n        output_attentions: bool | None = None,\n        output_hidden_states: bool | None = None,\n        return_dict: bool | None = None,\n        **kwargs,\n    ) -> Tuple[Tensor, ...] | TokenPredictorOutput:\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n        outputs = self.calm(\n            input_ids,\n            attention_mask=attention_mask,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            output_attentions=output_attentions,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n            **kwargs,\n        )\n        output = self.token_head(outputs, attention_mask, input_ids, labels)\n        logits, loss = output.logits, output.loss\n\n        if not return_dict:\n            output = (logits,) + outputs[2:]\n            return ((loss,) + output) if loss is not None else output\n\n        return TokenPredictorOutput(\n            loss=loss,\n            logits=logits,\n            hidden_states=outputs.hidden_states,\n            attentions=outputs.attentions,\n        )\n
"},{"location":"models/calm/#multimolecule.models.calm.CaLmModel","title":"CaLmModel","text":"

Bases: CaLmPreTrainedModel

Examples:

Python Console Session
>>> from multimolecule import CaLmConfig, CaLmModel, RnaTokenizer\n>>> config = CaLmConfig()\n>>> model = CaLmModel(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input)\n>>> output[\"last_hidden_state\"].shape\ntorch.Size([1, 7, 768])\n>>> output[\"pooler_output\"].shape\ntorch.Size([1, 768])\n
Source code in multimolecule/models/calm/modeling_calm.py Python
class CaLmModel(CaLmPreTrainedModel):\n    \"\"\"\n    Examples:\n        >>> from multimolecule import CaLmConfig, CaLmModel, RnaTokenizer\n        >>> config = CaLmConfig()\n        >>> model = CaLmModel(config)\n        >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n        >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n        >>> output = model(**input)\n        >>> output[\"last_hidden_state\"].shape\n        torch.Size([1, 7, 768])\n        >>> output[\"pooler_output\"].shape\n        torch.Size([1, 768])\n    \"\"\"\n\n    def __init__(self, config: CaLmConfig, add_pooling_layer: bool = True):\n        super().__init__(config)\n        self.pad_token_id = config.pad_token_id\n        self.embeddings = CaLmEmbeddings(config)\n        self.encoder = CaLmEncoder(config)\n        self.pooler = CaLmPooler(config) if add_pooling_layer else None\n\n        # Initialize weights and apply final processing\n        self.post_init()\n\n    def get_input_embeddings(self):\n        return self.embeddings.word_embeddings\n\n    def set_input_embeddings(self, value):\n        self.embeddings.word_embeddings = value\n\n    def _prune_heads(self, heads_to_prune):\n        \"\"\"\n        Prunes heads of the model. heads_to_prune: dict of {layer_num: list of heads to prune in this layer} See base\n        class PreTrainedModel\n        \"\"\"\n        for layer, heads in heads_to_prune.items():\n            self.encoder.layer[layer].attention.prune_heads(heads)\n\n    def forward(\n        self,\n        input_ids: Tensor | NestedTensor,\n        attention_mask: Tensor | None = None,\n        position_ids: Tensor | None = None,\n        head_mask: Tensor | None = None,\n        inputs_embeds: Tensor | NestedTensor | None = None,\n        encoder_hidden_states: Tensor | None = None,\n        encoder_attention_mask: Tensor | None = None,\n        past_key_values: Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None = None,\n        use_cache: bool | None = None,\n        output_attentions: bool | None = None,\n        output_hidden_states: bool | None = None,\n        return_dict: bool | None = None,\n        **kwargs,\n    ) -> Tuple[Tensor, ...] | BaseModelOutputWithPoolingAndCrossAttentions:\n        r\"\"\"\n        Args:\n            encoder_hidden_states:\n                Shape: `(batch_size, sequence_length, hidden_size)`\n\n                Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if\n                the model is configured as a decoder.\n            encoder_attention_mask:\n                Shape: `(batch_size, sequence_length)`\n\n                Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used\n                in the cross-attention if the model is configured as a decoder. Mask values selected in `[0, 1]`:\n\n                - 1 for tokens that are **not masked**,\n                - 0 for tokens that are **masked**.\n            past_key_values:\n                Tuple of length `config.n_layers` with each tuple having 4 tensors of shape\n                `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)\n\n                Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up\n                decoding.\n\n                If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those\n                that don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of\n                all `decoder_input_ids` of shape `(batch_size, sequence_length)`.\n            use_cache:\n                If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding\n                (see `past_key_values`).\n        \"\"\"\n        if kwargs:\n            warn(\n                f\"Additional keyword arguments `{', '.join(kwargs)}` are detected in \"\n                f\"`{self.__class__.__name__}.forward`, they will be ignored.\\n\"\n                \"This is provided for backward compatibility and may lead to unexpected behavior.\"\n            )\n        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions\n        output_hidden_states = (\n            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states\n        )\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n\n        if self.config.is_decoder:\n            use_cache = use_cache if use_cache is not None else self.config.use_cache\n        else:\n            use_cache = False\n\n        if isinstance(input_ids, NestedTensor):\n            input_ids, attention_mask = input_ids.tensor, input_ids.mask\n        if input_ids is not None and inputs_embeds is not None:\n            raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n        if input_ids is not None:\n            self.warn_if_padding_and_no_attention_mask(input_ids, attention_mask)\n            input_shape = input_ids.size()\n        elif inputs_embeds is not None:\n            input_shape = inputs_embeds.size()[:-1]\n        else:\n            raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n        batch_size, seq_length = input_shape\n        device = input_ids.device if input_ids is not None else inputs_embeds.device  # type: ignore[union-attr]\n\n        # past_key_values_length\n        past_key_values_length = past_key_values[0][0].shape[2] if past_key_values is not None else 0\n\n        if attention_mask is None:\n            attention_mask = (\n                input_ids.ne(self.pad_token_id)\n                if self.pad_token_id is not None\n                else torch.ones(((batch_size, seq_length + past_key_values_length)), device=device)\n            )\n\n        # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]\n        # ourselves in which case we just need to make it broadcastable to all heads.\n        extended_attention_mask: Tensor = self.get_extended_attention_mask(attention_mask, input_shape)\n\n        # If a 2D or 3D attention mask is provided for the cross-attention\n        # we need to make broadcastable to [batch_size, num_heads, seq_length, seq_length]\n        if self.config.is_decoder and encoder_hidden_states is not None:\n            encoder_batch_size, encoder_sequence_length, _ = encoder_hidden_states.size()\n            encoder_hidden_shape = (encoder_batch_size, encoder_sequence_length)\n            if encoder_attention_mask is None:\n                encoder_attention_mask = torch.ones(encoder_hidden_shape, device=device)\n            encoder_extended_attention_mask = self.invert_attention_mask(encoder_attention_mask)\n        else:\n            encoder_extended_attention_mask = None\n\n        # Prepare head mask if needed\n        # 1.0 in head_mask indicate we keep the head\n        # attention_probs has shape bsz x n_heads x N x N\n        # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]\n        # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]\n        head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)\n\n        embedding_output = self.embeddings(\n            input_ids=input_ids,\n            position_ids=position_ids,\n            attention_mask=attention_mask,\n            inputs_embeds=inputs_embeds,\n            past_key_values_length=past_key_values_length,\n        )\n        encoder_outputs = self.encoder(\n            embedding_output,\n            attention_mask=extended_attention_mask,\n            head_mask=head_mask,\n            encoder_hidden_states=encoder_hidden_states,\n            encoder_attention_mask=encoder_extended_attention_mask,\n            past_key_values=past_key_values,\n            use_cache=use_cache,\n            output_attentions=output_attentions,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n        )\n        sequence_output = encoder_outputs[0]\n        pooled_output = self.pooler(sequence_output) if self.pooler is not None else None\n\n        if not return_dict:\n            return (sequence_output, pooled_output) + encoder_outputs[1:]\n\n        return BaseModelOutputWithPoolingAndCrossAttentions(\n            last_hidden_state=sequence_output,\n            pooler_output=pooled_output,\n            past_key_values=encoder_outputs.past_key_values,\n            hidden_states=encoder_outputs.hidden_states,\n            attentions=encoder_outputs.attentions,\n            cross_attentions=encoder_outputs.cross_attentions,\n        )\n
"},{"location":"models/calm/#multimolecule.models.calm.CaLmModel.forward","title":"forward","text":"Python
forward(input_ids: Tensor | NestedTensor, attention_mask: Tensor | None = None, position_ids: Tensor | None = None, head_mask: Tensor | None = None, inputs_embeds: Tensor | NestedTensor | None = None, encoder_hidden_states: Tensor | None = None, encoder_attention_mask: Tensor | None = None, past_key_values: Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None = None, use_cache: bool | None = None, output_attentions: bool | None = None, output_hidden_states: bool | None = None, return_dict: bool | None = None, **kwargs) -> Tuple[Tensor, ...] | BaseModelOutputWithPoolingAndCrossAttentions\n

Parameters:

Name Type Description Default Tensor | None

Shape: (batch_size, sequence_length, hidden_size)

Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if the model is configured as a decoder.

None Tensor | None

Shape: (batch_size, sequence_length)

Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used in the cross-attention if the model is configured as a decoder. Mask values selected in [0, 1]:

None Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None

Tuple of length config.n_layers with each tuple having 4 tensors of shape `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)

Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up decoding.

If past_key_values are used, the user can optionally input only the last decoder_input_ids (those that don\u2019t have their past key value states given to this model) of shape (batch_size, 1) instead of all decoder_input_ids of shape (batch_size, sequence_length).

None bool | None

If set to True, past_key_values key value states are returned and can be used to speed up decoding (see past_key_values).

None Source code in multimolecule/models/calm/modeling_calm.py Python
def forward(\n    self,\n    input_ids: Tensor | NestedTensor,\n    attention_mask: Tensor | None = None,\n    position_ids: Tensor | None = None,\n    head_mask: Tensor | None = None,\n    inputs_embeds: Tensor | NestedTensor | None = None,\n    encoder_hidden_states: Tensor | None = None,\n    encoder_attention_mask: Tensor | None = None,\n    past_key_values: Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None = None,\n    use_cache: bool | None = None,\n    output_attentions: bool | None = None,\n    output_hidden_states: bool | None = None,\n    return_dict: bool | None = None,\n    **kwargs,\n) -> Tuple[Tensor, ...] | BaseModelOutputWithPoolingAndCrossAttentions:\n    r\"\"\"\n    Args:\n        encoder_hidden_states:\n            Shape: `(batch_size, sequence_length, hidden_size)`\n\n            Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if\n            the model is configured as a decoder.\n        encoder_attention_mask:\n            Shape: `(batch_size, sequence_length)`\n\n            Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used\n            in the cross-attention if the model is configured as a decoder. Mask values selected in `[0, 1]`:\n\n            - 1 for tokens that are **not masked**,\n            - 0 for tokens that are **masked**.\n        past_key_values:\n            Tuple of length `config.n_layers` with each tuple having 4 tensors of shape\n            `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)\n\n            Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up\n            decoding.\n\n            If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those\n            that don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of\n            all `decoder_input_ids` of shape `(batch_size, sequence_length)`.\n        use_cache:\n            If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding\n            (see `past_key_values`).\n    \"\"\"\n    if kwargs:\n        warn(\n            f\"Additional keyword arguments `{', '.join(kwargs)}` are detected in \"\n            f\"`{self.__class__.__name__}.forward`, they will be ignored.\\n\"\n            \"This is provided for backward compatibility and may lead to unexpected behavior.\"\n        )\n    output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions\n    output_hidden_states = (\n        output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states\n    )\n    return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n\n    if self.config.is_decoder:\n        use_cache = use_cache if use_cache is not None else self.config.use_cache\n    else:\n        use_cache = False\n\n    if isinstance(input_ids, NestedTensor):\n        input_ids, attention_mask = input_ids.tensor, input_ids.mask\n    if input_ids is not None and inputs_embeds is not None:\n        raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n    if input_ids is not None:\n        self.warn_if_padding_and_no_attention_mask(input_ids, attention_mask)\n        input_shape = input_ids.size()\n    elif inputs_embeds is not None:\n        input_shape = inputs_embeds.size()[:-1]\n    else:\n        raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n    batch_size, seq_length = input_shape\n    device = input_ids.device if input_ids is not None else inputs_embeds.device  # type: ignore[union-attr]\n\n    # past_key_values_length\n    past_key_values_length = past_key_values[0][0].shape[2] if past_key_values is not None else 0\n\n    if attention_mask is None:\n        attention_mask = (\n            input_ids.ne(self.pad_token_id)\n            if self.pad_token_id is not None\n            else torch.ones(((batch_size, seq_length + past_key_values_length)), device=device)\n        )\n\n    # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]\n    # ourselves in which case we just need to make it broadcastable to all heads.\n    extended_attention_mask: Tensor = self.get_extended_attention_mask(attention_mask, input_shape)\n\n    # If a 2D or 3D attention mask is provided for the cross-attention\n    # we need to make broadcastable to [batch_size, num_heads, seq_length, seq_length]\n    if self.config.is_decoder and encoder_hidden_states is not None:\n        encoder_batch_size, encoder_sequence_length, _ = encoder_hidden_states.size()\n        encoder_hidden_shape = (encoder_batch_size, encoder_sequence_length)\n        if encoder_attention_mask is None:\n            encoder_attention_mask = torch.ones(encoder_hidden_shape, device=device)\n        encoder_extended_attention_mask = self.invert_attention_mask(encoder_attention_mask)\n    else:\n        encoder_extended_attention_mask = None\n\n    # Prepare head mask if needed\n    # 1.0 in head_mask indicate we keep the head\n    # attention_probs has shape bsz x n_heads x N x N\n    # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]\n    # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]\n    head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)\n\n    embedding_output = self.embeddings(\n        input_ids=input_ids,\n        position_ids=position_ids,\n        attention_mask=attention_mask,\n        inputs_embeds=inputs_embeds,\n        past_key_values_length=past_key_values_length,\n    )\n    encoder_outputs = self.encoder(\n        embedding_output,\n        attention_mask=extended_attention_mask,\n        head_mask=head_mask,\n        encoder_hidden_states=encoder_hidden_states,\n        encoder_attention_mask=encoder_extended_attention_mask,\n        past_key_values=past_key_values,\n        use_cache=use_cache,\n        output_attentions=output_attentions,\n        output_hidden_states=output_hidden_states,\n        return_dict=return_dict,\n    )\n    sequence_output = encoder_outputs[0]\n    pooled_output = self.pooler(sequence_output) if self.pooler is not None else None\n\n    if not return_dict:\n        return (sequence_output, pooled_output) + encoder_outputs[1:]\n\n    return BaseModelOutputWithPoolingAndCrossAttentions(\n        last_hidden_state=sequence_output,\n        pooler_output=pooled_output,\n        past_key_values=encoder_outputs.past_key_values,\n        hidden_states=encoder_outputs.hidden_states,\n        attentions=encoder_outputs.attentions,\n        cross_attentions=encoder_outputs.cross_attentions,\n    )\n
"},{"location":"models/calm/#multimolecule.models.calm.CaLmModel.forward(encoder_hidden_states)","title":"encoder_hidden_states","text":""},{"location":"models/calm/#multimolecule.models.calm.CaLmModel.forward(encoder_attention_mask)","title":"encoder_attention_mask","text":""},{"location":"models/calm/#multimolecule.models.calm.CaLmModel.forward(past_key_values)","title":"past_key_values","text":""},{"location":"models/calm/#multimolecule.models.calm.CaLmModel.forward(use_cache)","title":"use_cache","text":""},{"location":"models/calm/#multimolecule.models.calm.CaLmPreTrainedModel","title":"CaLmPreTrainedModel","text":"

Bases: PreTrainedModel

An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained models.

Source code in multimolecule/models/calm/modeling_calm.py Python
class CaLmPreTrainedModel(PreTrainedModel):\n    \"\"\"\n    An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained\n    models.\n    \"\"\"\n\n    config_class = CaLmConfig\n    base_model_prefix = \"calm\"\n    supports_gradient_checkpointing = True\n    _no_split_modules = [\"CaLmLayer\", \"CaLmEmbeddings\"]\n\n    # Copied from transformers.models.bert.modeling_bert.BertPreTrainedModel._init_weights\n    def _init_weights(self, module: nn.Module):\n        \"\"\"Initialize the weights\"\"\"\n        if isinstance(module, nn.Linear):\n            # Slightly different from the TF version which uses truncated_normal for initialization\n            # cf https://github.com/pytorch/pytorch/pull/5617\n            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)\n            if module.bias is not None:\n                module.bias.data.zero_()\n        elif isinstance(module, nn.Embedding):\n            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)\n            if module.padding_idx is not None:\n                module.weight.data[module.padding_idx].zero_()\n        elif isinstance(module, nn.LayerNorm):\n            module.bias.data.zero_()\n            module.weight.data.fill_(1.0)\n
"},{"location":"models/configuration_utils/","title":"configuration_utils","text":""},{"location":"models/configuration_utils/#multimolecule.models.configuration_utils","title":"multimolecule.models.configuration_utils","text":""},{"location":"models/configuration_utils/#multimolecule.models.configuration_utils.HeadConfig","title":"HeadConfig","text":"

Bases: BaseHeadConfig

Configuration class for a prediction head.

Parameters:

Name Type Description Default

Number of labels to use in the last layer added to the model, typically for a classification task.

Head should look for Config.num_labels if is None.

required

Problem type for XxxForYyyPrediction models. Can be one of \"binary\", \"regression\", \"multiclass\" or \"multilabel\".

Head should look for Config.problem_type if is None.

required

Dimensionality of the encoder layers and the pooler layer.

Head should look for Config.hidden_size if is None.

required

The dropout ratio for the hidden states.

required

The transform operation applied to hidden states.

required

The activation function of transform applied to hidden states.

required

Whether to apply bias to the final prediction layer.

required

The activation function of the final prediction output.

required

The epsilon used by the layer normalization layers.

required

The name of the tensor required in model outputs.

If is None, will use the default output name of the corresponding head.

required

The type of the head in the model.

This is used by MultiMoleculeModel to construct heads.

required Source code in multimolecule/module/heads/config.py Python
class HeadConfig(BaseHeadConfig):\n    r\"\"\"\n    Configuration class for a prediction head.\n\n    Args:\n        num_labels:\n            Number of labels to use in the last layer added to the model, typically for a classification task.\n\n            Head should look for [`Config.num_labels`][multimolecule.PreTrainedConfig] if is `None`.\n        problem_type:\n            Problem type for `XxxForYyyPrediction` models. Can be one of `\"binary\"`, `\"regression\"`,\n            `\"multiclass\"` or `\"multilabel\"`.\n\n            Head should look for [`Config.problem_type`][multimolecule.PreTrainedConfig] if is `None`.\n        hidden_size:\n            Dimensionality of the encoder layers and the pooler layer.\n\n            Head should look for [`Config.hidden_size`][multimolecule.PreTrainedConfig] if is `None`.\n        dropout:\n            The dropout ratio for the hidden states.\n        transform:\n            The transform operation applied to hidden states.\n        transform_act:\n            The activation function of transform applied to hidden states.\n        bias:\n            Whether to apply bias to the final prediction layer.\n        act:\n            The activation function of the final prediction output.\n        layer_norm_eps:\n            The epsilon used by the layer normalization layers.\n        output_name:\n            The name of the tensor required in model outputs.\n\n            If is `None`, will use the default output name of the corresponding head.\n        type:\n            The type of the head in the model.\n\n            This is used by [`MultiMoleculeModel`][multimolecule.MultiMoleculeModel] to construct heads.\n    \"\"\"\n\n    num_labels: Optional[int] = None\n    problem_type: Optional[str] = None\n    hidden_size: Optional[int] = None\n    dropout: float = 0.0\n    transform: Optional[str] = None\n    transform_act: Optional[str] = \"gelu\"\n    bias: bool = True\n    act: Optional[str] = None\n    layer_norm_eps: float = 1e-12\n    output_name: Optional[str] = None\n    type: Optional[str] = None\n
"},{"location":"models/configuration_utils/#multimolecule.models.configuration_utils.HeadConfig(num_labels)","title":"num_labels","text":""},{"location":"models/configuration_utils/#multimolecule.models.configuration_utils.HeadConfig(problem_type)","title":"problem_type","text":""},{"location":"models/configuration_utils/#multimolecule.models.configuration_utils.HeadConfig(hidden_size)","title":"hidden_size","text":""},{"location":"models/configuration_utils/#multimolecule.models.configuration_utils.HeadConfig(dropout)","title":"dropout","text":""},{"location":"models/configuration_utils/#multimolecule.models.configuration_utils.HeadConfig(transform)","title":"transform","text":""},{"location":"models/configuration_utils/#multimolecule.models.configuration_utils.HeadConfig(transform_act)","title":"transform_act","text":""},{"location":"models/configuration_utils/#multimolecule.models.configuration_utils.HeadConfig(bias)","title":"bias","text":""},{"location":"models/configuration_utils/#multimolecule.models.configuration_utils.HeadConfig(act)","title":"act","text":""},{"location":"models/configuration_utils/#multimolecule.models.configuration_utils.HeadConfig(layer_norm_eps)","title":"layer_norm_eps","text":""},{"location":"models/configuration_utils/#multimolecule.models.configuration_utils.HeadConfig(output_name)","title":"output_name","text":""},{"location":"models/configuration_utils/#multimolecule.models.configuration_utils.HeadConfig(type)","title":"type","text":""},{"location":"models/configuration_utils/#multimolecule.models.configuration_utils.MaskedLMHeadConfig","title":"MaskedLMHeadConfig","text":"

Bases: BaseHeadConfig

Configuration class for a Masked Language Modeling head.

Parameters:

Name Type Description Default

Dimensionality of the encoder layers and the pooler layer.

Head should look for Config.hidden_size if is None.

required

The dropout ratio for the hidden states.

required

The transform operation applied to hidden states.

required

The activation function of transform applied to hidden states.

required

Whether to apply bias to the final prediction layer.

required

The activation function of the final prediction output.

required

The epsilon used by the layer normalization layers.

required

The name of the tensor required in model outputs.

If is None, will use the default output name of the corresponding head.

required Source code in multimolecule/module/heads/config.py Python
class MaskedLMHeadConfig(BaseHeadConfig):\n    r\"\"\"\n    Configuration class for a Masked Language Modeling head.\n\n    Args:\n        hidden_size:\n            Dimensionality of the encoder layers and the pooler layer.\n\n            Head should look for [`Config.hidden_size`][multimolecule.PreTrainedConfig] if is `None`.\n        dropout:\n            The dropout ratio for the hidden states.\n        transform:\n            The transform operation applied to hidden states.\n        transform_act:\n            The activation function of transform applied to hidden states.\n        bias:\n            Whether to apply bias to the final prediction layer.\n        act:\n            The activation function of the final prediction output.\n        layer_norm_eps:\n            The epsilon used by the layer normalization layers.\n        output_name:\n            The name of the tensor required in model outputs.\n\n            If is `None`, will use the default output name of the corresponding head.\n    \"\"\"\n\n    hidden_size: Optional[int] = None\n    dropout: float = 0.0\n    transform: Optional[str] = \"nonlinear\"\n    transform_act: Optional[str] = \"gelu\"\n    bias: bool = True\n    act: Optional[str] = None\n    layer_norm_eps: float = 1e-12\n    output_name: Optional[str] = None\n
"},{"location":"models/configuration_utils/#multimolecule.models.configuration_utils.MaskedLMHeadConfig(hidden_size)","title":"hidden_size","text":""},{"location":"models/configuration_utils/#multimolecule.models.configuration_utils.MaskedLMHeadConfig(dropout)","title":"dropout","text":""},{"location":"models/configuration_utils/#multimolecule.models.configuration_utils.MaskedLMHeadConfig(transform)","title":"transform","text":""},{"location":"models/configuration_utils/#multimolecule.models.configuration_utils.MaskedLMHeadConfig(transform_act)","title":"transform_act","text":""},{"location":"models/configuration_utils/#multimolecule.models.configuration_utils.MaskedLMHeadConfig(bias)","title":"bias","text":""},{"location":"models/configuration_utils/#multimolecule.models.configuration_utils.MaskedLMHeadConfig(act)","title":"act","text":""},{"location":"models/configuration_utils/#multimolecule.models.configuration_utils.MaskedLMHeadConfig(layer_norm_eps)","title":"layer_norm_eps","text":""},{"location":"models/configuration_utils/#multimolecule.models.configuration_utils.MaskedLMHeadConfig(output_name)","title":"output_name","text":""},{"location":"models/configuration_utils/#multimolecule.models.configuration_utils.PreTrainedConfig","title":"PreTrainedConfig","text":"

Bases: PretrainedConfig

Base class for all model configuration classes.

Source code in multimolecule/models/configuration_utils.py Python
class PreTrainedConfig(PretrainedConfig):\n    r\"\"\"\n    Base class for all model configuration classes.\n    \"\"\"\n\n    head: HeadConfig | None\n    num_labels: int = 1\n\n    hidden_size: int\n\n    pad_token_id: int = 0\n    bos_token_id: int = 1\n    eos_token_id: int = 2\n    unk_token_id: int = 3\n    mask_token_id: int = 4\n    null_token_id: int = 5\n\n    def __init__(\n        self,\n        pad_token_id: int = 0,\n        bos_token_id: int = 1,\n        eos_token_id: int = 2,\n        unk_token_id: int = 3,\n        mask_token_id: int = 4,\n        null_token_id: int = 5,\n        num_labels: int = 1,\n        **kwargs,\n    ):\n        super().__init__(\n            pad_token_id=pad_token_id,\n            bos_token_id=bos_token_id,\n            eos_token_id=eos_token_id,\n            unk_token_id=unk_token_id,\n            mask_token_id=mask_token_id,\n            null_token_id=null_token_id,\n            num_labels=num_labels,\n            **kwargs,\n        )\n\n    def to_dict(self):\n        output = super().to_dict()\n        for k, v in output.items():\n            if hasattr(v, \"to_dict\"):\n                output[k] = v.to_dict()\n            if is_dataclass(v):\n                output[k] = asdict(v)\n        return output\n
"},{"location":"models/ernierna/","title":"ERNIE-RNA","text":"

Pre-trained model on non-coding RNA (ncRNA) using a masked language modeling (MLM) objective.

"},{"location":"models/ernierna/#disclaimer","title":"Disclaimer","text":"

This is an UNOFFICIAL implementation of the ERNIE-RNA: An RNA Language Model with Structure-enhanced Representations by Weijie Yin, Zhaoyu Zhang, Liang He, et al.

The OFFICIAL repository of ERNIE-RNA is at Bruce-ywj/ERNIE-RNA.

Tip

The MultiMolecule team has confirmed that the provided model and checkpoints are producing the same intermediate representations as the original implementation.

The team releasing ERNIE-RNA did not write this model card for this model so this model card has been written by the MultiMolecule team.

"},{"location":"models/ernierna/#model-details","title":"Model Details","text":"

ERNIE-RNA is a bert-style model pre-trained on a large corpus of non-coding RNA sequences in a self-supervised fashion. This means that the model was trained on the raw nucleotides of RNA sequences only, with an automatic process to generate inputs and labels from those texts. Please refer to the Training Details section for more information on the training process.

"},{"location":"models/ernierna/#variations","title":"Variations","text":""},{"location":"models/ernierna/#model-specification","title":"Model Specification","text":"Num Layers Hidden Size Num Heads Intermediate Size Num Parameters (M) FLOPs (G) MACs (G) Max Num Tokens 12 768 12 3072 85.67 22.36 11.17 1024"},{"location":"models/ernierna/#links","title":"Links","text":""},{"location":"models/ernierna/#usage","title":"Usage","text":"

The model file depends on the multimolecule library. You can install it using pip:

Bash
pip install multimolecule\n
"},{"location":"models/ernierna/#direct-use","title":"Direct Use","text":"

You can use this model directly with a pipeline for masked language modeling:

Python
>>> import multimolecule  # you must import multimolecule to register models\n>>> from transformers import pipeline\n>>> unmasker = pipeline(\"fill-mask\", model=\"multimolecule/ernierna\")\n>>> unmasker(\"gguc<mask>cucugguuagaccagaucugagccu\")\n\n[{'score': 0.32839149236679077,\n  'token': 6,\n  'token_str': 'A',\n  'sequence': 'G G U C A C U C U G G U U A G A C C A G A U C U G A G C C U'},\n {'score': 0.3044775426387787,\n  'token': 9,\n  'token_str': 'U',\n  'sequence': 'G G U C U C U C U G G U U A G A C C A G A U C U G A G C C U'},\n {'score': 0.09914574027061462,\n  'token': 7,\n  'token_str': 'C',\n  'sequence': 'G G U C C C U C U G G U U A G A C C A G A U C U G A G C C U'},\n {'score': 0.09502048045396805,\n  'token': 24,\n  'token_str': '-',\n  'sequence': 'G G U C - C U C U G G U U A G A C C A G A U C U G A G C C U'},\n {'score': 0.06993662565946579,\n  'token': 21,\n  'token_str': '.',\n  'sequence': 'G G U C. C U C U G G U U A G A C C A G A U C U G A G C C U'}]\n
"},{"location":"models/ernierna/#downstream-use","title":"Downstream Use","text":""},{"location":"models/ernierna/#extract-features","title":"Extract Features","text":"

Here is how to use this model to get the features of a given sequence in PyTorch:

Python
from multimolecule import RnaTokenizer, ErnieRnaModel\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/ernierna\")\nmodel = ErnieRnaModel.from_pretrained(\"multimolecule/ernierna\")\n\ntext = \"UAGCUUAUCAGACUGAUGUUGA\"\ninput = tokenizer(text, return_tensors=\"pt\")\n\noutput = model(**input)\n
"},{"location":"models/ernierna/#sequence-classification-regression","title":"Sequence Classification / Regression","text":"

Note: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for sequence classification or regression.

Here is how to use this model as backbone to fine-tune for a sequence-level task in PyTorch:

Python
import torch\nfrom multimolecule import RnaTokenizer, ErnieRnaForSequencePrediction\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/ernierna\")\nmodel = ErnieRnaForSequencePrediction.from_pretrained(\"multimolecule/ernierna\")\n\ntext = \"UAGCUUAUCAGACUGAUGUUGA\"\ninput = tokenizer(text, return_tensors=\"pt\")\nlabel = torch.tensor([1])\n\noutput = model(**input, labels=label)\n
"},{"location":"models/ernierna/#token-classification-regression","title":"Token Classification / Regression","text":"

Note: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for nucleotide classification or regression.

Here is how to use this model as backbone to fine-tune for a nucleotide-level task in PyTorch:

Python
import torch\nfrom multimolecule import RnaTokenizer, ErnieRnaForTokenPrediction\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/ernierna\")\nmodel = ErnieRnaForTokenPrediction.from_pretrained(\"multimolecule/ernierna\")\n\ntext = \"UAGCUUAUCAGACUGAUGUUGA\"\ninput = tokenizer(text, return_tensors=\"pt\")\nlabel = torch.randint(2, (len(text), ))\n\noutput = model(**input, labels=label)\n
"},{"location":"models/ernierna/#contact-classification-regression","title":"Contact Classification / Regression","text":"

Note: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for contact classification or regression.

Here is how to use this model as backbone to fine-tune for a contact-level task in PyTorch:

Python
import torch\nfrom multimolecule import RnaTokenizer, ErnieRnaForContactPrediction\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/ernierna\")\nmodel = ErnieRnaForContactPrediction.from_pretrained(\"multimolecule/ernierna\")\n\ntext = \"UAGCUUAUCAGACUGAUGUUGA\"\ninput = tokenizer(text, return_tensors=\"pt\")\nlabel = torch.randint(2, (len(text), len(text)))\n\noutput = model(**input, labels=label)\n
"},{"location":"models/ernierna/#training-details","title":"Training Details","text":"

ERNIE-RNA used Masked Language Modeling (MLM) as the pre-training objective: taking a sequence, the model randomly masks 15% of the tokens in the input then runs the entire masked sentence through the model and has to predict the masked tokens. This is comparable to the Cloze task in language modeling.

"},{"location":"models/ernierna/#training-data","title":"Training Data","text":"

The ERNIE-RNA model was pre-trained on RNAcentral. RNAcentral is a free, public resource that offers integrated access to a comprehensive and up-to-date set of non-coding RNA sequences provided by a collaborating group of Expert Databases representing a broad range of organisms and RNA types.

ERNIE-RNA applied CD-HIT (CD-HIT-EST) with a cut-off at 100% sequence identity to remove redundancy from the RNAcentral, resulting 25 million unique sequences. Sequences longer than 1024 nucleotides were subsequently excluded. The final dataset contains 20.4 million non-redundant RNA sequences. ERNIE-RNA preprocessed all tokens by replacing \u201cT\u201ds with \u201cS\u201ds.

Note that RnaTokenizer will convert \u201cT\u201ds to \u201cU\u201ds for you, you may disable this behaviour by passing replace_T_with_U=False.

"},{"location":"models/ernierna/#training-procedure","title":"Training Procedure","text":""},{"location":"models/ernierna/#preprocessing","title":"Preprocessing","text":"

ERNIE-RNA used masked language modeling (MLM) as the pre-training objective. The masking procedure is similar to the one used in BERT:

"},{"location":"models/ernierna/#pretraining","title":"PreTraining","text":"

The model was trained on 24 NVIDIA V100 GPUs with 32GiB memories.

"},{"location":"models/ernierna/#citation","title":"Citation","text":"

BibTeX:

BibTeX
@article {Yin2024.03.17.585376,\n    author = {Yin, Weijie and Zhang, Zhaoyu and He, Liang and Jiang, Rui and Zhang, Shuo and Liu, Gan and Zhang, Xuegong and Qin, Tao and Xie, Zhen},\n    title = {ERNIE-RNA: An RNA Language Model with Structure-enhanced Representations},\n    elocation-id = {2024.03.17.585376},\n    year = {2024},\n    doi = {10.1101/2024.03.17.585376},\n    publisher = {Cold Spring Harbor Laboratory},\n    abstract = {With large amounts of unlabeled RNA sequences data produced by high-throughput sequencing technologies, pre-trained RNA language models have been developed to estimate semantic space of RNA molecules, which facilities the understanding of grammar of RNA language. However, existing RNA language models overlook the impact of structure when modeling the RNA semantic space, resulting in incomplete feature extraction and suboptimal performance across various downstream tasks. In this study, we developed a RNA pre-trained language model named ERNIE-RNA (Enhanced Representations with base-pairing restriction for RNA modeling) based on a modified BERT (Bidirectional Encoder Representations from Transformers) by incorporating base-pairing restriction with no MSA (Multiple Sequence Alignment) information. We found that the attention maps from ERNIE-RNA with no fine-tuning are able to capture RNA structure in the zero-shot experiment more precisely than conventional methods such as fine-tuned RNAfold and RNAstructure, suggesting that the ERNIE-RNA can provide comprehensive RNA structural representations. Furthermore, ERNIE-RNA achieved SOTA (state-of-the-art) performance after fine-tuning for various downstream tasks, including RNA structural and functional predictions. In summary, our ERNIE-RNA model provides general features which can be widely and effectively applied in various subsequent research tasks. Our results indicate that introducing key knowledge-based prior information in the BERT framework may be a useful strategy to enhance the performance of other language models.Competing Interest StatementOne patent based on the study was submitted by Z.X. and W.Y., which is entitled as \"A Pre-training Approach for RNA Sequences and Its Applications\"(application number, no 202410262527.5). The remaining authors declare no competing interests.},\n    URL = {https://www.biorxiv.org/content/early/2024/03/17/2024.03.17.585376},\n    eprint = {https://www.biorxiv.org/content/early/2024/03/17/2024.03.17.585376.full.pdf},\n    journal = {bioRxiv}\n}\n
"},{"location":"models/ernierna/#contact","title":"Contact","text":"

Please use GitHub issues of MultiMolecule for any questions or comments on the model card.

Please contact the authors of the ERNIE-RNA paper for questions or comments on the paper/model.

"},{"location":"models/ernierna/#license","title":"License","text":"

This model is licensed under the AGPL-3.0 License.

Text Only
SPDX-License-Identifier: AGPL-3.0-or-later\n
"},{"location":"models/ernierna/#multimolecule.models.ernierna","title":"multimolecule.models.ernierna","text":""},{"location":"models/ernierna/#multimolecule.models.ernierna.RnaTokenizer","title":"RnaTokenizer","text":"

Bases: Tokenizer

Tokenizer for RNA sequences.

Parameters:

Name Type Description Default Alphabet | str | List[str] | None

alphabet to use for tokenization.

None int

Size of kmer to tokenize.

1 bool

Whether to tokenize into codons.

False bool

Whether to replace T with U.

True bool

Whether to convert input to uppercase.

True

Examples:

Python Console Session
>>> from multimolecule import RnaTokenizer\n>>> tokenizer = RnaTokenizer()\n>>> tokenizer('<pad><cls><eos><unk><mask><null>ACGUNRYSWKMBDHV.X*-I')[\"input_ids\"]\n[1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 2]\n>>> tokenizer('acgu')[\"input_ids\"]\n[1, 6, 7, 8, 9, 2]\n>>> tokenizer('acgt')[\"input_ids\"]\n[1, 6, 7, 8, 9, 2]\n>>> tokenizer = RnaTokenizer(replace_T_with_U=False)\n>>> tokenizer('acgt')[\"input_ids\"]\n[1, 6, 7, 8, 3, 2]\n>>> tokenizer = RnaTokenizer(nmers=3)\n>>> tokenizer('uagcuuauc')[\"input_ids\"]\n[1, 83, 17, 64, 49, 96, 84, 22, 2]\n>>> tokenizer = RnaTokenizer(codon=True)\n>>> tokenizer('uagcuuauc')[\"input_ids\"]\n[1, 83, 49, 22, 2]\n>>> tokenizer('uagcuuauca')[\"input_ids\"]\nTraceback (most recent call last):\nValueError: length of input sequence must be a multiple of 3 for codon tokenization, but got 10\n
Source code in multimolecule/tokenisers/rna/tokenization_rna.py Python
class RnaTokenizer(Tokenizer):\n    \"\"\"\n    Tokenizer for RNA sequences.\n\n    Args:\n        alphabet: alphabet to use for tokenization.\n\n            - If is `None`, the standard RNA alphabet will be used.\n            - If is a `string`, it should correspond to the name of a predefined alphabet. The options include\n                + `standard`\n                + `extended`\n                + `streamline`\n                + `nucleobase`\n            - If is an alphabet or a list of characters, that specific alphabet will be used.\n        nmers: Size of kmer to tokenize.\n        codon: Whether to tokenize into codons.\n        replace_T_with_U: Whether to replace T with U.\n        do_upper_case: Whether to convert input to uppercase.\n\n    Examples:\n        >>> from multimolecule import RnaTokenizer\n        >>> tokenizer = RnaTokenizer()\n        >>> tokenizer('<pad><cls><eos><unk><mask><null>ACGUNRYSWKMBDHV.X*-I')[\"input_ids\"]\n        [1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 2]\n        >>> tokenizer('acgu')[\"input_ids\"]\n        [1, 6, 7, 8, 9, 2]\n        >>> tokenizer('acgt')[\"input_ids\"]\n        [1, 6, 7, 8, 9, 2]\n        >>> tokenizer = RnaTokenizer(replace_T_with_U=False)\n        >>> tokenizer('acgt')[\"input_ids\"]\n        [1, 6, 7, 8, 3, 2]\n        >>> tokenizer = RnaTokenizer(nmers=3)\n        >>> tokenizer('uagcuuauc')[\"input_ids\"]\n        [1, 83, 17, 64, 49, 96, 84, 22, 2]\n        >>> tokenizer = RnaTokenizer(codon=True)\n        >>> tokenizer('uagcuuauc')[\"input_ids\"]\n        [1, 83, 49, 22, 2]\n        >>> tokenizer('uagcuuauca')[\"input_ids\"]\n        Traceback (most recent call last):\n        ValueError: length of input sequence must be a multiple of 3 for codon tokenization, but got 10\n    \"\"\"\n\n    model_input_names = [\"input_ids\", \"attention_mask\"]\n\n    def __init__(\n        self,\n        alphabet: Alphabet | str | List[str] | None = None,\n        nmers: int = 1,\n        codon: bool = False,\n        replace_T_with_U: bool = True,\n        do_upper_case: bool = True,\n        additional_special_tokens: List | Tuple | None = None,\n        **kwargs,\n    ):\n        if codon and (nmers > 1 and nmers != 3):\n            raise ValueError(\"Codon and nmers cannot be used together.\")\n        if codon:\n            nmers = 3  # set to 3 to get correct vocab\n        if not isinstance(alphabet, Alphabet):\n            alphabet = get_alphabet(alphabet, nmers=nmers)\n        super().__init__(\n            alphabet=alphabet,\n            nmers=nmers,\n            codon=codon,\n            replace_T_with_U=replace_T_with_U,\n            do_upper_case=do_upper_case,\n            additional_special_tokens=additional_special_tokens,\n            **kwargs,\n        )\n        self.replace_T_with_U = replace_T_with_U\n        self.nmers = nmers\n        self.codon = codon\n\n    def _tokenize(self, text: str, **kwargs):\n        if self.do_upper_case:\n            text = text.upper()\n        if self.replace_T_with_U:\n            text = text.replace(\"T\", \"U\")\n        if self.codon:\n            if len(text) % 3 != 0:\n                raise ValueError(\n                    f\"length of input sequence must be a multiple of 3 for codon tokenization, but got {len(text)}\"\n                )\n            return [text[i : i + 3] for i in range(0, len(text), 3)]\n        if self.nmers > 1:\n            return [text[i : i + self.nmers] for i in range(len(text) - self.nmers + 1)]  # noqa: E203\n        return list(text)\n
"},{"location":"models/ernierna/#multimolecule.models.ernierna.RnaTokenizer(alphabet)","title":"alphabet","text":""},{"location":"models/ernierna/#multimolecule.models.ernierna.RnaTokenizer(nmers)","title":"nmers","text":""},{"location":"models/ernierna/#multimolecule.models.ernierna.RnaTokenizer(codon)","title":"codon","text":""},{"location":"models/ernierna/#multimolecule.models.ernierna.RnaTokenizer(replace_T_with_U)","title":"replace_T_with_U","text":""},{"location":"models/ernierna/#multimolecule.models.ernierna.RnaTokenizer(do_upper_case)","title":"do_upper_case","text":""},{"location":"models/ernierna/#multimolecule.models.ernierna.ErnieRnaConfig","title":"ErnieRnaConfig","text":"

Bases: PreTrainedConfig

This is the configuration class to store the configuration of a ErnieRnaModel. It is used to instantiate a ErnieRna model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the ErnieRna Bruce-ywj/ERNIE-RNA architecture.

Configuration objects inherit from PreTrainedConfig and can be used to control the model outputs. Read the documentation from PreTrainedConfig for more information.

Parameters:

Name Type Description Default int

Vocabulary size of the ErnieRna model. Defines the number of different tokens that can be represented by the inputs_ids passed when calling [ErnieRnaModel].

26 int

Dimensionality of the encoder layers and the pooler layer.

768 int

Number of hidden layers in the Transformer encoder.

12 int

Number of attention heads for each attention layer in the Transformer encoder.

12 int

Dimensionality of the \u201cintermediate\u201d (often named feed-forward) layer in the Transformer encoder.

3072 float

The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.

0.1 float

The dropout ratio for the attention probabilities.

0.1 int

The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).

1026 float

The standard deviation of the truncated_normal_initializer for initializing all weight matrices.

0.02 float

The epsilon used by the layer normalization layers.

1e-12

Examples:

Python Console Session
>>> from multimolecule import ErnieRnaModel, ErnieRnaConfig\n>>> # Initializing a ERNIE-RNA multimolecule/ernierna style configuration\n>>> configuration = ErnieRnaConfig()\n>>> # Initializing a model (with random weights) from the multimolecule/ernierna style configuration\n>>> model = ErnieRnaModel(configuration)\n>>> # Accessing the model configuration\n>>> configuration = model.config\n
Source code in multimolecule/models/ernierna/configuration_ernierna.py Python
class ErnieRnaConfig(PreTrainedConfig):\n    r\"\"\"\n    This is the configuration class to store the configuration of a\n    [`ErnieRnaModel`][multimolecule.models.ErnieRnaModel]. It is used to instantiate a ErnieRna model according to the\n    specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a\n    similar configuration to that of the ErnieRna [Bruce-ywj/ERNIE-RNA](https://github.com/Bruce-ywj/ERNIE-RNA)\n    architecture.\n\n    Configuration objects inherit from [`PreTrainedConfig`][multimolecule.models.PreTrainedConfig] and can be used to\n    control the model outputs. Read the documentation from [`PreTrainedConfig`][multimolecule.models.PreTrainedConfig]\n    for more information.\n\n    Args:\n        vocab_size:\n            Vocabulary size of the ErnieRna model. Defines the number of different tokens that can be represented by\n            the `inputs_ids` passed when calling [`ErnieRnaModel`].\n        hidden_size:\n            Dimensionality of the encoder layers and the pooler layer.\n        num_hidden_layers:\n            Number of hidden layers in the Transformer encoder.\n        num_attention_heads:\n            Number of attention heads for each attention layer in the Transformer encoder.\n        intermediate_size:\n            Dimensionality of the \"intermediate\" (often named feed-forward) layer in the Transformer encoder.\n        hidden_dropout:\n            The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.\n        attention_dropout:\n            The dropout ratio for the attention probabilities.\n        max_position_embeddings:\n            The maximum sequence length that this model might ever be used with. Typically set this to something large\n            just in case (e.g., 512 or 1024 or 2048).\n        initializer_range:\n            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.\n        layer_norm_eps:\n            The epsilon used by the layer normalization layers.\n\n    Examples:\n        >>> from multimolecule import ErnieRnaModel, ErnieRnaConfig\n        >>> # Initializing a ERNIE-RNA multimolecule/ernierna style configuration\n        >>> configuration = ErnieRnaConfig()\n        >>> # Initializing a model (with random weights) from the multimolecule/ernierna style configuration\n        >>> model = ErnieRnaModel(configuration)\n        >>> # Accessing the model configuration\n        >>> configuration = model.config\n    \"\"\"\n\n    model_type = \"ernierna\"\n\n    def __init__(\n        self,\n        vocab_size: int = 26,\n        hidden_size: int = 768,\n        num_hidden_layers: int = 12,\n        num_attention_heads: int = 12,\n        intermediate_size: int = 3072,\n        hidden_act: str = \"gelu\",\n        hidden_dropout: float = 0.1,\n        attention_dropout: float = 0.1,\n        max_position_embeddings: int = 1026,\n        initializer_range: float = 0.02,\n        layer_norm_eps: float = 1e-12,\n        position_embedding_type: str = \"sinusoidal\",\n        pairwise_alpha: float = 0.8,\n        is_decoder: bool = False,\n        use_cache: bool = True,\n        head: HeadConfig | None = None,\n        lm_head: MaskedLMHeadConfig | None = None,\n        **kwargs,\n    ):\n        super().__init__(**kwargs)\n\n        self.vocab_size = vocab_size\n        self.type_vocab_size = 2\n        self.hidden_size = hidden_size\n        self.num_hidden_layers = num_hidden_layers\n        self.num_attention_heads = num_attention_heads\n        self.intermediate_size = intermediate_size\n        self.hidden_act = hidden_act\n        self.hidden_dropout = hidden_dropout\n        self.attention_dropout = attention_dropout\n        self.max_position_embeddings = max_position_embeddings\n        self.initializer_range = initializer_range\n        self.layer_norm_eps = layer_norm_eps\n        self.position_embedding_type = position_embedding_type\n        self.pairwise_alpha = pairwise_alpha\n        self.is_decoder = is_decoder\n        self.use_cache = use_cache\n        self.head = HeadConfig(**head) if head is not None else None\n        self.lm_head = MaskedLMHeadConfig(**lm_head) if lm_head is not None else None\n
"},{"location":"models/ernierna/#multimolecule.models.ernierna.ErnieRnaConfig(vocab_size)","title":"vocab_size","text":""},{"location":"models/ernierna/#multimolecule.models.ernierna.ErnieRnaConfig(hidden_size)","title":"hidden_size","text":""},{"location":"models/ernierna/#multimolecule.models.ernierna.ErnieRnaConfig(num_hidden_layers)","title":"num_hidden_layers","text":""},{"location":"models/ernierna/#multimolecule.models.ernierna.ErnieRnaConfig(num_attention_heads)","title":"num_attention_heads","text":""},{"location":"models/ernierna/#multimolecule.models.ernierna.ErnieRnaConfig(intermediate_size)","title":"intermediate_size","text":""},{"location":"models/ernierna/#multimolecule.models.ernierna.ErnieRnaConfig(hidden_dropout)","title":"hidden_dropout","text":""},{"location":"models/ernierna/#multimolecule.models.ernierna.ErnieRnaConfig(attention_dropout)","title":"attention_dropout","text":""},{"location":"models/ernierna/#multimolecule.models.ernierna.ErnieRnaConfig(max_position_embeddings)","title":"max_position_embeddings","text":""},{"location":"models/ernierna/#multimolecule.models.ernierna.ErnieRnaConfig(initializer_range)","title":"initializer_range","text":""},{"location":"models/ernierna/#multimolecule.models.ernierna.ErnieRnaConfig(layer_norm_eps)","title":"layer_norm_eps","text":""},{"location":"models/ernierna/#multimolecule.models.ernierna.ErnieRnaForContactClassification","title":"ErnieRnaForContactClassification","text":"

Bases: ErnieRnaForPreTraining

Examples:

Python Console Session
>>> from multimolecule.models import ErnieRnaConfig, ErnieRnaForContactClassification, RnaTokenizer\n>>> config = ErnieRnaConfig()\n>>> model = ErnieRnaForContactClassification(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input)\n
Source code in multimolecule/models/ernierna/modeling_ernierna.py Python
class ErnieRnaForContactClassification(ErnieRnaForPreTraining):\n    \"\"\"\n    Examples:\n        >>> from multimolecule.models import ErnieRnaConfig, ErnieRnaForContactClassification, RnaTokenizer\n        >>> config = ErnieRnaConfig()\n        >>> model = ErnieRnaForContactClassification(config)\n        >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n        >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n        >>> output = model(**input)\n    \"\"\"\n\n    def __init__(self, config: ErnieRnaConfig):\n        super().__init__(config)\n        self.ss_head = ErnieRnaContactClassificationHead(config)\n\n        # Initialize weights and apply final processing\n        self.post_init()\n\n    def forward(  # type: ignore[override]  # pylint: disable=W0221\n        self,\n        input_ids: Tensor | NestedTensor,\n        attention_mask: Tensor | None = None,\n        position_ids: Tensor | None = None,\n        head_mask: Tensor | None = None,\n        inputs_embeds: Tensor | NestedTensor | None = None,\n        labels_lm: Tensor | None = None,\n        labels_ss: Tensor | None = None,\n        output_attentions: bool | None = None,\n        output_attention_biases: bool | None = None,\n        output_hidden_states: bool | None = None,\n        return_dict: bool | None = None,\n        **kwargs,\n    ) -> Tuple[Tensor, ...] | ErnieRnaForContactClassificationOutput:\n        if output_attentions is False:\n            warn(\"output_attentions must be True for contact classification and will be ignored.\")\n        outputs = self.ernierna(\n            input_ids,\n            attention_mask=attention_mask,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            output_attentions=True,\n            output_attention_biases=output_attention_biases,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n            **kwargs,\n        )\n        output_lm = self.lm_head(outputs, labels_lm)\n        output_ss = self.ss_head(outputs[-1][-1], attention_mask, input_ids, labels_ss)\n        logits_lm, loss_lm = output_lm.logits, output_lm.loss\n        logits_ss, loss_ss = output_ss.logits, output_ss.loss\n\n        loss = None\n        if loss_lm is not None and loss_ss is not None:\n            loss = loss_lm + loss_ss\n        elif loss_lm is not None:\n            loss = loss_lm\n        elif loss_ss is not None:\n            loss = loss_ss\n\n        if not return_dict:\n            output = outputs[2:]\n            output = ((logits_ss, loss_ss) + output) if loss_ss is not None else ((logits_ss,) + output)\n            output = ((logits_lm, loss_lm) + output) if loss_lm is not None else ((logits_lm,) + output)\n            return ((loss,) + output) if loss is not None else output\n\n        return ErnieRnaForContactClassificationOutput(\n            loss=loss,\n            logits_lm=logits_lm,\n            loss_lm=loss_lm,\n            logits_ss=logits_ss,\n            loss_ss=loss_ss,\n            hidden_states=outputs.hidden_states,\n            attentions=outputs.attentions,\n            attention_biases=outputs.attention_biases,\n        )\n
"},{"location":"models/ernierna/#multimolecule.models.ernierna.ErnieRnaForContactPrediction","title":"ErnieRnaForContactPrediction","text":"

Bases: ErnieRnaPreTrainedModel

Examples:

Python Console Session
>>> from multimolecule import ErnieRnaConfig, ErnieRnaForContactPrediction, RnaTokenizer\n>>> config = ErnieRnaConfig()\n>>> model = ErnieRnaForContactPrediction(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=torch.randint(2, (1, 5, 5)))\n>>> output[\"logits\"].shape\ntorch.Size([1, 5, 5, 1])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n
Source code in multimolecule/models/ernierna/modeling_ernierna.py Python
class ErnieRnaForContactPrediction(ErnieRnaPreTrainedModel):\n    \"\"\"\n    Examples:\n        >>> from multimolecule import ErnieRnaConfig, ErnieRnaForContactPrediction, RnaTokenizer\n        >>> config = ErnieRnaConfig()\n        >>> model = ErnieRnaForContactPrediction(config)\n        >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n        >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n        >>> output = model(**input, labels=torch.randint(2, (1, 5, 5)))\n        >>> output[\"logits\"].shape\n        torch.Size([1, 5, 5, 1])\n        >>> output[\"loss\"]  # doctest:+ELLIPSIS\n        tensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n    \"\"\"\n\n    def __init__(self, config: ErnieRnaConfig):\n        super().__init__(config)\n        self.ernierna = ErnieRnaModel(config, add_pooling_layer=True)\n        self.contact_head = ContactPredictionHead(config)\n        self.head_config = self.contact_head.config\n\n        # Initialize weights and apply final processing\n        self.post_init()\n\n    def forward(\n        self,\n        input_ids: Tensor | NestedTensor,\n        attention_mask: Tensor | None = None,\n        position_ids: Tensor | None = None,\n        head_mask: Tensor | None = None,\n        inputs_embeds: Tensor | NestedTensor | None = None,\n        labels: Tensor | None = None,\n        output_attentions: bool | None = None,\n        output_hidden_states: bool | None = None,\n        return_dict: bool | None = None,\n        **kwargs,\n    ) -> Tuple[Tensor, ...] | ErnieRnaContactPredictorOutput:\n        if output_attentions is False:\n            warn(\"output_attentions must be True for contact classification and will be ignored.\")\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n        outputs = self.ernierna(\n            input_ids,\n            attention_mask=attention_mask,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            output_attentions=True,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n            **kwargs,\n        )\n        output = self.contact_head(outputs, attention_mask, input_ids, labels)\n        logits, loss = output.logits, output.loss\n\n        if not return_dict:\n            output = (logits,) + outputs[2:]\n            return ((loss,) + output) if loss is not None else output\n\n        return ErnieRnaContactPredictorOutput(\n            loss=loss,\n            logits=logits,\n            hidden_states=outputs.hidden_states,\n            attentions=outputs.attentions,\n        )\n
"},{"location":"models/ernierna/#multimolecule.models.ernierna.ErnieRnaForMaskedLM","title":"ErnieRnaForMaskedLM","text":"

Bases: ErnieRnaPreTrainedModel

Examples:

Python Console Session
>>> from multimolecule import ErnieRnaConfig, ErnieRnaForMaskedLM, RnaTokenizer\n>>> config = ErnieRnaConfig()\n>>> model = ErnieRnaForMaskedLM(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=input[\"input_ids\"])\n>>> output[\"logits\"].shape\ntorch.Size([1, 7, 26])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<NllLossBackward0>)\n
Source code in multimolecule/models/ernierna/modeling_ernierna.py Python
class ErnieRnaForMaskedLM(ErnieRnaPreTrainedModel):\n    \"\"\"\n    Examples:\n        >>> from multimolecule import ErnieRnaConfig, ErnieRnaForMaskedLM, RnaTokenizer\n        >>> config = ErnieRnaConfig()\n        >>> model = ErnieRnaForMaskedLM(config)\n        >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n        >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n        >>> output = model(**input, labels=input[\"input_ids\"])\n        >>> output[\"logits\"].shape\n        torch.Size([1, 7, 26])\n        >>> output[\"loss\"]  # doctest:+ELLIPSIS\n        tensor(..., grad_fn=<NllLossBackward0>)\n    \"\"\"\n\n    _tied_weights_keys = [\"lm_head.decoder.weight\", \"lm_head.decoder.bias\"]\n\n    def __init__(self, config: ErnieRnaConfig):\n        super().__init__(config)\n        if config.is_decoder:\n            logger.warning(\n                \"If you want to use `BertForMaskedLM` make sure `config.is_decoder=False` for \"\n                \"bi-directional self-attention.\"\n            )\n        self.ernierna = ErnieRnaModel(config, add_pooling_layer=False)\n        self.lm_head = MaskedLMHead(config)\n\n        # Initialize weights and apply final processing\n        self.post_init()\n\n    def get_output_embeddings(self):\n        return self.lm_head.decoder\n\n    def set_output_embeddings(self, new_embeddings):\n        self.lm_head.decoder = new_embeddings\n\n    def forward(\n        self,\n        input_ids: Tensor | NestedTensor,\n        attention_mask: Tensor | None = None,\n        position_ids: Tensor | None = None,\n        head_mask: Tensor | None = None,\n        inputs_embeds: Tensor | NestedTensor | None = None,\n        encoder_hidden_states: Tensor | None = None,\n        encoder_attention_mask: Tensor | None = None,\n        labels: Tensor | None = None,\n        output_attentions: bool | None = None,\n        output_attention_biases: bool | None = None,\n        output_hidden_states: bool | None = None,\n        return_dict: bool | None = None,\n        **kwargs,\n    ) -> Tuple[Tensor, ...] | ErnieRnaForMaskedLMOutput:\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n        outputs = self.ernierna(\n            input_ids,\n            attention_mask=attention_mask,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            encoder_hidden_states=encoder_hidden_states,\n            encoder_attention_mask=encoder_attention_mask,\n            output_attentions=output_attentions,\n            output_attention_biases=output_attention_biases,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n            **kwargs,\n        )\n        output = self.lm_head(outputs, labels)\n        logits, loss = output.logits, output.loss\n\n        if not return_dict:\n            output = (logits,) + outputs[2:]\n            return ((loss,) + output) if loss is not None else output\n\n        return ErnieRnaForMaskedLMOutput(\n            loss=loss,\n            logits=logits,\n            hidden_states=outputs.hidden_states,\n            attentions=outputs.attentions,\n        )\n
"},{"location":"models/ernierna/#multimolecule.models.ernierna.ErnieRnaForSequencePrediction","title":"ErnieRnaForSequencePrediction","text":"

Bases: ErnieRnaPreTrainedModel

Examples:

Python Console Session
>>> from multimolecule import ErnieRnaConfig, ErnieRnaForSequencePrediction, RnaTokenizer\n>>> config = ErnieRnaConfig()\n>>> model = ErnieRnaForSequencePrediction(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input)\n>>> output[\"logits\"].shape\ntorch.Size([1, 1])\n
Source code in multimolecule/models/ernierna/modeling_ernierna.py Python
class ErnieRnaForSequencePrediction(ErnieRnaPreTrainedModel):\n    \"\"\"\n    Examples:\n        >>> from multimolecule import ErnieRnaConfig, ErnieRnaForSequencePrediction, RnaTokenizer\n        >>> config = ErnieRnaConfig()\n        >>> model = ErnieRnaForSequencePrediction(config)\n        >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n        >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n        >>> output = model(**input)\n        >>> output[\"logits\"].shape\n        torch.Size([1, 1])\n    \"\"\"\n\n    def __init__(self, config: ErnieRnaConfig):\n        super().__init__(config)\n        self.ernierna = ErnieRnaModel(config)\n        self.sequence_head = SequencePredictionHead(config)\n        self.head_config = self.sequence_head.config\n\n        # Initialize weights and apply final processing\n        self.post_init()\n\n    def forward(\n        self,\n        input_ids: Tensor | NestedTensor,\n        attention_mask: Tensor | None = None,\n        position_ids: Tensor | None = None,\n        head_mask: Tensor | None = None,\n        inputs_embeds: Tensor | NestedTensor | None = None,\n        labels: Tensor | None = None,\n        output_attentions: bool | None = None,\n        output_attention_biases: bool | None = None,\n        output_hidden_states: bool | None = None,\n        return_dict: bool | None = None,\n        **kwargs,\n    ) -> Tuple[Tensor, ...] | ErnieRnaSequencePredictorOutput:\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n        outputs = self.ernierna(\n            input_ids,\n            attention_mask=attention_mask,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            output_attentions=output_attentions,\n            output_attention_biases=output_attention_biases,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n            **kwargs,\n        )\n        output = self.sequence_head(outputs, labels)\n        logits, loss = output.logits, output.loss\n\n        if not return_dict:\n            output = (logits,) + outputs[2:]\n            return ((loss,) + output) if loss is not None else output\n\n        return ErnieRnaSequencePredictorOutput(\n            loss=loss,\n            logits=logits,\n            hidden_states=outputs.hidden_states,\n            attentions=outputs.attentions,\n        )\n
"},{"location":"models/ernierna/#multimolecule.models.ernierna.ErnieRnaForTokenPrediction","title":"ErnieRnaForTokenPrediction","text":"

Bases: ErnieRnaPreTrainedModel

Examples:

Python Console Session
>>> from multimolecule import ErnieRnaConfig, ErnieRnaForTokenPrediction, RnaTokenizer\n>>> config = ErnieRnaConfig()\n>>> model = ErnieRnaForTokenPrediction(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=torch.randint(2, (1, 5)))\n>>> output[\"logits\"].shape\ntorch.Size([1, 5, 1])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n
Source code in multimolecule/models/ernierna/modeling_ernierna.py Python
class ErnieRnaForTokenPrediction(ErnieRnaPreTrainedModel):\n    \"\"\"\n    Examples:\n        >>> from multimolecule import ErnieRnaConfig, ErnieRnaForTokenPrediction, RnaTokenizer\n        >>> config = ErnieRnaConfig()\n        >>> model = ErnieRnaForTokenPrediction(config)\n        >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n        >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n        >>> output = model(**input, labels=torch.randint(2, (1, 5)))\n        >>> output[\"logits\"].shape\n        torch.Size([1, 5, 1])\n        >>> output[\"loss\"]  # doctest:+ELLIPSIS\n        tensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n    \"\"\"\n\n    def __init__(self, config: ErnieRnaConfig):\n        super().__init__(config)\n        self.num_labels = config.num_labels\n        self.ernierna = ErnieRnaModel(config, add_pooling_layer=True)\n        self.token_head = TokenPredictionHead(config)\n        self.head_config = self.token_head.config\n\n        # Initialize weights and apply final processing\n        self.post_init()\n\n    def forward(\n        self,\n        input_ids: Tensor | NestedTensor,\n        attention_mask: Tensor | None = None,\n        position_ids: Tensor | None = None,\n        head_mask: Tensor | None = None,\n        inputs_embeds: Tensor | NestedTensor | None = None,\n        labels: Tensor | None = None,\n        output_attentions: bool | None = None,\n        output_attention_biases: bool | None = None,\n        output_hidden_states: bool | None = None,\n        return_dict: bool | None = None,\n        **kwargs,\n    ) -> Tuple[Tensor, ...] | ErnieRnaTokenPredictorOutput:\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n        outputs = self.ernierna(\n            input_ids,\n            attention_mask=attention_mask,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            output_attentions=output_attentions,\n            output_attention_biases=output_attention_biases,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n            **kwargs,\n        )\n        output = self.token_head(outputs, attention_mask, input_ids, labels)\n        logits, loss = output.logits, output.loss\n\n        if not return_dict:\n            output = (logits,) + outputs[2:]\n            return ((loss,) + output) if loss is not None else output\n\n        return ErnieRnaTokenPredictorOutput(\n            loss=loss,\n            logits=logits,\n            hidden_states=outputs.hidden_states,\n            attentions=outputs.attentions,\n        )\n
"},{"location":"models/ernierna/#multimolecule.models.ernierna.ErnieRnaModel","title":"ErnieRnaModel","text":"

Bases: ErnieRnaPreTrainedModel

Examples:

Python Console Session
>>> from multimolecule import ErnieRnaConfig, ErnieRnaModel, RnaTokenizer\n>>> config = ErnieRnaConfig()\n>>> model = ErnieRnaModel(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input)\n>>> output[\"last_hidden_state\"].shape\ntorch.Size([1, 7, 768])\n>>> output[\"pooler_output\"].shape\ntorch.Size([1, 768])\n
Source code in multimolecule/models/ernierna/modeling_ernierna.py Python
class ErnieRnaModel(ErnieRnaPreTrainedModel):\n    \"\"\"\n    Examples:\n        >>> from multimolecule import ErnieRnaConfig, ErnieRnaModel, RnaTokenizer\n        >>> config = ErnieRnaConfig()\n        >>> model = ErnieRnaModel(config)\n        >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n        >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n        >>> output = model(**input)\n        >>> output[\"last_hidden_state\"].shape\n        torch.Size([1, 7, 768])\n        >>> output[\"pooler_output\"].shape\n        torch.Size([1, 768])\n    \"\"\"\n\n    pairwise_bias_map: Tensor\n\n    def __init__(\n        self, config: ErnieRnaConfig, add_pooling_layer: bool = True, tokenizer: PreTrainedTokenizer | None = None\n    ):\n        super().__init__(config)\n        if tokenizer is None:\n            tokenizer = AutoTokenizer.from_pretrained(\"multimolecule/rna\")\n        self.tokenizer = tokenizer\n        self.pad_token_id = tokenizer.pad_token_id\n        self.vocab_size = len(self.tokenizer)\n        if self.vocab_size != config.vocab_size:\n            raise ValueError(\n                f\"Vocab size in tokenizer ({self.vocab_size}) does not match the one in config ({config.vocab_size})\"\n            )\n        token_to_ids = self.tokenizer._token_to_id\n        tokens = sorted(token_to_ids, key=token_to_ids.get)\n        pairwise_bias_dict = get_pairwise_bias_dict(config.pairwise_alpha)\n        self.register_buffer(\n            \"pairwise_bias_map\",\n            torch.tensor([[pairwise_bias_dict.get(f\"{i}{j}\", 0) for i in tokens] for j in tokens]),\n            persistent=False,\n        )\n        self.pairwise_bias_proj = nn.Sequential(\n            nn.Linear(1, config.num_attention_heads // 2),\n            nn.GELU(),\n            nn.Linear(config.num_attention_heads // 2, config.num_attention_heads),\n        )\n        self.embeddings = ErnieRnaEmbeddings(config)\n        self.encoder = ErnieRnaEncoder(config)\n        self.pooler = ErnieRnaPooler(config) if add_pooling_layer else None\n\n        # Initialize weights and apply final processing\n        self.post_init()\n\n    def get_input_embeddings(self):\n        return self.embeddings.word_embeddings\n\n    def set_input_embeddings(self, value):\n        self.embeddings.word_embeddings = value\n\n    def _prune_heads(self, heads_to_prune):\n        \"\"\"\n        Prunes heads of the model. heads_to_prune: dict of {layer_num: list of heads to prune in this layer} See base\n        class PreTrainedModel\n        \"\"\"\n        for layer, heads in heads_to_prune.items():\n            self.encoder.layer[layer].attention.prune_heads(heads)\n\n    def get_pairwise_bias(\n        self, input_ids: Tensor | NestedTensor, attention_mask: Tensor | NestedTensor | None = None\n    ) -> Tensor | NestedTensor:\n        batch_size, seq_len = input_ids.shape\n\n        # Broadcasting data indices to compute indices\n        data_index_x = input_ids.unsqueeze(2).expand(batch_size, seq_len, seq_len)\n        data_index_y = input_ids.unsqueeze(1).expand(batch_size, seq_len, seq_len)\n\n        # Get bias from pairwise_bias_map\n        return self.pairwise_bias_map[data_index_x, data_index_y]\n\n        # Zhiyuan: Is it really necessary to mask the bias?\n        # The mask position should have been nan, and the implementation is incorrect anyway\n        # if attention_mask is not None:\n        #     attention_mask = attention_mask.unsqueeze(1).expand(batch_size, seq_len, seq_len)\n        #     bias = bias * attention_mask\n\n    def forward(\n        self,\n        input_ids: Tensor | NestedTensor,\n        attention_mask: Tensor | None = None,\n        position_ids: Tensor | None = None,\n        head_mask: Tensor | None = None,\n        inputs_embeds: Tensor | NestedTensor | None = None,\n        encoder_hidden_states: Tensor | None = None,\n        encoder_attention_mask: Tensor | None = None,\n        past_key_values: Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None = None,\n        use_cache: bool | None = None,\n        output_attentions: bool | None = None,\n        output_attention_biases: bool | None = None,\n        output_hidden_states: bool | None = None,\n        return_dict: bool | None = None,\n        **kwargs,\n    ) -> Tuple[Tensor, ...] | ErnieRnaModelOutputWithPoolingAndCrossAttentions:\n        r\"\"\"\n        Args:\n            encoder_hidden_states:\n                Shape: `(batch_size, sequence_length, hidden_size)`\n\n                Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if\n                the model is configured as a decoder.\n            encoder_attention_mask:\n                Shape: `(batch_size, sequence_length)`\n\n                Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used\n                in the cross-attention if the model is configured as a decoder. Mask values selected in `[0, 1]`:\n\n                - 1 for tokens that are **not masked**,\n                - 0 for tokens that are **masked**.\n            past_key_values:\n                Tuple of length `config.n_layers` with each tuple having 4 tensors of shape\n                `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)\n\n                Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up\n                decoding.\n\n                If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those\n                that don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of\n                all `decoder_input_ids` of shape `(batch_size, sequence_length)`.\n            use_cache:\n                If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding\n                (see `past_key_values`).\n        \"\"\"\n        if kwargs:\n            warn(\n                f\"Additional keyword arguments `{', '.join(kwargs)}` are detected in \"\n                f\"`{self.__class__.__name__}.forward`, they will be ignored.\\n\"\n                \"This is provided for backward compatibility and may lead to unexpected behavior.\"\n            )\n        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions\n        output_hidden_states = (\n            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states\n        )\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n\n        if self.config.is_decoder:\n            use_cache = use_cache if use_cache is not None else self.config.use_cache\n        else:\n            use_cache = False\n\n        pairwise_bias = self.get_pairwise_bias(input_ids, attention_mask)\n        attention_bias = self.pairwise_bias_proj(pairwise_bias.unsqueeze(-1)).transpose(1, 3)\n\n        if isinstance(input_ids, NestedTensor):\n            input_ids, attention_mask = input_ids.tensor, input_ids.mask\n        if input_ids is not None and inputs_embeds is not None:\n            raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n        if input_ids is not None:\n            self.warn_if_padding_and_no_attention_mask(input_ids, attention_mask)\n            input_shape = input_ids.size()\n        elif inputs_embeds is not None:\n            input_shape = inputs_embeds.size()[:-1]\n        else:\n            raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n        batch_size, seq_length = input_shape\n        device = input_ids.device if input_ids is not None else inputs_embeds.device  # type: ignore[union-attr]\n\n        # past_key_values_length\n        past_key_values_length = past_key_values[0][0].shape[2] if past_key_values is not None else 0\n\n        if attention_mask is None:\n            attention_mask = (\n                input_ids.ne(self.pad_token_id)\n                if self.pad_token_id is not None\n                else torch.ones(((batch_size, seq_length + past_key_values_length)), device=device)\n            )\n\n        # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]\n        # ourselves in which case we just need to make it broadcastable to all heads.\n        extended_attention_mask: Tensor = self.get_extended_attention_mask(attention_mask, input_shape)\n\n        # If a 2D or 3D attention mask is provided for the cross-attention\n        # we need to make broadcastable to [batch_size, num_heads, seq_length, seq_length]\n        if self.config.is_decoder and encoder_hidden_states is not None:\n            encoder_batch_size, encoder_sequence_length, _ = encoder_hidden_states.size()\n            encoder_hidden_shape = (encoder_batch_size, encoder_sequence_length)\n            if encoder_attention_mask is None:\n                encoder_attention_mask = torch.ones(encoder_hidden_shape, device=device)\n            encoder_extended_attention_mask = self.invert_attention_mask(encoder_attention_mask)\n        else:\n            encoder_extended_attention_mask = None\n\n        # Prepare head mask if needed\n        # 1.0 in head_mask indicate we keep the head\n        # attention_probs has shape bsz x n_heads x N x N\n        # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]\n        # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]\n        head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)\n\n        embedding_output = self.embeddings(\n            input_ids=input_ids,\n            position_ids=position_ids,\n            inputs_embeds=inputs_embeds,\n            past_key_values_length=past_key_values_length,\n        )\n        encoder_outputs = self.encoder(\n            embedding_output,\n            attention_mask=extended_attention_mask,\n            attention_bias=attention_bias,\n            head_mask=head_mask,\n            encoder_hidden_states=encoder_hidden_states,\n            encoder_attention_mask=encoder_extended_attention_mask,\n            past_key_values=past_key_values,\n            use_cache=use_cache,\n            output_attentions=output_attentions,\n            output_attention_biases=output_attention_biases,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n        )\n        sequence_output = encoder_outputs[0]\n        pooled_output = self.pooler(sequence_output) if self.pooler is not None else None\n\n        if not return_dict:\n            return (sequence_output, pooled_output) + encoder_outputs[1:]\n\n        return ErnieRnaModelOutputWithPoolingAndCrossAttentions(\n            last_hidden_state=sequence_output,\n            pooler_output=pooled_output,\n            past_key_values=encoder_outputs.past_key_values,\n            hidden_states=encoder_outputs.hidden_states,\n            attention_biases=encoder_outputs.attention_biases,\n            attentions=encoder_outputs.attentions,\n            cross_attentions=encoder_outputs.cross_attentions,\n        )\n
"},{"location":"models/ernierna/#multimolecule.models.ernierna.ErnieRnaModel.forward","title":"forward","text":"Python
forward(input_ids: Tensor | NestedTensor, attention_mask: Tensor | None = None, position_ids: Tensor | None = None, head_mask: Tensor | None = None, inputs_embeds: Tensor | NestedTensor | None = None, encoder_hidden_states: Tensor | None = None, encoder_attention_mask: Tensor | None = None, past_key_values: Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None = None, use_cache: bool | None = None, output_attentions: bool | None = None, output_attention_biases: bool | None = None, output_hidden_states: bool | None = None, return_dict: bool | None = None, **kwargs) -> Tuple[Tensor, ...] | ErnieRnaModelOutputWithPoolingAndCrossAttentions\n

Parameters:

Name Type Description Default Tensor | None

Shape: (batch_size, sequence_length, hidden_size)

Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if the model is configured as a decoder.

None Tensor | None

Shape: (batch_size, sequence_length)

Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used in the cross-attention if the model is configured as a decoder. Mask values selected in [0, 1]:

None Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None

Tuple of length config.n_layers with each tuple having 4 tensors of shape `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)

Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up decoding.

If past_key_values are used, the user can optionally input only the last decoder_input_ids (those that don\u2019t have their past key value states given to this model) of shape (batch_size, 1) instead of all decoder_input_ids of shape (batch_size, sequence_length).

None bool | None

If set to True, past_key_values key value states are returned and can be used to speed up decoding (see past_key_values).

None Source code in multimolecule/models/ernierna/modeling_ernierna.py Python
def forward(\n    self,\n    input_ids: Tensor | NestedTensor,\n    attention_mask: Tensor | None = None,\n    position_ids: Tensor | None = None,\n    head_mask: Tensor | None = None,\n    inputs_embeds: Tensor | NestedTensor | None = None,\n    encoder_hidden_states: Tensor | None = None,\n    encoder_attention_mask: Tensor | None = None,\n    past_key_values: Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None = None,\n    use_cache: bool | None = None,\n    output_attentions: bool | None = None,\n    output_attention_biases: bool | None = None,\n    output_hidden_states: bool | None = None,\n    return_dict: bool | None = None,\n    **kwargs,\n) -> Tuple[Tensor, ...] | ErnieRnaModelOutputWithPoolingAndCrossAttentions:\n    r\"\"\"\n    Args:\n        encoder_hidden_states:\n            Shape: `(batch_size, sequence_length, hidden_size)`\n\n            Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if\n            the model is configured as a decoder.\n        encoder_attention_mask:\n            Shape: `(batch_size, sequence_length)`\n\n            Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used\n            in the cross-attention if the model is configured as a decoder. Mask values selected in `[0, 1]`:\n\n            - 1 for tokens that are **not masked**,\n            - 0 for tokens that are **masked**.\n        past_key_values:\n            Tuple of length `config.n_layers` with each tuple having 4 tensors of shape\n            `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)\n\n            Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up\n            decoding.\n\n            If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those\n            that don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of\n            all `decoder_input_ids` of shape `(batch_size, sequence_length)`.\n        use_cache:\n            If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding\n            (see `past_key_values`).\n    \"\"\"\n    if kwargs:\n        warn(\n            f\"Additional keyword arguments `{', '.join(kwargs)}` are detected in \"\n            f\"`{self.__class__.__name__}.forward`, they will be ignored.\\n\"\n            \"This is provided for backward compatibility and may lead to unexpected behavior.\"\n        )\n    output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions\n    output_hidden_states = (\n        output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states\n    )\n    return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n\n    if self.config.is_decoder:\n        use_cache = use_cache if use_cache is not None else self.config.use_cache\n    else:\n        use_cache = False\n\n    pairwise_bias = self.get_pairwise_bias(input_ids, attention_mask)\n    attention_bias = self.pairwise_bias_proj(pairwise_bias.unsqueeze(-1)).transpose(1, 3)\n\n    if isinstance(input_ids, NestedTensor):\n        input_ids, attention_mask = input_ids.tensor, input_ids.mask\n    if input_ids is not None and inputs_embeds is not None:\n        raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n    if input_ids is not None:\n        self.warn_if_padding_and_no_attention_mask(input_ids, attention_mask)\n        input_shape = input_ids.size()\n    elif inputs_embeds is not None:\n        input_shape = inputs_embeds.size()[:-1]\n    else:\n        raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n    batch_size, seq_length = input_shape\n    device = input_ids.device if input_ids is not None else inputs_embeds.device  # type: ignore[union-attr]\n\n    # past_key_values_length\n    past_key_values_length = past_key_values[0][0].shape[2] if past_key_values is not None else 0\n\n    if attention_mask is None:\n        attention_mask = (\n            input_ids.ne(self.pad_token_id)\n            if self.pad_token_id is not None\n            else torch.ones(((batch_size, seq_length + past_key_values_length)), device=device)\n        )\n\n    # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]\n    # ourselves in which case we just need to make it broadcastable to all heads.\n    extended_attention_mask: Tensor = self.get_extended_attention_mask(attention_mask, input_shape)\n\n    # If a 2D or 3D attention mask is provided for the cross-attention\n    # we need to make broadcastable to [batch_size, num_heads, seq_length, seq_length]\n    if self.config.is_decoder and encoder_hidden_states is not None:\n        encoder_batch_size, encoder_sequence_length, _ = encoder_hidden_states.size()\n        encoder_hidden_shape = (encoder_batch_size, encoder_sequence_length)\n        if encoder_attention_mask is None:\n            encoder_attention_mask = torch.ones(encoder_hidden_shape, device=device)\n        encoder_extended_attention_mask = self.invert_attention_mask(encoder_attention_mask)\n    else:\n        encoder_extended_attention_mask = None\n\n    # Prepare head mask if needed\n    # 1.0 in head_mask indicate we keep the head\n    # attention_probs has shape bsz x n_heads x N x N\n    # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]\n    # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]\n    head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)\n\n    embedding_output = self.embeddings(\n        input_ids=input_ids,\n        position_ids=position_ids,\n        inputs_embeds=inputs_embeds,\n        past_key_values_length=past_key_values_length,\n    )\n    encoder_outputs = self.encoder(\n        embedding_output,\n        attention_mask=extended_attention_mask,\n        attention_bias=attention_bias,\n        head_mask=head_mask,\n        encoder_hidden_states=encoder_hidden_states,\n        encoder_attention_mask=encoder_extended_attention_mask,\n        past_key_values=past_key_values,\n        use_cache=use_cache,\n        output_attentions=output_attentions,\n        output_attention_biases=output_attention_biases,\n        output_hidden_states=output_hidden_states,\n        return_dict=return_dict,\n    )\n    sequence_output = encoder_outputs[0]\n    pooled_output = self.pooler(sequence_output) if self.pooler is not None else None\n\n    if not return_dict:\n        return (sequence_output, pooled_output) + encoder_outputs[1:]\n\n    return ErnieRnaModelOutputWithPoolingAndCrossAttentions(\n        last_hidden_state=sequence_output,\n        pooler_output=pooled_output,\n        past_key_values=encoder_outputs.past_key_values,\n        hidden_states=encoder_outputs.hidden_states,\n        attention_biases=encoder_outputs.attention_biases,\n        attentions=encoder_outputs.attentions,\n        cross_attentions=encoder_outputs.cross_attentions,\n    )\n
"},{"location":"models/ernierna/#multimolecule.models.ernierna.ErnieRnaModel.forward(encoder_hidden_states)","title":"encoder_hidden_states","text":""},{"location":"models/ernierna/#multimolecule.models.ernierna.ErnieRnaModel.forward(encoder_attention_mask)","title":"encoder_attention_mask","text":""},{"location":"models/ernierna/#multimolecule.models.ernierna.ErnieRnaModel.forward(past_key_values)","title":"past_key_values","text":""},{"location":"models/ernierna/#multimolecule.models.ernierna.ErnieRnaModel.forward(use_cache)","title":"use_cache","text":""},{"location":"models/ernierna/#multimolecule.models.ernierna.ErnieRnaPreTrainedModel","title":"ErnieRnaPreTrainedModel","text":"

Bases: PreTrainedModel

An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained models.

Source code in multimolecule/models/ernierna/modeling_ernierna.py Python
class ErnieRnaPreTrainedModel(PreTrainedModel):\n    \"\"\"\n    An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained\n    models.\n    \"\"\"\n\n    config_class = ErnieRnaConfig\n    base_model_prefix = \"ernierna\"\n    supports_gradient_checkpointing = True\n    _no_split_modules = [\"ErnieRnaLayer\", \"ErnieRnaEmbeddings\"]\n\n    # Copied from transformers.models.bert.modeling_bert.BertPreTrainedModel._init_weights\n    def _init_weights(self, module: nn.Module):\n        \"\"\"Initialize the weights\"\"\"\n        if isinstance(module, nn.Linear):\n            # Slightly different from the TF version which uses truncated_normal for initialization\n            # cf https://github.com/pytorch/pytorch/pull/5617\n            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)\n            if module.bias is not None:\n                module.bias.data.zero_()\n        elif isinstance(module, nn.Embedding):\n            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)\n            if module.padding_idx is not None:\n                module.weight.data[module.padding_idx].zero_()\n        elif isinstance(module, nn.LayerNorm):\n            module.bias.data.zero_()\n            module.weight.data.fill_(1.0)\n
"},{"location":"models/modeling_outputs/","title":"modeling_outputs","text":""},{"location":"models/modeling_outputs/#multimolecule.models.modeling_outputs","title":"multimolecule.models.modeling_outputs","text":""},{"location":"models/modeling_outputs/#multimolecule.models.modeling_outputs.SequencePredictorOutput","title":"SequencePredictorOutput dataclass","text":"

Bases: ModelOutput

Base class for outputs of sentence classification & regression models.

Parameters:

Name Type Description Default FloatTensor | None

torch.FloatTensor of shape (1,).

Optional, returned when labels is provided

None FloatTensor

torch.FloatTensor of shape (batch_size, sequence_length, config.num_labels)

Prediction outputs.

None Tuple[FloatTensor, ...] | None

Tuple of torch.FloatTensor (one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).

Optional, returned when output_hidden_states=True is passed or when `config.output_hidden_states=True

Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.

None Tuple[FloatTensor, ...] | None

Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

Optional, eturned when output_attentions=True is passed or when config.output_attentions=True

Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

None Source code in multimolecule/models/modeling_outputs.py Python
@dataclass\nclass SequencePredictorOutput(ModelOutput):\n    \"\"\"\n    Base class for outputs of sentence classification & regression models.\n\n    Args:\n        loss:\n            `torch.FloatTensor` of shape `(1,)`.\n\n            Optional, returned when `labels` is provided\n        logits:\n            `torch.FloatTensor` of shape `(batch_size, sequence_length, config.num_labels)`\n\n            Prediction outputs.\n        hidden_states:\n            Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +\n            one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.\n\n            Optional, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True\n\n            Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.\n        attentions:\n            Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,\n            sequence_length)`.\n\n            Optional, eturned when `output_attentions=True` is passed or when `config.output_attentions=True`\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n    \"\"\"\n\n    loss: torch.FloatTensor | None = None\n    logits: torch.FloatTensor = None\n    hidden_states: Tuple[torch.FloatTensor, ...] | None = None\n    attentions: Tuple[torch.FloatTensor, ...] | None = None\n
"},{"location":"models/modeling_outputs/#multimolecule.models.modeling_outputs.SequencePredictorOutput(loss)","title":"loss","text":""},{"location":"models/modeling_outputs/#multimolecule.models.modeling_outputs.SequencePredictorOutput(logits)","title":"logits","text":""},{"location":"models/modeling_outputs/#multimolecule.models.modeling_outputs.SequencePredictorOutput(hidden_states)","title":"hidden_states","text":""},{"location":"models/modeling_outputs/#multimolecule.models.modeling_outputs.SequencePredictorOutput(attentions)","title":"attentions","text":""},{"location":"models/modeling_outputs/#multimolecule.models.modeling_outputs.TokenPredictorOutput","title":"TokenPredictorOutput dataclass","text":"

Bases: ModelOutput

Base class for outputs of token classification & regression models.

Parameters:

Name Type Description Default FloatTensor | None

torch.FloatTensor of shape (1,).

Optional, returned when labels is provided

None FloatTensor

torch.FloatTensor of shape (batch_size, sequence_length, config.num_labels)

Prediction outputs.

None Tuple[FloatTensor, ...] | None

Tuple of torch.FloatTensor (one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).

Optional, returned when output_hidden_states=True is passed or when `config.output_hidden_states=True

Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.

None Tuple[FloatTensor, ...] | None

Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

Optional, eturned when output_attentions=True is passed or when config.output_attentions=True

Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

None Source code in multimolecule/models/modeling_outputs.py Python
@dataclass\nclass TokenPredictorOutput(ModelOutput):\n    \"\"\"\n    Base class for outputs of token classification & regression models.\n\n    Args:\n        loss:\n            `torch.FloatTensor` of shape `(1,)`.\n\n            Optional, returned when `labels` is provided\n        logits:\n            `torch.FloatTensor` of shape `(batch_size, sequence_length, config.num_labels)`\n\n            Prediction outputs.\n        hidden_states:\n            Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +\n            one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.\n\n            Optional, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True\n\n            Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.\n        attentions:\n            Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,\n            sequence_length)`.\n\n            Optional, eturned when `output_attentions=True` is passed or when `config.output_attentions=True`\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n    \"\"\"\n\n    loss: torch.FloatTensor | None = None\n    logits: torch.FloatTensor = None\n    hidden_states: Tuple[torch.FloatTensor, ...] | None = None\n    attentions: Tuple[torch.FloatTensor, ...] | None = None\n
"},{"location":"models/modeling_outputs/#multimolecule.models.modeling_outputs.TokenPredictorOutput(loss)","title":"loss","text":""},{"location":"models/modeling_outputs/#multimolecule.models.modeling_outputs.TokenPredictorOutput(logits)","title":"logits","text":""},{"location":"models/modeling_outputs/#multimolecule.models.modeling_outputs.TokenPredictorOutput(hidden_states)","title":"hidden_states","text":""},{"location":"models/modeling_outputs/#multimolecule.models.modeling_outputs.TokenPredictorOutput(attentions)","title":"attentions","text":""},{"location":"models/modeling_outputs/#multimolecule.models.modeling_outputs.ContactPredictorOutput","title":"ContactPredictorOutput dataclass","text":"

Bases: ModelOutput

Base class for outputs of contact classification & regression models.

Parameters:

Name Type Description Default FloatTensor | None

torch.FloatTensor of shape (1,).

Optional, returned when labels is provided

None FloatTensor

torch.FloatTensor of shape (batch_size, sequence_length, config.num_labels)

Prediction outputs.

None Tuple[FloatTensor, ...] | None

Tuple of torch.FloatTensor (one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).

Optional, returned when output_hidden_states=True is passed or when `config.output_hidden_states=True

Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.

None Tuple[FloatTensor, ...] | None

Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

Optional, eturned when output_attentions=True is passed or when config.output_attentions=True

Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

None Source code in multimolecule/models/modeling_outputs.py Python
@dataclass\nclass ContactPredictorOutput(ModelOutput):\n    \"\"\"\n    Base class for outputs of contact classification & regression models.\n\n    Args:\n        loss:\n            `torch.FloatTensor` of shape `(1,)`.\n\n            Optional, returned when `labels` is provided\n        logits:\n            `torch.FloatTensor` of shape `(batch_size, sequence_length, config.num_labels)`\n\n            Prediction outputs.\n        hidden_states:\n            Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +\n            one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.\n\n            Optional, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True\n\n            Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.\n        attentions:\n            Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,\n            sequence_length)`.\n\n            Optional, eturned when `output_attentions=True` is passed or when `config.output_attentions=True`\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n    \"\"\"\n\n    loss: torch.FloatTensor | None = None\n    logits: torch.FloatTensor = None\n    hidden_states: Tuple[torch.FloatTensor, ...] | None = None\n    attentions: Tuple[torch.FloatTensor, ...] | None = None\n
"},{"location":"models/modeling_outputs/#multimolecule.models.modeling_outputs.ContactPredictorOutput(loss)","title":"loss","text":""},{"location":"models/modeling_outputs/#multimolecule.models.modeling_outputs.ContactPredictorOutput(logits)","title":"logits","text":""},{"location":"models/modeling_outputs/#multimolecule.models.modeling_outputs.ContactPredictorOutput(hidden_states)","title":"hidden_states","text":""},{"location":"models/modeling_outputs/#multimolecule.models.modeling_outputs.ContactPredictorOutput(attentions)","title":"attentions","text":""},{"location":"models/rinalmo/","title":"RiNALMo","text":"

Pre-trained model on non-coding RNA (ncRNA) using a masked language modeling (MLM) objective.

"},{"location":"models/rinalmo/#disclaimer","title":"Disclaimer","text":"

This is an UNOFFICIAL implementation of the RiNALMo: General-Purpose RNA Language Models Can Generalize Well on Structure Prediction Tasks by Rafael Josip Peni\u0107, et al.

The OFFICIAL repository of RiNALMo is at lbcb-sci/RiNALMo.

Tip

The MultiMolecule team has confirmed that the provided model and checkpoints are producing the same intermediate representations as the original implementation.

The team releasing RiNALMo did not write this model card for this model so this model card has been written by the MultiMolecule team.

"},{"location":"models/rinalmo/#model-details","title":"Model Details","text":"

RiNALMo is a bert-style model pre-trained on a large corpus of non-coding RNA sequences in a self-supervised fashion. This means that the model was trained on the raw nucleotides of RNA sequences only, with an automatic process to generate inputs and labels from those texts. Please refer to the Training Details section for more information on the training process.

"},{"location":"models/rinalmo/#model-specification","title":"Model Specification","text":"Num Layers Hidden Size Num Heads Intermediate Size Num Parameters (M) FLOPs (G) MACs (G) Max Num Tokens 33 1280 20 5120 650.88 168.92 84.43 1022"},{"location":"models/rinalmo/#links","title":"Links","text":""},{"location":"models/rinalmo/#usage","title":"Usage","text":"

The model file depends on the multimolecule library. You can install it using pip:

Bash
pip install multimolecule\n
"},{"location":"models/rinalmo/#direct-use","title":"Direct Use","text":"

You can use this model directly with a pipeline for masked language modeling:

Python
>>> import multimolecule  # you must import multimolecule to register models\n>>> from transformers import pipeline\n>>> unmasker = pipeline(\"fill-mask\", model=\"multimolecule/rinalmo\")\n>>> unmasker(\"gguc<mask>cucugguuagaccagaucugagccu\")\n\n[{'score': 0.3932918310165405,\n  'token': 6,\n  'token_str': 'A',\n  'sequence': 'G G U C A C U C U G G U U A G A C C A G A U C U G A G C C U'},\n {'score': 0.2897723913192749,\n  'token': 9,\n  'token_str': 'U',\n  'sequence': 'G G U C U C U C U G G U U A G A C C A G A U C U G A G C C U'},\n {'score': 0.15423105657100677,\n  'token': 22,\n  'token_str': 'X',\n  'sequence': 'G G U C X C U C U G G U U A G A C C A G A U C U G A G C C U'},\n {'score': 0.12160095572471619,\n  'token': 7,\n  'token_str': 'C',\n  'sequence': 'G G U C C C U C U G G U U A G A C C A G A U C U G A G C C U'},\n {'score': 0.0408296100795269,\n  'token': 8,\n  'token_str': 'G',\n  'sequence': 'G G U C G C U C U G G U U A G A C C A G A U C U G A G C C U'}]\n
"},{"location":"models/rinalmo/#downstream-use","title":"Downstream Use","text":""},{"location":"models/rinalmo/#extract-features","title":"Extract Features","text":"

Here is how to use this model to get the features of a given sequence in PyTorch:

Python
from multimolecule import RnaTokenizer, RiNALMoModel\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rinalmo\")\nmodel = RiNALMoModel.from_pretrained(\"multimolecule/rinalmo\")\n\ntext = \"UAGCUUAUCAGACUGAUGUUGA\"\ninput = tokenizer(text, return_tensors=\"pt\")\n\noutput = model(**input)\n
"},{"location":"models/rinalmo/#sequence-classification-regression","title":"Sequence Classification / Regression","text":"

Note: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for sequence classification or regression.

Here is how to use this model as backbone to fine-tune for a sequence-level task in PyTorch:

Python
import torch\nfrom multimolecule import RnaTokenizer, RiNALMoForSequencePrediction\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rinalmo\")\nmodel = RiNALMoForSequencePrediction.from_pretrained(\"multimolecule/rinalmo\")\n\ntext = \"UAGCUUAUCAGACUGAUGUUGA\"\ninput = tokenizer(text, return_tensors=\"pt\")\nlabel = torch.tensor([1])\n\noutput = model(**input, labels=label)\n
"},{"location":"models/rinalmo/#token-classification-regression","title":"Token Classification / Regression","text":"

Note: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for nucleotide classification or regression.

Here is how to use this model as backbone to fine-tune for a nucleotide-level task in PyTorch:

Python
import torch\nfrom multimolecule import RnaTokenizer, RiNALMoForTokenPrediction\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rinalmo\")\nmodel = RiNALMoForTokenPrediction.from_pretrained(\"multimolecule/rinalmo\")\n\ntext = \"UAGCUUAUCAGACUGAUGUUGA\"\ninput = tokenizer(text, return_tensors=\"pt\")\nlabel = torch.randint(2, (len(text), ))\n\noutput = model(**input, labels=label)\n
"},{"location":"models/rinalmo/#contact-classification-regression","title":"Contact Classification / Regression","text":"

Note: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for contact classification or regression.

Here is how to use this model as backbone to fine-tune for a contact-level task in PyTorch:

Python
import torch\nfrom multimolecule import RnaTokenizer, RiNALMoForContactPrediction\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rinalmo\")\nmodel = RiNALMoForContactPrediction.from_pretrained(\"multimolecule/rinalmo\")\n\ntext = \"UAGCUUAUCAGACUGAUGUUGA\"\ninput = tokenizer(text, return_tensors=\"pt\")\nlabel = torch.randint(2, (len(text), len(text)))\n\noutput = model(**input, labels=label)\n
"},{"location":"models/rinalmo/#training-details","title":"Training Details","text":"

RiNALMo used Masked Language Modeling (MLM) as the pre-training objective: taking a sequence, the model randomly masks 15% of the tokens in the input then runs the entire masked sentence through the model and has to predict the masked tokens. This is comparable to the Cloze task in language modeling.

"},{"location":"models/rinalmo/#training-data","title":"Training Data","text":"

The RiNALMo model was pre-trained on a cocktail of databases including RNAcentral, Rfam, Ensembl Genome Browser, and Nucleotide. The training data contains 36 million unique ncRNA sequences.

To ensure sequence diversity in each training batch, RiNALMo clustered the sequences with MMSeqs2 into 17 million clusters and then sampled each sequence in the batch from a different cluster.

RiNALMo preprocessed all tokens by replacing \u201cU\u201ds with \u201cT\u201ds.

Note that during model conversions, \u201cT\u201d is replaced with \u201cU\u201d. RnaTokenizer will convert \u201cT\u201ds to \u201cU\u201ds for you, you may disable this behaviour by passing replace_T_with_U=False.

"},{"location":"models/rinalmo/#training-procedure","title":"Training Procedure","text":""},{"location":"models/rinalmo/#preprocessing","title":"Preprocessing","text":"

RiNALMo used masked language modeling (MLM) as the pre-training objective. The masking procedure is similar to the one used in BERT:

"},{"location":"models/rinalmo/#pretraining","title":"PreTraining","text":"

The model was trained on 7 NVIDIA A100 GPUs with 80GiB memories.

"},{"location":"models/rinalmo/#citation","title":"Citation","text":"

BibTeX:

BibTeX
@article{penic2024rinalmo,\n  title={RiNALMo: General-Purpose RNA Language Models Can Generalize Well on Structure Prediction Tasks},\n  author={Peni\u0107, Rafael Josip and Vla\u0161i\u0107, Tin and Huber, Roland G. and Wan, Yue and \u0160iki\u0107, Mile},\n  journal={arXiv preprint arXiv:2403.00043},\n  year={2024}\n}\n
"},{"location":"models/rinalmo/#contact","title":"Contact","text":"

Please use GitHub issues of MultiMolecule for any questions or comments on the model card.

Please contact the authors of the RiNALMo paper for questions or comments on the paper/model.

"},{"location":"models/rinalmo/#license","title":"License","text":"

This model is licensed under the AGPL-3.0 License.

Text Only
SPDX-License-Identifier: AGPL-3.0-or-later\n
"},{"location":"models/rinalmo/#multimolecule.models.rinalmo","title":"multimolecule.models.rinalmo","text":""},{"location":"models/rinalmo/#multimolecule.models.rinalmo.RnaTokenizer","title":"RnaTokenizer","text":"

Bases: Tokenizer

Tokenizer for RNA sequences.

Parameters:

Name Type Description Default Alphabet | str | List[str] | None

alphabet to use for tokenization.

None int

Size of kmer to tokenize.

1 bool

Whether to tokenize into codons.

False bool

Whether to replace T with U.

True bool

Whether to convert input to uppercase.

True

Examples:

Python Console Session
>>> from multimolecule import RnaTokenizer\n>>> tokenizer = RnaTokenizer()\n>>> tokenizer('<pad><cls><eos><unk><mask><null>ACGUNRYSWKMBDHV.X*-I')[\"input_ids\"]\n[1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 2]\n>>> tokenizer('acgu')[\"input_ids\"]\n[1, 6, 7, 8, 9, 2]\n>>> tokenizer('acgt')[\"input_ids\"]\n[1, 6, 7, 8, 9, 2]\n>>> tokenizer = RnaTokenizer(replace_T_with_U=False)\n>>> tokenizer('acgt')[\"input_ids\"]\n[1, 6, 7, 8, 3, 2]\n>>> tokenizer = RnaTokenizer(nmers=3)\n>>> tokenizer('uagcuuauc')[\"input_ids\"]\n[1, 83, 17, 64, 49, 96, 84, 22, 2]\n>>> tokenizer = RnaTokenizer(codon=True)\n>>> tokenizer('uagcuuauc')[\"input_ids\"]\n[1, 83, 49, 22, 2]\n>>> tokenizer('uagcuuauca')[\"input_ids\"]\nTraceback (most recent call last):\nValueError: length of input sequence must be a multiple of 3 for codon tokenization, but got 10\n
Source code in multimolecule/tokenisers/rna/tokenization_rna.py Python
class RnaTokenizer(Tokenizer):\n    \"\"\"\n    Tokenizer for RNA sequences.\n\n    Args:\n        alphabet: alphabet to use for tokenization.\n\n            - If is `None`, the standard RNA alphabet will be used.\n            - If is a `string`, it should correspond to the name of a predefined alphabet. The options include\n                + `standard`\n                + `extended`\n                + `streamline`\n                + `nucleobase`\n            - If is an alphabet or a list of characters, that specific alphabet will be used.\n        nmers: Size of kmer to tokenize.\n        codon: Whether to tokenize into codons.\n        replace_T_with_U: Whether to replace T with U.\n        do_upper_case: Whether to convert input to uppercase.\n\n    Examples:\n        >>> from multimolecule import RnaTokenizer\n        >>> tokenizer = RnaTokenizer()\n        >>> tokenizer('<pad><cls><eos><unk><mask><null>ACGUNRYSWKMBDHV.X*-I')[\"input_ids\"]\n        [1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 2]\n        >>> tokenizer('acgu')[\"input_ids\"]\n        [1, 6, 7, 8, 9, 2]\n        >>> tokenizer('acgt')[\"input_ids\"]\n        [1, 6, 7, 8, 9, 2]\n        >>> tokenizer = RnaTokenizer(replace_T_with_U=False)\n        >>> tokenizer('acgt')[\"input_ids\"]\n        [1, 6, 7, 8, 3, 2]\n        >>> tokenizer = RnaTokenizer(nmers=3)\n        >>> tokenizer('uagcuuauc')[\"input_ids\"]\n        [1, 83, 17, 64, 49, 96, 84, 22, 2]\n        >>> tokenizer = RnaTokenizer(codon=True)\n        >>> tokenizer('uagcuuauc')[\"input_ids\"]\n        [1, 83, 49, 22, 2]\n        >>> tokenizer('uagcuuauca')[\"input_ids\"]\n        Traceback (most recent call last):\n        ValueError: length of input sequence must be a multiple of 3 for codon tokenization, but got 10\n    \"\"\"\n\n    model_input_names = [\"input_ids\", \"attention_mask\"]\n\n    def __init__(\n        self,\n        alphabet: Alphabet | str | List[str] | None = None,\n        nmers: int = 1,\n        codon: bool = False,\n        replace_T_with_U: bool = True,\n        do_upper_case: bool = True,\n        additional_special_tokens: List | Tuple | None = None,\n        **kwargs,\n    ):\n        if codon and (nmers > 1 and nmers != 3):\n            raise ValueError(\"Codon and nmers cannot be used together.\")\n        if codon:\n            nmers = 3  # set to 3 to get correct vocab\n        if not isinstance(alphabet, Alphabet):\n            alphabet = get_alphabet(alphabet, nmers=nmers)\n        super().__init__(\n            alphabet=alphabet,\n            nmers=nmers,\n            codon=codon,\n            replace_T_with_U=replace_T_with_U,\n            do_upper_case=do_upper_case,\n            additional_special_tokens=additional_special_tokens,\n            **kwargs,\n        )\n        self.replace_T_with_U = replace_T_with_U\n        self.nmers = nmers\n        self.codon = codon\n\n    def _tokenize(self, text: str, **kwargs):\n        if self.do_upper_case:\n            text = text.upper()\n        if self.replace_T_with_U:\n            text = text.replace(\"T\", \"U\")\n        if self.codon:\n            if len(text) % 3 != 0:\n                raise ValueError(\n                    f\"length of input sequence must be a multiple of 3 for codon tokenization, but got {len(text)}\"\n                )\n            return [text[i : i + 3] for i in range(0, len(text), 3)]\n        if self.nmers > 1:\n            return [text[i : i + self.nmers] for i in range(len(text) - self.nmers + 1)]  # noqa: E203\n        return list(text)\n
"},{"location":"models/rinalmo/#multimolecule.models.rinalmo.RnaTokenizer(alphabet)","title":"alphabet","text":""},{"location":"models/rinalmo/#multimolecule.models.rinalmo.RnaTokenizer(nmers)","title":"nmers","text":""},{"location":"models/rinalmo/#multimolecule.models.rinalmo.RnaTokenizer(codon)","title":"codon","text":""},{"location":"models/rinalmo/#multimolecule.models.rinalmo.RnaTokenizer(replace_T_with_U)","title":"replace_T_with_U","text":""},{"location":"models/rinalmo/#multimolecule.models.rinalmo.RnaTokenizer(do_upper_case)","title":"do_upper_case","text":""},{"location":"models/rinalmo/#multimolecule.models.rinalmo.RiNALMoConfig","title":"RiNALMoConfig","text":"

Bases: PreTrainedConfig

This is the configuration class to store the configuration of a RiNALMoModel. It is used to instantiate a RiNALMo model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the RiNALMo lbcb-sci/RiNALMo architecture.

Configuration objects inherit from PreTrainedConfig and can be used to control the model outputs. Read the documentation from PreTrainedConfig for more information.

Parameters:

Name Type Description Default int

Vocabulary size of the RiNALMo model. Defines the number of different tokens that can be represented by the inputs_ids passed when calling [RiNALMoModel].

26 int

Dimensionality of the encoder layers and the pooler layer.

1280 int

Number of hidden layers in the Transformer encoder.

33 int

Number of attention heads for each attention layer in the Transformer encoder.

20 int

Dimensionality of the \u201cintermediate\u201d (often named feed-forward) layer in the Transformer encoder.

5120 float

The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.

0.1 float

The dropout ratio for the attention probabilities.

0.1 int

The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).

1024 float

The standard deviation of the truncated_normal_initializer for initializing all weight matrices.

0.02 float

The epsilon used by the layer normalization layers.

1e-12 str

Type of position embedding. Choose one of \"absolute\", \"relative_key\", \"relative_key_query\", \"rotary\". For positional embeddings use \"absolute\". For more information on \"relative_key\", please refer to Self-Attention with Relative Position Representations (Shaw et al.). For more information on \"relative_key_query\", please refer to Method 4 in Improve Transformer Models with Better Relative Position Embeddings (Huang et al.).

'rotary' bool

Whether the model is used as a decoder or not. If False, the model is used as an encoder.

False bool

Whether or not the model should return the last key/values attentions (not used by all models). Only relevant if config.is_decoder=True.

True bool

Whether to apply layer normalization after embeddings but before the main stem of the network.

True bool

When this is enabled, masked tokens are treated as if they had been dropped out by input dropout.

True

Examples:

Python Console Session
>>> from multimolecule import RiNALMoModel, RiNALMoConfig\n>>> # Initializing a RiNALMo multimolecule/rinalmo style configuration\n>>> configuration = RiNALMoConfig()\n>>> # Initializing a model (with random weights) from the multimolecule/rinalmo style configuration\n>>> model = RiNALMoModel(configuration)\n>>> # Accessing the model configuration\n>>> configuration = model.config\n
Source code in multimolecule/models/rinalmo/configuration_rinalmo.py Python
class RiNALMoConfig(PreTrainedConfig):\n    r\"\"\"\n    This is the configuration class to store the configuration of a [`RiNALMoModel`][multimolecule.models.RiNALMoModel].\n    It is used to instantiate a RiNALMo model according to the specified arguments, defining the model architecture.\n    Instantiating a configuration with the defaults will yield a similar configuration to that of the RiNALMo\n    [lbcb-sci/RiNALMo](https://github.com/lbcb-sci/RiNALMo) architecture.\n\n    Configuration objects inherit from [`PreTrainedConfig`][multimolecule.models.PreTrainedConfig] and can be used to\n    control the model outputs. Read the documentation from [`PreTrainedConfig`][multimolecule.models.PreTrainedConfig]\n    for more information.\n\n    Args:\n        vocab_size:\n            Vocabulary size of the RiNALMo model. Defines the number of different tokens that can be represented by the\n            `inputs_ids` passed when calling [`RiNALMoModel`].\n        hidden_size:\n            Dimensionality of the encoder layers and the pooler layer.\n        num_hidden_layers:\n            Number of hidden layers in the Transformer encoder.\n        num_attention_heads:\n            Number of attention heads for each attention layer in the Transformer encoder.\n        intermediate_size:\n            Dimensionality of the \"intermediate\" (often named feed-forward) layer in the Transformer encoder.\n        hidden_dropout:\n            The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.\n        attention_dropout:\n            The dropout ratio for the attention probabilities.\n        max_position_embeddings:\n            The maximum sequence length that this model might ever be used with. Typically set this to something large\n            just in case (e.g., 512 or 1024 or 2048).\n        initializer_range:\n            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.\n        layer_norm_eps:\n            The epsilon used by the layer normalization layers.\n        position_embedding_type:\n            Type of position embedding. Choose one of `\"absolute\"`, `\"relative_key\"`, `\"relative_key_query\", \"rotary\"`.\n            For positional embeddings use `\"absolute\"`. For more information on `\"relative_key\"`, please refer to\n            [Self-Attention with Relative Position Representations (Shaw et al.)](https://arxiv.org/abs/1803.02155).\n            For more information on `\"relative_key_query\"`, please refer to *Method 4* in [Improve Transformer Models\n            with Better Relative Position Embeddings (Huang et al.)](https://arxiv.org/abs/2009.13658).\n        is_decoder:\n            Whether the model is used as a decoder or not. If `False`, the model is used as an encoder.\n        use_cache:\n            Whether or not the model should return the last key/values attentions (not used by all models). Only\n            relevant if `config.is_decoder=True`.\n        emb_layer_norm_before:\n            Whether to apply layer normalization after embeddings but before the main stem of the network.\n        token_dropout:\n            When this is enabled, masked tokens are treated as if they had been dropped out by input dropout.\n\n    Examples:\n        >>> from multimolecule import RiNALMoModel, RiNALMoConfig\n        >>> # Initializing a RiNALMo multimolecule/rinalmo style configuration\n        >>> configuration = RiNALMoConfig()\n        >>> # Initializing a model (with random weights) from the multimolecule/rinalmo style configuration\n        >>> model = RiNALMoModel(configuration)\n        >>> # Accessing the model configuration\n        >>> configuration = model.config\n    \"\"\"\n\n    model_type = \"rinalmo\"\n\n    def __init__(\n        self,\n        vocab_size: int = 26,\n        hidden_size: int = 1280,\n        num_hidden_layers: int = 33,\n        num_attention_heads: int = 20,\n        intermediate_size: int = 5120,\n        hidden_act: str = \"gelu\",\n        hidden_dropout: float = 0.1,\n        attention_dropout: float = 0.1,\n        max_position_embeddings: int = 1024,\n        initializer_range: float = 0.02,\n        layer_norm_eps: float = 1e-12,\n        position_embedding_type: str = \"rotary\",\n        is_decoder: bool = False,\n        use_cache: bool = True,\n        emb_layer_norm_before: bool = True,\n        learnable_beta: bool = True,\n        token_dropout: bool = True,\n        head: HeadConfig | None = None,\n        lm_head: MaskedLMHeadConfig | None = None,\n        **kwargs,\n    ):\n        super().__init__(**kwargs)\n        self.vocab_size = vocab_size\n        self.hidden_size = hidden_size\n        self.num_hidden_layers = num_hidden_layers\n        self.num_attention_heads = num_attention_heads\n        self.intermediate_size = intermediate_size\n        self.hidden_act = hidden_act\n        self.hidden_dropout = hidden_dropout\n        self.attention_dropout = attention_dropout\n        self.max_position_embeddings = max_position_embeddings\n        self.initializer_range = initializer_range\n        self.layer_norm_eps = layer_norm_eps\n        self.position_embedding_type = position_embedding_type\n        self.is_decoder = is_decoder\n        self.use_cache = use_cache\n        self.learnable_beta = learnable_beta\n        self.token_dropout = token_dropout\n        self.head = HeadConfig(**head) if head is not None else None\n        self.lm_head = MaskedLMHeadConfig(**lm_head) if lm_head is not None else None\n        self.emb_layer_norm_before = emb_layer_norm_before\n
"},{"location":"models/rinalmo/#multimolecule.models.rinalmo.RiNALMoConfig(vocab_size)","title":"vocab_size","text":""},{"location":"models/rinalmo/#multimolecule.models.rinalmo.RiNALMoConfig(hidden_size)","title":"hidden_size","text":""},{"location":"models/rinalmo/#multimolecule.models.rinalmo.RiNALMoConfig(num_hidden_layers)","title":"num_hidden_layers","text":""},{"location":"models/rinalmo/#multimolecule.models.rinalmo.RiNALMoConfig(num_attention_heads)","title":"num_attention_heads","text":""},{"location":"models/rinalmo/#multimolecule.models.rinalmo.RiNALMoConfig(intermediate_size)","title":"intermediate_size","text":""},{"location":"models/rinalmo/#multimolecule.models.rinalmo.RiNALMoConfig(hidden_dropout)","title":"hidden_dropout","text":""},{"location":"models/rinalmo/#multimolecule.models.rinalmo.RiNALMoConfig(attention_dropout)","title":"attention_dropout","text":""},{"location":"models/rinalmo/#multimolecule.models.rinalmo.RiNALMoConfig(max_position_embeddings)","title":"max_position_embeddings","text":""},{"location":"models/rinalmo/#multimolecule.models.rinalmo.RiNALMoConfig(initializer_range)","title":"initializer_range","text":""},{"location":"models/rinalmo/#multimolecule.models.rinalmo.RiNALMoConfig(layer_norm_eps)","title":"layer_norm_eps","text":""},{"location":"models/rinalmo/#multimolecule.models.rinalmo.RiNALMoConfig(position_embedding_type)","title":"position_embedding_type","text":""},{"location":"models/rinalmo/#multimolecule.models.rinalmo.RiNALMoConfig(is_decoder)","title":"is_decoder","text":""},{"location":"models/rinalmo/#multimolecule.models.rinalmo.RiNALMoConfig(use_cache)","title":"use_cache","text":""},{"location":"models/rinalmo/#multimolecule.models.rinalmo.RiNALMoConfig(emb_layer_norm_before)","title":"emb_layer_norm_before","text":""},{"location":"models/rinalmo/#multimolecule.models.rinalmo.RiNALMoConfig(token_dropout)","title":"token_dropout","text":""},{"location":"models/rinalmo/#multimolecule.models.rinalmo.RiNALMoForContactPrediction","title":"RiNALMoForContactPrediction","text":"

Bases: RiNALMoPreTrainedModel

Examples:

Python Console Session
>>> from multimolecule import RiNALMoConfig, RiNALMoForContactPrediction, RnaTokenizer\n>>> config = RiNALMoConfig()\n>>> model = RiNALMoForContactPrediction(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=torch.randint(2, (1, 5, 5)))\n>>> output[\"logits\"].shape\ntorch.Size([1, 5, 5, 1])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n
Source code in multimolecule/models/rinalmo/modeling_rinalmo.py Python
class RiNALMoForContactPrediction(RiNALMoPreTrainedModel):\n    \"\"\"\n    Examples:\n        >>> from multimolecule import RiNALMoConfig, RiNALMoForContactPrediction, RnaTokenizer\n        >>> config = RiNALMoConfig()\n        >>> model = RiNALMoForContactPrediction(config)\n        >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n        >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n        >>> output = model(**input, labels=torch.randint(2, (1, 5, 5)))\n        >>> output[\"logits\"].shape\n        torch.Size([1, 5, 5, 1])\n        >>> output[\"loss\"]  # doctest:+ELLIPSIS\n        tensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n    \"\"\"\n\n    def __init__(self, config: RiNALMoConfig):\n        super().__init__(config)\n        self.rinalmo = RiNALMoModel(config, add_pooling_layer=True)\n        self.contact_head = ContactPredictionHead(config)\n        self.head_config = self.contact_head.config\n\n        # Initialize weights and apply final processing\n        self.post_init()\n\n    def forward(\n        self,\n        input_ids: Tensor | NestedTensor,\n        attention_mask: Tensor | None = None,\n        position_ids: Tensor | None = None,\n        head_mask: Tensor | None = None,\n        inputs_embeds: Tensor | NestedTensor | None = None,\n        labels: Tensor | None = None,\n        output_attentions: bool | None = None,\n        output_hidden_states: bool | None = None,\n        return_dict: bool | None = None,\n        **kwargs,\n    ) -> Tuple[Tensor, ...] | ContactPredictorOutput:\n        if output_attentions is False:\n            warn(\"output_attentions must be True for contact classification and will be ignored.\")\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n        outputs = self.rinalmo(\n            input_ids,\n            attention_mask=attention_mask,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            output_attentions=True,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n            **kwargs,\n        )\n        output = self.contact_head(outputs, attention_mask, input_ids, labels)\n        logits, loss = output.logits, output.loss\n\n        if not return_dict:\n            output = (logits,) + outputs[2:]\n            return ((loss,) + output) if loss is not None else output\n\n        return ContactPredictorOutput(\n            loss=loss,\n            logits=logits,\n            hidden_states=outputs.hidden_states,\n            attentions=outputs.attentions,\n        )\n
"},{"location":"models/rinalmo/#multimolecule.models.rinalmo.RiNALMoForMaskedLM","title":"RiNALMoForMaskedLM","text":"

Bases: RiNALMoPreTrainedModel

Examples:

Python Console Session
>>> from multimolecule import RiNALMoConfig, RiNALMoForMaskedLM, RnaTokenizer\n>>> config = RiNALMoConfig()\n>>> model = RiNALMoForMaskedLM(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=input[\"input_ids\"])\n>>> output[\"logits\"].shape\ntorch.Size([1, 7, 26])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<NllLossBackward0>)\n
Source code in multimolecule/models/rinalmo/modeling_rinalmo.py Python
class RiNALMoForMaskedLM(RiNALMoPreTrainedModel):\n    \"\"\"\n    Examples:\n        >>> from multimolecule import RiNALMoConfig, RiNALMoForMaskedLM, RnaTokenizer\n        >>> config = RiNALMoConfig()\n        >>> model = RiNALMoForMaskedLM(config)\n        >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n        >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n        >>> output = model(**input, labels=input[\"input_ids\"])\n        >>> output[\"logits\"].shape\n        torch.Size([1, 7, 26])\n        >>> output[\"loss\"]  # doctest:+ELLIPSIS\n        tensor(..., grad_fn=<NllLossBackward0>)\n    \"\"\"\n\n    _tied_weights_keys = [\"lm_head.decoder.weight\", \"lm_head.decoder.bias\"]\n\n    def __init__(self, config: RiNALMoConfig):\n        super().__init__(config)\n        if config.is_decoder:\n            logger.warning(\n                \"If you want to use `RiNALMoForMaskedLM` make sure `config.is_decoder=False` for \"\n                \"bi-directional self-attention.\"\n            )\n        self.rinalmo = RiNALMoModel(config, add_pooling_layer=False)\n        self.lm_head = MaskedLMHead(config)\n\n        # Initialize weights and apply final processing\n        self.post_init()\n\n    def forward(\n        self,\n        input_ids: Tensor | NestedTensor,\n        attention_mask: Tensor | None = None,\n        position_ids: Tensor | None = None,\n        head_mask: Tensor | None = None,\n        inputs_embeds: Tensor | NestedTensor | None = None,\n        encoder_hidden_states: Tensor | None = None,\n        encoder_attention_mask: Tensor | None = None,\n        labels: Tensor | None = None,\n        output_attentions: bool | None = None,\n        output_hidden_states: bool | None = None,\n        return_dict: bool | None = None,\n        **kwargs,\n    ) -> Tuple[Tensor, ...] | MaskedLMOutput:\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n        outputs = self.rinalmo(\n            input_ids,\n            attention_mask=attention_mask,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            encoder_hidden_states=encoder_hidden_states,\n            encoder_attention_mask=encoder_attention_mask,\n            output_attentions=output_attentions,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n            **kwargs,\n        )\n        output = self.lm_head(outputs, labels)\n        logits, loss = output.logits, output.loss\n\n        if not return_dict:\n            output = (logits,) + outputs[2:]\n            return ((loss,) + output) if loss is not None else output\n\n        return MaskedLMOutput(\n            loss=loss,\n            logits=logits,\n            hidden_states=outputs.hidden_states,\n            attentions=outputs.attentions,\n        )\n
"},{"location":"models/rinalmo/#multimolecule.models.rinalmo.RiNALMoForSequencePrediction","title":"RiNALMoForSequencePrediction","text":"

Bases: RiNALMoPreTrainedModel

Examples:

Python Console Session
>>> from multimolecule import RiNALMoConfig, RiNALMoForSequencePrediction, RnaTokenizer\n>>> config = RiNALMoConfig()\n>>> model = RiNALMoForSequencePrediction(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=torch.tensor([[1]]))\n>>> output[\"logits\"].shape\ntorch.Size([1, 1])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n
Source code in multimolecule/models/rinalmo/modeling_rinalmo.py Python
class RiNALMoForSequencePrediction(RiNALMoPreTrainedModel):\n    \"\"\"\n    Examples:\n        >>> from multimolecule import RiNALMoConfig, RiNALMoForSequencePrediction, RnaTokenizer\n        >>> config = RiNALMoConfig()\n        >>> model = RiNALMoForSequencePrediction(config)\n        >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n        >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n        >>> output = model(**input, labels=torch.tensor([[1]]))\n        >>> output[\"logits\"].shape\n        torch.Size([1, 1])\n        >>> output[\"loss\"]  # doctest:+ELLIPSIS\n        tensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n    \"\"\"\n\n    def __init__(self, config: RiNALMoConfig):\n        super().__init__(config)\n        self.rinalmo = RiNALMoModel(config, add_pooling_layer=True)\n        self.sequence_head = SequencePredictionHead(config)\n        self.head_config = self.sequence_head.config\n\n        # Initialize weights and apply final processing\n        self.post_init()\n\n    def forward(\n        self,\n        input_ids: Tensor | NestedTensor,\n        attention_mask: Tensor | None = None,\n        position_ids: Tensor | None = None,\n        head_mask: Tensor | None = None,\n        inputs_embeds: Tensor | NestedTensor | None = None,\n        labels: Tensor | None = None,\n        output_attentions: bool | None = None,\n        output_hidden_states: bool | None = None,\n        return_dict: bool | None = None,\n        **kwargs,\n    ) -> Tuple[Tensor, ...] | SequencePredictorOutput:\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n        outputs = self.rinalmo(\n            input_ids,\n            attention_mask=attention_mask,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            output_attentions=output_attentions,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n            **kwargs,\n        )\n        output = self.sequence_head(outputs, labels)\n        logits, loss = output.logits, output.loss\n\n        if not return_dict:\n            output = (logits,) + outputs[2:]\n            return ((loss,) + output) if loss is not None else output\n\n        return SequencePredictorOutput(\n            loss=loss,\n            logits=logits,\n            hidden_states=outputs.hidden_states,\n            attentions=outputs.attentions,\n        )\n
"},{"location":"models/rinalmo/#multimolecule.models.rinalmo.RiNALMoForTokenPrediction","title":"RiNALMoForTokenPrediction","text":"

Bases: RiNALMoPreTrainedModel

Examples:

Python Console Session
>>> from multimolecule import RiNALMoConfig, RiNALMoForTokenPrediction, RnaTokenizer\n>>> config = RiNALMoConfig()\n>>> model = RiNALMoForTokenPrediction(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=torch.randint(2, (1, 5)))\n>>> output[\"logits\"].shape\ntorch.Size([1, 5, 1])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n
Source code in multimolecule/models/rinalmo/modeling_rinalmo.py Python
class RiNALMoForTokenPrediction(RiNALMoPreTrainedModel):\n    \"\"\"\n    Examples:\n        >>> from multimolecule import RiNALMoConfig, RiNALMoForTokenPrediction, RnaTokenizer\n        >>> config = RiNALMoConfig()\n        >>> model = RiNALMoForTokenPrediction(config)\n        >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n        >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n        >>> output = model(**input, labels=torch.randint(2, (1, 5)))\n        >>> output[\"logits\"].shape\n        torch.Size([1, 5, 1])\n        >>> output[\"loss\"]  # doctest:+ELLIPSIS\n        tensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n    \"\"\"\n\n    def __init__(self, config: RiNALMoConfig):\n        super().__init__(config)\n        self.rinalmo = RiNALMoModel(config, add_pooling_layer=True)\n        self.token_head = TokenPredictionHead(config)\n        self.head_config = self.token_head.config\n\n        # Initialize weights and apply final processing\n        self.post_init()\n\n    def forward(\n        self,\n        input_ids: Tensor | NestedTensor,\n        attention_mask: Tensor | None = None,\n        position_ids: Tensor | None = None,\n        head_mask: Tensor | None = None,\n        inputs_embeds: Tensor | NestedTensor | None = None,\n        labels: Tensor | None = None,\n        output_attentions: bool | None = None,\n        output_hidden_states: bool | None = None,\n        return_dict: bool | None = None,\n        **kwargs,\n    ) -> Tuple[Tensor, ...] | TokenPredictorOutput:\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n        outputs = self.rinalmo(\n            input_ids,\n            attention_mask=attention_mask,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            output_attentions=output_attentions,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n            **kwargs,\n        )\n        output = self.token_head(outputs, attention_mask, input_ids, labels)\n        logits, loss = output.logits, output.loss\n\n        if not return_dict:\n            output = (logits,) + outputs[2:]\n            return ((loss,) + output) if loss is not None else output\n\n        return TokenPredictorOutput(\n            loss=loss,\n            logits=logits,\n            hidden_states=outputs.hidden_states,\n            attentions=outputs.attentions,\n        )\n
"},{"location":"models/rinalmo/#multimolecule.models.rinalmo.RiNALMoModel","title":"RiNALMoModel","text":"

Bases: RiNALMoPreTrainedModel

Examples:

Python Console Session
>>> from multimolecule import RiNALMoConfig, RiNALMoModel, RnaTokenizer\n>>> config = RiNALMoConfig()\n>>> model = RiNALMoModel(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input)\n>>> output[\"last_hidden_state\"].shape\ntorch.Size([1, 7, 1280])\n>>> output[\"pooler_output\"].shape\ntorch.Size([1, 1280])\n
Source code in multimolecule/models/rinalmo/modeling_rinalmo.py Python
class RiNALMoModel(RiNALMoPreTrainedModel):\n    \"\"\"\n    Examples:\n        >>> from multimolecule import RiNALMoConfig, RiNALMoModel, RnaTokenizer\n        >>> config = RiNALMoConfig()\n        >>> model = RiNALMoModel(config)\n        >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n        >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n        >>> output = model(**input)\n        >>> output[\"last_hidden_state\"].shape\n        torch.Size([1, 7, 1280])\n        >>> output[\"pooler_output\"].shape\n        torch.Size([1, 1280])\n    \"\"\"\n\n    def __init__(self, config: RiNALMoConfig, add_pooling_layer: bool = True):\n        super().__init__(config)\n        self.pad_token_id = config.pad_token_id\n        self.embeddings = RiNALMoEmbeddings(config)\n        self.encoder = RiNALMoEncoder(config)\n        self.pooler = RiNALMoPooler(config) if add_pooling_layer else None\n\n        # Initialize weights and apply final processing\n        self.post_init()\n\n    def get_input_embeddings(self):\n        return self.embeddings.word_embeddings\n\n    def set_input_embeddings(self, value):\n        self.embeddings.word_embeddings = value\n\n    def _prune_heads(self, heads_to_prune):\n        \"\"\"\n        Prunes heads of the model. heads_to_prune: dict of {layer_num: list of heads to prune in this layer} See base\n        class PreTrainedModel\n        \"\"\"\n        for layer, heads in heads_to_prune.items():\n            self.encoder.layer[layer].attention.prune_heads(heads)\n\n    def forward(\n        self,\n        input_ids: Tensor | NestedTensor,\n        attention_mask: Tensor | None = None,\n        position_ids: Tensor | None = None,\n        head_mask: Tensor | None = None,\n        inputs_embeds: Tensor | NestedTensor | None = None,\n        encoder_hidden_states: Tensor | None = None,\n        encoder_attention_mask: Tensor | None = None,\n        past_key_values: Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None = None,\n        use_cache: bool | None = None,\n        output_attentions: bool | None = None,\n        output_hidden_states: bool | None = None,\n        return_dict: bool | None = None,\n        **kwargs,\n    ) -> Tuple[Tensor, ...] | BaseModelOutputWithPoolingAndCrossAttentions:\n        r\"\"\"\n        Args:\n            encoder_hidden_states:\n                Shape: `(batch_size, sequence_length, hidden_size)`\n\n                Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if\n                the model is configured as a decoder.\n            encoder_attention_mask:\n                Shape: `(batch_size, sequence_length)`\n\n                Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used\n                in the cross-attention if the model is configured as a decoder. Mask values selected in `[0, 1]`:\n\n                - 1 for tokens that are **not masked**,\n                - 0 for tokens that are **masked**.\n            past_key_values:\n                Tuple of length `config.n_layers` with each tuple having 4 tensors of shape\n                `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)\n\n                Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up\n                decoding.\n\n                If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those\n                that don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of\n                all `decoder_input_ids` of shape `(batch_size, sequence_length)`.\n            use_cache:\n                If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding\n                (see `past_key_values`).\n        \"\"\"\n        if kwargs:\n            warn(\n                f\"Additional keyword arguments `{', '.join(kwargs)}` are detected in \"\n                f\"`{self.__class__.__name__}.forward`, they will be ignored.\\n\"\n                \"This is provided for backward compatibility and may lead to unexpected behavior.\"\n            )\n        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions\n        output_hidden_states = (\n            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states\n        )\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n\n        if self.config.is_decoder:\n            use_cache = use_cache if use_cache is not None else self.config.use_cache\n        else:\n            use_cache = False\n\n        if isinstance(input_ids, NestedTensor):\n            input_ids, attention_mask = input_ids.tensor, input_ids.mask\n        if input_ids is not None and inputs_embeds is not None:\n            raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n        if input_ids is not None:\n            self.warn_if_padding_and_no_attention_mask(input_ids, attention_mask)\n            input_shape = input_ids.size()\n        elif inputs_embeds is not None:\n            input_shape = inputs_embeds.size()[:-1]\n        else:\n            raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n        batch_size, seq_length = input_shape\n        device = input_ids.device if input_ids is not None else inputs_embeds.device  # type: ignore[union-attr]\n\n        # past_key_values_length\n        past_key_values_length = past_key_values[0][0].shape[2] if past_key_values is not None else 0\n\n        if attention_mask is None:\n            attention_mask = (\n                input_ids.ne(self.pad_token_id)\n                if self.pad_token_id is not None\n                else torch.ones(((batch_size, seq_length + past_key_values_length)), device=device)\n            )\n\n        # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]\n        # ourselves in which case we just need to make it broadcastable to all heads.\n        extended_attention_mask: Tensor = self.get_extended_attention_mask(attention_mask, input_shape)\n\n        # If a 2D or 3D attention mask is provided for the cross-attention\n        # we need to make broadcastable to [batch_size, num_heads, seq_length, seq_length]\n        if self.config.is_decoder and encoder_hidden_states is not None:\n            encoder_batch_size, encoder_sequence_length, _ = encoder_hidden_states.size()\n            encoder_hidden_shape = (encoder_batch_size, encoder_sequence_length)\n            if encoder_attention_mask is None:\n                encoder_attention_mask = torch.ones(encoder_hidden_shape, device=device)\n            encoder_extended_attention_mask = self.invert_attention_mask(encoder_attention_mask)\n        else:\n            encoder_extended_attention_mask = None\n\n        # Prepare head mask if needed\n        # 1.0 in head_mask indicate we keep the head\n        # attention_probs has shape bsz x n_heads x N x N\n        # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]\n        # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]\n        head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)\n\n        embedding_output = self.embeddings(\n            input_ids=input_ids,\n            position_ids=position_ids,\n            attention_mask=attention_mask,\n            inputs_embeds=inputs_embeds,\n            past_key_values_length=past_key_values_length,\n        )\n        encoder_outputs = self.encoder(\n            embedding_output,\n            attention_mask=extended_attention_mask,\n            head_mask=head_mask,\n            encoder_hidden_states=encoder_hidden_states,\n            encoder_attention_mask=encoder_extended_attention_mask,\n            past_key_values=past_key_values,\n            use_cache=use_cache,\n            output_attentions=output_attentions,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n        )\n        sequence_output = encoder_outputs[0]\n        pooled_output = self.pooler(sequence_output) if self.pooler is not None else None\n\n        if not return_dict:\n            return (sequence_output, pooled_output) + encoder_outputs[1:]\n\n        return BaseModelOutputWithPoolingAndCrossAttentions(\n            last_hidden_state=sequence_output,\n            pooler_output=pooled_output,\n            past_key_values=encoder_outputs.past_key_values,\n            hidden_states=encoder_outputs.hidden_states,\n            attentions=encoder_outputs.attentions,\n            cross_attentions=encoder_outputs.cross_attentions,\n        )\n
"},{"location":"models/rinalmo/#multimolecule.models.rinalmo.RiNALMoModel.forward","title":"forward","text":"Python
forward(input_ids: Tensor | NestedTensor, attention_mask: Tensor | None = None, position_ids: Tensor | None = None, head_mask: Tensor | None = None, inputs_embeds: Tensor | NestedTensor | None = None, encoder_hidden_states: Tensor | None = None, encoder_attention_mask: Tensor | None = None, past_key_values: Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None = None, use_cache: bool | None = None, output_attentions: bool | None = None, output_hidden_states: bool | None = None, return_dict: bool | None = None, **kwargs) -> Tuple[Tensor, ...] | BaseModelOutputWithPoolingAndCrossAttentions\n

Parameters:

Name Type Description Default Tensor | None

Shape: (batch_size, sequence_length, hidden_size)

Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if the model is configured as a decoder.

None Tensor | None

Shape: (batch_size, sequence_length)

Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used in the cross-attention if the model is configured as a decoder. Mask values selected in [0, 1]:

None Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None

Tuple of length config.n_layers with each tuple having 4 tensors of shape `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)

Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up decoding.

If past_key_values are used, the user can optionally input only the last decoder_input_ids (those that don\u2019t have their past key value states given to this model) of shape (batch_size, 1) instead of all decoder_input_ids of shape (batch_size, sequence_length).

None bool | None

If set to True, past_key_values key value states are returned and can be used to speed up decoding (see past_key_values).

None Source code in multimolecule/models/rinalmo/modeling_rinalmo.py Python
def forward(\n    self,\n    input_ids: Tensor | NestedTensor,\n    attention_mask: Tensor | None = None,\n    position_ids: Tensor | None = None,\n    head_mask: Tensor | None = None,\n    inputs_embeds: Tensor | NestedTensor | None = None,\n    encoder_hidden_states: Tensor | None = None,\n    encoder_attention_mask: Tensor | None = None,\n    past_key_values: Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None = None,\n    use_cache: bool | None = None,\n    output_attentions: bool | None = None,\n    output_hidden_states: bool | None = None,\n    return_dict: bool | None = None,\n    **kwargs,\n) -> Tuple[Tensor, ...] | BaseModelOutputWithPoolingAndCrossAttentions:\n    r\"\"\"\n    Args:\n        encoder_hidden_states:\n            Shape: `(batch_size, sequence_length, hidden_size)`\n\n            Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if\n            the model is configured as a decoder.\n        encoder_attention_mask:\n            Shape: `(batch_size, sequence_length)`\n\n            Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used\n            in the cross-attention if the model is configured as a decoder. Mask values selected in `[0, 1]`:\n\n            - 1 for tokens that are **not masked**,\n            - 0 for tokens that are **masked**.\n        past_key_values:\n            Tuple of length `config.n_layers` with each tuple having 4 tensors of shape\n            `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)\n\n            Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up\n            decoding.\n\n            If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those\n            that don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of\n            all `decoder_input_ids` of shape `(batch_size, sequence_length)`.\n        use_cache:\n            If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding\n            (see `past_key_values`).\n    \"\"\"\n    if kwargs:\n        warn(\n            f\"Additional keyword arguments `{', '.join(kwargs)}` are detected in \"\n            f\"`{self.__class__.__name__}.forward`, they will be ignored.\\n\"\n            \"This is provided for backward compatibility and may lead to unexpected behavior.\"\n        )\n    output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions\n    output_hidden_states = (\n        output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states\n    )\n    return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n\n    if self.config.is_decoder:\n        use_cache = use_cache if use_cache is not None else self.config.use_cache\n    else:\n        use_cache = False\n\n    if isinstance(input_ids, NestedTensor):\n        input_ids, attention_mask = input_ids.tensor, input_ids.mask\n    if input_ids is not None and inputs_embeds is not None:\n        raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n    if input_ids is not None:\n        self.warn_if_padding_and_no_attention_mask(input_ids, attention_mask)\n        input_shape = input_ids.size()\n    elif inputs_embeds is not None:\n        input_shape = inputs_embeds.size()[:-1]\n    else:\n        raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n    batch_size, seq_length = input_shape\n    device = input_ids.device if input_ids is not None else inputs_embeds.device  # type: ignore[union-attr]\n\n    # past_key_values_length\n    past_key_values_length = past_key_values[0][0].shape[2] if past_key_values is not None else 0\n\n    if attention_mask is None:\n        attention_mask = (\n            input_ids.ne(self.pad_token_id)\n            if self.pad_token_id is not None\n            else torch.ones(((batch_size, seq_length + past_key_values_length)), device=device)\n        )\n\n    # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]\n    # ourselves in which case we just need to make it broadcastable to all heads.\n    extended_attention_mask: Tensor = self.get_extended_attention_mask(attention_mask, input_shape)\n\n    # If a 2D or 3D attention mask is provided for the cross-attention\n    # we need to make broadcastable to [batch_size, num_heads, seq_length, seq_length]\n    if self.config.is_decoder and encoder_hidden_states is not None:\n        encoder_batch_size, encoder_sequence_length, _ = encoder_hidden_states.size()\n        encoder_hidden_shape = (encoder_batch_size, encoder_sequence_length)\n        if encoder_attention_mask is None:\n            encoder_attention_mask = torch.ones(encoder_hidden_shape, device=device)\n        encoder_extended_attention_mask = self.invert_attention_mask(encoder_attention_mask)\n    else:\n        encoder_extended_attention_mask = None\n\n    # Prepare head mask if needed\n    # 1.0 in head_mask indicate we keep the head\n    # attention_probs has shape bsz x n_heads x N x N\n    # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]\n    # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]\n    head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)\n\n    embedding_output = self.embeddings(\n        input_ids=input_ids,\n        position_ids=position_ids,\n        attention_mask=attention_mask,\n        inputs_embeds=inputs_embeds,\n        past_key_values_length=past_key_values_length,\n    )\n    encoder_outputs = self.encoder(\n        embedding_output,\n        attention_mask=extended_attention_mask,\n        head_mask=head_mask,\n        encoder_hidden_states=encoder_hidden_states,\n        encoder_attention_mask=encoder_extended_attention_mask,\n        past_key_values=past_key_values,\n        use_cache=use_cache,\n        output_attentions=output_attentions,\n        output_hidden_states=output_hidden_states,\n        return_dict=return_dict,\n    )\n    sequence_output = encoder_outputs[0]\n    pooled_output = self.pooler(sequence_output) if self.pooler is not None else None\n\n    if not return_dict:\n        return (sequence_output, pooled_output) + encoder_outputs[1:]\n\n    return BaseModelOutputWithPoolingAndCrossAttentions(\n        last_hidden_state=sequence_output,\n        pooler_output=pooled_output,\n        past_key_values=encoder_outputs.past_key_values,\n        hidden_states=encoder_outputs.hidden_states,\n        attentions=encoder_outputs.attentions,\n        cross_attentions=encoder_outputs.cross_attentions,\n    )\n
"},{"location":"models/rinalmo/#multimolecule.models.rinalmo.RiNALMoModel.forward(encoder_hidden_states)","title":"encoder_hidden_states","text":""},{"location":"models/rinalmo/#multimolecule.models.rinalmo.RiNALMoModel.forward(encoder_attention_mask)","title":"encoder_attention_mask","text":""},{"location":"models/rinalmo/#multimolecule.models.rinalmo.RiNALMoModel.forward(past_key_values)","title":"past_key_values","text":""},{"location":"models/rinalmo/#multimolecule.models.rinalmo.RiNALMoModel.forward(use_cache)","title":"use_cache","text":""},{"location":"models/rinalmo/#multimolecule.models.rinalmo.RiNALMoPreTrainedModel","title":"RiNALMoPreTrainedModel","text":"

Bases: PreTrainedModel

An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained models.

Source code in multimolecule/models/rinalmo/modeling_rinalmo.py Python
class RiNALMoPreTrainedModel(PreTrainedModel):\n    \"\"\"\n    An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained\n    models.\n    \"\"\"\n\n    config_class = RiNALMoConfig\n    base_model_prefix = \"rinalmo\"\n    supports_gradient_checkpointing = True\n    _no_split_modules = [\"RiNALMoLayer\", \"RiNALMoEmbeddings\"]\n\n    # Copied from transformers.models.bert.modeling_bert.BertPreTrainedModel._init_weights\n    def _init_weights(self, module: nn.Module):\n        \"\"\"Initialize the weights\"\"\"\n        if isinstance(module, nn.Linear):\n            # Slightly different from the TF version which uses truncated_normal for initialization\n            # cf https://github.com/pytorch/pytorch/pull/5617\n            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)\n            if module.bias is not None:\n                module.bias.data.zero_()\n        elif isinstance(module, nn.Embedding):\n            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)\n            if module.padding_idx is not None:\n                module.weight.data[module.padding_idx].zero_()\n        elif isinstance(module, nn.LayerNorm):\n            module.bias.data.zero_()\n            module.weight.data.fill_(1.0)\n
"},{"location":"models/rnabert/","title":"RNABERT","text":"

Pre-trained model on non-coding RNA (ncRNA) using masked language modeling (MLM) and structural alignment learning (SAL) objectives.

"},{"location":"models/rnabert/#disclaimer","title":"Disclaimer","text":"

This is an UNOFFICIAL implementation of the Informative RNA-base embedding for functional RNA clustering and structural alignment by Manato Akiyama and Yasubumi Sakakibara.

The OFFICIAL repository of RNABERT is at mana438/RNABERT.

Caution

The MultiMolecule team is aware of a potential risk in reproducing the results of RNABERT.

The original implementation of RNABERT does not prepend <cls> and append <eos> tokens to the input sequence. This should not affect the performance of the model in most cases, but it can lead to unexpected behavior in some cases.

Please set cls_token=None and eos_token=None explicitly in the tokenizer if you want the exact behavior of the original implementation.

Tip

The MultiMolecule team has confirmed that the provided model and checkpoints are producing the same intermediate representations as the original implementation.

The team releasing RNABERT did not write this model card for this model so this model card has been written by the MultiMolecule team.

"},{"location":"models/rnabert/#model-details","title":"Model Details","text":"

RNABERT is a bert-style model pre-trained on a large corpus of non-coding RNA sequences in a self-supervised fashion. This means that the model was trained on the raw nucleotides of RNA sequences only, with an automatic process to generate inputs and labels from those texts. Please refer to the Training Details section for more information on the training process.

"},{"location":"models/rnabert/#model-specification","title":"Model Specification","text":"Num Layers Hidden Size Num Heads Intermediate Size Num Parameters (M) FLOPs (G) MACs (G) Max Num Tokens 6 120 12 40 0.48 0.15 0.08 440"},{"location":"models/rnabert/#links","title":"Links","text":""},{"location":"models/rnabert/#usage","title":"Usage","text":"

The model file depends on the multimolecule library. You can install it using pip:

Bash
pip install multimolecule\n
"},{"location":"models/rnabert/#direct-use","title":"Direct Use","text":"

You can use this model directly with a pipeline for masked language modeling:

Python
>>> import multimolecule  # you must import multimolecule to register models\n>>> from transformers import pipeline\n>>> unmasker = pipeline(\"fill-mask\", model=\"multimolecule/rnabert\")\n>>> unmasker(\"gguc<mask>cucugguuagaccagaucugagccu\")\n\n[{'score': 0.03852083534002304,\n  'token': 24,\n  'token_str': '-',\n  'sequence': 'G G U C - C U C U G G U U A G A C C A G A U C U G A G C C U'},\n {'score': 0.03851056098937988,\n  'token': 10,\n  'token_str': 'N',\n  'sequence': 'G G U C N C U C U G G U U A G A C C A G A U C U G A G C C U'},\n {'score': 0.03849703073501587,\n  'token': 25,\n  'token_str': 'I',\n  'sequence': 'G G U C I C U C U G G U U A G A C C A G A U C U G A G C C U'},\n {'score': 0.03848597779870033,\n  'token': 3,\n  'token_str': '<unk>',\n  'sequence': 'G G U C C U C U G G U U A G A C C A G A U C U G A G C C U'},\n {'score': 0.038484156131744385,\n  'token': 5,\n  'token_str': '<null>',\n  'sequence': 'G G U C C U C U G G U U A G A C C A G A U C U G A G C C U'}]\n
"},{"location":"models/rnabert/#downstream-use","title":"Downstream Use","text":""},{"location":"models/rnabert/#extract-features","title":"Extract Features","text":"

Here is how to use this model to get the features of a given sequence in PyTorch:

Python
from multimolecule import RnaTokenizer, RnaBertModel\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rnabert\")\nmodel = RnaBertModel.from_pretrained(\"multimolecule/rnabert\")\n\ntext = \"UAGCUUAUCAGACUGAUGUUGA\"\ninput = tokenizer(text, return_tensors=\"pt\")\n\noutput = model(**input)\n
"},{"location":"models/rnabert/#sequence-classification-regression","title":"Sequence Classification / Regression","text":"

Note: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for sequence classification or regression.

Here is how to use this model as backbone to fine-tune for a sequence-level task in PyTorch:

Python
import torch\nfrom multimolecule import RnaTokenizer, RnaBertForSequencePrediction\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rnabert\")\nmodel = RnaBertForSequencePrediction.from_pretrained(\"multimolecule/rnabert\")\n\ntext = \"UAGCUUAUCAGACUGAUGUUGA\"\ninput = tokenizer(text, return_tensors=\"pt\")\nlabel = torch.tensor([1])\n\noutput = model(**input, labels=label)\n
"},{"location":"models/rnabert/#token-classification-regression","title":"Token Classification / Regression","text":"

Note: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for nucleotide classification or regression.

Here is how to use this model as backbone to fine-tune for a nucleotide-level task in PyTorch:

Python
import torch\nfrom multimolecule import RnaTokenizer, RnaBertForTokenPrediction\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rnabert\")\nmodel = RnaBertForTokenPrediction.from_pretrained(\"multimolecule/rnabert\")\n\ntext = \"UAGCUUAUCAGACUGAUGUUGA\"\ninput = tokenizer(text, return_tensors=\"pt\")\nlabel = torch.randint(2, (len(text), ))\n\noutput = model(**input, labels=label)\n
"},{"location":"models/rnabert/#contact-classification-regression","title":"Contact Classification / Regression","text":"

Note: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for contact classification or regression.

Here is how to use this model as backbone to fine-tune for a contact-level task in PyTorch:

Python
import torch\nfrom multimolecule import RnaTokenizer, RnaBertForContactPrediction\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rnabert\")\nmodel = RnaBertForContactPrediction.from_pretrained(\"multimolecule/rnabert\")\n\ntext = \"UAGCUUAUCAGACUGAUGUUGA\"\ninput = tokenizer(text, return_tensors=\"pt\")\nlabel = torch.randint(2, (len(text), len(text)))\n\noutput = model(**input, labels=label)\n
"},{"location":"models/rnabert/#training-details","title":"Training Details","text":"

RNABERT has two pre-training objectives: masked language modeling (MLM) and structural alignment learning (SAL).

"},{"location":"models/rnabert/#training-data","title":"Training Data","text":"

The RNABERT model was pre-trained on RNAcentral. RNAcentral is a free, public resource that offers integrated access to a comprehensive and up-to-date set of non-coding RNA sequences provided by a collaborating group of Expert Databases representing a broad range of organisms and RNA types.

RNABERT used a subset of 76, 237 human ncRNA sequences from RNAcentral for pre-training. RNABERT preprocessed all tokens by replacing \u201cU\u201ds with \u201cT\u201ds.

Note that during model conversions, \u201cT\u201d is replaced with \u201cU\u201d. RnaTokenizer will convert \u201cT\u201ds to \u201cU\u201ds for you, you may disable this behaviour by passing replace_T_with_U=False.

"},{"location":"models/rnabert/#training-procedure","title":"Training Procedure","text":""},{"location":"models/rnabert/#preprocessing","title":"Preprocessing","text":"

RNABERT preprocess the dataset by applying 10 different mask patterns to the 72, 237 human ncRNA sequences. The final dataset contains 722, 370 sequences. The masking procedure is similar to the one used in BERT:

"},{"location":"models/rnabert/#pretraining","title":"PreTraining","text":"

The model was trained on 1 NVIDIA V100 GPU.

"},{"location":"models/rnabert/#citation","title":"Citation","text":"

BibTeX:

BibTeX
@article{akiyama2022informative,\n    author = {Akiyama, Manato and Sakakibara, Yasubumi},\n    title = \"{Informative RNA base embedding for RNA structural alignment and clustering by deep representation learning}\",\n    journal = {NAR Genomics and Bioinformatics},\n    volume = {4},\n    number = {1},\n    pages = {lqac012},\n    year = {2022},\n    month = {02},\n    abstract = \"{Effective embedding is actively conducted by applying deep learning to biomolecular information. Obtaining better embeddings enhances the quality of downstream analyses, such as DNA sequence motif detection and protein function prediction. In this study, we adopt a pre-training algorithm for the effective embedding of RNA bases to acquire semantically rich representations and apply this algorithm to two fundamental RNA sequence problems: structural alignment and clustering. By using the pre-training algorithm to embed the four bases of RNA in a position-dependent manner using a large number of RNA sequences from various RNA families, a context-sensitive embedding representation is obtained. As a result, not only base information but also secondary structure and context information of RNA sequences are embedded for each base. We call this \u2018informative base embedding\u2019 and use it to achieve accuracies superior to those of existing state-of-the-art methods on RNA structural alignment and RNA family clustering tasks. Furthermore, upon performing RNA sequence alignment by combining this informative base embedding with a simple Needleman\u2013Wunsch alignment algorithm, we succeed in calculating structural alignments with a time complexity of O(n2) instead of the O(n6) time complexity of the naive implementation of Sankoff-style algorithm for input RNA sequence of length n.}\",\n    issn = {2631-9268},\n    doi = {10.1093/nargab/lqac012},\n    url = {https://doi.org/10.1093/nargab/lqac012},\n    eprint = {https://academic.oup.com/nargab/article-pdf/4/1/lqac012/42577168/lqac012.pdf},\n}\n
"},{"location":"models/rnabert/#contact","title":"Contact","text":"

Please use GitHub issues of MultiMolecule for any questions or comments on the model card.

Please contact the authors of the RNABERT paper for questions or comments on the paper/model.

"},{"location":"models/rnabert/#license","title":"License","text":"

This model is licensed under the AGPL-3.0 License.

Text Only
SPDX-License-Identifier: AGPL-3.0-or-later\n
"},{"location":"models/rnabert/#multimolecule.models.rnabert","title":"multimolecule.models.rnabert","text":""},{"location":"models/rnabert/#multimolecule.models.rnabert.RnaTokenizer","title":"RnaTokenizer","text":"

Bases: Tokenizer

Tokenizer for RNA sequences.

Parameters:

Name Type Description Default Alphabet | str | List[str] | None

alphabet to use for tokenization.

None int

Size of kmer to tokenize.

1 bool

Whether to tokenize into codons.

False bool

Whether to replace T with U.

True bool

Whether to convert input to uppercase.

True

Examples:

Python Console Session
>>> from multimolecule import RnaTokenizer\n>>> tokenizer = RnaTokenizer()\n>>> tokenizer('<pad><cls><eos><unk><mask><null>ACGUNRYSWKMBDHV.X*-I')[\"input_ids\"]\n[1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 2]\n>>> tokenizer('acgu')[\"input_ids\"]\n[1, 6, 7, 8, 9, 2]\n>>> tokenizer('acgt')[\"input_ids\"]\n[1, 6, 7, 8, 9, 2]\n>>> tokenizer = RnaTokenizer(replace_T_with_U=False)\n>>> tokenizer('acgt')[\"input_ids\"]\n[1, 6, 7, 8, 3, 2]\n>>> tokenizer = RnaTokenizer(nmers=3)\n>>> tokenizer('uagcuuauc')[\"input_ids\"]\n[1, 83, 17, 64, 49, 96, 84, 22, 2]\n>>> tokenizer = RnaTokenizer(codon=True)\n>>> tokenizer('uagcuuauc')[\"input_ids\"]\n[1, 83, 49, 22, 2]\n>>> tokenizer('uagcuuauca')[\"input_ids\"]\nTraceback (most recent call last):\nValueError: length of input sequence must be a multiple of 3 for codon tokenization, but got 10\n
Source code in multimolecule/tokenisers/rna/tokenization_rna.py Python
class RnaTokenizer(Tokenizer):\n    \"\"\"\n    Tokenizer for RNA sequences.\n\n    Args:\n        alphabet: alphabet to use for tokenization.\n\n            - If is `None`, the standard RNA alphabet will be used.\n            - If is a `string`, it should correspond to the name of a predefined alphabet. The options include\n                + `standard`\n                + `extended`\n                + `streamline`\n                + `nucleobase`\n            - If is an alphabet or a list of characters, that specific alphabet will be used.\n        nmers: Size of kmer to tokenize.\n        codon: Whether to tokenize into codons.\n        replace_T_with_U: Whether to replace T with U.\n        do_upper_case: Whether to convert input to uppercase.\n\n    Examples:\n        >>> from multimolecule import RnaTokenizer\n        >>> tokenizer = RnaTokenizer()\n        >>> tokenizer('<pad><cls><eos><unk><mask><null>ACGUNRYSWKMBDHV.X*-I')[\"input_ids\"]\n        [1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 2]\n        >>> tokenizer('acgu')[\"input_ids\"]\n        [1, 6, 7, 8, 9, 2]\n        >>> tokenizer('acgt')[\"input_ids\"]\n        [1, 6, 7, 8, 9, 2]\n        >>> tokenizer = RnaTokenizer(replace_T_with_U=False)\n        >>> tokenizer('acgt')[\"input_ids\"]\n        [1, 6, 7, 8, 3, 2]\n        >>> tokenizer = RnaTokenizer(nmers=3)\n        >>> tokenizer('uagcuuauc')[\"input_ids\"]\n        [1, 83, 17, 64, 49, 96, 84, 22, 2]\n        >>> tokenizer = RnaTokenizer(codon=True)\n        >>> tokenizer('uagcuuauc')[\"input_ids\"]\n        [1, 83, 49, 22, 2]\n        >>> tokenizer('uagcuuauca')[\"input_ids\"]\n        Traceback (most recent call last):\n        ValueError: length of input sequence must be a multiple of 3 for codon tokenization, but got 10\n    \"\"\"\n\n    model_input_names = [\"input_ids\", \"attention_mask\"]\n\n    def __init__(\n        self,\n        alphabet: Alphabet | str | List[str] | None = None,\n        nmers: int = 1,\n        codon: bool = False,\n        replace_T_with_U: bool = True,\n        do_upper_case: bool = True,\n        additional_special_tokens: List | Tuple | None = None,\n        **kwargs,\n    ):\n        if codon and (nmers > 1 and nmers != 3):\n            raise ValueError(\"Codon and nmers cannot be used together.\")\n        if codon:\n            nmers = 3  # set to 3 to get correct vocab\n        if not isinstance(alphabet, Alphabet):\n            alphabet = get_alphabet(alphabet, nmers=nmers)\n        super().__init__(\n            alphabet=alphabet,\n            nmers=nmers,\n            codon=codon,\n            replace_T_with_U=replace_T_with_U,\n            do_upper_case=do_upper_case,\n            additional_special_tokens=additional_special_tokens,\n            **kwargs,\n        )\n        self.replace_T_with_U = replace_T_with_U\n        self.nmers = nmers\n        self.codon = codon\n\n    def _tokenize(self, text: str, **kwargs):\n        if self.do_upper_case:\n            text = text.upper()\n        if self.replace_T_with_U:\n            text = text.replace(\"T\", \"U\")\n        if self.codon:\n            if len(text) % 3 != 0:\n                raise ValueError(\n                    f\"length of input sequence must be a multiple of 3 for codon tokenization, but got {len(text)}\"\n                )\n            return [text[i : i + 3] for i in range(0, len(text), 3)]\n        if self.nmers > 1:\n            return [text[i : i + self.nmers] for i in range(len(text) - self.nmers + 1)]  # noqa: E203\n        return list(text)\n
"},{"location":"models/rnabert/#multimolecule.models.rnabert.RnaTokenizer(alphabet)","title":"alphabet","text":""},{"location":"models/rnabert/#multimolecule.models.rnabert.RnaTokenizer(nmers)","title":"nmers","text":""},{"location":"models/rnabert/#multimolecule.models.rnabert.RnaTokenizer(codon)","title":"codon","text":""},{"location":"models/rnabert/#multimolecule.models.rnabert.RnaTokenizer(replace_T_with_U)","title":"replace_T_with_U","text":""},{"location":"models/rnabert/#multimolecule.models.rnabert.RnaTokenizer(do_upper_case)","title":"do_upper_case","text":""},{"location":"models/rnabert/#multimolecule.models.rnabert.RnaBertConfig","title":"RnaBertConfig","text":"

Bases: PreTrainedConfig

This is the configuration class to store the configuration of a RnaBertModel. It is used to instantiate a RnaBert model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the RnaBert mana438/RNABERT architecture.

Configuration objects inherit from PreTrainedConfig and can be used to control the model outputs. Read the documentation from PreTrainedConfig for more information.

Parameters:

Name Type Description Default int

Vocabulary size of the RnaBert model. Defines the number of different tokens that can be represented by the inputs_ids passed when calling [RnaBertModel].

26 int | None

Dimensionality of the encoder layers and the pooler layer.

None int

Number of hidden layers in the Transformer encoder.

6 int

Number of attention heads for each attention layer in the Transformer encoder.

12 int

Dimensionality of the \u201cintermediate\u201d (often named feed-forward) layer in the Transformer encoder.

40 float

The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.

0.0 float

The dropout ratio for the attention probabilities.

0.0 int

The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).

440 float

The standard deviation of the truncated_normal_initializer for initializing all weight matrices.

0.02 float

The epsilon used by the layer normalization layers.

1e-12

Examples:

Python Console Session
>>> from multimolecule import RnaBertModel, RnaBertConfig\n>>> # Initializing a RNABERT multimolecule/rnabert style configuration\n>>> configuration = RnaBertConfig()\n>>> # Initializing a model (with random weights) from the multimolecule/rnabert style configuration\n>>> model = RnaBertModel(configuration)\n>>> # Accessing the model configuration\n>>> configuration = model.config\n
Source code in multimolecule/models/rnabert/configuration_rnabert.py Python
class RnaBertConfig(PreTrainedConfig):\n    r\"\"\"\n    This is the configuration class to store the configuration of a [`RnaBertModel`][multimolecule.models.RnaBertModel].\n    It is used to instantiate a RnaBert model according to the specified arguments, defining the model architecture.\n    Instantiating a configuration with the defaults will yield a similar configuration to that of the RnaBert\n    [mana438/RNABERT](https://github.com/mana438/RNABERT) architecture.\n\n    Configuration objects inherit from [`PreTrainedConfig`][multimolecule.models.PreTrainedConfig] and can be used to\n    control the model outputs. Read the documentation from [`PreTrainedConfig`][multimolecule.models.PreTrainedConfig]\n    for more information.\n\n    Args:\n        vocab_size:\n            Vocabulary size of the RnaBert model. Defines the number of different tokens that can be represented by the\n            `inputs_ids` passed when calling [`RnaBertModel`].\n        hidden_size:\n            Dimensionality of the encoder layers and the pooler layer.\n        num_hidden_layers:\n            Number of hidden layers in the Transformer encoder.\n        num_attention_heads:\n            Number of attention heads for each attention layer in the Transformer encoder.\n        intermediate_size:\n            Dimensionality of the \"intermediate\" (often named feed-forward) layer in the Transformer encoder.\n        hidden_dropout:\n            The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.\n        attention_dropout:\n            The dropout ratio for the attention probabilities.\n        max_position_embeddings:\n            The maximum sequence length that this model might ever be used with. Typically set this to something large\n            just in case (e.g., 512 or 1024 or 2048).\n        initializer_range:\n            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.\n        layer_norm_eps:\n            The epsilon used by the layer normalization layers.\n\n    Examples:\n        >>> from multimolecule import RnaBertModel, RnaBertConfig\n        >>> # Initializing a RNABERT multimolecule/rnabert style configuration\n        >>> configuration = RnaBertConfig()\n        >>> # Initializing a model (with random weights) from the multimolecule/rnabert style configuration\n        >>> model = RnaBertModel(configuration)\n        >>> # Accessing the model configuration\n        >>> configuration = model.config\n    \"\"\"\n\n    model_type = \"rnabert\"\n\n    def __init__(\n        self,\n        vocab_size: int = 26,\n        ss_vocab_size: int = 8,\n        hidden_size: int | None = None,\n        multiple: int | None = None,\n        num_hidden_layers: int = 6,\n        num_attention_heads: int = 12,\n        intermediate_size: int = 40,\n        hidden_act: str = \"gelu\",\n        hidden_dropout: float = 0.0,\n        attention_dropout: float = 0.0,\n        max_position_embeddings: int = 440,\n        initializer_range: float = 0.02,\n        layer_norm_eps: float = 1e-12,\n        position_embedding_type: str = \"absolute\",\n        is_decoder: bool = False,\n        use_cache: bool = True,\n        head: HeadConfig | None = None,\n        lm_head: MaskedLMHeadConfig | None = None,\n        **kwargs,\n    ):\n        if hidden_size is None:\n            hidden_size = num_attention_heads * multiple if multiple is not None else 120\n        super().__init__(**kwargs)\n\n        self.vocab_size = vocab_size\n        self.ss_vocab_size = ss_vocab_size\n        self.type_vocab_size = 2\n        self.hidden_size = hidden_size\n        self.num_hidden_layers = num_hidden_layers\n        self.num_attention_heads = num_attention_heads\n        self.intermediate_size = intermediate_size\n        self.hidden_act = hidden_act\n        self.hidden_dropout = hidden_dropout\n        self.attention_dropout = attention_dropout\n        self.max_position_embeddings = max_position_embeddings\n        self.initializer_range = initializer_range\n        self.layer_norm_eps = layer_norm_eps\n        self.position_embedding_type = position_embedding_type\n        self.is_decoder = is_decoder\n        self.use_cache = use_cache\n        self.head = HeadConfig(**head) if head is not None else None\n        self.lm_head = MaskedLMHeadConfig(**lm_head) if lm_head is not None else None\n
"},{"location":"models/rnabert/#multimolecule.models.rnabert.RnaBertConfig(vocab_size)","title":"vocab_size","text":""},{"location":"models/rnabert/#multimolecule.models.rnabert.RnaBertConfig(hidden_size)","title":"hidden_size","text":""},{"location":"models/rnabert/#multimolecule.models.rnabert.RnaBertConfig(num_hidden_layers)","title":"num_hidden_layers","text":""},{"location":"models/rnabert/#multimolecule.models.rnabert.RnaBertConfig(num_attention_heads)","title":"num_attention_heads","text":""},{"location":"models/rnabert/#multimolecule.models.rnabert.RnaBertConfig(intermediate_size)","title":"intermediate_size","text":""},{"location":"models/rnabert/#multimolecule.models.rnabert.RnaBertConfig(hidden_dropout)","title":"hidden_dropout","text":""},{"location":"models/rnabert/#multimolecule.models.rnabert.RnaBertConfig(attention_dropout)","title":"attention_dropout","text":""},{"location":"models/rnabert/#multimolecule.models.rnabert.RnaBertConfig(max_position_embeddings)","title":"max_position_embeddings","text":""},{"location":"models/rnabert/#multimolecule.models.rnabert.RnaBertConfig(initializer_range)","title":"initializer_range","text":""},{"location":"models/rnabert/#multimolecule.models.rnabert.RnaBertConfig(layer_norm_eps)","title":"layer_norm_eps","text":""},{"location":"models/rnabert/#multimolecule.models.rnabert.RnaBertForContactPrediction","title":"RnaBertForContactPrediction","text":"

Bases: RnaBertPreTrainedModel

Examples:

Python Console Session
>>> from multimolecule import RnaBertConfig, RnaBertForContactPrediction, RnaTokenizer\n>>> config = RnaBertConfig()\n>>> model = RnaBertForContactPrediction(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=torch.randint(2, (1, 5, 5)))\n>>> output[\"logits\"].shape\ntorch.Size([1, 5, 5, 1])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n
Source code in multimolecule/models/rnabert/modeling_rnabert.py Python
class RnaBertForContactPrediction(RnaBertPreTrainedModel):\n    \"\"\"\n    Examples:\n        >>> from multimolecule import RnaBertConfig, RnaBertForContactPrediction, RnaTokenizer\n        >>> config = RnaBertConfig()\n        >>> model = RnaBertForContactPrediction(config)\n        >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n        >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n        >>> output = model(**input, labels=torch.randint(2, (1, 5, 5)))\n        >>> output[\"logits\"].shape\n        torch.Size([1, 5, 5, 1])\n        >>> output[\"loss\"]  # doctest:+ELLIPSIS\n        tensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n    \"\"\"\n\n    def __init__(self, config: RnaBertConfig):\n        super().__init__(config)\n        self.rnabert = RnaBertModel(config, add_pooling_layer=True)\n        self.contact_head = ContactPredictionHead(config)\n        self.head_config = self.contact_head.config\n\n        # Initialize weights and apply final processing\n        self.post_init()\n\n    def forward(\n        self,\n        input_ids: Tensor | NestedTensor,\n        attention_mask: Tensor | None = None,\n        position_ids: Tensor | None = None,\n        head_mask: Tensor | None = None,\n        inputs_embeds: Tensor | NestedTensor | None = None,\n        labels: Tensor | None = None,\n        output_attentions: bool | None = None,\n        output_hidden_states: bool | None = None,\n        return_dict: bool | None = None,\n        **kwargs,\n    ) -> Tuple[Tensor, ...] | ContactPredictorOutput:\n        if output_attentions is False:\n            warn(\"output_attentions must be True for contact classification and will be ignored.\")\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n        outputs = self.rnabert(\n            input_ids,\n            attention_mask=attention_mask,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            output_attentions=True,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n            **kwargs,\n        )\n        output = self.contact_head(outputs, attention_mask, input_ids, labels)\n        logits, loss = output.logits, output.loss\n\n        if not return_dict:\n            output = (logits,) + outputs[2:]\n            return ((loss,) + output) if loss is not None else output\n\n        return ContactPredictorOutput(\n            loss=loss,\n            logits=logits,\n            hidden_states=outputs.hidden_states,\n            attentions=outputs.attentions,\n        )\n
"},{"location":"models/rnabert/#multimolecule.models.rnabert.RnaBertForMaskedLM","title":"RnaBertForMaskedLM","text":"

Bases: RnaBertPreTrainedModel

Examples:

Python Console Session
>>> from multimolecule import RnaBertConfig, RnaBertForMaskedLM, RnaTokenizer\n>>> config = RnaBertConfig()\n>>> model = RnaBertForMaskedLM(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=input[\"input_ids\"])\n>>> output[\"logits\"].shape\ntorch.Size([1, 7, 26])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<NllLossBackward0>)\n
Source code in multimolecule/models/rnabert/modeling_rnabert.py Python
class RnaBertForMaskedLM(RnaBertPreTrainedModel):\n    \"\"\"\n    Examples:\n        >>> from multimolecule import RnaBertConfig, RnaBertForMaskedLM, RnaTokenizer\n        >>> config = RnaBertConfig()\n        >>> model = RnaBertForMaskedLM(config)\n        >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n        >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n        >>> output = model(**input, labels=input[\"input_ids\"])\n        >>> output[\"logits\"].shape\n        torch.Size([1, 7, 26])\n        >>> output[\"loss\"]  # doctest:+ELLIPSIS\n        tensor(..., grad_fn=<NllLossBackward0>)\n    \"\"\"\n\n    _tied_weights_keys = [\"lm_head.decoder.weight\", \"lm_head.decoder.bias\"]\n\n    def __init__(self, config: RnaBertConfig):\n        super().__init__(config)\n        self.rnabert = RnaBertModel(config, add_pooling_layer=False)\n        self.lm_head = MaskedLMHead(config)\n\n        # Initialize weights and apply final processing\n        self.post_init()\n\n    def forward(\n        self,\n        input_ids: Tensor | NestedTensor,\n        attention_mask: Tensor | None = None,\n        labels: Tensor | None = None,\n        output_attentions: bool = False,\n        output_hidden_states: bool = False,\n        return_dict: bool = True,\n        **kwargs,\n    ) -> Tuple[Tensor, ...] | MaskedLMOutput:\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n        outputs = self.rnabert(\n            input_ids,\n            attention_mask,\n            output_attentions=output_attentions,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n            **kwargs,\n        )\n        output = self.lm_head(outputs, labels)\n        logits, loss = output.logits, output.loss\n\n        if not return_dict:\n            output = (logits,) + outputs[2:]\n            return ((loss,) + output) if loss is not None else output\n\n        return MaskedLMOutput(\n            loss=loss,\n            logits=logits,\n            hidden_states=outputs.hidden_states,\n            attentions=outputs.attentions,\n        )\n
"},{"location":"models/rnabert/#multimolecule.models.rnabert.RnaBertForPreTraining","title":"RnaBertForPreTraining","text":"

Bases: RnaBertPreTrainedModel

Examples:

Python Console Session
>>> from multimolecule import RnaBertConfig, RnaBertForPreTraining, RnaTokenizer\n>>> config = RnaBertConfig()\n>>> model = RnaBertForPreTraining(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels_mlm=input[\"input_ids\"])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<AddBackward0>)\n>>> output[\"logits_mlm\"].shape\ntorch.Size([1, 7, 26])\n>>> output[\"logits_ss\"].shape\ntorch.Size([1, 7, 8])\n>>> output[\"logits_sal\"].shape\ntorch.Size([1, 2])\n
Source code in multimolecule/models/rnabert/modeling_rnabert.py Python
class RnaBertForPreTraining(RnaBertPreTrainedModel):\n    \"\"\"\n    Examples:\n        >>> from multimolecule import RnaBertConfig, RnaBertForPreTraining, RnaTokenizer\n        >>> config = RnaBertConfig()\n        >>> model = RnaBertForPreTraining(config)\n        >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n        >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n        >>> output = model(**input, labels_mlm=input[\"input_ids\"])\n        >>> output[\"loss\"]  # doctest:+ELLIPSIS\n        tensor(..., grad_fn=<AddBackward0>)\n        >>> output[\"logits_mlm\"].shape\n        torch.Size([1, 7, 26])\n        >>> output[\"logits_ss\"].shape\n        torch.Size([1, 7, 8])\n        >>> output[\"logits_sal\"].shape\n        torch.Size([1, 2])\n    \"\"\"\n\n    _tied_weights_keys = [\n        \"lm_head.decoder.weight\",\n        \"lm_head.decoder.bias\",\n        \"pretrain.predictions.decoder.weight\",\n        \"pretrain.predictions.decoder.bias\",\n        \"pretrain.predictions_ss.decoder.weight\",\n        \"pretrain.predictions_ss.decoder.bias\",\n    ]\n\n    def __init__(self, config: RnaBertConfig):\n        super().__init__(config)\n        self.rnabert = RnaBertModel(config, add_pooling_layer=True)\n        self.pretrain = RnaBertPreTrainingHeads(config)\n\n        # Initialize weights and apply final processing\n        self.post_init()\n\n    def forward(\n        self,\n        input_ids: Tensor | NestedTensor,\n        attention_mask: Tensor | None = None,\n        labels_mlm: Tensor | None = None,\n        labels_ss: Tensor | None = None,\n        labels_sal: Tensor | None = None,\n        output_attentions: bool = False,\n        output_hidden_states: bool = False,\n        return_dict: bool = True,\n        **kwargs,\n    ) -> Tuple[Tensor, ...] | RnaBertForPreTrainingOutput:\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n        outputs = self.rnabert(\n            input_ids,\n            attention_mask,\n            output_attentions=output_attentions,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n            **kwargs,\n        )\n        total_loss, logits_mlm, logits_ss, logits_sal = self.pretrain(\n            outputs, labels_mlm=labels_mlm, labels_ss=labels_ss, labels_sal=labels_sal\n        )\n\n        if not return_dict:\n            output = (logits_mlm, logits_ss, logits_sal) + outputs[2:]\n            return ((total_loss,) + output) if total_loss is not None else output\n\n        return RnaBertForPreTrainingOutput(\n            loss=total_loss,\n            logits_mlm=logits_mlm,\n            logits_ss=logits_ss,\n            logits_sal=logits_sal,\n            hidden_states=outputs.hidden_states,\n            attentions=outputs.attentions,\n        )\n
"},{"location":"models/rnabert/#multimolecule.models.rnabert.RnaBertForSequencePrediction","title":"RnaBertForSequencePrediction","text":"

Bases: RnaBertPreTrainedModel

Examples:

Python Console Session
>>> from multimolecule import RnaBertConfig, RnaBertForSequencePrediction, RnaTokenizer\n>>> config = RnaBertConfig()\n>>> model = RnaBertForSequencePrediction(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=torch.tensor([[1]]))\n>>> output[\"logits\"].shape\ntorch.Size([1, 1])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n
Source code in multimolecule/models/rnabert/modeling_rnabert.py Python
class RnaBertForSequencePrediction(RnaBertPreTrainedModel):\n    \"\"\"\n    Examples:\n        >>> from multimolecule import RnaBertConfig, RnaBertForSequencePrediction, RnaTokenizer\n        >>> config = RnaBertConfig()\n        >>> model = RnaBertForSequencePrediction(config)\n        >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n        >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n        >>> output = model(**input, labels=torch.tensor([[1]]))\n        >>> output[\"logits\"].shape\n        torch.Size([1, 1])\n        >>> output[\"loss\"]  # doctest:+ELLIPSIS\n        tensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n    \"\"\"\n\n    def __init__(self, config: RnaBertConfig):\n        super().__init__(config)\n        self.rnabert = RnaBertModel(config, add_pooling_layer=True)\n        self.sequence_head = SequencePredictionHead(config)\n        self.head_config = self.sequence_head.config\n\n        # Initialize weights and apply final processing\n        self.post_init()\n\n    def forward(\n        self,\n        input_ids: Tensor | NestedTensor,\n        attention_mask: Tensor | None = None,\n        position_ids: Tensor | None = None,\n        head_mask: Tensor | None = None,\n        inputs_embeds: Tensor | NestedTensor | None = None,\n        labels: Tensor | None = None,\n        output_attentions: bool | None = None,\n        output_hidden_states: bool | None = None,\n        return_dict: bool | None = None,\n        **kwargs,\n    ) -> Tuple[Tensor, ...] | SequencePredictorOutput:\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n        outputs = self.rnabert(\n            input_ids,\n            attention_mask=attention_mask,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            output_attentions=output_attentions,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n            **kwargs,\n        )\n        output = self.sequence_head(outputs, labels)\n        logits, loss = output.logits, output.loss\n\n        if not return_dict:\n            output = (logits,) + outputs[2:]\n            return ((loss,) + output) if loss is not None else output\n\n        return SequencePredictorOutput(\n            loss=loss,\n            logits=logits,\n            hidden_states=outputs.hidden_states,\n            attentions=outputs.attentions,\n        )\n
"},{"location":"models/rnabert/#multimolecule.models.rnabert.RnaBertForTokenPrediction","title":"RnaBertForTokenPrediction","text":"

Bases: RnaBertPreTrainedModel

Examples:

Python Console Session
>>> from multimolecule import RnaBertConfig, RnaBertForTokenPrediction, RnaTokenizer\n>>> config = RnaBertConfig()\n>>> model = RnaBertForTokenPrediction(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=torch.randint(2, (1, 5)))\n>>> output[\"logits\"].shape\ntorch.Size([1, 5, 1])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n
Source code in multimolecule/models/rnabert/modeling_rnabert.py Python
class RnaBertForTokenPrediction(RnaBertPreTrainedModel):\n    \"\"\"\n    Examples:\n        >>> from multimolecule import RnaBertConfig, RnaBertForTokenPrediction, RnaTokenizer\n        >>> config = RnaBertConfig()\n        >>> model = RnaBertForTokenPrediction(config)\n        >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n        >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n        >>> output = model(**input, labels=torch.randint(2, (1, 5)))\n        >>> output[\"logits\"].shape\n        torch.Size([1, 5, 1])\n        >>> output[\"loss\"]  # doctest:+ELLIPSIS\n        tensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n    \"\"\"\n\n    def __init__(self, config: RnaBertConfig):\n        super().__init__(config)\n        self.rnabert = RnaBertModel(config, add_pooling_layer=True)\n        self.token_head = TokenPredictionHead(config)\n        self.head_config = self.token_head.config\n\n        # Initialize weights and apply final processing\n        self.post_init()\n\n    def forward(\n        self,\n        input_ids: Tensor | NestedTensor,\n        attention_mask: Tensor | None = None,\n        position_ids: Tensor | None = None,\n        head_mask: Tensor | None = None,\n        inputs_embeds: Tensor | NestedTensor | None = None,\n        labels: Tensor | None = None,\n        output_attentions: bool | None = None,\n        output_hidden_states: bool | None = None,\n        return_dict: bool | None = None,\n        **kwargs,\n    ) -> Tuple[Tensor, ...] | TokenPredictorOutput:\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n        outputs = self.rnabert(\n            input_ids,\n            attention_mask=attention_mask,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            output_attentions=output_attentions,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n            **kwargs,\n        )\n        output = self.token_head(outputs, attention_mask, input_ids, labels)\n        logits, loss = output.logits, output.loss\n\n        if not return_dict:\n            output = (logits,) + outputs[2:]\n            return ((loss,) + output) if loss is not None else output\n\n        return TokenPredictorOutput(\n            loss=loss,\n            logits=logits,\n            hidden_states=outputs.hidden_states,\n            attentions=outputs.attentions,\n        )\n
"},{"location":"models/rnabert/#multimolecule.models.rnabert.RnaBertModel","title":"RnaBertModel","text":"

Bases: RnaBertPreTrainedModel

Examples:

Python Console Session
>>> from multimolecule import RnaBertConfig, RnaBertModel, RnaTokenizer\n>>> config = RnaBertConfig()\n>>> model = RnaBertModel(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input)\n>>> output[\"last_hidden_state\"].shape\ntorch.Size([1, 7, 120])\n>>> output[\"pooler_output\"].shape\ntorch.Size([1, 120])\n
Source code in multimolecule/models/rnabert/modeling_rnabert.py Python
class RnaBertModel(RnaBertPreTrainedModel):\n    \"\"\"\n    Examples:\n        >>> from multimolecule import RnaBertConfig, RnaBertModel, RnaTokenizer\n        >>> config = RnaBertConfig()\n        >>> model = RnaBertModel(config)\n        >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n        >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n        >>> output = model(**input)\n        >>> output[\"last_hidden_state\"].shape\n        torch.Size([1, 7, 120])\n        >>> output[\"pooler_output\"].shape\n        torch.Size([1, 120])\n    \"\"\"\n\n    def __init__(self, config: RnaBertConfig, add_pooling_layer: bool = True):\n        super().__init__(config)\n        self.pad_token_id = config.pad_token_id\n        self.embeddings = RnaBertEmbeddings(config)\n        self.encoder = RnaBertEncoder(config)\n        self.pooler = RnaBertPooler(config) if add_pooling_layer else None\n\n        # Initialize weights and apply final processing\n        self.post_init()\n\n    def get_input_embeddings(self):\n        return self.embeddings.word_embeddings\n\n    def set_input_embeddings(self, value):\n        self.embeddings.word_embeddings = value\n\n    def _prune_heads(self, heads_to_prune):\n        \"\"\"\n        Prunes heads of the model. heads_to_prune: dict of {layer_num: list of heads to prune in this layer} See base\n        class PreTrainedModel\n        \"\"\"\n        for layer, heads in heads_to_prune.items():\n            self.encoder.layer[layer].attention.prune_heads(heads)\n\n    def forward(\n        self,\n        input_ids: Tensor | NestedTensor,\n        attention_mask: Tensor | None = None,\n        position_ids: Tensor | None = None,\n        head_mask: Tensor | None = None,\n        inputs_embeds: Tensor | NestedTensor | None = None,\n        encoder_hidden_states: Tensor | None = None,\n        encoder_attention_mask: Tensor | None = None,\n        past_key_values: Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None = None,\n        use_cache: bool | None = None,\n        output_attentions: bool | None = None,\n        output_hidden_states: bool | None = None,\n        return_dict: bool | None = None,\n        **kwargs,\n    ) -> Tuple[Tensor, ...] | BaseModelOutputWithPoolingAndCrossAttentions:\n        r\"\"\"\n        Args:\n            encoder_hidden_states:\n                Shape: `(batch_size, sequence_length, hidden_size)`\n\n                Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if\n                the model is configured as a decoder.\n            encoder_attention_mask:\n                Shape: `(batch_size, sequence_length)`\n\n                Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used\n                in the cross-attention if the model is configured as a decoder. Mask values selected in `[0, 1]`:\n\n                - 1 for tokens that are **not masked**,\n                - 0 for tokens that are **masked**.\n            past_key_values:\n                Tuple of length `config.n_layers` with each tuple having 4 tensors of shape\n                `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)\n\n                Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up\n                decoding.\n\n                If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those\n                that don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of\n                all `decoder_input_ids` of shape `(batch_size, sequence_length)`.\n            use_cache:\n                If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding\n                (see `past_key_values`).\n        \"\"\"\n        if kwargs:\n            warn(\n                f\"Additional keyword arguments `{', '.join(kwargs)}` are detected in \"\n                f\"`{self.__class__.__name__}.forward`, they will be ignored.\\n\"\n                \"This is provided for backward compatibility and may lead to unexpected behavior.\"\n            )\n        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions\n        output_hidden_states = (\n            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states\n        )\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n\n        if self.config.is_decoder:\n            use_cache = use_cache if use_cache is not None else self.config.use_cache\n        else:\n            use_cache = False\n\n        if isinstance(input_ids, NestedTensor):\n            input_ids, attention_mask = input_ids.tensor, input_ids.mask\n        if input_ids is not None and inputs_embeds is not None:\n            raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n        if input_ids is not None:\n            self.warn_if_padding_and_no_attention_mask(input_ids, attention_mask)\n            input_shape = input_ids.size()\n        elif inputs_embeds is not None:\n            input_shape = inputs_embeds.size()[:-1]\n        else:\n            raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n        batch_size, seq_length = input_shape\n        device = input_ids.device if input_ids is not None else inputs_embeds.device  # type: ignore[union-attr]\n\n        # past_key_values_length\n        past_key_values_length = past_key_values[0][0].shape[2] if past_key_values is not None else 0\n\n        if attention_mask is None:\n            attention_mask = (\n                input_ids.ne(self.pad_token_id)\n                if self.pad_token_id is not None\n                else torch.ones(((batch_size, seq_length + past_key_values_length)), device=device)\n            )\n\n        # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]\n        # ourselves in which case we just need to make it broadcastable to all heads.\n        extended_attention_mask: Tensor = self.get_extended_attention_mask(attention_mask, input_shape)\n\n        # If a 2D or 3D attention mask is provided for the cross-attention\n        # we need to make broadcastable to [batch_size, num_heads, seq_length, seq_length]\n        if self.config.is_decoder and encoder_hidden_states is not None:\n            encoder_batch_size, encoder_sequence_length, _ = encoder_hidden_states.size()\n            encoder_hidden_shape = (encoder_batch_size, encoder_sequence_length)\n            if encoder_attention_mask is None:\n                encoder_attention_mask = torch.ones(encoder_hidden_shape, device=device)\n            encoder_extended_attention_mask = self.invert_attention_mask(encoder_attention_mask)\n        else:\n            encoder_extended_attention_mask = None\n\n        # Prepare head mask if needed\n        # 1.0 in head_mask indicate we keep the head\n        # attention_probs has shape bsz x n_heads x N x N\n        # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]\n        # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]\n        head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)\n\n        embedding_output = self.embeddings(\n            input_ids=input_ids,\n            position_ids=position_ids,\n            inputs_embeds=inputs_embeds,\n            past_key_values_length=past_key_values_length,\n        )\n        encoder_outputs = self.encoder(\n            embedding_output,\n            attention_mask=extended_attention_mask,\n            head_mask=head_mask,\n            encoder_hidden_states=encoder_hidden_states,\n            encoder_attention_mask=encoder_extended_attention_mask,\n            past_key_values=past_key_values,\n            use_cache=use_cache,\n            output_attentions=output_attentions,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n        )\n        sequence_output = encoder_outputs[0]\n        pooled_output = self.pooler(sequence_output) if self.pooler is not None else None\n\n        if not return_dict:\n            return (sequence_output, pooled_output) + encoder_outputs[1:]\n\n        return BaseModelOutputWithPoolingAndCrossAttentions(\n            last_hidden_state=sequence_output,\n            pooler_output=pooled_output,\n            past_key_values=encoder_outputs.past_key_values,\n            hidden_states=encoder_outputs.hidden_states,\n            attentions=encoder_outputs.attentions,\n            cross_attentions=encoder_outputs.cross_attentions,\n        )\n
"},{"location":"models/rnabert/#multimolecule.models.rnabert.RnaBertModel.forward","title":"forward","text":"Python
forward(input_ids: Tensor | NestedTensor, attention_mask: Tensor | None = None, position_ids: Tensor | None = None, head_mask: Tensor | None = None, inputs_embeds: Tensor | NestedTensor | None = None, encoder_hidden_states: Tensor | None = None, encoder_attention_mask: Tensor | None = None, past_key_values: Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None = None, use_cache: bool | None = None, output_attentions: bool | None = None, output_hidden_states: bool | None = None, return_dict: bool | None = None, **kwargs) -> Tuple[Tensor, ...] | BaseModelOutputWithPoolingAndCrossAttentions\n

Parameters:

Name Type Description Default Tensor | None

Shape: (batch_size, sequence_length, hidden_size)

Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if the model is configured as a decoder.

None Tensor | None

Shape: (batch_size, sequence_length)

Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used in the cross-attention if the model is configured as a decoder. Mask values selected in [0, 1]:

None Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None

Tuple of length config.n_layers with each tuple having 4 tensors of shape `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)

Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up decoding.

If past_key_values are used, the user can optionally input only the last decoder_input_ids (those that don\u2019t have their past key value states given to this model) of shape (batch_size, 1) instead of all decoder_input_ids of shape (batch_size, sequence_length).

None bool | None

If set to True, past_key_values key value states are returned and can be used to speed up decoding (see past_key_values).

None Source code in multimolecule/models/rnabert/modeling_rnabert.py Python
def forward(\n    self,\n    input_ids: Tensor | NestedTensor,\n    attention_mask: Tensor | None = None,\n    position_ids: Tensor | None = None,\n    head_mask: Tensor | None = None,\n    inputs_embeds: Tensor | NestedTensor | None = None,\n    encoder_hidden_states: Tensor | None = None,\n    encoder_attention_mask: Tensor | None = None,\n    past_key_values: Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None = None,\n    use_cache: bool | None = None,\n    output_attentions: bool | None = None,\n    output_hidden_states: bool | None = None,\n    return_dict: bool | None = None,\n    **kwargs,\n) -> Tuple[Tensor, ...] | BaseModelOutputWithPoolingAndCrossAttentions:\n    r\"\"\"\n    Args:\n        encoder_hidden_states:\n            Shape: `(batch_size, sequence_length, hidden_size)`\n\n            Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if\n            the model is configured as a decoder.\n        encoder_attention_mask:\n            Shape: `(batch_size, sequence_length)`\n\n            Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used\n            in the cross-attention if the model is configured as a decoder. Mask values selected in `[0, 1]`:\n\n            - 1 for tokens that are **not masked**,\n            - 0 for tokens that are **masked**.\n        past_key_values:\n            Tuple of length `config.n_layers` with each tuple having 4 tensors of shape\n            `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)\n\n            Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up\n            decoding.\n\n            If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those\n            that don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of\n            all `decoder_input_ids` of shape `(batch_size, sequence_length)`.\n        use_cache:\n            If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding\n            (see `past_key_values`).\n    \"\"\"\n    if kwargs:\n        warn(\n            f\"Additional keyword arguments `{', '.join(kwargs)}` are detected in \"\n            f\"`{self.__class__.__name__}.forward`, they will be ignored.\\n\"\n            \"This is provided for backward compatibility and may lead to unexpected behavior.\"\n        )\n    output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions\n    output_hidden_states = (\n        output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states\n    )\n    return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n\n    if self.config.is_decoder:\n        use_cache = use_cache if use_cache is not None else self.config.use_cache\n    else:\n        use_cache = False\n\n    if isinstance(input_ids, NestedTensor):\n        input_ids, attention_mask = input_ids.tensor, input_ids.mask\n    if input_ids is not None and inputs_embeds is not None:\n        raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n    if input_ids is not None:\n        self.warn_if_padding_and_no_attention_mask(input_ids, attention_mask)\n        input_shape = input_ids.size()\n    elif inputs_embeds is not None:\n        input_shape = inputs_embeds.size()[:-1]\n    else:\n        raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n    batch_size, seq_length = input_shape\n    device = input_ids.device if input_ids is not None else inputs_embeds.device  # type: ignore[union-attr]\n\n    # past_key_values_length\n    past_key_values_length = past_key_values[0][0].shape[2] if past_key_values is not None else 0\n\n    if attention_mask is None:\n        attention_mask = (\n            input_ids.ne(self.pad_token_id)\n            if self.pad_token_id is not None\n            else torch.ones(((batch_size, seq_length + past_key_values_length)), device=device)\n        )\n\n    # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]\n    # ourselves in which case we just need to make it broadcastable to all heads.\n    extended_attention_mask: Tensor = self.get_extended_attention_mask(attention_mask, input_shape)\n\n    # If a 2D or 3D attention mask is provided for the cross-attention\n    # we need to make broadcastable to [batch_size, num_heads, seq_length, seq_length]\n    if self.config.is_decoder and encoder_hidden_states is not None:\n        encoder_batch_size, encoder_sequence_length, _ = encoder_hidden_states.size()\n        encoder_hidden_shape = (encoder_batch_size, encoder_sequence_length)\n        if encoder_attention_mask is None:\n            encoder_attention_mask = torch.ones(encoder_hidden_shape, device=device)\n        encoder_extended_attention_mask = self.invert_attention_mask(encoder_attention_mask)\n    else:\n        encoder_extended_attention_mask = None\n\n    # Prepare head mask if needed\n    # 1.0 in head_mask indicate we keep the head\n    # attention_probs has shape bsz x n_heads x N x N\n    # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]\n    # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]\n    head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)\n\n    embedding_output = self.embeddings(\n        input_ids=input_ids,\n        position_ids=position_ids,\n        inputs_embeds=inputs_embeds,\n        past_key_values_length=past_key_values_length,\n    )\n    encoder_outputs = self.encoder(\n        embedding_output,\n        attention_mask=extended_attention_mask,\n        head_mask=head_mask,\n        encoder_hidden_states=encoder_hidden_states,\n        encoder_attention_mask=encoder_extended_attention_mask,\n        past_key_values=past_key_values,\n        use_cache=use_cache,\n        output_attentions=output_attentions,\n        output_hidden_states=output_hidden_states,\n        return_dict=return_dict,\n    )\n    sequence_output = encoder_outputs[0]\n    pooled_output = self.pooler(sequence_output) if self.pooler is not None else None\n\n    if not return_dict:\n        return (sequence_output, pooled_output) + encoder_outputs[1:]\n\n    return BaseModelOutputWithPoolingAndCrossAttentions(\n        last_hidden_state=sequence_output,\n        pooler_output=pooled_output,\n        past_key_values=encoder_outputs.past_key_values,\n        hidden_states=encoder_outputs.hidden_states,\n        attentions=encoder_outputs.attentions,\n        cross_attentions=encoder_outputs.cross_attentions,\n    )\n
"},{"location":"models/rnabert/#multimolecule.models.rnabert.RnaBertModel.forward(encoder_hidden_states)","title":"encoder_hidden_states","text":""},{"location":"models/rnabert/#multimolecule.models.rnabert.RnaBertModel.forward(encoder_attention_mask)","title":"encoder_attention_mask","text":""},{"location":"models/rnabert/#multimolecule.models.rnabert.RnaBertModel.forward(past_key_values)","title":"past_key_values","text":""},{"location":"models/rnabert/#multimolecule.models.rnabert.RnaBertModel.forward(use_cache)","title":"use_cache","text":""},{"location":"models/rnabert/#multimolecule.models.rnabert.RnaBertPreTrainedModel","title":"RnaBertPreTrainedModel","text":"

Bases: PreTrainedModel

An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained models.

Source code in multimolecule/models/rnabert/modeling_rnabert.py Python
class RnaBertPreTrainedModel(PreTrainedModel):\n    \"\"\"\n    An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained\n    models.\n    \"\"\"\n\n    config_class = RnaBertConfig\n    base_model_prefix = \"rnabert\"\n    supports_gradient_checkpointing = True\n    _no_split_modules = [\"RnaBertLayer\", \"RnaBertEmbeddings\"]\n\n    # Copied from transformers.models.bert.modeling_bert.BertPreTrainedModel._init_weights\n    def _init_weights(self, module: nn.Module):\n        \"\"\"Initialize the weights\"\"\"\n        if isinstance(module, nn.Linear):\n            # Slightly different from the TF version which uses truncated_normal for initialization\n            # cf https://github.com/pytorch/pytorch/pull/5617\n            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)\n            if module.bias is not None:\n                module.bias.data.zero_()\n        elif isinstance(module, nn.Embedding):\n            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)\n            if module.padding_idx is not None:\n                module.weight.data[module.padding_idx].zero_()\n        elif isinstance(module, nn.LayerNorm):\n            module.bias.data.zero_()\n            module.weight.data.fill_(1.0)\n
"},{"location":"models/rnaernie/","title":"RNAErnie","text":"

Pre-trained model on non-coding RNA (ncRNA) using a multi-stage masked language modeling (MLM) objective.

"},{"location":"models/rnaernie/#statement","title":"Statement","text":"

Multi-purpose RNA language modelling with motif-aware pretraining and type-guided fine-tuning is published in Nature Machine Intelligence, which is a Closed Access / Author-Fee journal.

Machine learning has been at the forefront of the movement for free and open access to research.

We see no role for closed access or author-fee publication in the future of machine learning research and believe the adoption of these journals as an outlet of record for the machine learning community would be a retrograde step.

The MultiMolecule team is committed to the principles of open access and open science.

We do NOT endorse the publication of manuscripts in Closed Access / Author-Fee journals and encourage the community to support Open Access journals and conferences.

Please consider signing the Statement on Nature Machine Intelligence.

"},{"location":"models/rnaernie/#disclaimer","title":"Disclaimer","text":"

This is an UNOFFICIAL implementation of the RNAErnie: An RNA Language Model with Structure-enhanced Representations by Ning Wang, Jiang Bian, Haoyi Xiong, et al.

The OFFICIAL repository of RNAErnie is at CatIIIIIIII/RNAErnie.

Warning

The MultiMolecule team is unable to confirm that the provided model and checkpoints are producing the same intermediate representations as the original implementation. This is because

The proposed method is published in a Closed Access / Author-Fee journal.

The team releasing RNAErnie did not write this model card for this model so this model card has been written by the MultiMolecule team.

"},{"location":"models/rnaernie/#model-details","title":"Model Details","text":"

RNAErnie is a bert-style model pre-trained on a large corpus of non-coding RNA sequences in a self-supervised fashion. This means that the model was trained on the raw nucleotides of RNA sequences only, with an automatic process to generate inputs and labels from those texts. Please refer to the Training Details section for more information on the training process.

Note that during the conversion process, additional tokens such as [IND] and ncRNA class symbols are removed.

"},{"location":"models/rnaernie/#model-specification","title":"Model Specification","text":"Num Layers Hidden Size Num Heads Intermediate Size Num Parameters (M) FLOPs (G) MACs (G) Max Num Tokens 12 768 12 3072 86.06 22.36 11.17 512"},{"location":"models/rnaernie/#links","title":"Links","text":""},{"location":"models/rnaernie/#usage","title":"Usage","text":"

The model file depends on the multimolecule library. You can install it using pip:

Bash
pip install multimolecule\n
"},{"location":"models/rnaernie/#direct-use","title":"Direct Use","text":"

You can use this model directly with a pipeline for masked language modeling:

Python
>>> import multimolecule  # you must import multimolecule to register models\n>>> from transformers import pipeline\n>>> unmasker = pipeline(\"fill-mask\", model=\"multimolecule/rnaernie\")\n>>> unmasker(\"gguc<mask>cucugguuagaccagaucugagccu\")\n\n[{'score': 0.09252794831991196,\n  'token': 8,\n  'token_str': 'G',\n  'sequence': 'G G U C G C U C U G G U U A G A C C A G A U C U G A G C C U'},\n {'score': 0.09062391519546509,\n  'token': 11,\n  'token_str': 'R',\n  'sequence': 'G G U C R C U C U G G U U A G A C C A G A U C U G A G C C U'},\n {'score': 0.08875908702611923,\n  'token': 6,\n  'token_str': 'A',\n  'sequence': 'G G U C A C U C U G G U U A G A C C A G A U C U G A G C C U'},\n {'score': 0.07809742540121078,\n  'token': 20,\n  'token_str': 'V',\n  'sequence': 'G G U C V C U C U G G U U A G A C C A G A U C U G A G C C U'},\n {'score': 0.07325706630945206,\n  'token': 13,\n  'token_str': 'S',\n  'sequence': 'G G U C S C U C U G G U U A G A C C A G A U C U G A G C C U'}]\n
"},{"location":"models/rnaernie/#downstream-use","title":"Downstream Use","text":""},{"location":"models/rnaernie/#extract-features","title":"Extract Features","text":"

Here is how to use this model to get the features of a given sequence in PyTorch:

Python
from multimolecule import RnaTokenizer, RnaErnieModel\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rnaernie\")\nmodel = RnaErnieModel.from_pretrained(\"multimolecule/rnaernie\")\n\ntext = \"UAGCUUAUCAGACUGAUGUUGA\"\ninput = tokenizer(text, return_tensors=\"pt\")\n\noutput = model(**input)\n
"},{"location":"models/rnaernie/#sequence-classification-regression","title":"Sequence Classification / Regression","text":"

Note: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for sequence classification or regression.

Here is how to use this model as backbone to fine-tune for a sequence-level task in PyTorch:

Python
import torch\nfrom multimolecule import RnaTokenizer, RnaErnieForSequencePrediction\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rnaernie\")\nmodel = RnaErnieForSequencePrediction.from_pretrained(\"multimolecule/rnaernie\")\n\ntext = \"UAGCUUAUCAGACUGAUGUUGA\"\ninput = tokenizer(text, return_tensors=\"pt\")\nlabel = torch.tensor([1])\n\noutput = model(**input, labels=label)\n
"},{"location":"models/rnaernie/#token-classification-regression","title":"Token Classification / Regression","text":"

Note: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for nucleotide classification or regression.

Here is how to use this model as backbone to fine-tune for a nucleotide-level task in PyTorch:

Python
import torch\nfrom multimolecule import RnaTokenizer, RnaErnieForTokenPrediction\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rnaernie\")\nmodel = RnaErnieForTokenPrediction.from_pretrained(\"multimolecule/rnaernie\")\n\ntext = \"UAGCUUAUCAGACUGAUGUUGA\"\ninput = tokenizer(text, return_tensors=\"pt\")\nlabel = torch.randint(2, (len(text), ))\n\noutput = model(**input, labels=label)\n
"},{"location":"models/rnaernie/#contact-classification-regression","title":"Contact Classification / Regression","text":"

Note: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for contact classification or regression.

Here is how to use this model as backbone to fine-tune for a contact-level task in PyTorch:

Python
import torch\nfrom multimolecule import RnaTokenizer, RnaErnieForContactPrediction\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rnaernie\")\nmodel = RnaErnieForContactPrediction.from_pretrained(\"multimolecule/rnaernie\")\n\ntext = \"UAGCUUAUCAGACUGAUGUUGA\"\ninput = tokenizer(text, return_tensors=\"pt\")\nlabel = torch.randint(2, (len(text), len(text)))\n\noutput = model(**input, labels=label)\n
"},{"location":"models/rnaernie/#training-details","title":"Training Details","text":"

RNAErnie used Masked Language Modeling (MLM) as the pre-training objective: taking a sequence, the model randomly masks 15% of the tokens in the input then runs the entire masked sentence through the model and has to predict the masked tokens. This is comparable to the Cloze task in language modeling.

"},{"location":"models/rnaernie/#training-data","title":"Training Data","text":"

The RNAErnie model was pre-trained on RNAcentral. RNAcentral is a free, public resource that offers integrated access to a comprehensive and up-to-date set of non-coding RNA sequences provided by a collaborating group of Expert Databases representing a broad range of organisms and RNA types.

RNAErnie used a subset of RNAcentral for pre-training. The subset contains 23 million sequences. RNAErnie preprocessed all tokens by replacing \u201cT\u201ds with \u201cS\u201ds.

Note that RnaTokenizer will convert \u201cT\u201ds to \u201cU\u201ds for you, you may disable this behaviour by passing replace_T_with_U=False.

"},{"location":"models/rnaernie/#training-procedure","title":"Training Procedure","text":""},{"location":"models/rnaernie/#preprocessing","title":"Preprocessing","text":"

RNAErnie used masked language modeling (MLM) as the pre-training objective. The masking procedure is similar to the one used in BERT:

"},{"location":"models/rnaernie/#pretraining","title":"PreTraining","text":"

RNAErnie uses a special 3-stage training pipeline to pre-train the model, each with a different masking strategy:

Base-level Masking: The masking applies to each nucleotide in the sequence. Subsequence-level Masking: The masking applies to subsequences of 4-8bp in the sequence. Motif-level Masking: The model is trained on motif datasets.

The model was trained on 4 NVIDIA V100 GPUs with 32GiB memories.

"},{"location":"models/rnaernie/#citation","title":"Citation","text":"

Citation information is not available for papers published in Closed Access / Author-Fee journals.

"},{"location":"models/rnaernie/#contact","title":"Contact","text":"

Please use GitHub issues of MultiMolecule for any questions or comments on the model card.

Please contact the authors of the RNAErnie paper for questions or comments on the paper/model.

"},{"location":"models/rnaernie/#license","title":"License","text":"

This model is licensed under the AGPL-3.0 License.

Text Only
SPDX-License-Identifier: AGPL-3.0-or-later\n
"},{"location":"models/rnaernie/#multimolecule.models.rnaernie","title":"multimolecule.models.rnaernie","text":""},{"location":"models/rnaernie/#multimolecule.models.rnaernie.RnaTokenizer","title":"RnaTokenizer","text":"

Bases: Tokenizer

Tokenizer for RNA sequences.

Parameters:

Name Type Description Default Alphabet | str | List[str] | None

alphabet to use for tokenization.

None int

Size of kmer to tokenize.

1 bool

Whether to tokenize into codons.

False bool

Whether to replace T with U.

True bool

Whether to convert input to uppercase.

True

Examples:

Python Console Session
>>> from multimolecule import RnaTokenizer\n>>> tokenizer = RnaTokenizer()\n>>> tokenizer('<pad><cls><eos><unk><mask><null>ACGUNRYSWKMBDHV.X*-I')[\"input_ids\"]\n[1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 2]\n>>> tokenizer('acgu')[\"input_ids\"]\n[1, 6, 7, 8, 9, 2]\n>>> tokenizer('acgt')[\"input_ids\"]\n[1, 6, 7, 8, 9, 2]\n>>> tokenizer = RnaTokenizer(replace_T_with_U=False)\n>>> tokenizer('acgt')[\"input_ids\"]\n[1, 6, 7, 8, 3, 2]\n>>> tokenizer = RnaTokenizer(nmers=3)\n>>> tokenizer('uagcuuauc')[\"input_ids\"]\n[1, 83, 17, 64, 49, 96, 84, 22, 2]\n>>> tokenizer = RnaTokenizer(codon=True)\n>>> tokenizer('uagcuuauc')[\"input_ids\"]\n[1, 83, 49, 22, 2]\n>>> tokenizer('uagcuuauca')[\"input_ids\"]\nTraceback (most recent call last):\nValueError: length of input sequence must be a multiple of 3 for codon tokenization, but got 10\n
Source code in multimolecule/tokenisers/rna/tokenization_rna.py Python
class RnaTokenizer(Tokenizer):\n    \"\"\"\n    Tokenizer for RNA sequences.\n\n    Args:\n        alphabet: alphabet to use for tokenization.\n\n            - If is `None`, the standard RNA alphabet will be used.\n            - If is a `string`, it should correspond to the name of a predefined alphabet. The options include\n                + `standard`\n                + `extended`\n                + `streamline`\n                + `nucleobase`\n            - If is an alphabet or a list of characters, that specific alphabet will be used.\n        nmers: Size of kmer to tokenize.\n        codon: Whether to tokenize into codons.\n        replace_T_with_U: Whether to replace T with U.\n        do_upper_case: Whether to convert input to uppercase.\n\n    Examples:\n        >>> from multimolecule import RnaTokenizer\n        >>> tokenizer = RnaTokenizer()\n        >>> tokenizer('<pad><cls><eos><unk><mask><null>ACGUNRYSWKMBDHV.X*-I')[\"input_ids\"]\n        [1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 2]\n        >>> tokenizer('acgu')[\"input_ids\"]\n        [1, 6, 7, 8, 9, 2]\n        >>> tokenizer('acgt')[\"input_ids\"]\n        [1, 6, 7, 8, 9, 2]\n        >>> tokenizer = RnaTokenizer(replace_T_with_U=False)\n        >>> tokenizer('acgt')[\"input_ids\"]\n        [1, 6, 7, 8, 3, 2]\n        >>> tokenizer = RnaTokenizer(nmers=3)\n        >>> tokenizer('uagcuuauc')[\"input_ids\"]\n        [1, 83, 17, 64, 49, 96, 84, 22, 2]\n        >>> tokenizer = RnaTokenizer(codon=True)\n        >>> tokenizer('uagcuuauc')[\"input_ids\"]\n        [1, 83, 49, 22, 2]\n        >>> tokenizer('uagcuuauca')[\"input_ids\"]\n        Traceback (most recent call last):\n        ValueError: length of input sequence must be a multiple of 3 for codon tokenization, but got 10\n    \"\"\"\n\n    model_input_names = [\"input_ids\", \"attention_mask\"]\n\n    def __init__(\n        self,\n        alphabet: Alphabet | str | List[str] | None = None,\n        nmers: int = 1,\n        codon: bool = False,\n        replace_T_with_U: bool = True,\n        do_upper_case: bool = True,\n        additional_special_tokens: List | Tuple | None = None,\n        **kwargs,\n    ):\n        if codon and (nmers > 1 and nmers != 3):\n            raise ValueError(\"Codon and nmers cannot be used together.\")\n        if codon:\n            nmers = 3  # set to 3 to get correct vocab\n        if not isinstance(alphabet, Alphabet):\n            alphabet = get_alphabet(alphabet, nmers=nmers)\n        super().__init__(\n            alphabet=alphabet,\n            nmers=nmers,\n            codon=codon,\n            replace_T_with_U=replace_T_with_U,\n            do_upper_case=do_upper_case,\n            additional_special_tokens=additional_special_tokens,\n            **kwargs,\n        )\n        self.replace_T_with_U = replace_T_with_U\n        self.nmers = nmers\n        self.codon = codon\n\n    def _tokenize(self, text: str, **kwargs):\n        if self.do_upper_case:\n            text = text.upper()\n        if self.replace_T_with_U:\n            text = text.replace(\"T\", \"U\")\n        if self.codon:\n            if len(text) % 3 != 0:\n                raise ValueError(\n                    f\"length of input sequence must be a multiple of 3 for codon tokenization, but got {len(text)}\"\n                )\n            return [text[i : i + 3] for i in range(0, len(text), 3)]\n        if self.nmers > 1:\n            return [text[i : i + self.nmers] for i in range(len(text) - self.nmers + 1)]  # noqa: E203\n        return list(text)\n
"},{"location":"models/rnaernie/#multimolecule.models.rnaernie.RnaTokenizer(alphabet)","title":"alphabet","text":""},{"location":"models/rnaernie/#multimolecule.models.rnaernie.RnaTokenizer(nmers)","title":"nmers","text":""},{"location":"models/rnaernie/#multimolecule.models.rnaernie.RnaTokenizer(codon)","title":"codon","text":""},{"location":"models/rnaernie/#multimolecule.models.rnaernie.RnaTokenizer(replace_T_with_U)","title":"replace_T_with_U","text":""},{"location":"models/rnaernie/#multimolecule.models.rnaernie.RnaTokenizer(do_upper_case)","title":"do_upper_case","text":""},{"location":"models/rnaernie/#multimolecule.models.rnaernie.RnaErnieConfig","title":"RnaErnieConfig","text":"

Bases: PreTrainedConfig

This is the configuration class to store the configuration of a RnaErnieModel. It is used to instantiate a RnaErnie model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the RnaErnie Bruce-ywj/rnaernie architecture.

Configuration objects inherit from PreTrainedConfig and can be used to control the model outputs. Read the documentation from PreTrainedConfig for more information.

Parameters:

Name Type Description Default int

Vocabulary size of the RnaErnie model. Defines the number of different tokens that can be represented by the inputs_ids passed when calling [RnaErnieModel].

26 int

Dimensionality of the encoder layers and the pooler layer.

768 int

Number of hidden layers in the Transformer encoder.

12 int

Number of attention heads for each attention layer in the Transformer encoder.

12 int

Dimensionality of the \u201cintermediate\u201d (often named feed-forward) layer in the Transformer encoder.

3072 float

The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.

0.1 float

The dropout ratio for the attention probabilities.

0.1 int

The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).

513 float

The standard deviation of the truncated_normal_initializer for initializing all weight matrices.

0.02 float

The epsilon used by the layer normalization layers.

1e-12

Examples:

Python Console Session
>>> from multimolecule import RnaErnieModel, RnaErnieConfig\n>>> # Initializing a rnaernie multimolecule/rnaernie style configuration\n>>> configuration = RnaErnieConfig()\n>>> # Initializing a model (with random weights) from the multimolecule/rnaernie style configuration\n>>> model = RnaErnieModel(configuration)\n>>> # Accessing the model configuration\n>>> configuration = model.config\n
Source code in multimolecule/models/rnaernie/configuration_rnaernie.py Python
class RnaErnieConfig(PreTrainedConfig):\n    r\"\"\"\n    This is the configuration class to store the configuration of a\n    [`RnaErnieModel`][multimolecule.models.RnaErnieModel]. It is used to instantiate a RnaErnie model according to the\n    specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a\n    similar configuration to that of the RnaErnie [Bruce-ywj/rnaernie](https://github.com/Bruce-ywj/rnaernie)\n    architecture.\n\n    Configuration objects inherit from [`PreTrainedConfig`][multimolecule.models.PreTrainedConfig] and can be used to\n    control the model outputs. Read the documentation from [`PreTrainedConfig`][multimolecule.models.PreTrainedConfig]\n    for more information.\n\n    Args:\n        vocab_size:\n            Vocabulary size of the RnaErnie model. Defines the number of different tokens that can be represented by\n            the `inputs_ids` passed when calling [`RnaErnieModel`].\n        hidden_size:\n            Dimensionality of the encoder layers and the pooler layer.\n        num_hidden_layers:\n            Number of hidden layers in the Transformer encoder.\n        num_attention_heads:\n            Number of attention heads for each attention layer in the Transformer encoder.\n        intermediate_size:\n            Dimensionality of the \"intermediate\" (often named feed-forward) layer in the Transformer encoder.\n        hidden_dropout:\n            The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.\n        attention_dropout:\n            The dropout ratio for the attention probabilities.\n        max_position_embeddings:\n            The maximum sequence length that this model might ever be used with. Typically set this to something large\n            just in case (e.g., 512 or 1024 or 2048).\n        initializer_range:\n            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.\n        layer_norm_eps:\n            The epsilon used by the layer normalization layers.\n\n    Examples:\n        >>> from multimolecule import RnaErnieModel, RnaErnieConfig\n        >>> # Initializing a rnaernie multimolecule/rnaernie style configuration\n        >>> configuration = RnaErnieConfig()\n        >>> # Initializing a model (with random weights) from the multimolecule/rnaernie style configuration\n        >>> model = RnaErnieModel(configuration)\n        >>> # Accessing the model configuration\n        >>> configuration = model.config\n    \"\"\"\n\n    model_type = \"rnaernie\"\n\n    def __init__(\n        self,\n        vocab_size: int = 26,\n        hidden_size: int = 768,\n        num_hidden_layers: int = 12,\n        num_attention_heads: int = 12,\n        intermediate_size: int = 3072,\n        hidden_act: str = \"relu\",\n        hidden_dropout: float = 0.1,\n        attention_dropout: float = 0.1,\n        max_position_embeddings: int = 513,\n        initializer_range: float = 0.02,\n        layer_norm_eps: float = 1e-12,\n        position_embedding_type: str = \"absolute\",\n        is_decoder: bool = False,\n        use_cache: bool = True,\n        head: HeadConfig | None = None,\n        lm_head: MaskedLMHeadConfig | None = None,\n        **kwargs,\n    ):\n        super().__init__(**kwargs)\n\n        self.vocab_size = vocab_size\n        self.type_vocab_size = 2\n        self.hidden_size = hidden_size\n        self.num_hidden_layers = num_hidden_layers\n        self.num_attention_heads = num_attention_heads\n        self.intermediate_size = intermediate_size\n        self.hidden_act = hidden_act\n        self.hidden_dropout = hidden_dropout\n        self.attention_dropout = attention_dropout\n        self.max_position_embeddings = max_position_embeddings\n        self.initializer_range = initializer_range\n        self.layer_norm_eps = layer_norm_eps\n        self.position_embedding_type = position_embedding_type\n        self.is_decoder = is_decoder\n        self.use_cache = use_cache\n        self.head = HeadConfig(**head) if head is not None else None\n        self.lm_head = MaskedLMHeadConfig(**lm_head) if lm_head is not None else None\n
"},{"location":"models/rnaernie/#multimolecule.models.rnaernie.RnaErnieConfig(vocab_size)","title":"vocab_size","text":""},{"location":"models/rnaernie/#multimolecule.models.rnaernie.RnaErnieConfig(hidden_size)","title":"hidden_size","text":""},{"location":"models/rnaernie/#multimolecule.models.rnaernie.RnaErnieConfig(num_hidden_layers)","title":"num_hidden_layers","text":""},{"location":"models/rnaernie/#multimolecule.models.rnaernie.RnaErnieConfig(num_attention_heads)","title":"num_attention_heads","text":""},{"location":"models/rnaernie/#multimolecule.models.rnaernie.RnaErnieConfig(intermediate_size)","title":"intermediate_size","text":""},{"location":"models/rnaernie/#multimolecule.models.rnaernie.RnaErnieConfig(hidden_dropout)","title":"hidden_dropout","text":""},{"location":"models/rnaernie/#multimolecule.models.rnaernie.RnaErnieConfig(attention_dropout)","title":"attention_dropout","text":""},{"location":"models/rnaernie/#multimolecule.models.rnaernie.RnaErnieConfig(max_position_embeddings)","title":"max_position_embeddings","text":""},{"location":"models/rnaernie/#multimolecule.models.rnaernie.RnaErnieConfig(initializer_range)","title":"initializer_range","text":""},{"location":"models/rnaernie/#multimolecule.models.rnaernie.RnaErnieConfig(layer_norm_eps)","title":"layer_norm_eps","text":""},{"location":"models/rnaernie/#multimolecule.models.rnaernie.RnaErnieForContactPrediction","title":"RnaErnieForContactPrediction","text":"

Bases: RnaErniePreTrainedModel

Examples:

Python Console Session
>>> from multimolecule import RnaErnieConfig, RnaErnieForContactPrediction, RnaTokenizer\n>>> config = RnaErnieConfig()\n>>> model = RnaErnieForContactPrediction(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=torch.randint(2, (1, 5, 5)))\n>>> output[\"logits\"].shape\ntorch.Size([1, 5, 5, 1])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n
Source code in multimolecule/models/rnaernie/modeling_rnaernie.py Python
class RnaErnieForContactPrediction(RnaErniePreTrainedModel):\n    \"\"\"\n    Examples:\n        >>> from multimolecule import RnaErnieConfig, RnaErnieForContactPrediction, RnaTokenizer\n        >>> config = RnaErnieConfig()\n        >>> model = RnaErnieForContactPrediction(config)\n        >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n        >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n        >>> output = model(**input, labels=torch.randint(2, (1, 5, 5)))\n        >>> output[\"logits\"].shape\n        torch.Size([1, 5, 5, 1])\n        >>> output[\"loss\"]  # doctest:+ELLIPSIS\n        tensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n    \"\"\"\n\n    def __init__(self, config: RnaErnieConfig):\n        super().__init__(config)\n        self.rnaernie = RnaErnieModel(config, add_pooling_layer=True)\n        self.contact_head = ContactPredictionHead(config)\n        self.head_config = self.contact_head.config\n\n        # Initialize weights and apply final processing\n        self.post_init()\n\n    def forward(\n        self,\n        input_ids: Tensor | NestedTensor,\n        attention_mask: Tensor | None = None,\n        position_ids: Tensor | None = None,\n        head_mask: Tensor | None = None,\n        inputs_embeds: Tensor | NestedTensor | None = None,\n        labels: Tensor | None = None,\n        output_attentions: bool | None = None,\n        output_hidden_states: bool | None = None,\n        return_dict: bool | None = None,\n        **kwargs,\n    ) -> Tuple[Tensor, ...] | ContactPredictorOutput:\n        if output_attentions is False:\n            warn(\"output_attentions must be True for contact classification and will be ignored.\")\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n        outputs = self.rnaernie(\n            input_ids,\n            attention_mask=attention_mask,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            output_attentions=True,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n            **kwargs,\n        )\n        output = self.contact_head(outputs, attention_mask, input_ids, labels)\n        logits, loss = output.logits, output.loss\n\n        if not return_dict:\n            output = (logits,) + outputs[2:]\n            return ((loss,) + output) if loss is not None else output\n\n        return ContactPredictorOutput(\n            loss=loss,\n            logits=logits,\n            hidden_states=outputs.hidden_states,\n            attentions=outputs.attentions,\n        )\n
"},{"location":"models/rnaernie/#multimolecule.models.rnaernie.RnaErnieForMaskedLM","title":"RnaErnieForMaskedLM","text":"

Bases: RnaErniePreTrainedModel

Examples:

Python Console Session
>>> from multimolecule import RnaErnieConfig, RnaErnieForMaskedLM, RnaTokenizer\n>>> config = RnaErnieConfig()\n>>> model = RnaErnieForMaskedLM(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=input[\"input_ids\"])\n>>> output[\"logits\"].shape\ntorch.Size([1, 7, 26])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<NllLossBackward0>)\n
Source code in multimolecule/models/rnaernie/modeling_rnaernie.py Python
class RnaErnieForMaskedLM(RnaErniePreTrainedModel):\n    \"\"\"\n    Examples:\n        >>> from multimolecule import RnaErnieConfig, RnaErnieForMaskedLM, RnaTokenizer\n        >>> config = RnaErnieConfig()\n        >>> model = RnaErnieForMaskedLM(config)\n        >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n        >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n        >>> output = model(**input, labels=input[\"input_ids\"])\n        >>> output[\"logits\"].shape\n        torch.Size([1, 7, 26])\n        >>> output[\"loss\"]  # doctest:+ELLIPSIS\n        tensor(..., grad_fn=<NllLossBackward0>)\n    \"\"\"\n\n    _tied_weights_keys = [\"lm_head.decoder.bias\", \"lm_head.decoder.weight\"]\n\n    def __init__(self, config: RnaErnieConfig):\n        super().__init__(config)\n        if config.is_decoder:\n            logger.warning(\n                \"If you want to use `RnaErnieForMaskedLM` make sure `config.is_decoder=False` for \"\n                \"bi-directional self-attention.\"\n            )\n        self.rnaernie = RnaErnieModel(config, add_pooling_layer=False)\n        self.lm_head = MaskedLMHead(config)\n\n        # Initialize weights and apply final processing\n        self.post_init()\n\n    def get_output_embeddings(self):\n        return self.lm_head.decoder\n\n    def set_output_embeddings(self, new_embeddings):\n        self.lm_head.decoder = new_embeddings\n\n    def forward(\n        self,\n        input_ids: Tensor | NestedTensor,\n        attention_mask: Tensor | None = None,\n        position_ids: Tensor | None = None,\n        head_mask: Tensor | None = None,\n        inputs_embeds: Tensor | NestedTensor | None = None,\n        encoder_hidden_states: Tensor | None = None,\n        encoder_attention_mask: Tensor | None = None,\n        labels: Tensor | None = None,\n        output_attentions: bool | None = None,\n        output_hidden_states: bool | None = None,\n        return_dict: bool | None = None,\n        **kwargs,\n    ) -> Tuple[Tensor, ...] | MaskedLMOutput:\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n        outputs = self.rnaernie(\n            input_ids,\n            attention_mask=attention_mask,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            encoder_hidden_states=encoder_hidden_states,\n            encoder_attention_mask=encoder_attention_mask,\n            output_attentions=output_attentions,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n            **kwargs,\n        )\n        output = self.lm_head(outputs, labels)\n        logits, loss = output.logits, output.loss\n\n        if not return_dict:\n            output = (logits,) + outputs[2:]\n            return ((loss,) + output) if loss is not None else output\n\n        return MaskedLMOutput(\n            loss=loss,\n            logits=logits,\n            hidden_states=outputs.hidden_states,\n            attentions=outputs.attentions,\n        )\n
"},{"location":"models/rnaernie/#multimolecule.models.rnaernie.RnaErnieForSequencePrediction","title":"RnaErnieForSequencePrediction","text":"

Bases: RnaErniePreTrainedModel

Examples:

Python Console Session
>>> from multimolecule import RnaErnieConfig, RnaErnieForSequencePrediction, RnaTokenizer\n>>> config = RnaErnieConfig()\n>>> model = RnaErnieForSequencePrediction(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=torch.tensor([[1]]))\n>>> output[\"logits\"].shape\ntorch.Size([1, 1])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n
Source code in multimolecule/models/rnaernie/modeling_rnaernie.py Python
class RnaErnieForSequencePrediction(RnaErniePreTrainedModel):\n    \"\"\"\n    Examples:\n        >>> from multimolecule import RnaErnieConfig, RnaErnieForSequencePrediction, RnaTokenizer\n        >>> config = RnaErnieConfig()\n        >>> model = RnaErnieForSequencePrediction(config)\n        >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n        >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n        >>> output = model(**input, labels=torch.tensor([[1]]))\n        >>> output[\"logits\"].shape\n        torch.Size([1, 1])\n        >>> output[\"loss\"]  # doctest:+ELLIPSIS\n        tensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n    \"\"\"\n\n    def __init__(self, config):\n        super().__init__(config)\n        self.rnaernie = RnaErnieModel(config)\n        self.sequence_head = SequencePredictionHead(config)\n        self.head_config = self.sequence_head.config\n\n        # Initialize weights and apply final processing\n        self.post_init()\n\n    def forward(\n        self,\n        input_ids: Tensor | NestedTensor,\n        attention_mask: Tensor | None = None,\n        position_ids: Tensor | None = None,\n        head_mask: Tensor | None = None,\n        inputs_embeds: Tensor | NestedTensor | None = None,\n        labels: Tensor | None = None,\n        output_attentions: bool | None = None,\n        output_hidden_states: bool | None = None,\n        return_dict: bool | None = None,\n        **kwargs,\n    ) -> Tuple[Tensor, ...] | SequencePredictorOutput:\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n        outputs = self.rnaernie(\n            input_ids,\n            attention_mask=attention_mask,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            output_attentions=output_attentions,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n            **kwargs,\n        )\n        output = self.sequence_head(outputs, labels)\n        logits, loss = output.logits, output.loss\n\n        if not return_dict:\n            output = (logits,) + outputs[2:]\n            return ((loss,) + output) if loss is not None else output\n\n        return SequencePredictorOutput(\n            loss=loss,\n            logits=logits,\n            hidden_states=outputs.hidden_states,\n            attentions=outputs.attentions,\n        )\n
"},{"location":"models/rnaernie/#multimolecule.models.rnaernie.RnaErnieForTokenPrediction","title":"RnaErnieForTokenPrediction","text":"

Bases: RnaErniePreTrainedModel

Examples:

Python Console Session
>>> from multimolecule import RnaErnieConfig, RnaErnieForTokenPrediction, RnaTokenizer\n>>> config = RnaErnieConfig()\n>>> model = RnaErnieForTokenPrediction(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=torch.randint(2, (1, 5)))\n>>> output[\"logits\"].shape\ntorch.Size([1, 5, 1])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n
Source code in multimolecule/models/rnaernie/modeling_rnaernie.py Python
class RnaErnieForTokenPrediction(RnaErniePreTrainedModel):\n    \"\"\"\n    Examples:\n        >>> from multimolecule import RnaErnieConfig, RnaErnieForTokenPrediction, RnaTokenizer\n        >>> config = RnaErnieConfig()\n        >>> model = RnaErnieForTokenPrediction(config)\n        >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n        >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n        >>> output = model(**input, labels=torch.randint(2, (1, 5)))\n        >>> output[\"logits\"].shape\n        torch.Size([1, 5, 1])\n        >>> output[\"loss\"]  # doctest:+ELLIPSIS\n        tensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n    \"\"\"\n\n    def __init__(self, config: RnaErnieConfig):\n        super().__init__(config)\n        self.rnaernie = RnaErnieModel(config, add_pooling_layer=True)\n        self.token_head = TokenPredictionHead(config)\n        self.head_config = self.token_head.config\n\n        # Initialize weights and apply final processing\n        self.post_init()\n\n    def forward(\n        self,\n        input_ids: Tensor | NestedTensor,\n        attention_mask: Tensor | None = None,\n        position_ids: Tensor | None = None,\n        head_mask: Tensor | None = None,\n        inputs_embeds: Tensor | NestedTensor | None = None,\n        labels: Tensor | None = None,\n        output_attentions: bool | None = None,\n        output_hidden_states: bool | None = None,\n        return_dict: bool | None = None,\n        **kwargs,\n    ) -> Tuple[Tensor, ...] | TokenPredictorOutput:\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n        outputs = self.rnaernie(\n            input_ids,\n            attention_mask=attention_mask,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            output_attentions=output_attentions,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n            **kwargs,\n        )\n        output = self.token_head(outputs, attention_mask, input_ids, labels)\n        logits, loss = output.logits, output.loss\n\n        if not return_dict:\n            output = (logits,) + outputs[2:]\n            return ((loss,) + output) if loss is not None else output\n\n        return TokenPredictorOutput(\n            loss=loss,\n            logits=logits,\n            hidden_states=outputs.hidden_states,\n            attentions=outputs.attentions,\n        )\n
"},{"location":"models/rnaernie/#multimolecule.models.rnaernie.RnaErnieModel","title":"RnaErnieModel","text":"

Bases: RnaErniePreTrainedModel

Examples:

Python Console Session
>>> from multimolecule import RnaErnieConfig, RnaErnieModel, RnaTokenizer\n>>> config = RnaErnieConfig()\n>>> model = RnaErnieModel(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input)\n>>> output[\"last_hidden_state\"].shape\ntorch.Size([1, 7, 768])\n>>> output[\"pooler_output\"].shape\ntorch.Size([1, 768])\n
Source code in multimolecule/models/rnaernie/modeling_rnaernie.py Python
class RnaErnieModel(RnaErniePreTrainedModel):\n    \"\"\"\n    Examples:\n        >>> from multimolecule import RnaErnieConfig, RnaErnieModel, RnaTokenizer\n        >>> config = RnaErnieConfig()\n        >>> model = RnaErnieModel(config)\n        >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n        >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n        >>> output = model(**input)\n        >>> output[\"last_hidden_state\"].shape\n        torch.Size([1, 7, 768])\n        >>> output[\"pooler_output\"].shape\n        torch.Size([1, 768])\n    \"\"\"\n\n    def __init__(self, config: RnaErnieConfig, add_pooling_layer: bool = True):\n        super().__init__(config)\n        self.pad_token_id = config.pad_token_id\n\n        self.embeddings = RnaErnieEmbeddings(config)\n        self.encoder = RnaErnieEncoder(config)\n\n        self.pooler = RnaErniePooler(config) if add_pooling_layer else None\n\n        # Initialize weights and apply final processing\n        self.post_init()\n\n    def get_input_embeddings(self):\n        return self.embeddings.word_embeddings\n\n    def set_input_embeddings(self, value):\n        self.embeddings.word_embeddings = value\n\n    def _prune_heads(self, heads_to_prune):\n        \"\"\"\n        Prunes heads of the model. heads_to_prune: dict of {layer_num: list of heads to prune in this layer} See base\n        class PreTrainedModel\n        \"\"\"\n        for layer, heads in heads_to_prune.items():\n            self.encoder.layer[layer].attention.prune_heads(heads)\n\n    def forward(\n        self,\n        input_ids: Tensor | NestedTensor,\n        attention_mask: Tensor | None = None,\n        position_ids: Tensor | None = None,\n        head_mask: Tensor | None = None,\n        inputs_embeds: Tensor | NestedTensor | None = None,\n        encoder_hidden_states: Tensor | None = None,\n        encoder_attention_mask: Tensor | None = None,\n        past_key_values: Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None = None,\n        use_cache: bool | None = None,\n        output_attentions: bool | None = None,\n        output_hidden_states: bool | None = None,\n        return_dict: bool | None = None,\n        **kwargs,\n    ) -> Tuple[Tensor, ...] | BaseModelOutputWithPoolingAndCrossAttentions:\n        r\"\"\"\n        Args:\n            encoder_hidden_states:\n                Shape: `(batch_size, sequence_length, hidden_size)`\n\n                Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if\n                the model is configured as a decoder.\n            encoder_attention_mask:\n                Shape: `(batch_size, sequence_length)`\n\n                Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used\n                in the cross-attention if the model is configured as a decoder. Mask values selected in `[0, 1]`:\n\n                - 1 for tokens that are **not masked**,\n                - 0 for tokens that are **masked**.\n            past_key_values:\n                Tuple of length `config.n_layers` with each tuple having 4 tensors of shape\n                `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)\n\n                Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up\n                decoding.\n\n                If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those\n                that don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of\n                all `decoder_input_ids` of shape `(batch_size, sequence_length)`.\n            use_cache:\n                If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding\n                (see `past_key_values`).\n        \"\"\"\n        if kwargs:\n            warn(\n                f\"Additional keyword arguments `{', '.join(kwargs)}` are detected in \"\n                f\"`{self.__class__.__name__}.forward`, they will be ignored.\\n\"\n                \"This is provided for backward compatibility and may lead to unexpected behavior.\"\n            )\n        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions\n        output_hidden_states = (\n            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states\n        )\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n\n        if self.config.is_decoder:\n            use_cache = use_cache if use_cache is not None else self.config.use_cache\n        else:\n            use_cache = False\n\n        if isinstance(input_ids, NestedTensor):\n            input_ids, attention_mask = input_ids.tensor, input_ids.mask\n        if input_ids is not None and inputs_embeds is not None:\n            raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n        if input_ids is not None:\n            self.warn_if_padding_and_no_attention_mask(input_ids, attention_mask)\n            input_shape = input_ids.size()\n        elif inputs_embeds is not None:\n            input_shape = inputs_embeds.size()[:-1]\n        else:\n            raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n        batch_size, seq_length = input_shape\n        device = input_ids.device if input_ids is not None else inputs_embeds.device  # type: ignore[union-attr]\n\n        # past_key_values_length\n        past_key_values_length = past_key_values[0][0].shape[2] if past_key_values is not None else 0\n\n        if attention_mask is None:\n            attention_mask = (\n                input_ids.ne(self.pad_token_id)\n                if self.pad_token_id is not None\n                else torch.ones(((batch_size, seq_length + past_key_values_length)), device=device)\n            )\n\n        # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]\n        # ourselves in which case we just need to make it broadcastable to all heads.\n        extended_attention_mask: Tensor = self.get_extended_attention_mask(attention_mask, input_shape)\n\n        # If a 2D or 3D attention mask is provided for the cross-attention\n        # we need to make broadcastable to [batch_size, num_heads, seq_length, seq_length]\n        if self.config.is_decoder and encoder_hidden_states is not None:\n            encoder_batch_size, encoder_sequence_length, _ = encoder_hidden_states.size()\n            encoder_hidden_shape = (encoder_batch_size, encoder_sequence_length)\n            if encoder_attention_mask is None:\n                encoder_attention_mask = torch.ones(encoder_hidden_shape, device=device)\n            encoder_extended_attention_mask = self.invert_attention_mask(encoder_attention_mask)\n        else:\n            encoder_extended_attention_mask = None\n\n        # Prepare head mask if needed\n        # 1.0 in head_mask indicate we keep the head\n        # attention_probs has shape bsz x n_heads x N x N\n        # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]\n        # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]\n        head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)\n\n        embedding_output = self.embeddings(\n            input_ids=input_ids,\n            position_ids=position_ids,\n            inputs_embeds=inputs_embeds,\n            past_key_values_length=past_key_values_length,\n        )\n        encoder_outputs = self.encoder(\n            embedding_output,\n            attention_mask=extended_attention_mask,\n            head_mask=head_mask,\n            encoder_hidden_states=encoder_hidden_states,\n            encoder_attention_mask=encoder_extended_attention_mask,\n            past_key_values=past_key_values,\n            use_cache=use_cache,\n            output_attentions=output_attentions,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n        )\n        sequence_output = encoder_outputs[0]\n        pooled_output = self.pooler(sequence_output) if self.pooler is not None else None\n\n        if not return_dict:\n            return (sequence_output, pooled_output) + encoder_outputs[1:]\n\n        return BaseModelOutputWithPoolingAndCrossAttentions(\n            last_hidden_state=sequence_output,\n            pooler_output=pooled_output,\n            past_key_values=encoder_outputs.past_key_values,\n            hidden_states=encoder_outputs.hidden_states,\n            attentions=encoder_outputs.attentions,\n            cross_attentions=encoder_outputs.cross_attentions,\n        )\n
"},{"location":"models/rnaernie/#multimolecule.models.rnaernie.RnaErnieModel.forward","title":"forward","text":"Python
forward(input_ids: Tensor | NestedTensor, attention_mask: Tensor | None = None, position_ids: Tensor | None = None, head_mask: Tensor | None = None, inputs_embeds: Tensor | NestedTensor | None = None, encoder_hidden_states: Tensor | None = None, encoder_attention_mask: Tensor | None = None, past_key_values: Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None = None, use_cache: bool | None = None, output_attentions: bool | None = None, output_hidden_states: bool | None = None, return_dict: bool | None = None, **kwargs) -> Tuple[Tensor, ...] | BaseModelOutputWithPoolingAndCrossAttentions\n

Parameters:

Name Type Description Default Tensor | None

Shape: (batch_size, sequence_length, hidden_size)

Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if the model is configured as a decoder.

None Tensor | None

Shape: (batch_size, sequence_length)

Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used in the cross-attention if the model is configured as a decoder. Mask values selected in [0, 1]:

None Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None

Tuple of length config.n_layers with each tuple having 4 tensors of shape `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)

Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up decoding.

If past_key_values are used, the user can optionally input only the last decoder_input_ids (those that don\u2019t have their past key value states given to this model) of shape (batch_size, 1) instead of all decoder_input_ids of shape (batch_size, sequence_length).

None bool | None

If set to True, past_key_values key value states are returned and can be used to speed up decoding (see past_key_values).

None Source code in multimolecule/models/rnaernie/modeling_rnaernie.py Python
def forward(\n    self,\n    input_ids: Tensor | NestedTensor,\n    attention_mask: Tensor | None = None,\n    position_ids: Tensor | None = None,\n    head_mask: Tensor | None = None,\n    inputs_embeds: Tensor | NestedTensor | None = None,\n    encoder_hidden_states: Tensor | None = None,\n    encoder_attention_mask: Tensor | None = None,\n    past_key_values: Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None = None,\n    use_cache: bool | None = None,\n    output_attentions: bool | None = None,\n    output_hidden_states: bool | None = None,\n    return_dict: bool | None = None,\n    **kwargs,\n) -> Tuple[Tensor, ...] | BaseModelOutputWithPoolingAndCrossAttentions:\n    r\"\"\"\n    Args:\n        encoder_hidden_states:\n            Shape: `(batch_size, sequence_length, hidden_size)`\n\n            Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if\n            the model is configured as a decoder.\n        encoder_attention_mask:\n            Shape: `(batch_size, sequence_length)`\n\n            Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used\n            in the cross-attention if the model is configured as a decoder. Mask values selected in `[0, 1]`:\n\n            - 1 for tokens that are **not masked**,\n            - 0 for tokens that are **masked**.\n        past_key_values:\n            Tuple of length `config.n_layers` with each tuple having 4 tensors of shape\n            `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)\n\n            Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up\n            decoding.\n\n            If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those\n            that don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of\n            all `decoder_input_ids` of shape `(batch_size, sequence_length)`.\n        use_cache:\n            If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding\n            (see `past_key_values`).\n    \"\"\"\n    if kwargs:\n        warn(\n            f\"Additional keyword arguments `{', '.join(kwargs)}` are detected in \"\n            f\"`{self.__class__.__name__}.forward`, they will be ignored.\\n\"\n            \"This is provided for backward compatibility and may lead to unexpected behavior.\"\n        )\n    output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions\n    output_hidden_states = (\n        output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states\n    )\n    return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n\n    if self.config.is_decoder:\n        use_cache = use_cache if use_cache is not None else self.config.use_cache\n    else:\n        use_cache = False\n\n    if isinstance(input_ids, NestedTensor):\n        input_ids, attention_mask = input_ids.tensor, input_ids.mask\n    if input_ids is not None and inputs_embeds is not None:\n        raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n    if input_ids is not None:\n        self.warn_if_padding_and_no_attention_mask(input_ids, attention_mask)\n        input_shape = input_ids.size()\n    elif inputs_embeds is not None:\n        input_shape = inputs_embeds.size()[:-1]\n    else:\n        raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n    batch_size, seq_length = input_shape\n    device = input_ids.device if input_ids is not None else inputs_embeds.device  # type: ignore[union-attr]\n\n    # past_key_values_length\n    past_key_values_length = past_key_values[0][0].shape[2] if past_key_values is not None else 0\n\n    if attention_mask is None:\n        attention_mask = (\n            input_ids.ne(self.pad_token_id)\n            if self.pad_token_id is not None\n            else torch.ones(((batch_size, seq_length + past_key_values_length)), device=device)\n        )\n\n    # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]\n    # ourselves in which case we just need to make it broadcastable to all heads.\n    extended_attention_mask: Tensor = self.get_extended_attention_mask(attention_mask, input_shape)\n\n    # If a 2D or 3D attention mask is provided for the cross-attention\n    # we need to make broadcastable to [batch_size, num_heads, seq_length, seq_length]\n    if self.config.is_decoder and encoder_hidden_states is not None:\n        encoder_batch_size, encoder_sequence_length, _ = encoder_hidden_states.size()\n        encoder_hidden_shape = (encoder_batch_size, encoder_sequence_length)\n        if encoder_attention_mask is None:\n            encoder_attention_mask = torch.ones(encoder_hidden_shape, device=device)\n        encoder_extended_attention_mask = self.invert_attention_mask(encoder_attention_mask)\n    else:\n        encoder_extended_attention_mask = None\n\n    # Prepare head mask if needed\n    # 1.0 in head_mask indicate we keep the head\n    # attention_probs has shape bsz x n_heads x N x N\n    # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]\n    # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]\n    head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)\n\n    embedding_output = self.embeddings(\n        input_ids=input_ids,\n        position_ids=position_ids,\n        inputs_embeds=inputs_embeds,\n        past_key_values_length=past_key_values_length,\n    )\n    encoder_outputs = self.encoder(\n        embedding_output,\n        attention_mask=extended_attention_mask,\n        head_mask=head_mask,\n        encoder_hidden_states=encoder_hidden_states,\n        encoder_attention_mask=encoder_extended_attention_mask,\n        past_key_values=past_key_values,\n        use_cache=use_cache,\n        output_attentions=output_attentions,\n        output_hidden_states=output_hidden_states,\n        return_dict=return_dict,\n    )\n    sequence_output = encoder_outputs[0]\n    pooled_output = self.pooler(sequence_output) if self.pooler is not None else None\n\n    if not return_dict:\n        return (sequence_output, pooled_output) + encoder_outputs[1:]\n\n    return BaseModelOutputWithPoolingAndCrossAttentions(\n        last_hidden_state=sequence_output,\n        pooler_output=pooled_output,\n        past_key_values=encoder_outputs.past_key_values,\n        hidden_states=encoder_outputs.hidden_states,\n        attentions=encoder_outputs.attentions,\n        cross_attentions=encoder_outputs.cross_attentions,\n    )\n
"},{"location":"models/rnaernie/#multimolecule.models.rnaernie.RnaErnieModel.forward(encoder_hidden_states)","title":"encoder_hidden_states","text":""},{"location":"models/rnaernie/#multimolecule.models.rnaernie.RnaErnieModel.forward(encoder_attention_mask)","title":"encoder_attention_mask","text":""},{"location":"models/rnaernie/#multimolecule.models.rnaernie.RnaErnieModel.forward(past_key_values)","title":"past_key_values","text":""},{"location":"models/rnaernie/#multimolecule.models.rnaernie.RnaErnieModel.forward(use_cache)","title":"use_cache","text":""},{"location":"models/rnaernie/#multimolecule.models.rnaernie.RnaErniePreTrainedModel","title":"RnaErniePreTrainedModel","text":"

Bases: PreTrainedModel

An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained models.

Source code in multimolecule/models/rnaernie/modeling_rnaernie.py Python
class RnaErniePreTrainedModel(PreTrainedModel):\n    \"\"\"\n    An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained\n    models.\n    \"\"\"\n\n    config_class = RnaErnieConfig\n    base_model_prefix = \"rnaernie\"\n    supports_gradient_checkpointing = True\n    _no_split_modules = [\"RnaErnieLayer\", \"RnaErnieEmbeddings\"]\n\n    # Copied from transformers.models.bert.modeling_bert.BertPreTrainedModel._init_weights\n    def _init_weights(self, module: nn.Module):\n        \"\"\"Initialize the weights\"\"\"\n        if isinstance(module, nn.Linear):\n            # Slightly different from the TF version which uses truncated_normal for initialization\n            # cf https://github.com/pytorch/pytorch/pull/5617\n            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)\n            if module.bias is not None:\n                module.bias.data.zero_()\n        elif isinstance(module, nn.Embedding):\n            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)\n            if module.padding_idx is not None:\n                module.weight.data[module.padding_idx].zero_()\n        elif isinstance(module, nn.LayerNorm):\n            module.bias.data.zero_()\n            module.weight.data.fill_(1.0)\n\n    def _set_gradient_checkpointing(self, module, value=False):\n        if isinstance(module, RnaErnieEncoder):\n            module.gradient_checkpointing = value\n
"},{"location":"models/rnafm/","title":"RNA-FM","text":"

Pre-trained model on non-coding RNA (ncRNA) using a masked language modeling (MLM) objective.

"},{"location":"models/rnafm/#disclaimer","title":"Disclaimer","text":"

This is an UNOFFICIAL implementation of the Interpretable RNA Foundation Model from Unannotated Data for Highly Accurate RNA Structure and Function Predictions by Jiayang Chen, Zhihang Hue, Siqi Sun, et al.

The OFFICIAL repository of RNA-FM is at ml4bio/RNA-FM.

Tip

The MultiMolecule team has confirmed that the provided model and checkpoints are producing the same intermediate representations as the original implementation.

The team releasing RNA-FM did not write this model card for this model so this model card has been written by the MultiMolecule team.

"},{"location":"models/rnafm/#model-details","title":"Model Details","text":"

RNA-FM is a bert-style model pre-trained on a large corpus of non-coding RNA sequences in a self-supervised fashion. This means that the model was trained on the raw nucleotides of RNA sequences only, with an automatic process to generate inputs and labels from those texts. Please refer to the Training Details section for more information on the training process.

"},{"location":"models/rnafm/#variations","title":"Variations","text":""},{"location":"models/rnafm/#model-specification","title":"Model Specification","text":"Variants Num Layers Hidden Size Num Heads Intermediate Size Num Parameters (M) FLOPs (G) MACs (G) Max Num Tokens RNA-FM 12 640 20 5120 99.52 25.68 12.83 1024 mRNA-FM 1280 239.25 61.43 30.7"},{"location":"models/rnafm/#links","title":"Links","text":""},{"location":"models/rnafm/#usage","title":"Usage","text":"

The model file depends on the multimolecule library. You can install it using pip:

Bash
pip install multimolecule\n
"},{"location":"models/rnafm/#direct-use","title":"Direct Use","text":"

You can use this model directly with a pipeline for masked language modeling:

Python
>>> import multimolecule  # you must import multimolecule to register models\n>>> from transformers import pipeline\n>>> unmasker = pipeline(\"fill-mask\", model=\"multimolecule/rnafm\")\n>>> unmasker(\"gguc<mask>cucugguuagaccagaucugagccu\")\n\n[{'score': 0.2752501964569092,\n  'token': 21,\n  'token_str': '.',\n  'sequence': 'G G U C. C U C U G G U U A G A C C A G A U C U G A G C C U'},\n {'score': 0.22108642756938934,\n  'token': 23,\n  'token_str': '*',\n  'sequence': 'G G U C * C U C U G G U U A G A C C A G A U C U G A G C C U'},\n {'score': 0.18201279640197754,\n  'token': 25,\n  'token_str': 'I',\n  'sequence': 'G G U C I C U C U G G U U A G A C C A G A U C U G A G C C U'},\n {'score': 0.10875876247882843,\n  'token': 9,\n  'token_str': 'U',\n  'sequence': 'G G U C U C U C U G G U U A G A C C A G A U C U G A G C C U'},\n {'score': 0.08898332715034485,\n  'token': 6,\n  'token_str': 'A',\n  'sequence': 'G G U C A C U C U G G U U A G A C C A G A U C U G A G C C U'}]\n
"},{"location":"models/rnafm/#downstream-use","title":"Downstream Use","text":""},{"location":"models/rnafm/#extract-features","title":"Extract Features","text":"

Here is how to use this model to get the features of a given sequence in PyTorch:

Python
from multimolecule import RnaTokenizer, RnaFmModel\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rnafm\")\nmodel = RnaFmModel.from_pretrained(\"multimolecule/rnafm\")\n\ntext = \"UAGCUUAUCAGACUGAUGUUGA\"\ninput = tokenizer(text, return_tensors=\"pt\")\n\noutput = model(**input)\n
"},{"location":"models/rnafm/#sequence-classification-regression","title":"Sequence Classification / Regression","text":"

Note: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for sequence classification or regression.

Here is how to use this model as backbone to fine-tune for a sequence-level task in PyTorch:

Python
import torch\nfrom multimolecule import RnaTokenizer, RnaFmForSequencePrediction\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rnafm\")\nmodel = RnaFmForSequencePrediction.from_pretrained(\"multimolecule/rnafm\")\n\ntext = \"UAGCUUAUCAGACUGAUGUUGA\"\ninput = tokenizer(text, return_tensors=\"pt\")\nlabel = torch.tensor([1])\n\noutput = model(**input, labels=label)\n
"},{"location":"models/rnafm/#token-classification-regression","title":"Token Classification / Regression","text":"

Note: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for nucleotide classification or regression.

Here is how to use this model as backbone to fine-tune for a nucleotide-level task in PyTorch:

Python
import torch\nfrom multimolecule import RnaTokenizer, RnaFmForTokenPrediction\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rnafm\")\nmodel = RnaFmForTokenPrediction.from_pretrained(\"multimolecule/rnafm\")\n\ntext = \"UAGCUUAUCAGACUGAUGUUGA\"\ninput = tokenizer(text, return_tensors=\"pt\")\nlabel = torch.randint(2, (len(text), ))\n\noutput = model(**input, labels=label)\n
"},{"location":"models/rnafm/#contact-classification-regression","title":"Contact Classification / Regression","text":"

Note: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for contact classification or regression.

Here is how to use this model as backbone to fine-tune for a contact-level task in PyTorch:

Python
import torch\nfrom multimolecule import RnaTokenizer, RnaFmForContactPrediction\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rnafm\")\nmodel = RnaFmForContactPrediction.from_pretrained(\"multimolecule/rnafm\")\n\ntext = \"UAGCUUAUCAGACUGAUGUUGA\"\ninput = tokenizer(text, return_tensors=\"pt\")\nlabel = torch.randint(2, (len(text), len(text)))\n\noutput = model(**input, labels=label)\n
"},{"location":"models/rnafm/#training-details","title":"Training Details","text":"

RNA-FM used Masked Language Modeling (MLM) as the pre-training objective: taking a sequence, the model randomly masks 15% of the tokens in the input then runs the entire masked sentence through the model and has to predict the masked tokens. This is comparable to the Cloze task in language modeling.

"},{"location":"models/rnafm/#training-data","title":"Training Data","text":"

The RNA-FM model was pre-trained on RNAcentral. RNAcentral is a free, public resource that offers integrated access to a comprehensive and up-to-date set of non-coding RNA sequences provided by a collaborating group of Expert Databases representing a broad range of organisms and RNA types.

RNA-FM applied CD-HIT (CD-HIT-EST) with a cut-off at 100% sequence identity to remove redundancy from the RNAcentral. The final dataset contains 23.7 million non-redundant RNA sequences.

RNA-FM preprocessed all tokens by replacing \u201cU\u201ds with \u201cT\u201ds.

Note that during model conversions, \u201cT\u201d is replaced with \u201cU\u201d. RnaTokenizer will convert \u201cT\u201ds to \u201cU\u201ds for you, you may disable this behaviour by passing replace_T_with_U=False.

"},{"location":"models/rnafm/#training-procedure","title":"Training Procedure","text":""},{"location":"models/rnafm/#preprocessing","title":"Preprocessing","text":"

RNA-FM used masked language modeling (MLM) as the pre-training objective. The masking procedure is similar to the one used in BERT:

"},{"location":"models/rnafm/#pretraining","title":"PreTraining","text":"

The model was trained on 8 NVIDIA A100 GPUs with 80GiB memories.

"},{"location":"models/rnafm/#citation","title":"Citation","text":"

BibTeX:

BibTeX
@article{chen2022interpretable,\n  title={Interpretable rna foundation model from unannotated data for highly accurate rna structure and function predictions},\n  author={Chen, Jiayang and Hu, Zhihang and Sun, Siqi and Tan, Qingxiong and Wang, Yixuan and Yu, Qinze and Zong, Licheng and Hong, Liang and Xiao, Jin and King, Irwin and others},\n  journal={arXiv preprint arXiv:2204.00300},\n  year={2022}\n}\n
"},{"location":"models/rnafm/#contact","title":"Contact","text":"

Please use GitHub issues of MultiMolecule for any questions or comments on the model card.

Please contact the authors of the RNA-FM paper for questions or comments on the paper/model.

"},{"location":"models/rnafm/#license","title":"License","text":"

This model is licensed under the AGPL-3.0 License.

Text Only
SPDX-License-Identifier: AGPL-3.0-or-later\n
"},{"location":"models/rnafm/#multimolecule.models.rnafm","title":"multimolecule.models.rnafm","text":""},{"location":"models/rnafm/#multimolecule.models.rnafm.RnaTokenizer","title":"RnaTokenizer","text":"

Bases: Tokenizer

Tokenizer for RNA sequences.

Parameters:

Name Type Description Default Alphabet | str | List[str] | None

alphabet to use for tokenization.

None int

Size of kmer to tokenize.

1 bool

Whether to tokenize into codons.

False bool

Whether to replace T with U.

True bool

Whether to convert input to uppercase.

True

Examples:

Python Console Session
>>> from multimolecule import RnaTokenizer\n>>> tokenizer = RnaTokenizer()\n>>> tokenizer('<pad><cls><eos><unk><mask><null>ACGUNRYSWKMBDHV.X*-I')[\"input_ids\"]\n[1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 2]\n>>> tokenizer('acgu')[\"input_ids\"]\n[1, 6, 7, 8, 9, 2]\n>>> tokenizer('acgt')[\"input_ids\"]\n[1, 6, 7, 8, 9, 2]\n>>> tokenizer = RnaTokenizer(replace_T_with_U=False)\n>>> tokenizer('acgt')[\"input_ids\"]\n[1, 6, 7, 8, 3, 2]\n>>> tokenizer = RnaTokenizer(nmers=3)\n>>> tokenizer('uagcuuauc')[\"input_ids\"]\n[1, 83, 17, 64, 49, 96, 84, 22, 2]\n>>> tokenizer = RnaTokenizer(codon=True)\n>>> tokenizer('uagcuuauc')[\"input_ids\"]\n[1, 83, 49, 22, 2]\n>>> tokenizer('uagcuuauca')[\"input_ids\"]\nTraceback (most recent call last):\nValueError: length of input sequence must be a multiple of 3 for codon tokenization, but got 10\n
Source code in multimolecule/tokenisers/rna/tokenization_rna.py Python
class RnaTokenizer(Tokenizer):\n    \"\"\"\n    Tokenizer for RNA sequences.\n\n    Args:\n        alphabet: alphabet to use for tokenization.\n\n            - If is `None`, the standard RNA alphabet will be used.\n            - If is a `string`, it should correspond to the name of a predefined alphabet. The options include\n                + `standard`\n                + `extended`\n                + `streamline`\n                + `nucleobase`\n            - If is an alphabet or a list of characters, that specific alphabet will be used.\n        nmers: Size of kmer to tokenize.\n        codon: Whether to tokenize into codons.\n        replace_T_with_U: Whether to replace T with U.\n        do_upper_case: Whether to convert input to uppercase.\n\n    Examples:\n        >>> from multimolecule import RnaTokenizer\n        >>> tokenizer = RnaTokenizer()\n        >>> tokenizer('<pad><cls><eos><unk><mask><null>ACGUNRYSWKMBDHV.X*-I')[\"input_ids\"]\n        [1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 2]\n        >>> tokenizer('acgu')[\"input_ids\"]\n        [1, 6, 7, 8, 9, 2]\n        >>> tokenizer('acgt')[\"input_ids\"]\n        [1, 6, 7, 8, 9, 2]\n        >>> tokenizer = RnaTokenizer(replace_T_with_U=False)\n        >>> tokenizer('acgt')[\"input_ids\"]\n        [1, 6, 7, 8, 3, 2]\n        >>> tokenizer = RnaTokenizer(nmers=3)\n        >>> tokenizer('uagcuuauc')[\"input_ids\"]\n        [1, 83, 17, 64, 49, 96, 84, 22, 2]\n        >>> tokenizer = RnaTokenizer(codon=True)\n        >>> tokenizer('uagcuuauc')[\"input_ids\"]\n        [1, 83, 49, 22, 2]\n        >>> tokenizer('uagcuuauca')[\"input_ids\"]\n        Traceback (most recent call last):\n        ValueError: length of input sequence must be a multiple of 3 for codon tokenization, but got 10\n    \"\"\"\n\n    model_input_names = [\"input_ids\", \"attention_mask\"]\n\n    def __init__(\n        self,\n        alphabet: Alphabet | str | List[str] | None = None,\n        nmers: int = 1,\n        codon: bool = False,\n        replace_T_with_U: bool = True,\n        do_upper_case: bool = True,\n        additional_special_tokens: List | Tuple | None = None,\n        **kwargs,\n    ):\n        if codon and (nmers > 1 and nmers != 3):\n            raise ValueError(\"Codon and nmers cannot be used together.\")\n        if codon:\n            nmers = 3  # set to 3 to get correct vocab\n        if not isinstance(alphabet, Alphabet):\n            alphabet = get_alphabet(alphabet, nmers=nmers)\n        super().__init__(\n            alphabet=alphabet,\n            nmers=nmers,\n            codon=codon,\n            replace_T_with_U=replace_T_with_U,\n            do_upper_case=do_upper_case,\n            additional_special_tokens=additional_special_tokens,\n            **kwargs,\n        )\n        self.replace_T_with_U = replace_T_with_U\n        self.nmers = nmers\n        self.codon = codon\n\n    def _tokenize(self, text: str, **kwargs):\n        if self.do_upper_case:\n            text = text.upper()\n        if self.replace_T_with_U:\n            text = text.replace(\"T\", \"U\")\n        if self.codon:\n            if len(text) % 3 != 0:\n                raise ValueError(\n                    f\"length of input sequence must be a multiple of 3 for codon tokenization, but got {len(text)}\"\n                )\n            return [text[i : i + 3] for i in range(0, len(text), 3)]\n        if self.nmers > 1:\n            return [text[i : i + self.nmers] for i in range(len(text) - self.nmers + 1)]  # noqa: E203\n        return list(text)\n
"},{"location":"models/rnafm/#multimolecule.models.rnafm.RnaTokenizer(alphabet)","title":"alphabet","text":""},{"location":"models/rnafm/#multimolecule.models.rnafm.RnaTokenizer(nmers)","title":"nmers","text":""},{"location":"models/rnafm/#multimolecule.models.rnafm.RnaTokenizer(codon)","title":"codon","text":""},{"location":"models/rnafm/#multimolecule.models.rnafm.RnaTokenizer(replace_T_with_U)","title":"replace_T_with_U","text":""},{"location":"models/rnafm/#multimolecule.models.rnafm.RnaTokenizer(do_upper_case)","title":"do_upper_case","text":""},{"location":"models/rnafm/#multimolecule.models.rnafm.RnaFmConfig","title":"RnaFmConfig","text":"

Bases: PreTrainedConfig

This is the configuration class to store the configuration of a RnaFmModel. It is used to instantiate a RNA-FM model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the RNA-FM ml4bio/RNA-FM architecture.

Configuration objects inherit from PreTrainedConfig and can be used to control the model outputs. Read the documentation from PreTrainedConfig for more information.

Parameters:

Name Type Description Default int | None

Vocabulary size of the RNA-FM model. Defines the number of different tokens that can be represented by the inputs_ids passed when calling [RnaFmModel]. Defaults to 25 if codon=False else 131.

None bool

Whether to use codon tokenization.

False int

Dimensionality of the encoder layers and the pooler layer.

640 int

Number of hidden layers in the Transformer encoder.

12 int

Number of attention heads for each attention layer in the Transformer encoder.

20 int

Dimensionality of the \u201cintermediate\u201d (often named feed-forward) layer in the Transformer encoder.

5120 float

The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.

0.1 float

The dropout ratio for the attention probabilities.

0.1 int

The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).

1026 float

The standard deviation of the truncated_normal_initializer for initializing all weight matrices.

0.02 float

The epsilon used by the layer normalization layers.

1e-12 str

Type of position embedding. Choose one of \"absolute\", \"relative_key\", \"relative_key_query\", \"rotary\". For positional embeddings use \"absolute\". For more information on \"relative_key\", please refer to Self-Attention with Relative Position Representations (Shaw et al.). For more information on \"relative_key_query\", please refer to Method 4 in Improve Transformer Models with Better Relative Position Embeddings (Huang et al.).

'absolute' bool

Whether the model is used as a decoder or not. If False, the model is used as an encoder.

False bool

Whether or not the model should return the last key/values attentions (not used by all models). Only relevant if config.is_decoder=True.

True bool

Whether to apply layer normalization after embeddings but before the main stem of the network.

True bool

When this is enabled, masked tokens are treated as if they had been dropped out by input dropout.

False

Examples:

Python Console Session
>>> from multimolecule import RnaFmModel, RnaFmConfig\n>>> # Initializing a RNA-FM multimolecule/rnafm style configuration\n>>> configuration = RnaFmConfig()\n>>> # Initializing a model (with random weights) from the multimolecule/rnafm style configuration\n>>> model = RnaFmModel(configuration)\n>>> # Accessing the model configuration\n>>> configuration = model.config\n
Source code in multimolecule/models/rnafm/configuration_rnafm.py Python
class RnaFmConfig(PreTrainedConfig):\n    r\"\"\"\n    This is the configuration class to store the configuration of a [`RnaFmModel`][multimolecule.models.RnaFmModel].\n    It is used to instantiate a RNA-FM model according to the specified arguments, defining the model architecture.\n    Instantiating a configuration with the defaults will yield a similar configuration to that of the RNA-FM\n    [ml4bio/RNA-FM](https://github.com/ml4bio/RNA-FM) architecture.\n\n    Configuration objects inherit from [`PreTrainedConfig`][multimolecule.models.PreTrainedConfig] and can be used to\n    control the model outputs. Read the documentation from [`PreTrainedConfig`][multimolecule.models.PreTrainedConfig]\n    for more information.\n\n    Args:\n        vocab_size:\n            Vocabulary size of the RNA-FM model. Defines the number of different tokens that can be represented by the\n            `inputs_ids` passed when calling [`RnaFmModel`].\n            Defaults to 25 if `codon=False` else 131.\n        codon:\n            Whether to use codon tokenization.\n        hidden_size:\n            Dimensionality of the encoder layers and the pooler layer.\n        num_hidden_layers:\n            Number of hidden layers in the Transformer encoder.\n        num_attention_heads:\n            Number of attention heads for each attention layer in the Transformer encoder.\n        intermediate_size:\n            Dimensionality of the \"intermediate\" (often named feed-forward) layer in the Transformer encoder.\n        hidden_dropout:\n            The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.\n        attention_dropout:\n            The dropout ratio for the attention probabilities.\n        max_position_embeddings:\n            The maximum sequence length that this model might ever be used with. Typically set this to something large\n            just in case (e.g., 512 or 1024 or 2048).\n        initializer_range:\n            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.\n        layer_norm_eps:\n            The epsilon used by the layer normalization layers.\n        position_embedding_type:\n            Type of position embedding. Choose one of `\"absolute\"`, `\"relative_key\"`, `\"relative_key_query\", \"rotary\"`.\n            For positional embeddings use `\"absolute\"`. For more information on `\"relative_key\"`, please refer to\n            [Self-Attention with Relative Position Representations (Shaw et al.)](https://arxiv.org/abs/1803.02155).\n            For more information on `\"relative_key_query\"`, please refer to *Method 4* in [Improve Transformer Models\n            with Better Relative Position Embeddings (Huang et al.)](https://arxiv.org/abs/2009.13658).\n        is_decoder:\n            Whether the model is used as a decoder or not. If `False`, the model is used as an encoder.\n        use_cache:\n            Whether or not the model should return the last key/values attentions (not used by all models). Only\n            relevant if `config.is_decoder=True`.\n        emb_layer_norm_before:\n            Whether to apply layer normalization after embeddings but before the main stem of the network.\n        token_dropout:\n            When this is enabled, masked tokens are treated as if they had been dropped out by input dropout.\n\n    Examples:\n        >>> from multimolecule import RnaFmModel, RnaFmConfig\n        >>> # Initializing a RNA-FM multimolecule/rnafm style configuration\n        >>> configuration = RnaFmConfig()\n        >>> # Initializing a model (with random weights) from the multimolecule/rnafm style configuration\n        >>> model = RnaFmModel(configuration)\n        >>> # Accessing the model configuration\n        >>> configuration = model.config\n    \"\"\"\n\n    model_type = \"rnafm\"\n\n    def __init__(\n        self,\n        vocab_size: int | None = None,\n        codon: bool = False,\n        hidden_size: int = 640,\n        num_hidden_layers: int = 12,\n        num_attention_heads: int = 20,\n        intermediate_size: int = 5120,\n        hidden_act: str = \"gelu\",\n        hidden_dropout: float = 0.1,\n        attention_dropout: float = 0.1,\n        max_position_embeddings: int = 1026,\n        initializer_range: float = 0.02,\n        layer_norm_eps: float = 1e-12,\n        position_embedding_type: str = \"absolute\",\n        is_decoder: bool = False,\n        use_cache: bool = True,\n        emb_layer_norm_before: bool = True,\n        token_dropout: bool = False,\n        head: HeadConfig | None = None,\n        lm_head: MaskedLMHeadConfig | None = None,\n        **kwargs,\n    ):\n        super().__init__(**kwargs)\n        if vocab_size is None:\n            vocab_size = 131 if codon else 26\n        self.vocab_size = vocab_size\n        self.codon = codon\n        self.hidden_size = hidden_size\n        self.num_hidden_layers = num_hidden_layers\n        self.num_attention_heads = num_attention_heads\n        self.intermediate_size = intermediate_size\n        self.hidden_act = hidden_act\n        self.hidden_dropout = hidden_dropout\n        self.attention_dropout = attention_dropout\n        self.max_position_embeddings = max_position_embeddings\n        self.initializer_range = initializer_range\n        self.layer_norm_eps = layer_norm_eps\n        self.position_embedding_type = position_embedding_type\n        self.is_decoder = is_decoder\n        self.use_cache = use_cache\n        self.emb_layer_norm_before = emb_layer_norm_before\n        self.token_dropout = token_dropout\n        self.head = HeadConfig(**head) if head is not None else None\n        self.lm_head = MaskedLMHeadConfig(**lm_head) if lm_head is not None else None\n
"},{"location":"models/rnafm/#multimolecule.models.rnafm.RnaFmConfig(vocab_size)","title":"vocab_size","text":""},{"location":"models/rnafm/#multimolecule.models.rnafm.RnaFmConfig(codon)","title":"codon","text":""},{"location":"models/rnafm/#multimolecule.models.rnafm.RnaFmConfig(hidden_size)","title":"hidden_size","text":""},{"location":"models/rnafm/#multimolecule.models.rnafm.RnaFmConfig(num_hidden_layers)","title":"num_hidden_layers","text":""},{"location":"models/rnafm/#multimolecule.models.rnafm.RnaFmConfig(num_attention_heads)","title":"num_attention_heads","text":""},{"location":"models/rnafm/#multimolecule.models.rnafm.RnaFmConfig(intermediate_size)","title":"intermediate_size","text":""},{"location":"models/rnafm/#multimolecule.models.rnafm.RnaFmConfig(hidden_dropout)","title":"hidden_dropout","text":""},{"location":"models/rnafm/#multimolecule.models.rnafm.RnaFmConfig(attention_dropout)","title":"attention_dropout","text":""},{"location":"models/rnafm/#multimolecule.models.rnafm.RnaFmConfig(max_position_embeddings)","title":"max_position_embeddings","text":""},{"location":"models/rnafm/#multimolecule.models.rnafm.RnaFmConfig(initializer_range)","title":"initializer_range","text":""},{"location":"models/rnafm/#multimolecule.models.rnafm.RnaFmConfig(layer_norm_eps)","title":"layer_norm_eps","text":""},{"location":"models/rnafm/#multimolecule.models.rnafm.RnaFmConfig(position_embedding_type)","title":"position_embedding_type","text":""},{"location":"models/rnafm/#multimolecule.models.rnafm.RnaFmConfig(is_decoder)","title":"is_decoder","text":""},{"location":"models/rnafm/#multimolecule.models.rnafm.RnaFmConfig(use_cache)","title":"use_cache","text":""},{"location":"models/rnafm/#multimolecule.models.rnafm.RnaFmConfig(emb_layer_norm_before)","title":"emb_layer_norm_before","text":""},{"location":"models/rnafm/#multimolecule.models.rnafm.RnaFmConfig(token_dropout)","title":"token_dropout","text":""},{"location":"models/rnafm/#multimolecule.models.rnafm.RnaFmForContactPrediction","title":"RnaFmForContactPrediction","text":"

Bases: RnaFmPreTrainedModel

Examples:

Python Console Session
>>> from multimolecule import RnaFmConfig, RnaFmForContactPrediction, RnaTokenizer\n>>> config = RnaFmConfig()\n>>> model = RnaFmForContactPrediction(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=torch.randint(2, (1, 5, 5)))\n>>> output[\"logits\"].shape\ntorch.Size([1, 5, 5, 1])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n
Source code in multimolecule/models/rnafm/modeling_rnafm.py Python
class RnaFmForContactPrediction(RnaFmPreTrainedModel):\n    \"\"\"\n    Examples:\n        >>> from multimolecule import RnaFmConfig, RnaFmForContactPrediction, RnaTokenizer\n        >>> config = RnaFmConfig()\n        >>> model = RnaFmForContactPrediction(config)\n        >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n        >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n        >>> output = model(**input, labels=torch.randint(2, (1, 5, 5)))\n        >>> output[\"logits\"].shape\n        torch.Size([1, 5, 5, 1])\n        >>> output[\"loss\"]  # doctest:+ELLIPSIS\n        tensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n    \"\"\"\n\n    def __init__(self, config: RnaFmConfig):\n        super().__init__(config)\n        self.rnafm = RnaFmModel(config, add_pooling_layer=True)\n        self.contact_head = ContactPredictionHead(config)\n        self.head_config = self.contact_head.config\n\n        # Initialize weights and apply final processing\n        self.post_init()\n\n    def forward(\n        self,\n        input_ids: Tensor | NestedTensor,\n        attention_mask: Tensor | None = None,\n        position_ids: Tensor | None = None,\n        head_mask: Tensor | None = None,\n        inputs_embeds: Tensor | NestedTensor | None = None,\n        labels: Tensor | None = None,\n        output_attentions: bool | None = None,\n        output_hidden_states: bool | None = None,\n        return_dict: bool | None = None,\n        **kwargs,\n    ) -> Tuple[Tensor, ...] | ContactPredictorOutput:\n        if output_attentions is False:\n            warn(\"output_attentions must be True for contact classification and will be ignored.\")\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n        outputs = self.rnafm(\n            input_ids,\n            attention_mask=attention_mask,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            output_attentions=True,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n            **kwargs,\n        )\n        output = self.contact_head(outputs, attention_mask, input_ids, labels)\n        logits, loss = output.logits, output.loss\n\n        if not return_dict:\n            output = (logits,) + outputs[2:]\n            return ((loss,) + output) if loss is not None else output\n\n        return ContactPredictorOutput(\n            loss=loss,\n            logits=logits,\n            hidden_states=outputs.hidden_states,\n            attentions=outputs.attentions,\n        )\n
"},{"location":"models/rnafm/#multimolecule.models.rnafm.RnaFmForMaskedLM","title":"RnaFmForMaskedLM","text":"

Bases: RnaFmPreTrainedModel

Examples:

Python Console Session
>>> from multimolecule import RnaFmConfig, RnaFmForMaskedLM, RnaTokenizer\n>>> config = RnaFmConfig()\n>>> model = RnaFmForMaskedLM(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=input[\"input_ids\"])\n>>> output[\"logits\"].shape\ntorch.Size([1, 7, 26])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<NllLossBackward0>)\n
Source code in multimolecule/models/rnafm/modeling_rnafm.py Python
class RnaFmForMaskedLM(RnaFmPreTrainedModel):\n    \"\"\"\n    Examples:\n        >>> from multimolecule import RnaFmConfig, RnaFmForMaskedLM, RnaTokenizer\n        >>> config = RnaFmConfig()\n        >>> model = RnaFmForMaskedLM(config)\n        >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n        >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n        >>> output = model(**input, labels=input[\"input_ids\"])\n        >>> output[\"logits\"].shape\n        torch.Size([1, 7, 26])\n        >>> output[\"loss\"]  # doctest:+ELLIPSIS\n        tensor(..., grad_fn=<NllLossBackward0>)\n    \"\"\"\n\n    _tied_weights_keys = [\"lm_head.decoder.weight\", \"lm_head.decoder.bias\"]\n\n    def __init__(self, config: RnaFmConfig):\n        super().__init__(config)\n        if config.is_decoder:\n            logger.warning(\n                \"If you want to use `RnaFmForMaskedLM` make sure `config.is_decoder=False` for \"\n                \"bi-directional self-attention.\"\n            )\n        self.rnafm = RnaFmModel(config, add_pooling_layer=False)\n        self.lm_head = MaskedLMHead(config)\n\n        # Initialize weights and apply final processing\n        self.post_init()\n\n    def forward(\n        self,\n        input_ids: Tensor | NestedTensor,\n        attention_mask: Tensor | None = None,\n        position_ids: Tensor | None = None,\n        head_mask: Tensor | None = None,\n        inputs_embeds: Tensor | NestedTensor | None = None,\n        encoder_hidden_states: Tensor | None = None,\n        encoder_attention_mask: Tensor | None = None,\n        labels: Tensor | None = None,\n        output_attentions: bool | None = None,\n        output_hidden_states: bool | None = None,\n        return_dict: bool | None = None,\n        **kwargs,\n    ) -> Tuple[Tensor, ...] | MaskedLMOutput:\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n        outputs = self.rnafm(\n            input_ids,\n            attention_mask=attention_mask,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            encoder_hidden_states=encoder_hidden_states,\n            encoder_attention_mask=encoder_attention_mask,\n            output_attentions=output_attentions,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n            **kwargs,\n        )\n        output = self.lm_head(outputs, labels)\n        logits, loss = output.logits, output.loss\n\n        if not return_dict:\n            output = (logits,) + outputs[2:]\n            return ((loss,) + output) if loss is not None else output\n\n        return MaskedLMOutput(\n            loss=loss,\n            logits=logits,\n            hidden_states=outputs.hidden_states,\n            attentions=outputs.attentions,\n        )\n
"},{"location":"models/rnafm/#multimolecule.models.rnafm.RnaFmForPreTraining","title":"RnaFmForPreTraining","text":"

Bases: RnaFmPreTrainedModel

Examples:

Python Console Session
>>> from multimolecule import RnaFmConfig, RnaFmForPreTraining, RnaTokenizer\n>>> config = RnaFmConfig()\n>>> model = RnaFmForPreTraining(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels_mlm=input[\"input_ids\"])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<AddBackward0>)\n>>> output[\"logits\"].shape\ntorch.Size([1, 7, 26])\n>>> output[\"contact_map\"].shape\ntorch.Size([1, 5, 5, 1])\n
Source code in multimolecule/models/rnafm/modeling_rnafm.py Python
class RnaFmForPreTraining(RnaFmPreTrainedModel):\n    \"\"\"\n    Examples:\n        >>> from multimolecule import RnaFmConfig, RnaFmForPreTraining, RnaTokenizer\n        >>> config = RnaFmConfig()\n        >>> model = RnaFmForPreTraining(config)\n        >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n        >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n        >>> output = model(**input, labels_mlm=input[\"input_ids\"])\n        >>> output[\"loss\"]  # doctest:+ELLIPSIS\n        tensor(..., grad_fn=<AddBackward0>)\n        >>> output[\"logits\"].shape\n        torch.Size([1, 7, 26])\n        >>> output[\"contact_map\"].shape\n        torch.Size([1, 5, 5, 1])\n    \"\"\"\n\n    _tied_weights_keys = [\n        \"lm_head.decoder.weight\",\n        \"lm_head.decoder.bias\",\n        \"pretrain.predictions.decoder.weight\",\n        \"pretrain.predictions.decoder.bias\",\n        \"pretrain.predictions_ss.decoder.weight\",\n        \"pretrain.predictions_ss.decoder.bias\",\n    ]\n\n    def __init__(self, config: RnaFmConfig):\n        super().__init__(config)\n        if config.is_decoder:\n            logger.warning(\n                \"If you want to use `RnaFmForPreTraining` make sure `config.is_decoder=False` for \"\n                \"bi-directional self-attention.\"\n            )\n        self.rnafm = RnaFmModel(config, add_pooling_layer=False)\n        self.pretrain = RnaFmPreTrainingHeads(config)\n\n        # Initialize weights and apply final processing\n        self.post_init()\n\n    def get_output_embeddings(self):\n        return self.pretrain.predictions.decoder\n\n    def set_output_embeddings(self, embeddings):\n        self.pretrain.predictions.decoder = embeddings\n\n    def forward(\n        self,\n        input_ids: Tensor | NestedTensor,\n        attention_mask: Tensor | None = None,\n        position_ids: Tensor | None = None,\n        head_mask: Tensor | None = None,\n        inputs_embeds: Tensor | NestedTensor | None = None,\n        encoder_hidden_states: Tensor | None = None,\n        encoder_attention_mask: Tensor | None = None,\n        labels_mlm: Tensor | None = None,\n        labels_contact: Tensor | None = None,\n        output_attentions: bool | None = None,\n        output_hidden_states: bool | None = None,\n        return_dict: bool | None = None,\n        **kwargs,\n    ) -> Tuple[Tensor, ...] | RnaFmForPreTrainingOutput:\n        if output_attentions is False:\n            warn(\"output_attentions must be True for contact classification and will be ignored.\")\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n        outputs = self.rnafm(\n            input_ids,\n            attention_mask=attention_mask,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            encoder_hidden_states=encoder_hidden_states,\n            encoder_attention_mask=encoder_attention_mask,\n            output_attentions=True,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n            **kwargs,\n        )\n        total_loss, logits, contact_map = self.pretrain(\n            outputs, attention_mask, input_ids, labels_mlm=labels_mlm, labels_contact=labels_contact\n        )\n\n        if not return_dict:\n            output = (logits,) + outputs[2:]\n            return ((total_loss,) + output) if total_loss is not None else output\n\n        return RnaFmForPreTrainingOutput(\n            loss=total_loss,\n            logits=logits,\n            contact_map=contact_map,\n            hidden_states=outputs.hidden_states,\n            attentions=outputs.attentions,\n        )\n
"},{"location":"models/rnafm/#multimolecule.models.rnafm.RnaFmForSequencePrediction","title":"RnaFmForSequencePrediction","text":"

Bases: RnaFmPreTrainedModel

Examples:

Python Console Session
>>> from multimolecule import RnaFmConfig, RnaFmForSequencePrediction, RnaTokenizer\n>>> config = RnaFmConfig()\n>>> model = RnaFmForSequencePrediction(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=torch.tensor([[1]]))\n>>> output[\"logits\"].shape\ntorch.Size([1, 1])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n
Source code in multimolecule/models/rnafm/modeling_rnafm.py Python
class RnaFmForSequencePrediction(RnaFmPreTrainedModel):\n    \"\"\"\n    Examples:\n        >>> from multimolecule import RnaFmConfig, RnaFmForSequencePrediction, RnaTokenizer\n        >>> config = RnaFmConfig()\n        >>> model = RnaFmForSequencePrediction(config)\n        >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n        >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n        >>> output = model(**input, labels=torch.tensor([[1]]))\n        >>> output[\"logits\"].shape\n        torch.Size([1, 1])\n        >>> output[\"loss\"]  # doctest:+ELLIPSIS\n        tensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n    \"\"\"\n\n    def __init__(self, config: RnaFmConfig):\n        super().__init__(config)\n        self.rnafm = RnaFmModel(config, add_pooling_layer=True)\n        self.sequence_head = SequencePredictionHead(config)\n        self.head_config = self.sequence_head.config\n\n        # Initialize weights and apply final processing\n        self.post_init()\n\n    def forward(\n        self,\n        input_ids: Tensor | NestedTensor,\n        attention_mask: Tensor | None = None,\n        position_ids: Tensor | None = None,\n        head_mask: Tensor | None = None,\n        inputs_embeds: Tensor | NestedTensor | None = None,\n        labels: Tensor | None = None,\n        output_attentions: bool | None = None,\n        output_hidden_states: bool | None = None,\n        return_dict: bool | None = None,\n        **kwargs,\n    ) -> Tuple[Tensor, ...] | SequencePredictorOutput:\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n        outputs = self.rnafm(\n            input_ids,\n            attention_mask=attention_mask,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            output_attentions=output_attentions,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n            **kwargs,\n        )\n        output = self.sequence_head(outputs, labels)\n        logits, loss = output.logits, output.loss\n\n        if not return_dict:\n            output = (logits,) + outputs[2:]\n            return ((loss,) + output) if loss is not None else output\n\n        return SequencePredictorOutput(\n            loss=loss,\n            logits=logits,\n            hidden_states=outputs.hidden_states,\n            attentions=outputs.attentions,\n        )\n
"},{"location":"models/rnafm/#multimolecule.models.rnafm.RnaFmForTokenPrediction","title":"RnaFmForTokenPrediction","text":"

Bases: RnaFmPreTrainedModel

Examples:

Python Console Session
>>> from multimolecule import RnaFmConfig, RnaFmForTokenPrediction, RnaTokenizer\n>>> config = RnaFmConfig()\n>>> model = RnaFmForTokenPrediction(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=torch.randint(2, (1, 5)))\n>>> output[\"logits\"].shape\ntorch.Size([1, 5, 1])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n
Source code in multimolecule/models/rnafm/modeling_rnafm.py Python
class RnaFmForTokenPrediction(RnaFmPreTrainedModel):\n    \"\"\"\n    Examples:\n        >>> from multimolecule import RnaFmConfig, RnaFmForTokenPrediction, RnaTokenizer\n        >>> config = RnaFmConfig()\n        >>> model = RnaFmForTokenPrediction(config)\n        >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n        >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n        >>> output = model(**input, labels=torch.randint(2, (1, 5)))\n        >>> output[\"logits\"].shape\n        torch.Size([1, 5, 1])\n        >>> output[\"loss\"]  # doctest:+ELLIPSIS\n        tensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n    \"\"\"\n\n    def __init__(self, config: RnaFmConfig):\n        super().__init__(config)\n        self.rnafm = RnaFmModel(config, add_pooling_layer=True)\n        self.token_head = TokenPredictionHead(config)\n        self.head_config = self.token_head.config\n\n        # Initialize weights and apply final processing\n        self.post_init()\n\n    def forward(\n        self,\n        input_ids: Tensor | NestedTensor,\n        attention_mask: Tensor | None = None,\n        position_ids: Tensor | None = None,\n        head_mask: Tensor | None = None,\n        inputs_embeds: Tensor | NestedTensor | None = None,\n        labels: Tensor | None = None,\n        output_attentions: bool | None = None,\n        output_hidden_states: bool | None = None,\n        return_dict: bool | None = None,\n        **kwargs,\n    ) -> Tuple[Tensor, ...] | TokenPredictorOutput:\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n        outputs = self.rnafm(\n            input_ids,\n            attention_mask=attention_mask,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            output_attentions=output_attentions,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n            **kwargs,\n        )\n        output = self.token_head(outputs, attention_mask, input_ids, labels)\n        logits, loss = output.logits, output.loss\n\n        if not return_dict:\n            output = (logits,) + outputs[2:]\n            return ((loss,) + output) if loss is not None else output\n\n        return TokenPredictorOutput(\n            loss=loss,\n            logits=logits,\n            hidden_states=outputs.hidden_states,\n            attentions=outputs.attentions,\n        )\n
"},{"location":"models/rnafm/#multimolecule.models.rnafm.RnaFmModel","title":"RnaFmModel","text":"

Bases: RnaFmPreTrainedModel

Examples:

Python Console Session
>>> from multimolecule import RnaFmConfig, RnaFmModel, RnaTokenizer\n>>> config = RnaFmConfig()\n>>> model = RnaFmModel(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input)\n>>> output[\"last_hidden_state\"].shape\ntorch.Size([1, 7, 640])\n>>> output[\"pooler_output\"].shape\ntorch.Size([1, 640])\n
Source code in multimolecule/models/rnafm/modeling_rnafm.py Python
class RnaFmModel(RnaFmPreTrainedModel):\n    \"\"\"\n    Examples:\n        >>> from multimolecule import RnaFmConfig, RnaFmModel, RnaTokenizer\n        >>> config = RnaFmConfig()\n        >>> model = RnaFmModel(config)\n        >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n        >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n        >>> output = model(**input)\n        >>> output[\"last_hidden_state\"].shape\n        torch.Size([1, 7, 640])\n        >>> output[\"pooler_output\"].shape\n        torch.Size([1, 640])\n    \"\"\"\n\n    def __init__(self, config: RnaFmConfig, add_pooling_layer: bool = True):\n        super().__init__(config)\n        self.pad_token_id = config.pad_token_id\n        self.embeddings = RnaFmEmbeddings(config)\n        self.encoder = RnaFmEncoder(config)\n        self.pooler = RnaFmPooler(config) if add_pooling_layer else None\n\n        # Initialize weights and apply final processing\n        self.post_init()\n\n    def get_input_embeddings(self):\n        return self.embeddings.word_embeddings\n\n    def set_input_embeddings(self, value):\n        self.embeddings.word_embeddings = value\n\n    def _prune_heads(self, heads_to_prune):\n        \"\"\"\n        Prunes heads of the model. heads_to_prune: dict of {layer_num: list of heads to prune in this layer} See base\n        class PreTrainedModel\n        \"\"\"\n        for layer, heads in heads_to_prune.items():\n            self.encoder.layer[layer].attention.prune_heads(heads)\n\n    def forward(\n        self,\n        input_ids: Tensor | NestedTensor,\n        attention_mask: Tensor | None = None,\n        position_ids: Tensor | None = None,\n        head_mask: Tensor | None = None,\n        inputs_embeds: Tensor | NestedTensor | None = None,\n        encoder_hidden_states: Tensor | None = None,\n        encoder_attention_mask: Tensor | None = None,\n        past_key_values: Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None = None,\n        use_cache: bool | None = None,\n        output_attentions: bool | None = None,\n        output_hidden_states: bool | None = None,\n        return_dict: bool | None = None,\n        **kwargs,\n    ) -> Tuple[Tensor, ...] | BaseModelOutputWithPoolingAndCrossAttentions:\n        r\"\"\"\n        Args:\n            encoder_hidden_states:\n                Shape: `(batch_size, sequence_length, hidden_size)`\n\n                Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if\n                the model is configured as a decoder.\n            encoder_attention_mask:\n                Shape: `(batch_size, sequence_length)`\n\n                Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used\n                in the cross-attention if the model is configured as a decoder. Mask values selected in `[0, 1]`:\n\n                - 1 for tokens that are **not masked**,\n                - 0 for tokens that are **masked**.\n            past_key_values:\n                Tuple of length `config.n_layers` with each tuple having 4 tensors of shape\n                `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)\n\n                Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up\n                decoding.\n\n                If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those\n                that don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of\n                all `decoder_input_ids` of shape `(batch_size, sequence_length)`.\n            use_cache:\n                If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding\n                (see `past_key_values`).\n        \"\"\"\n        if kwargs:\n            warn(\n                f\"Additional keyword arguments `{', '.join(kwargs)}` are detected in \"\n                f\"`{self.__class__.__name__}.forward`, they will be ignored.\\n\"\n                \"This is provided for backward compatibility and may lead to unexpected behavior.\"\n            )\n        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions\n        output_hidden_states = (\n            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states\n        )\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n\n        if self.config.is_decoder:\n            use_cache = use_cache if use_cache is not None else self.config.use_cache\n        else:\n            use_cache = False\n\n        if isinstance(input_ids, NestedTensor):\n            input_ids, attention_mask = input_ids.tensor, input_ids.mask\n        if input_ids is not None and inputs_embeds is not None:\n            raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n        if input_ids is not None:\n            self.warn_if_padding_and_no_attention_mask(input_ids, attention_mask)\n            input_shape = input_ids.size()\n        elif inputs_embeds is not None:\n            input_shape = inputs_embeds.size()[:-1]\n        else:\n            raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n        batch_size, seq_length = input_shape\n        device = input_ids.device if input_ids is not None else inputs_embeds.device  # type: ignore[union-attr]\n\n        # past_key_values_length\n        past_key_values_length = past_key_values[0][0].shape[2] if past_key_values is not None else 0\n\n        if attention_mask is None:\n            attention_mask = (\n                input_ids.ne(self.pad_token_id)\n                if self.pad_token_id is not None\n                else torch.ones(((batch_size, seq_length + past_key_values_length)), device=device)\n            )\n\n        # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]\n        # ourselves in which case we just need to make it broadcastable to all heads.\n        extended_attention_mask: Tensor = self.get_extended_attention_mask(attention_mask, input_shape)\n\n        # If a 2D or 3D attention mask is provided for the cross-attention\n        # we need to make broadcastable to [batch_size, num_heads, seq_length, seq_length]\n        if self.config.is_decoder and encoder_hidden_states is not None:\n            encoder_batch_size, encoder_sequence_length, _ = encoder_hidden_states.size()\n            encoder_hidden_shape = (encoder_batch_size, encoder_sequence_length)\n            if encoder_attention_mask is None:\n                encoder_attention_mask = torch.ones(encoder_hidden_shape, device=device)\n            encoder_extended_attention_mask = self.invert_attention_mask(encoder_attention_mask)\n        else:\n            encoder_extended_attention_mask = None\n\n        # Prepare head mask if needed\n        # 1.0 in head_mask indicate we keep the head\n        # attention_probs has shape bsz x n_heads x N x N\n        # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]\n        # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]\n        head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)\n\n        embedding_output = self.embeddings(\n            input_ids=input_ids,\n            position_ids=position_ids,\n            attention_mask=attention_mask,\n            inputs_embeds=inputs_embeds,\n            past_key_values_length=past_key_values_length,\n        )\n        encoder_outputs = self.encoder(\n            embedding_output,\n            attention_mask=extended_attention_mask,\n            head_mask=head_mask,\n            encoder_hidden_states=encoder_hidden_states,\n            encoder_attention_mask=encoder_extended_attention_mask,\n            past_key_values=past_key_values,\n            use_cache=use_cache,\n            output_attentions=output_attentions,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n        )\n        sequence_output = encoder_outputs[0]\n        pooled_output = self.pooler(sequence_output) if self.pooler is not None else None\n\n        if not return_dict:\n            return (sequence_output, pooled_output) + encoder_outputs[1:]\n\n        return BaseModelOutputWithPoolingAndCrossAttentions(\n            last_hidden_state=sequence_output,\n            pooler_output=pooled_output,\n            past_key_values=encoder_outputs.past_key_values,\n            hidden_states=encoder_outputs.hidden_states,\n            attentions=encoder_outputs.attentions,\n            cross_attentions=encoder_outputs.cross_attentions,\n        )\n
"},{"location":"models/rnafm/#multimolecule.models.rnafm.RnaFmModel.forward","title":"forward","text":"Python
forward(input_ids: Tensor | NestedTensor, attention_mask: Tensor | None = None, position_ids: Tensor | None = None, head_mask: Tensor | None = None, inputs_embeds: Tensor | NestedTensor | None = None, encoder_hidden_states: Tensor | None = None, encoder_attention_mask: Tensor | None = None, past_key_values: Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None = None, use_cache: bool | None = None, output_attentions: bool | None = None, output_hidden_states: bool | None = None, return_dict: bool | None = None, **kwargs) -> Tuple[Tensor, ...] | BaseModelOutputWithPoolingAndCrossAttentions\n

Parameters:

Name Type Description Default Tensor | None

Shape: (batch_size, sequence_length, hidden_size)

Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if the model is configured as a decoder.

None Tensor | None

Shape: (batch_size, sequence_length)

Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used in the cross-attention if the model is configured as a decoder. Mask values selected in [0, 1]:

None Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None

Tuple of length config.n_layers with each tuple having 4 tensors of shape `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)

Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up decoding.

If past_key_values are used, the user can optionally input only the last decoder_input_ids (those that don\u2019t have their past key value states given to this model) of shape (batch_size, 1) instead of all decoder_input_ids of shape (batch_size, sequence_length).

None bool | None

If set to True, past_key_values key value states are returned and can be used to speed up decoding (see past_key_values).

None Source code in multimolecule/models/rnafm/modeling_rnafm.py Python
def forward(\n    self,\n    input_ids: Tensor | NestedTensor,\n    attention_mask: Tensor | None = None,\n    position_ids: Tensor | None = None,\n    head_mask: Tensor | None = None,\n    inputs_embeds: Tensor | NestedTensor | None = None,\n    encoder_hidden_states: Tensor | None = None,\n    encoder_attention_mask: Tensor | None = None,\n    past_key_values: Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None = None,\n    use_cache: bool | None = None,\n    output_attentions: bool | None = None,\n    output_hidden_states: bool | None = None,\n    return_dict: bool | None = None,\n    **kwargs,\n) -> Tuple[Tensor, ...] | BaseModelOutputWithPoolingAndCrossAttentions:\n    r\"\"\"\n    Args:\n        encoder_hidden_states:\n            Shape: `(batch_size, sequence_length, hidden_size)`\n\n            Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if\n            the model is configured as a decoder.\n        encoder_attention_mask:\n            Shape: `(batch_size, sequence_length)`\n\n            Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used\n            in the cross-attention if the model is configured as a decoder. Mask values selected in `[0, 1]`:\n\n            - 1 for tokens that are **not masked**,\n            - 0 for tokens that are **masked**.\n        past_key_values:\n            Tuple of length `config.n_layers` with each tuple having 4 tensors of shape\n            `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)\n\n            Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up\n            decoding.\n\n            If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those\n            that don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of\n            all `decoder_input_ids` of shape `(batch_size, sequence_length)`.\n        use_cache:\n            If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding\n            (see `past_key_values`).\n    \"\"\"\n    if kwargs:\n        warn(\n            f\"Additional keyword arguments `{', '.join(kwargs)}` are detected in \"\n            f\"`{self.__class__.__name__}.forward`, they will be ignored.\\n\"\n            \"This is provided for backward compatibility and may lead to unexpected behavior.\"\n        )\n    output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions\n    output_hidden_states = (\n        output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states\n    )\n    return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n\n    if self.config.is_decoder:\n        use_cache = use_cache if use_cache is not None else self.config.use_cache\n    else:\n        use_cache = False\n\n    if isinstance(input_ids, NestedTensor):\n        input_ids, attention_mask = input_ids.tensor, input_ids.mask\n    if input_ids is not None and inputs_embeds is not None:\n        raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n    if input_ids is not None:\n        self.warn_if_padding_and_no_attention_mask(input_ids, attention_mask)\n        input_shape = input_ids.size()\n    elif inputs_embeds is not None:\n        input_shape = inputs_embeds.size()[:-1]\n    else:\n        raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n    batch_size, seq_length = input_shape\n    device = input_ids.device if input_ids is not None else inputs_embeds.device  # type: ignore[union-attr]\n\n    # past_key_values_length\n    past_key_values_length = past_key_values[0][0].shape[2] if past_key_values is not None else 0\n\n    if attention_mask is None:\n        attention_mask = (\n            input_ids.ne(self.pad_token_id)\n            if self.pad_token_id is not None\n            else torch.ones(((batch_size, seq_length + past_key_values_length)), device=device)\n        )\n\n    # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]\n    # ourselves in which case we just need to make it broadcastable to all heads.\n    extended_attention_mask: Tensor = self.get_extended_attention_mask(attention_mask, input_shape)\n\n    # If a 2D or 3D attention mask is provided for the cross-attention\n    # we need to make broadcastable to [batch_size, num_heads, seq_length, seq_length]\n    if self.config.is_decoder and encoder_hidden_states is not None:\n        encoder_batch_size, encoder_sequence_length, _ = encoder_hidden_states.size()\n        encoder_hidden_shape = (encoder_batch_size, encoder_sequence_length)\n        if encoder_attention_mask is None:\n            encoder_attention_mask = torch.ones(encoder_hidden_shape, device=device)\n        encoder_extended_attention_mask = self.invert_attention_mask(encoder_attention_mask)\n    else:\n        encoder_extended_attention_mask = None\n\n    # Prepare head mask if needed\n    # 1.0 in head_mask indicate we keep the head\n    # attention_probs has shape bsz x n_heads x N x N\n    # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]\n    # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]\n    head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)\n\n    embedding_output = self.embeddings(\n        input_ids=input_ids,\n        position_ids=position_ids,\n        attention_mask=attention_mask,\n        inputs_embeds=inputs_embeds,\n        past_key_values_length=past_key_values_length,\n    )\n    encoder_outputs = self.encoder(\n        embedding_output,\n        attention_mask=extended_attention_mask,\n        head_mask=head_mask,\n        encoder_hidden_states=encoder_hidden_states,\n        encoder_attention_mask=encoder_extended_attention_mask,\n        past_key_values=past_key_values,\n        use_cache=use_cache,\n        output_attentions=output_attentions,\n        output_hidden_states=output_hidden_states,\n        return_dict=return_dict,\n    )\n    sequence_output = encoder_outputs[0]\n    pooled_output = self.pooler(sequence_output) if self.pooler is not None else None\n\n    if not return_dict:\n        return (sequence_output, pooled_output) + encoder_outputs[1:]\n\n    return BaseModelOutputWithPoolingAndCrossAttentions(\n        last_hidden_state=sequence_output,\n        pooler_output=pooled_output,\n        past_key_values=encoder_outputs.past_key_values,\n        hidden_states=encoder_outputs.hidden_states,\n        attentions=encoder_outputs.attentions,\n        cross_attentions=encoder_outputs.cross_attentions,\n    )\n
"},{"location":"models/rnafm/#multimolecule.models.rnafm.RnaFmModel.forward(encoder_hidden_states)","title":"encoder_hidden_states","text":""},{"location":"models/rnafm/#multimolecule.models.rnafm.RnaFmModel.forward(encoder_attention_mask)","title":"encoder_attention_mask","text":""},{"location":"models/rnafm/#multimolecule.models.rnafm.RnaFmModel.forward(past_key_values)","title":"past_key_values","text":""},{"location":"models/rnafm/#multimolecule.models.rnafm.RnaFmModel.forward(use_cache)","title":"use_cache","text":""},{"location":"models/rnafm/#multimolecule.models.rnafm.RnaFmPreTrainedModel","title":"RnaFmPreTrainedModel","text":"

Bases: PreTrainedModel

An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained models.

Source code in multimolecule/models/rnafm/modeling_rnafm.py Python
class RnaFmPreTrainedModel(PreTrainedModel):\n    \"\"\"\n    An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained\n    models.\n    \"\"\"\n\n    config_class = RnaFmConfig\n    base_model_prefix = \"rnafm\"\n    supports_gradient_checkpointing = True\n    _no_split_modules = [\"RnaFmLayer\", \"RnaFmEmbeddings\"]\n\n    # Copied from transformers.models.bert.modeling_bert.BertPreTrainedModel._init_weights\n    def _init_weights(self, module: nn.Module):\n        \"\"\"Initialize the weights\"\"\"\n        if isinstance(module, nn.Linear):\n            # Slightly different from the TF version which uses truncated_normal for initialization\n            # cf https://github.com/pytorch/pytorch/pull/5617\n            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)\n            if module.bias is not None:\n                module.bias.data.zero_()\n        elif isinstance(module, nn.Embedding):\n            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)\n            if module.padding_idx is not None:\n                module.weight.data[module.padding_idx].zero_()\n        elif isinstance(module, nn.LayerNorm):\n            module.bias.data.zero_()\n            module.weight.data.fill_(1.0)\n
"},{"location":"models/rnamsm/","title":"RNA-MSM","text":"

Pre-trained model on non-coding RNA (ncRNA) with multi (homologous) sequence alignment using a masked language modeling (MLM) objective.

"},{"location":"models/rnamsm/#disclaimer","title":"Disclaimer","text":"

This is an UNOFFICIAL implementation of the Multiple sequence alignment-based RNA language model and its application to structural inference by Yikun Zhang, Mei Lang, Jiuhong Jiang, Zhiqiang Gao, et al.

The OFFICIAL repository of RNA-MSM is at yikunpku/RNA-MSM.

Caution

The MultiMolecule team is aware of a potential risk in reproducing the results of RNA-MSM.

The original implementation of RNA-MSM used a custom tokenizer that does not append <eos> token to the end of the input sequence in consistent to MSA Transformer. This should not affect the performance of the model in most cases, but it can lead to unexpected behavior in some cases.

Please set eos_token=None explicitly in the tokenizer if you want the exact behavior of the original implementation.

See more at issue #10

Tip

The MultiMolecule team has confirmed that the provided model and checkpoints are producing the same intermediate representations as the original implementation.

The team releasing RNA-MSM did not write this model card for this model so this model card has been written by the MultiMolecule team.

"},{"location":"models/rnamsm/#model-details","title":"Model Details","text":"

RNA-MSM is a bert-style model pre-trained on a large corpus of non-coding RNA sequences in a self-supervised fashion. This means that the model was trained on the raw nucleotides of RNA sequences only, with an automatic process to generate inputs and labels from those texts. Please refer to the Training Details section for more information on the training process.

"},{"location":"models/rnamsm/#model-specification","title":"Model Specification","text":"Num Layers Hidden Size Num Heads Intermediate Size Num Parameters (M) FLOPs (G) MACs (G) Max Num Tokens 10 768 12 3072 95.92 21.66 10.57 1024"},{"location":"models/rnamsm/#links","title":"Links","text":""},{"location":"models/rnamsm/#usage","title":"Usage","text":"

The model file depends on the multimolecule library. You can install it using pip:

Bash
pip install multimolecule\n
"},{"location":"models/rnamsm/#direct-use","title":"Direct Use","text":"

You can use this model directly with a pipeline for masked language modeling:

Python
>>> import multimolecule  # you must import multimolecule to register models\n>>> from transformers import pipeline\n>>> unmasker = pipeline(\"fill-mask\", model=\"multimolecule/rnamsm\")\n>>> unmasker(\"gguc<mask>cucugguuagaccagaucugagccu\")\n\n[{'score': 0.25111356377601624,\n  'token': 9,\n  'token_str': 'U',\n  'sequence': 'G G U C U C U C U G G U U A G A C C A G A U C U G A G C C U'},\n {'score': 0.1200353354215622,\n  'token': 14,\n  'token_str': 'W',\n  'sequence': 'G G U C W C U C U G G U U A G A C C A G A U C U G A G C C U'},\n {'score': 0.10132723301649094,\n  'token': 15,\n  'token_str': 'K',\n  'sequence': 'G G U C K C U C U G G U U A G A C C A G A U C U G A G C C U'},\n {'score': 0.08383019268512726,\n  'token': 18,\n  'token_str': 'D',\n  'sequence': 'G G U C D C U C U G G U U A G A C C A G A U C U G A G C C U'},\n {'score': 0.05737845227122307,\n  'token': 6,\n  'token_str': 'A',\n  'sequence': 'G G U C A C U C U G G U U A G A C C A G A U C U G A G C C U'}]\n
"},{"location":"models/rnamsm/#downstream-use","title":"Downstream Use","text":""},{"location":"models/rnamsm/#extract-features","title":"Extract Features","text":"

Here is how to use this model to get the features of a given sequence in PyTorch:

Python
from multimolecule import RnaTokenizer, RnaMsmModel\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rnamsm\")\nmodel = RnaMsmModel.from_pretrained(\"multimolecule/rnamsm\")\n\ntext = \"UAGCUUAUCAGACUGAUGUUGA\"\ninput = tokenizer(text, return_tensors=\"pt\")\n\noutput = model(**input)\n
"},{"location":"models/rnamsm/#sequence-classification-regression","title":"Sequence Classification / Regression","text":"

Note: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for sequence classification or regression.

Here is how to use this model as backbone to fine-tune for a sequence-level task in PyTorch:

Python
import torch\nfrom multimolecule import RnaTokenizer, RnaMsmForSequencePrediction\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rnamsm\")\nmodel = RnaMsmForSequencePrediction.from_pretrained(\"multimolecule/rnamsm\")\n\ntext = \"UAGCUUAUCAGACUGAUGUUGA\"\ninput = tokenizer(text, return_tensors=\"pt\")\nlabel = torch.tensor([1])\n\noutput = model(**input, labels=label)\n
"},{"location":"models/rnamsm/#token-classification-regression","title":"Token Classification / Regression","text":"

Note: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for nucleotide classification or regression.

Here is how to use this model as backbone to fine-tune for a nucleotide-level task in PyTorch:

Python
import torch\nfrom multimolecule import RnaTokenizer, RnaMsmForTokenPrediction\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rnamsm\")\nmodel = RnaMsmForNucleotidPrediction.from_pretrained(\"multimolecule/rnamsm\")\n\ntext = \"UAGCUUAUCAGACUGAUGUUGA\"\ninput = tokenizer(text, return_tensors=\"pt\")\nlabel = torch.randint(2, (len(text), ))\n\noutput = model(**input, labels=label)\n
"},{"location":"models/rnamsm/#contact-classification-regression","title":"Contact Classification / Regression","text":"

Note: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for contact classification or regression.

Here is how to use this model as backbone to fine-tune for a contact-level task in PyTorch:

Python
import torch\nfrom multimolecule import RnaTokenizer, RnaMsmForContactPrediction\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rnamsm\")\nmodel = RnaMsmForContactPrediction.from_pretrained(\"multimolecule/rnamsm\")\n\ntext = \"UAGCUUAUCAGACUGAUGUUGA\"\ninput = tokenizer(text, return_tensors=\"pt\")\nlabel = torch.randint(2, (len(text), len(text)))\n\noutput = model(**input, labels=label)\n
"},{"location":"models/rnamsm/#training-details","title":"Training Details","text":"

RNA-MSM used Masked Language Modeling (MLM) as the pre-training objective: taking a sequence, the model randomly masks 15% of the tokens in the input then runs the entire masked sentence through the model and has to predict the masked tokens. This is comparable to the Cloze task in language modeling.

"},{"location":"models/rnamsm/#training-data","title":"Training Data","text":"

The RNA-MSM model was pre-trained on Rfam. The Rfam database is a collection of RNA sequence families of structural RNAs including non-coding RNA genes as well as cis-regulatory elements. RNA-MSM used Rfam 14.7 which contains 4,069 RNA families.

To avoid potential overfitting in structural inference, RNA-MSM excluded families with experimentally determined structures, such as ribosomal RNAs, transfer RNAs, and small nuclear RNAs. The final dataset contains 3,932 RNA families. The median value for the number of MSA sequences for these families by RNAcmap3 is 2,184.

To increase the number of homologous sequences, RNA-MSM used an automatic pipeline, RNAcmap3, for homolog search and sequence alignment. RNAcmap3 is a pipeline that combines the BLAST-N, INFERNAL, Easel, RNAfold and evolutionary coupling tools to generate homologous sequences.

RNA-MSM preprocessed all tokens by replacing \u201cT\u201ds with \u201cU\u201ds and substituting \u201cR\u201d, \u201cY\u201d, \u201cK\u201d, \u201cM\u201d, \u201cS\u201d, \u201cW\u201d, \u201cB\u201d, \u201cD\u201d, \u201cH\u201d, \u201cV\u201d, \u201cN\u201d with \u201cX\u201d.

Note that RnaTokenizer will convert \u201cT\u201ds to \u201cU\u201ds for you, you may disable this behaviour by passing replace_T_with_U=False. RnaTokenizer does not perform other substitutions.

"},{"location":"models/rnamsm/#training-procedure","title":"Training Procedure","text":""},{"location":"models/rnamsm/#preprocessing","title":"Preprocessing","text":"

RNA-MSM used masked language modeling (MLM) as the pre-training objective. The masking procedure is similar to the one used in BERT:

"},{"location":"models/rnamsm/#pretraining","title":"PreTraining","text":"

The model was trained on 8 NVIDIA V100 GPUs with 32GiB memories.

"},{"location":"models/rnamsm/#citation","title":"Citation","text":"

BibTeX:

BibTeX
@article{zhang2023multiple,\n    author = {Zhang, Yikun and Lang, Mei and Jiang, Jiuhong and Gao, Zhiqiang and Xu, Fan and Litfin, Thomas and Chen, Ke and Singh, Jaswinder and Huang, Xiansong and Song, Guoli and Tian, Yonghong and Zhan, Jian and Chen, Jie and Zhou, Yaoqi},\n    title = \"{Multiple sequence alignment-based RNA language model and its application to structural inference}\",\n    journal = {Nucleic Acids Research},\n    volume = {52},\n    number = {1},\n    pages = {e3-e3},\n    year = {2023},\n    month = {11},\n    abstract = \"{Compared with proteins, DNA and RNA are more difficult languages to interpret because four-letter coded DNA/RNA sequences have less information content than 20-letter coded protein sequences. While BERT (Bidirectional Encoder Representations from Transformers)-like language models have been developed for RNA, they are ineffective at capturing the evolutionary information from homologous sequences because\u00a0unlike proteins, RNA sequences are less conserved. Here, we have developed an unsupervised multiple sequence alignment-based RNA language model (RNA-MSM) by utilizing homologous sequences from an automatic pipeline, RNAcmap, as it can provide significantly more homologous sequences than manually annotated Rfam. We demonstrate that the resulting unsupervised, two-dimensional attention maps and one-dimensional embeddings from RNA-MSM contain structural information. In fact, they can be directly mapped with high accuracy to 2D base pairing probabilities and 1D solvent accessibilities, respectively. Further fine-tuning led to significantly improved performance on these two downstream tasks compared with existing state-of-the-art techniques including SPOT-RNA2 and RNAsnap2. By comparison, RNA-FM, a BERT-based RNA language model, performs worse than one-hot encoding with its embedding in base pair and solvent-accessible surface area prediction. We anticipate that the pre-trained RNA-MSM model can be fine-tuned on many other tasks related to RNA structure and function.}\",\n    issn = {0305-1048},\n    doi = {10.1093/nar/gkad1031},\n    url = {https://doi.org/10.1093/nar/gkad1031},\n    eprint = {https://academic.oup.com/nar/article-pdf/52/1/e3/55443207/gkad1031.pdf},\n}\n
"},{"location":"models/rnamsm/#contact","title":"Contact","text":"

Please use GitHub issues of MultiMolecule for any questions or comments on the model card.

Please contact the authors of the RNA-MSM paper for questions or comments on the paper/model.

"},{"location":"models/rnamsm/#license","title":"License","text":"

This model is licensed under the AGPL-3.0 License.

Text Only
SPDX-License-Identifier: AGPL-3.0-or-later\n
"},{"location":"models/rnamsm/#multimolecule.models.rnamsm","title":"multimolecule.models.rnamsm","text":""},{"location":"models/rnamsm/#multimolecule.models.rnamsm.RnaTokenizer","title":"RnaTokenizer","text":"

Bases: Tokenizer

Tokenizer for RNA sequences.

Parameters:

Name Type Description Default Alphabet | str | List[str] | None

alphabet to use for tokenization.

None int

Size of kmer to tokenize.

1 bool

Whether to tokenize into codons.

False bool

Whether to replace T with U.

True bool

Whether to convert input to uppercase.

True

Examples:

Python Console Session
>>> from multimolecule import RnaTokenizer\n>>> tokenizer = RnaTokenizer()\n>>> tokenizer('<pad><cls><eos><unk><mask><null>ACGUNRYSWKMBDHV.X*-I')[\"input_ids\"]\n[1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 2]\n>>> tokenizer('acgu')[\"input_ids\"]\n[1, 6, 7, 8, 9, 2]\n>>> tokenizer('acgt')[\"input_ids\"]\n[1, 6, 7, 8, 9, 2]\n>>> tokenizer = RnaTokenizer(replace_T_with_U=False)\n>>> tokenizer('acgt')[\"input_ids\"]\n[1, 6, 7, 8, 3, 2]\n>>> tokenizer = RnaTokenizer(nmers=3)\n>>> tokenizer('uagcuuauc')[\"input_ids\"]\n[1, 83, 17, 64, 49, 96, 84, 22, 2]\n>>> tokenizer = RnaTokenizer(codon=True)\n>>> tokenizer('uagcuuauc')[\"input_ids\"]\n[1, 83, 49, 22, 2]\n>>> tokenizer('uagcuuauca')[\"input_ids\"]\nTraceback (most recent call last):\nValueError: length of input sequence must be a multiple of 3 for codon tokenization, but got 10\n
Source code in multimolecule/tokenisers/rna/tokenization_rna.py Python
class RnaTokenizer(Tokenizer):\n    \"\"\"\n    Tokenizer for RNA sequences.\n\n    Args:\n        alphabet: alphabet to use for tokenization.\n\n            - If is `None`, the standard RNA alphabet will be used.\n            - If is a `string`, it should correspond to the name of a predefined alphabet. The options include\n                + `standard`\n                + `extended`\n                + `streamline`\n                + `nucleobase`\n            - If is an alphabet or a list of characters, that specific alphabet will be used.\n        nmers: Size of kmer to tokenize.\n        codon: Whether to tokenize into codons.\n        replace_T_with_U: Whether to replace T with U.\n        do_upper_case: Whether to convert input to uppercase.\n\n    Examples:\n        >>> from multimolecule import RnaTokenizer\n        >>> tokenizer = RnaTokenizer()\n        >>> tokenizer('<pad><cls><eos><unk><mask><null>ACGUNRYSWKMBDHV.X*-I')[\"input_ids\"]\n        [1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 2]\n        >>> tokenizer('acgu')[\"input_ids\"]\n        [1, 6, 7, 8, 9, 2]\n        >>> tokenizer('acgt')[\"input_ids\"]\n        [1, 6, 7, 8, 9, 2]\n        >>> tokenizer = RnaTokenizer(replace_T_with_U=False)\n        >>> tokenizer('acgt')[\"input_ids\"]\n        [1, 6, 7, 8, 3, 2]\n        >>> tokenizer = RnaTokenizer(nmers=3)\n        >>> tokenizer('uagcuuauc')[\"input_ids\"]\n        [1, 83, 17, 64, 49, 96, 84, 22, 2]\n        >>> tokenizer = RnaTokenizer(codon=True)\n        >>> tokenizer('uagcuuauc')[\"input_ids\"]\n        [1, 83, 49, 22, 2]\n        >>> tokenizer('uagcuuauca')[\"input_ids\"]\n        Traceback (most recent call last):\n        ValueError: length of input sequence must be a multiple of 3 for codon tokenization, but got 10\n    \"\"\"\n\n    model_input_names = [\"input_ids\", \"attention_mask\"]\n\n    def __init__(\n        self,\n        alphabet: Alphabet | str | List[str] | None = None,\n        nmers: int = 1,\n        codon: bool = False,\n        replace_T_with_U: bool = True,\n        do_upper_case: bool = True,\n        additional_special_tokens: List | Tuple | None = None,\n        **kwargs,\n    ):\n        if codon and (nmers > 1 and nmers != 3):\n            raise ValueError(\"Codon and nmers cannot be used together.\")\n        if codon:\n            nmers = 3  # set to 3 to get correct vocab\n        if not isinstance(alphabet, Alphabet):\n            alphabet = get_alphabet(alphabet, nmers=nmers)\n        super().__init__(\n            alphabet=alphabet,\n            nmers=nmers,\n            codon=codon,\n            replace_T_with_U=replace_T_with_U,\n            do_upper_case=do_upper_case,\n            additional_special_tokens=additional_special_tokens,\n            **kwargs,\n        )\n        self.replace_T_with_U = replace_T_with_U\n        self.nmers = nmers\n        self.codon = codon\n\n    def _tokenize(self, text: str, **kwargs):\n        if self.do_upper_case:\n            text = text.upper()\n        if self.replace_T_with_U:\n            text = text.replace(\"T\", \"U\")\n        if self.codon:\n            if len(text) % 3 != 0:\n                raise ValueError(\n                    f\"length of input sequence must be a multiple of 3 for codon tokenization, but got {len(text)}\"\n                )\n            return [text[i : i + 3] for i in range(0, len(text), 3)]\n        if self.nmers > 1:\n            return [text[i : i + self.nmers] for i in range(len(text) - self.nmers + 1)]  # noqa: E203\n        return list(text)\n
"},{"location":"models/rnamsm/#multimolecule.models.rnamsm.RnaTokenizer(alphabet)","title":"alphabet","text":""},{"location":"models/rnamsm/#multimolecule.models.rnamsm.RnaTokenizer(nmers)","title":"nmers","text":""},{"location":"models/rnamsm/#multimolecule.models.rnamsm.RnaTokenizer(codon)","title":"codon","text":""},{"location":"models/rnamsm/#multimolecule.models.rnamsm.RnaTokenizer(replace_T_with_U)","title":"replace_T_with_U","text":""},{"location":"models/rnamsm/#multimolecule.models.rnamsm.RnaTokenizer(do_upper_case)","title":"do_upper_case","text":""},{"location":"models/rnamsm/#multimolecule.models.rnamsm.RnaMsmConfig","title":"RnaMsmConfig","text":"

Bases: PreTrainedConfig

This is the configuration class to store the configuration of a RnaMsmModel. It is used to instantiate a RnaMsm model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the RnaMsm yikunpku/RNA-MSM architecture.

Configuration objects inherit from PreTrainedConfig and can be used to control the model outputs. Read the documentation from PreTrainedConfig for more information.

Parameters:

Name Type Description Default int

Vocabulary size of the RnaMsm model. Defines the number of different tokens that can be represented by the inputs_ids passed when calling [RnaMsmModel].

26 int

Dimensionality of the encoder layers and the pooler layer.

768 int

Number of hidden layers in the Transformer encoder.

10 int

Number of attention heads for each attention layer in the Transformer encoder.

12 int

Dimensionality of the \u201cintermediate\u201d (often named feed-forward) layer in the Transformer encoder.

3072 float

The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.

0.1 float

The dropout ratio for the attention probabilities.

0.1 int

The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).

1024 float

The standard deviation of the truncated_normal_initializer for initializing all weight matrices.

0.02 float

The epsilon used by the layer normalization layers.

1e-12

Examples:

Python Console Session
>>> from multimolecule import RnaMsmModel, RnaMsmConfig\n>>> # Initializing a RNA-MSM multimolecule/rnamsm style configuration\n>>> configuration = RnaMsmConfig()\n>>> # Initializing a model (with random weights) from the multimolecule/rnamsm style configuration\n>>> model = RnaMsmModel(configuration)\n>>> # Accessing the model configuration\n>>> configuration = model.config\n
Source code in multimolecule/models/rnamsm/configuration_rnamsm.py Python
class RnaMsmConfig(PreTrainedConfig):\n    r\"\"\"\n    This is the configuration class to store the configuration of a [`RnaMsmModel`][multimolecule.models.RnaMsmModel].\n    It is used to instantiate a RnaMsm model according to the specified arguments, defining the model architecture.\n    Instantiating a configuration with the defaults will yield a similar configuration to that of the RnaMsm\n    [yikunpku/RNA-MSM](https://github.com/yikunpku/RNA-MSM) architecture.\n\n    Configuration objects inherit from [`PreTrainedConfig`][multimolecule.models.PreTrainedConfig] and can be used to\n    control the model outputs. Read the documentation from [`PreTrainedConfig`][multimolecule.models.PreTrainedConfig]\n    for more information.\n\n    Args:\n        vocab_size:\n            Vocabulary size of the RnaMsm model. Defines the number of different tokens that can be represented by the\n            `inputs_ids` passed when calling [`RnaMsmModel`].\n        hidden_size:\n            Dimensionality of the encoder layers and the pooler layer.\n        num_hidden_layers:\n            Number of hidden layers in the Transformer encoder.\n        num_attention_heads:\n            Number of attention heads for each attention layer in the Transformer encoder.\n        intermediate_size:\n            Dimensionality of the \"intermediate\" (often named feed-forward) layer in the Transformer encoder.\n        hidden_dropout:\n            The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.\n        attention_dropout:\n            The dropout ratio for the attention probabilities.\n        max_position_embeddings:\n            The maximum sequence length that this model might ever be used with. Typically set this to something large\n            just in case (e.g., 512 or 1024 or 2048).\n        initializer_range:\n            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.\n        layer_norm_eps:\n            The epsilon used by the layer normalization layers.\n\n    Examples:\n        >>> from multimolecule import RnaMsmModel, RnaMsmConfig\n        >>> # Initializing a RNA-MSM multimolecule/rnamsm style configuration\n        >>> configuration = RnaMsmConfig()\n        >>> # Initializing a model (with random weights) from the multimolecule/rnamsm style configuration\n        >>> model = RnaMsmModel(configuration)\n        >>> # Accessing the model configuration\n        >>> configuration = model.config\n    \"\"\"\n\n    model_type = \"rnamsm\"\n\n    def __init__(\n        self,\n        vocab_size: int = 26,\n        hidden_size: int = 768,\n        num_hidden_layers: int = 10,\n        num_attention_heads: int = 12,\n        intermediate_size: int = 3072,\n        hidden_act: str = \"gelu\",\n        hidden_dropout: float = 0.1,\n        attention_dropout: float = 0.1,\n        max_position_embeddings: int = 1024,\n        initializer_range: float = 0.02,\n        layer_norm_eps: float = 1e-12,\n        position_embedding_type: str = \"absolute\",\n        is_decoder: bool = False,\n        use_cache: bool = True,\n        max_tokens_per_msa: int = 2**14,\n        layer_type: str = \"standard\",\n        attention_type: str = \"standard\",\n        embed_positions_msa: bool = True,\n        attention_bias: bool = True,\n        head: HeadConfig | None = None,\n        lm_head: MaskedLMHeadConfig | None = None,\n        **kwargs,\n    ):\n        super().__init__(**kwargs)\n\n        self.vocab_size = vocab_size\n        self.hidden_size = hidden_size\n        self.num_hidden_layers = num_hidden_layers\n        self.num_attention_heads = num_attention_heads\n        self.intermediate_size = intermediate_size\n        self.hidden_act = hidden_act\n        self.hidden_dropout = hidden_dropout\n        self.attention_dropout = attention_dropout\n        self.max_position_embeddings = max_position_embeddings\n        self.initializer_range = initializer_range\n        self.layer_norm_eps = layer_norm_eps\n        self.position_embedding_type = position_embedding_type\n        self.is_decoder = is_decoder\n        self.use_cache = use_cache\n        self.max_tokens_per_msa = max_tokens_per_msa\n        self.layer_type = layer_type\n        self.attention_type = attention_type\n        self.embed_positions_msa = embed_positions_msa\n        self.attention_bias = attention_bias\n        self.head = HeadConfig(**head) if head is not None else None\n        self.lm_head = MaskedLMHeadConfig(**lm_head) if lm_head is not None else None\n
"},{"location":"models/rnamsm/#multimolecule.models.rnamsm.RnaMsmConfig(vocab_size)","title":"vocab_size","text":""},{"location":"models/rnamsm/#multimolecule.models.rnamsm.RnaMsmConfig(hidden_size)","title":"hidden_size","text":""},{"location":"models/rnamsm/#multimolecule.models.rnamsm.RnaMsmConfig(num_hidden_layers)","title":"num_hidden_layers","text":""},{"location":"models/rnamsm/#multimolecule.models.rnamsm.RnaMsmConfig(num_attention_heads)","title":"num_attention_heads","text":""},{"location":"models/rnamsm/#multimolecule.models.rnamsm.RnaMsmConfig(intermediate_size)","title":"intermediate_size","text":""},{"location":"models/rnamsm/#multimolecule.models.rnamsm.RnaMsmConfig(hidden_dropout)","title":"hidden_dropout","text":""},{"location":"models/rnamsm/#multimolecule.models.rnamsm.RnaMsmConfig(attention_dropout)","title":"attention_dropout","text":""},{"location":"models/rnamsm/#multimolecule.models.rnamsm.RnaMsmConfig(max_position_embeddings)","title":"max_position_embeddings","text":""},{"location":"models/rnamsm/#multimolecule.models.rnamsm.RnaMsmConfig(initializer_range)","title":"initializer_range","text":""},{"location":"models/rnamsm/#multimolecule.models.rnamsm.RnaMsmConfig(layer_norm_eps)","title":"layer_norm_eps","text":""},{"location":"models/rnamsm/#multimolecule.models.rnamsm.RnaMsmForContactPrediction","title":"RnaMsmForContactPrediction","text":"

Bases: RnaMsmPreTrainedModel

Examples:

Python Console Session
>>> from multimolecule import RnaMsmConfig, RnaMsmForContactPrediction, RnaTokenizer\n>>> config = RnaMsmConfig()\n>>> model = RnaMsmForContactPrediction(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=torch.randint(2, (1, 5, 5)))\n>>> output[\"logits\"].shape\ntorch.Size([1, 5, 5, 1])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n
Source code in multimolecule/models/rnamsm/modeling_rnamsm.py Python
class RnaMsmForContactPrediction(RnaMsmPreTrainedModel):\n    \"\"\"\n    Examples:\n        >>> from multimolecule import RnaMsmConfig, RnaMsmForContactPrediction, RnaTokenizer\n        >>> config = RnaMsmConfig()\n        >>> model = RnaMsmForContactPrediction(config)\n        >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n        >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n        >>> output = model(**input, labels=torch.randint(2, (1, 5, 5)))\n        >>> output[\"logits\"].shape\n        torch.Size([1, 5, 5, 1])\n        >>> output[\"loss\"]  # doctest:+ELLIPSIS\n        tensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n    \"\"\"\n\n    def __init__(self, config: RnaMsmConfig):\n        super().__init__(config)\n        self.rnamsm = RnaMsmModel(config, add_pooling_layer=True)\n        head_config = HeadConfig(output_name=\"row_attentions\")\n        self.contact_head = ContactPredictionHead(config, head_config)\n        self.head_config = self.contact_head.config\n\n        # Initialize weights and apply final processing\n        self.post_init()\n\n    def forward(\n        self,\n        input_ids: Tensor | NestedTensor,\n        attention_mask: Tensor | None = None,\n        position_ids: Tensor | None = None,\n        head_mask: Tensor | None = None,\n        inputs_embeds: Tensor | NestedTensor | None = None,\n        labels: Tensor | None = None,\n        output_attentions: bool | None = None,\n        output_hidden_states: bool | None = None,\n        return_dict: bool | None = None,\n        **kwargs,\n    ) -> Tuple[Tensor, ...] | RnaMsmContactPredictorOutput:\n        if output_attentions is False:\n            warn(\"output_attentions must be True for contact classification and will be ignored.\")\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n        outputs = self.rnamsm(\n            input_ids,\n            attention_mask=attention_mask,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            output_attentions=True,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n            **kwargs,\n        )\n        output = self.contact_head(outputs, attention_mask, input_ids, labels)\n        logits, loss = output.logits, output.loss\n\n        if not return_dict:\n            output = (logits,) + outputs[2:]\n            return ((loss,) + output) if loss is not None else output\n\n        return RnaMsmContactPredictorOutput(\n            loss=loss,\n            logits=logits,\n            hidden_states=outputs.hidden_states,\n            col_attentions=outputs.col_attentions,\n            row_attentions=outputs.row_attentions,\n        )\n
"},{"location":"models/rnamsm/#multimolecule.models.rnamsm.RnaMsmForMaskedLM","title":"RnaMsmForMaskedLM","text":"

Bases: RnaMsmPreTrainedModel

Examples:

Python Console Session
>>> from multimolecule import RnaMsmConfig, RnaMsmForMaskedLM, RnaTokenizer\n>>> config = RnaMsmConfig()\n>>> model = RnaMsmForMaskedLM(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=input[\"input_ids\"])\n>>> output[\"logits\"].shape\ntorch.Size([1, 7, 26])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<NllLossBackward0>)\n
Source code in multimolecule/models/rnamsm/modeling_rnamsm.py Python
class RnaMsmForMaskedLM(RnaMsmPreTrainedModel):\n    \"\"\"\n    Examples:\n        >>> from multimolecule import RnaMsmConfig, RnaMsmForMaskedLM, RnaTokenizer\n        >>> config = RnaMsmConfig()\n        >>> model = RnaMsmForMaskedLM(config)\n        >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n        >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n        >>> output = model(**input, labels=input[\"input_ids\"])\n        >>> output[\"logits\"].shape\n        torch.Size([1, 7, 26])\n        >>> output[\"loss\"]  # doctest:+ELLIPSIS\n        tensor(..., grad_fn=<NllLossBackward0>)\n    \"\"\"\n\n    _tied_weights_keys = [\"lm_head.decoder.weight\", \"lm_head.decoder.bias\"]\n\n    def __init__(self, config: RnaMsmConfig):\n        super().__init__(config)\n        self.rnamsm = RnaMsmModel(config, add_pooling_layer=False)\n        self.lm_head = MaskedLMHead(config, weight=self.rnamsm.embeddings.word_embeddings.weight)\n\n        # Initialize weights and apply final processing\n        self.post_init()\n\n    def forward(\n        self,\n        input_ids: Tensor | NestedTensor,\n        attention_mask: Tensor | None = None,\n        position_ids: Tensor | None = None,\n        inputs_embeds: Tensor | NestedTensor | None = None,\n        labels: Tensor | None = None,\n        output_attentions: bool | None = None,\n        output_hidden_states: bool | None = None,\n        return_dict: bool | None = None,\n        **kwargs,\n    ) -> Tuple[Tensor, ...] | RnaMsmForMaskedLMOutput:\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n        outputs = self.rnamsm(\n            input_ids,\n            attention_mask=attention_mask,\n            position_ids=position_ids,\n            inputs_embeds=inputs_embeds,\n            output_attentions=output_attentions,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n            **kwargs,\n        )\n        output = self.lm_head(outputs, labels)\n        logits, loss = output.logits, output.loss\n\n        if not return_dict:\n            output = (logits,) + outputs[2:]\n            return ((loss,) + output) if loss is not None else output\n\n        return RnaMsmForMaskedLMOutput(\n            loss=loss,\n            logits=logits,\n            hidden_states=outputs.hidden_states,\n            col_attentions=outputs.col_attentions,\n            row_attentions=outputs.row_attentions,\n        )\n
"},{"location":"models/rnamsm/#multimolecule.models.rnamsm.RnaMsmForPreTraining","title":"RnaMsmForPreTraining","text":"

Bases: RnaMsmPreTrainedModel

Examples:

Python Console Session
>>> from multimolecule import RnaMsmConfig, RnaMsmForPreTraining, RnaTokenizer\n>>> config = RnaMsmConfig()\n>>> model = RnaMsmForPreTraining(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels_mlm=input[\"input_ids\"])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<AddBackward0>)\n>>> output[\"logits\"].shape\ntorch.Size([1, 7, 26])\n>>> output[\"contact_map\"].shape\ntorch.Size([1, 5, 5, 1])\n
Source code in multimolecule/models/rnamsm/modeling_rnamsm.py Python
class RnaMsmForPreTraining(RnaMsmPreTrainedModel):\n    \"\"\"\n    Examples:\n        >>> from multimolecule import RnaMsmConfig, RnaMsmForPreTraining, RnaTokenizer\n        >>> config = RnaMsmConfig()\n        >>> model = RnaMsmForPreTraining(config)\n        >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n        >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n        >>> output = model(**input, labels_mlm=input[\"input_ids\"])\n        >>> output[\"loss\"]  # doctest:+ELLIPSIS\n        tensor(..., grad_fn=<AddBackward0>)\n        >>> output[\"logits\"].shape\n        torch.Size([1, 7, 26])\n        >>> output[\"contact_map\"].shape\n        torch.Size([1, 5, 5, 1])\n    \"\"\"\n\n    _tied_weights_keys = [\n        \"lm_head.decoder.weight\",\n        \"lm_head.decoder.bias\",\n        \"pretrain.predictions.decoder.weight\",\n        \"pretrain.predictions.decoder.bias\",\n        \"pretrain.predictions_ss.decoder.weight\",\n        \"pretrain.predictions_ss.decoder.bias\",\n    ]\n\n    def __init__(self, config: RnaMsmConfig):\n        super().__init__(config)\n        self.rnamsm = RnaMsmModel(config, add_pooling_layer=False)\n        self.pretrain = RnaMsmPreTrainingHeads(config, weight=self.rnamsm.embeddings.word_embeddings.weight)\n\n        # Initialize weights and apply final processing\n        self.post_init()\n\n    def forward(\n        self,\n        input_ids: Tensor | NestedTensor,\n        attention_mask: Tensor | None = None,\n        position_ids: Tensor | None = None,\n        inputs_embeds: Tensor | NestedTensor | None = None,\n        labels_mlm: Tensor | None = None,\n        labels_contact: Tensor | None = None,\n        output_attentions: bool | None = None,\n        output_hidden_states: bool | None = None,\n        return_dict: bool | None = None,\n        **kwargs,\n    ) -> Tuple[Tensor, ...] | RnaMsmForPreTrainingOutput:\n        if output_attentions is False:\n            warn(\"output_attentions must be True for contact classification and will be ignored.\")\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n        outputs = self.rnamsm(\n            input_ids,\n            attention_mask=attention_mask,\n            position_ids=position_ids,\n            inputs_embeds=inputs_embeds,\n            output_attentions=True,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n            **kwargs,\n        )\n        total_loss, logits, contact_map = self.pretrain(\n            outputs, attention_mask, input_ids, labels_mlm=labels_mlm, labels_contact=labels_contact\n        )\n\n        if not return_dict:\n            output = (logits, contact_map) + outputs[2:]\n            return ((total_loss,) + output) if total_loss is not None else output\n\n        return RnaMsmForPreTrainingOutput(\n            loss=total_loss,\n            logits=logits,\n            contact_map=contact_map,\n            hidden_states=outputs.hidden_states,\n            col_attentions=outputs.col_attentions,\n            row_attentions=outputs.row_attentions,\n        )\n
"},{"location":"models/rnamsm/#multimolecule.models.rnamsm.RnaMsmForSequencePrediction","title":"RnaMsmForSequencePrediction","text":"

Bases: RnaMsmPreTrainedModel

Examples:

Python Console Session
>>> from multimolecule import RnaMsmConfig, RnaMsmForSequencePrediction, RnaTokenizer\n>>> config = RnaMsmConfig()\n>>> model = RnaMsmForSequencePrediction(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=torch.tensor([[1]]))\n>>> output[\"logits\"].shape\ntorch.Size([1, 1])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n
Source code in multimolecule/models/rnamsm/modeling_rnamsm.py Python
class RnaMsmForSequencePrediction(RnaMsmPreTrainedModel):\n    \"\"\"\n    Examples:\n        >>> from multimolecule import RnaMsmConfig, RnaMsmForSequencePrediction, RnaTokenizer\n        >>> config = RnaMsmConfig()\n        >>> model = RnaMsmForSequencePrediction(config)\n        >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n        >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n        >>> output = model(**input, labels=torch.tensor([[1]]))\n        >>> output[\"logits\"].shape\n        torch.Size([1, 1])\n        >>> output[\"loss\"]  # doctest:+ELLIPSIS\n        tensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n    \"\"\"\n\n    def __init__(self, config: RnaMsmConfig):\n        super().__init__(config)\n        self.rnamsm = RnaMsmModel(config, add_pooling_layer=True)\n        self.sequence_head = SequencePredictionHead(config)\n        self.head_config = self.sequence_head.config\n\n        # Initialize weights and apply final processing\n        self.post_init()\n\n    def forward(\n        self,\n        input_ids: Tensor | NestedTensor,\n        attention_mask: Tensor | None = None,\n        position_ids: Tensor | None = None,\n        inputs_embeds: Tensor | NestedTensor | None = None,\n        labels: Tensor | None = None,\n        output_attentions: bool = False,\n        output_hidden_states: bool = False,\n        return_dict: bool = True,\n        **kwargs,\n    ) -> Tuple[Tensor, ...] | RnaMsmSequencePredictorOutput:\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n        outputs = self.rnamsm(\n            input_ids,\n            attention_mask=attention_mask,\n            position_ids=position_ids,\n            inputs_embeds=inputs_embeds,\n            output_attentions=output_attentions,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n            **kwargs,\n        )\n        output = self.sequence_head(outputs, labels)\n        logits, loss = output.logits, output.loss\n\n        if not return_dict:\n            output = (logits,) + outputs[2:]\n            return ((loss,) + output) if loss is not None else output\n\n        return RnaMsmSequencePredictorOutput(\n            loss=loss,\n            logits=logits,\n            hidden_states=outputs.hidden_states,\n            col_attentions=outputs.col_attentions,\n            row_attentions=outputs.row_attentions,\n        )\n
"},{"location":"models/rnamsm/#multimolecule.models.rnamsm.RnaMsmForTokenPrediction","title":"RnaMsmForTokenPrediction","text":"

Bases: RnaMsmPreTrainedModel

Examples:

Python Console Session
>>> from multimolecule import RnaMsmConfig, RnaMsmForTokenPrediction, RnaTokenizer\n>>> config = RnaMsmConfig()\n>>> model = RnaMsmForTokenPrediction(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=torch.randint(2, (1, 5)))\n>>> output[\"logits\"].shape\ntorch.Size([1, 5, 1])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n
Source code in multimolecule/models/rnamsm/modeling_rnamsm.py Python
class RnaMsmForTokenPrediction(RnaMsmPreTrainedModel):\n    \"\"\"\n    Examples:\n        >>> from multimolecule import RnaMsmConfig, RnaMsmForTokenPrediction, RnaTokenizer\n        >>> config = RnaMsmConfig()\n        >>> model = RnaMsmForTokenPrediction(config)\n        >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n        >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n        >>> output = model(**input, labels=torch.randint(2, (1, 5)))\n        >>> output[\"logits\"].shape\n        torch.Size([1, 5, 1])\n        >>> output[\"loss\"]  # doctest:+ELLIPSIS\n        tensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n    \"\"\"\n\n    def __init__(self, config: RnaMsmConfig):\n        super().__init__(config)\n        self.rnamsm = RnaMsmModel(config, add_pooling_layer=True)\n        self.token_head = TokenPredictionHead(config)\n        self.head_config = self.token_head.config\n\n        # Initialize weights and apply final processing\n        self.post_init()\n\n    def forward(\n        self,\n        input_ids: Tensor | NestedTensor,\n        attention_mask: Tensor | None = None,\n        position_ids: Tensor | None = None,\n        inputs_embeds: Tensor | NestedTensor | None = None,\n        labels: Tensor | None = None,\n        output_attentions: bool = False,\n        output_hidden_states: bool = False,\n        return_dict: bool = True,\n        **kwargs,\n    ) -> Tuple[Tensor, ...] | RnaMsmTokenPredictorOutput:\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n        outputs = self.rnamsm(\n            input_ids,\n            attention_mask=attention_mask,\n            position_ids=position_ids,\n            inputs_embeds=inputs_embeds,\n            output_attentions=output_attentions,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n            **kwargs,\n        )\n        output = self.token_head(outputs, attention_mask, input_ids, labels)\n        logits, loss = output.logits, output.loss\n\n        if not return_dict:\n            output = (logits,) + outputs[2:]\n            return ((loss,) + output) if loss is not None else output\n\n        return RnaMsmTokenPredictorOutput(\n            loss=loss,\n            logits=logits,\n            hidden_states=outputs.hidden_states,\n            col_attentions=outputs.col_attentions,\n            row_attentions=outputs.row_attentions,\n        )\n
"},{"location":"models/rnamsm/#multimolecule.models.rnamsm.RnaMsmModel","title":"RnaMsmModel","text":"

Bases: RnaMsmPreTrainedModel

Examples:

Python Console Session
>>> from multimolecule import RnaMsmConfig, RnaMsmModel, RnaTokenizer\n>>> config = RnaMsmConfig()\n>>> model = RnaMsmModel(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input)\n>>> output[\"last_hidden_state\"].shape\ntorch.Size([1, 7, 768])\n>>> output[\"pooler_output\"].shape\ntorch.Size([1, 768])\n
Source code in multimolecule/models/rnamsm/modeling_rnamsm.py Python
class RnaMsmModel(RnaMsmPreTrainedModel):\n    \"\"\"\n    Examples:\n        >>> from multimolecule import RnaMsmConfig, RnaMsmModel, RnaTokenizer\n        >>> config = RnaMsmConfig()\n        >>> model = RnaMsmModel(config)\n        >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n        >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n        >>> output = model(**input)\n        >>> output[\"last_hidden_state\"].shape\n        torch.Size([1, 7, 768])\n        >>> output[\"pooler_output\"].shape\n        torch.Size([1, 768])\n    \"\"\"\n\n    def __init__(self, config: RnaMsmConfig, add_pooling_layer: bool = True):\n        super().__init__(config)\n        self.pad_token_id = config.pad_token_id\n        self.embeddings = RnaMsmEmbeddings(config)\n        self.encoder = RnaMsmEncoder(config)\n        self.pooler = RnaMsmPooler(config) if add_pooling_layer else None\n\n        # Initialize weights and apply final processing\n        self.post_init()\n\n    def forward(\n        self,\n        input_ids: Tensor | NestedTensor,\n        attention_mask: Tensor | None = None,\n        position_ids: Tensor | None = None,\n        inputs_embeds: Tensor | NestedTensor | None = None,\n        output_attentions: bool | None = None,\n        output_hidden_states: bool | None = None,\n        return_dict: bool | None = None,\n        **kwargs,\n    ) -> Tuple[Tensor, ...] | RnaMsmModelOutputWithPooling:\n        if kwargs:\n            warn(\n                f\"Additional keyword arguments `{', '.join(kwargs)}` are detected in \"\n                f\"`{self.__class__.__name__}.forward`, they will be ignored.\\n\"\n                \"This is provided for backward compatibility and may lead to unexpected behavior.\"\n            )\n        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions\n        output_hidden_states = (\n            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states\n        )\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n\n        if isinstance(input_ids, NestedTensor):\n            input_ids, attention_mask = input_ids.tensor, input_ids.mask\n        if input_ids is not None and inputs_embeds is not None:\n            raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n        if input_ids is not None:\n            self.warn_if_padding_and_no_attention_mask(input_ids, attention_mask)\n        elif inputs_embeds is None:\n            raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n        if attention_mask is None:\n            attention_mask = (\n                input_ids.ne(self.pad_token_id) if self.pad_token_id is not None else torch.ones_like(input_ids)\n            )\n\n        unsqueeze_input = input_ids.ndim == 2\n        if unsqueeze_input:\n            input_ids = input_ids.unsqueeze(1)\n        if attention_mask.ndim == 2:\n            attention_mask = attention_mask.unsqueeze(1)\n\n        embedding_output = self.embeddings(\n            input_ids=input_ids,\n            position_ids=position_ids,\n            attention_mask=attention_mask,\n            inputs_embeds=inputs_embeds,\n        )\n        encoder_outputs = self.encoder(\n            embedding_output,\n            attention_mask=attention_mask,\n            output_attentions=output_attentions,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n        )\n        sequence_output = encoder_outputs[0]\n        if unsqueeze_input:\n            sequence_output = sequence_output.squeeze(1)\n        pooled_output = self.pooler(sequence_output) if self.pooler is not None else None\n\n        if not return_dict:\n            return (sequence_output, pooled_output) + encoder_outputs[1:]\n\n        return RnaMsmModelOutputWithPooling(\n            last_hidden_state=sequence_output,\n            pooler_output=pooled_output,\n            hidden_states=encoder_outputs.hidden_states,\n            col_attentions=encoder_outputs.col_attentions,\n            row_attentions=encoder_outputs.row_attentions,\n        )\n
"},{"location":"models/rnamsm/#multimolecule.models.rnamsm.RnaMsmPreTrainedModel","title":"RnaMsmPreTrainedModel","text":"

Bases: PreTrainedModel

An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained models.

Source code in multimolecule/models/rnamsm/modeling_rnamsm.py Python
class RnaMsmPreTrainedModel(PreTrainedModel):\n    \"\"\"\n    An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained\n    models.\n    \"\"\"\n\n    config_class = RnaMsmConfig\n    base_model_prefix = \"rnamsm\"\n    supports_gradient_checkpointing = True\n    _no_split_modules = [\"RnaMsmLayer\", \"RnaMsmAxialLayer\", \"RnaMsmPkmLayer\", \"RnaMsmEmbeddings\"]\n\n    # Copied from transformers.models.bert.modeling_bert.BertPreTrainedModel._init_weights\n    def _init_weights(self, module: nn.Module):\n        \"\"\"Initialize the weights\"\"\"\n        if isinstance(module, nn.Linear):\n            # Slightly different from the TF version which uses truncated_normal for initialization\n            # cf https://github.com/pytorch/pytorch/pull/5617\n            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)\n            if module.bias is not None:\n                module.bias.data.zero_()\n        elif isinstance(module, nn.Embedding):\n            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)\n            if module.padding_idx is not None:\n                module.weight.data[module.padding_idx].zero_()\n        elif isinstance(module, nn.LayerNorm) and module.elementwise_affine:\n            module.bias.data.zero_()\n            module.weight.data.fill_(1.0)\n
"},{"location":"models/splicebert/","title":"SpliceBERT","text":"

Pre-trained model on messenger RNA precursor (pre-mRNA) using a masked language modeling (MLM) objective.

"},{"location":"models/splicebert/#disclaimer","title":"Disclaimer","text":"

This is an UNOFFICIAL implementation of the Self-supervised learning on millions of pre-mRNA sequences improves sequence-based RNA splicing prediction by Ken Chen, et al.

The OFFICIAL repository of SpliceBERT is at chenkenbio/SpliceBERT.

Tip

The MultiMolecule team has confirmed that the provided model and checkpoints are producing the same intermediate representations as the original implementation.

The team releasing SpliceBERT did not write this model card for this model so this model card has been written by the MultiMolecule team.

"},{"location":"models/splicebert/#model-details","title":"Model Details","text":"

SpliceBERT is a bert-style model pre-trained on a large corpus of messenger RNA precursor sequences in a self-supervised fashion. This means that the model was trained on the raw nucleotides of RNA sequences only, with an automatic process to generate inputs and labels from those texts. Please refer to the Training Details section for more information on the training process.

"},{"location":"models/splicebert/#variations","title":"Variations","text":""},{"location":"models/splicebert/#model-specification","title":"Model Specification","text":"Variants Num Layers Hidden Size Num Heads Intermediate Size Num Parameters (M) FLOPs (G) MACs (G) Max Num Tokens splicebert 6 512 16 2048 19.72 5.04 2.52 1024 splicebert.510 19.45 510 splicebert-human.510"},{"location":"models/splicebert/#links","title":"Links","text":""},{"location":"models/splicebert/#usage","title":"Usage","text":"

The model file depends on the multimolecule library. You can install it using pip:

Bash
pip install multimolecule\n
"},{"location":"models/splicebert/#direct-use","title":"Direct Use","text":"

You can use this model directly with a pipeline for masked language modeling:

Python
>>> import multimolecule  # you must import multimolecule to register models\n>>> from transformers import pipeline\n>>> unmasker = pipeline(\"fill-mask\", model=\"multimolecule/splicebert\")\n>>> unmasker(\"gguc<mask>cucugguuagaccagaucugagccu\")\n\n[{'score': 0.340412974357605,\n  'token': 9,\n  'token_str': 'U',\n  'sequence': 'G G U C U C U C U G G U U A G A C C A G A U C U G A G C C U'},\n {'score': 0.13882005214691162,\n  'token': 12,\n  'token_str': 'Y',\n  'sequence': 'G G U C Y C U C U G G U U A G A C C A G A U C U G A G C C U'},\n {'score': 0.056610625237226486,\n  'token': 7,\n  'token_str': 'C',\n  'sequence': 'G G U C C C U C U G G U U A G A C C A G A U C U G A G C C U'},\n {'score': 0.05455885827541351,\n  'token': 19,\n  'token_str': 'H',\n  'sequence': 'G G U C H C U C U G G U U A G A C C A G A U C U G A G C C U'},\n {'score': 0.05356108024716377,\n  'token': 14,\n  'token_str': 'W',\n  'sequence': 'G G U C W C U C U G G U U A G A C C A G A U C U G A G C C U'}]\n
"},{"location":"models/splicebert/#downstream-use","title":"Downstream Use","text":""},{"location":"models/splicebert/#extract-features","title":"Extract Features","text":"

Here is how to use this model to get the features of a given sequence in PyTorch:

Python
from multimolecule import RnaTokenizer, SpliceBertModel\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/splicebert\")\nmodel = SpliceBertModel.from_pretrained(\"multimolecule/splicebert\")\n\ntext = \"UAGCUUAUCAGACUGAUGUUGA\"\ninput = tokenizer(text, return_tensors=\"pt\")\n\noutput = model(**input)\n
"},{"location":"models/splicebert/#sequence-classification-regression","title":"Sequence Classification / Regression","text":"

Note: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for sequence classification or regression.

Here is how to use this model as backbone to fine-tune for a sequence-level task in PyTorch:

Python
import torch\nfrom multimolecule import RnaTokenizer, SpliceBertForSequencePrediction\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/splicebert\")\nmodel = SpliceBertForSequencePrediction.from_pretrained(\"multimolecule/splicebert\")\n\ntext = \"UAGCUUAUCAGACUGAUGUUGA\"\ninput = tokenizer(text, return_tensors=\"pt\")\nlabel = torch.tensor([1])\n\noutput = model(**input, labels=label)\n
"},{"location":"models/splicebert/#token-classification-regression","title":"Token Classification / Regression","text":"

Note: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for nucleotide classification or regression.

Here is how to use this model as backbone to fine-tune for a nucleotide-level task in PyTorch:

Python
import torch\nfrom multimolecule import RnaTokenizer, SpliceBertForTokenPrediction\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/splicebert\")\nmodel = SpliceBertForTokenPrediction.from_pretrained(\"multimolecule/splicebert\")\n\ntext = \"UAGCUUAUCAGACUGAUGUUGA\"\ninput = tokenizer(text, return_tensors=\"pt\")\nlabel = torch.randint(2, (len(text), ))\n\noutput = model(**input, labels=label)\n
"},{"location":"models/splicebert/#contact-classification-regression","title":"Contact Classification / Regression","text":"

Note: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for contact classification or regression.

Here is how to use this model as backbone to fine-tune for a contact-level task in PyTorch:

Python
import torch\nfrom multimolecule import RnaTokenizer, SpliceBertForContactPrediction\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/splicebert\")\nmodel = SpliceBertForContactPrediction.from_pretrained(\"multimolecule/splicebert\")\n\ntext = \"UAGCUUAUCAGACUGAUGUUGA\"\ninput = tokenizer(text, return_tensors=\"pt\")\nlabel = torch.randint(2, (len(text), len(text)))\n\noutput = model(**input, labels=label)\n
"},{"location":"models/splicebert/#training-details","title":"Training Details","text":"

SpliceBERT used Masked Language Modeling (MLM) as the pre-training objective: taking a sequence, the model randomly masks 15% of the tokens in the input then runs the entire masked sentence through the model and has to predict the masked tokens. This is comparable to the Cloze task in language modeling.

"},{"location":"models/splicebert/#training-data","title":"Training Data","text":"

The SpliceBERT model was pre-trained on messenger RNA precursor sequences from UCSC Genome Browser. UCSC Genome Browser provides visualization, analysis, and download of comprehensive vertebrate genome data with aligned annotation tracks (known genes, predicted genes, ESTs, mRNAs, CpG islands, etc.).

SpliceBERT collected reference genomes and gene annotations from the UCSC Genome Browser for 72 vertebrate species. It applied bedtools getfasta to extract pre-mRNA sequences from the reference genomes based on the gene annotations. The pre-mRNA sequences are then used to pre-train SpliceBERT. The pre-training data contains 2 million pre-mRNA sequences with a total length of 65 billion nucleotides.

Note RnaTokenizer will convert \u201cT\u201ds to \u201cU\u201ds for you, you may disable this behaviour by passing replace_T_with_U=False.

"},{"location":"models/splicebert/#training-procedure","title":"Training Procedure","text":""},{"location":"models/splicebert/#preprocessing","title":"Preprocessing","text":"

SpliceBERT used masked language modeling (MLM) as the pre-training objective. The masking procedure is similar to the one used in BERT:

"},{"location":"models/splicebert/#pretraining","title":"PreTraining","text":"

The model was trained on 8 NVIDIA V100 GPUs.

SpliceBERT trained model in a two-stage training process:

  1. Pre-train with sequences of a fixed length of 510 nucleotides.
  2. Pre-train with sequences of a variable length between 64 and 1024 nucleotides.

The intermediate model after the first stage is available as multimolecule/splicebert.510.

SpliceBERT also pre-trained a model on human data only to validate the contribution of multi-species pre-training. The intermediate model after the first stage is available as multimolecule/splicebert-human.510.

"},{"location":"models/splicebert/#citation","title":"Citation","text":"

BibTeX:

BibTeX
@article {chen2023self,\n    author = {Chen, Ken and Zhou, Yue and Ding, Maolin and Wang, Yu and Ren, Zhixiang and Yang, Yuedong},\n    title = {Self-supervised learning on millions of pre-mRNA sequences improves sequence-based RNA splicing prediction},\n    elocation-id = {2023.01.31.526427},\n    year = {2023},\n    doi = {10.1101/2023.01.31.526427},\n    publisher = {Cold Spring Harbor Laboratory},\n    abstract = {RNA splicing is an important post-transcriptional process of gene expression in eukaryotic cells. Predicting RNA splicing from primary sequences can facilitate the interpretation of genomic variants. In this study, we developed a novel self-supervised pre-trained language model, SpliceBERT, to improve sequence-based RNA splicing prediction. Pre-training on pre-mRNA sequences from vertebrates enables SpliceBERT to capture evolutionary conservation information and characterize the unique property of splice sites. SpliceBERT also improves zero-shot prediction of variant effects on splicing by considering sequence context information, and achieves superior performance for predicting branchpoint in the human genome and splice sites across species. Our study highlighted the importance of pre-training genomic language models on a diverse range of species and suggested that pre-trained language models were promising for deciphering the sequence logic of RNA splicing.Competing Interest StatementThe authors have declared no competing interest.},\n    URL = {https://www.biorxiv.org/content/early/2023/05/09/2023.01.31.526427},\n    eprint = {https://www.biorxiv.org/content/early/2023/05/09/2023.01.31.526427.full.pdf},\n    journal = {bioRxiv}\n}\n
"},{"location":"models/splicebert/#contact","title":"Contact","text":"

Please use GitHub issues of MultiMolecule for any questions or comments on the model card.

Please contact the authors of the SpliceBERT paper for questions or comments on the paper/model.

"},{"location":"models/splicebert/#license","title":"License","text":"

This model is licensed under the AGPL-3.0 License.

Text Only
SPDX-License-Identifier: AGPL-3.0-or-later\n
"},{"location":"models/splicebert/#multimolecule.models.splicebert","title":"multimolecule.models.splicebert","text":""},{"location":"models/splicebert/#multimolecule.models.splicebert.RnaTokenizer","title":"RnaTokenizer","text":"

Bases: Tokenizer

Tokenizer for RNA sequences.

Parameters:

Name Type Description Default Alphabet | str | List[str] | None

alphabet to use for tokenization.

None int

Size of kmer to tokenize.

1 bool

Whether to tokenize into codons.

False bool

Whether to replace T with U.

True bool

Whether to convert input to uppercase.

True

Examples:

Python Console Session
>>> from multimolecule import RnaTokenizer\n>>> tokenizer = RnaTokenizer()\n>>> tokenizer('<pad><cls><eos><unk><mask><null>ACGUNRYSWKMBDHV.X*-I')[\"input_ids\"]\n[1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 2]\n>>> tokenizer('acgu')[\"input_ids\"]\n[1, 6, 7, 8, 9, 2]\n>>> tokenizer('acgt')[\"input_ids\"]\n[1, 6, 7, 8, 9, 2]\n>>> tokenizer = RnaTokenizer(replace_T_with_U=False)\n>>> tokenizer('acgt')[\"input_ids\"]\n[1, 6, 7, 8, 3, 2]\n>>> tokenizer = RnaTokenizer(nmers=3)\n>>> tokenizer('uagcuuauc')[\"input_ids\"]\n[1, 83, 17, 64, 49, 96, 84, 22, 2]\n>>> tokenizer = RnaTokenizer(codon=True)\n>>> tokenizer('uagcuuauc')[\"input_ids\"]\n[1, 83, 49, 22, 2]\n>>> tokenizer('uagcuuauca')[\"input_ids\"]\nTraceback (most recent call last):\nValueError: length of input sequence must be a multiple of 3 for codon tokenization, but got 10\n
Source code in multimolecule/tokenisers/rna/tokenization_rna.py Python
class RnaTokenizer(Tokenizer):\n    \"\"\"\n    Tokenizer for RNA sequences.\n\n    Args:\n        alphabet: alphabet to use for tokenization.\n\n            - If is `None`, the standard RNA alphabet will be used.\n            - If is a `string`, it should correspond to the name of a predefined alphabet. The options include\n                + `standard`\n                + `extended`\n                + `streamline`\n                + `nucleobase`\n            - If is an alphabet or a list of characters, that specific alphabet will be used.\n        nmers: Size of kmer to tokenize.\n        codon: Whether to tokenize into codons.\n        replace_T_with_U: Whether to replace T with U.\n        do_upper_case: Whether to convert input to uppercase.\n\n    Examples:\n        >>> from multimolecule import RnaTokenizer\n        >>> tokenizer = RnaTokenizer()\n        >>> tokenizer('<pad><cls><eos><unk><mask><null>ACGUNRYSWKMBDHV.X*-I')[\"input_ids\"]\n        [1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 2]\n        >>> tokenizer('acgu')[\"input_ids\"]\n        [1, 6, 7, 8, 9, 2]\n        >>> tokenizer('acgt')[\"input_ids\"]\n        [1, 6, 7, 8, 9, 2]\n        >>> tokenizer = RnaTokenizer(replace_T_with_U=False)\n        >>> tokenizer('acgt')[\"input_ids\"]\n        [1, 6, 7, 8, 3, 2]\n        >>> tokenizer = RnaTokenizer(nmers=3)\n        >>> tokenizer('uagcuuauc')[\"input_ids\"]\n        [1, 83, 17, 64, 49, 96, 84, 22, 2]\n        >>> tokenizer = RnaTokenizer(codon=True)\n        >>> tokenizer('uagcuuauc')[\"input_ids\"]\n        [1, 83, 49, 22, 2]\n        >>> tokenizer('uagcuuauca')[\"input_ids\"]\n        Traceback (most recent call last):\n        ValueError: length of input sequence must be a multiple of 3 for codon tokenization, but got 10\n    \"\"\"\n\n    model_input_names = [\"input_ids\", \"attention_mask\"]\n\n    def __init__(\n        self,\n        alphabet: Alphabet | str | List[str] | None = None,\n        nmers: int = 1,\n        codon: bool = False,\n        replace_T_with_U: bool = True,\n        do_upper_case: bool = True,\n        additional_special_tokens: List | Tuple | None = None,\n        **kwargs,\n    ):\n        if codon and (nmers > 1 and nmers != 3):\n            raise ValueError(\"Codon and nmers cannot be used together.\")\n        if codon:\n            nmers = 3  # set to 3 to get correct vocab\n        if not isinstance(alphabet, Alphabet):\n            alphabet = get_alphabet(alphabet, nmers=nmers)\n        super().__init__(\n            alphabet=alphabet,\n            nmers=nmers,\n            codon=codon,\n            replace_T_with_U=replace_T_with_U,\n            do_upper_case=do_upper_case,\n            additional_special_tokens=additional_special_tokens,\n            **kwargs,\n        )\n        self.replace_T_with_U = replace_T_with_U\n        self.nmers = nmers\n        self.codon = codon\n\n    def _tokenize(self, text: str, **kwargs):\n        if self.do_upper_case:\n            text = text.upper()\n        if self.replace_T_with_U:\n            text = text.replace(\"T\", \"U\")\n        if self.codon:\n            if len(text) % 3 != 0:\n                raise ValueError(\n                    f\"length of input sequence must be a multiple of 3 for codon tokenization, but got {len(text)}\"\n                )\n            return [text[i : i + 3] for i in range(0, len(text), 3)]\n        if self.nmers > 1:\n            return [text[i : i + self.nmers] for i in range(len(text) - self.nmers + 1)]  # noqa: E203\n        return list(text)\n
"},{"location":"models/splicebert/#multimolecule.models.splicebert.RnaTokenizer(alphabet)","title":"alphabet","text":""},{"location":"models/splicebert/#multimolecule.models.splicebert.RnaTokenizer(nmers)","title":"nmers","text":""},{"location":"models/splicebert/#multimolecule.models.splicebert.RnaTokenizer(codon)","title":"codon","text":""},{"location":"models/splicebert/#multimolecule.models.splicebert.RnaTokenizer(replace_T_with_U)","title":"replace_T_with_U","text":""},{"location":"models/splicebert/#multimolecule.models.splicebert.RnaTokenizer(do_upper_case)","title":"do_upper_case","text":""},{"location":"models/splicebert/#multimolecule.models.splicebert.SpliceBertConfig","title":"SpliceBertConfig","text":"

Bases: PreTrainedConfig

This is the configuration class to store the configuration of a SpliceBertModel. It is used to instantiate a SpliceBert model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the SpliceBert biomed-AI/SpliceBERT architecture.

Configuration objects inherit from PreTrainedConfig and can be used to control the model outputs. Read the documentation from PreTrainedConfig for more information.

Parameters:

Name Type Description Default int

Vocabulary size of the SpliceBert model. Defines the number of different tokens that can be represented by the inputs_ids passed when calling [SpliceBertModel].

26 int

Dimensionality of the encoder layers and the pooler layer.

512 int

Number of hidden layers in the Transformer encoder.

6 int

Number of attention heads for each attention layer in the Transformer encoder.

16 int

Dimensionality of the \u201cintermediate\u201d (often named feed-forward) layer in the Transformer encoder.

2048 float

The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.

0.1 float

The dropout ratio for the attention probabilities.

0.1 int

The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).

1026 float

The standard deviation of the truncated_normal_initializer for initializing all weight matrices.

0.02 float

The epsilon used by the layer normalization layers.

1e-12

Examples:

Python Console Session
>>> from multimolecule import SpliceBertModel, SpliceBertConfig\n>>> # Initializing a SpliceBERT multimolecule/splicebert style configuration\n>>> configuration = SpliceBertConfig()\n>>> # Initializing a model (with random weights) from the multimolecule/splicebert style configuration\n>>> model = SpliceBertModel(configuration)\n>>> # Accessing the model configuration\n>>> configuration = model.config\n
Source code in multimolecule/models/splicebert/configuration_splicebert.py Python
class SpliceBertConfig(PreTrainedConfig):\n    r\"\"\"\n    This is the configuration class to store the configuration of a\n    [`SpliceBertModel`][multimolecule.models.SpliceBertModel]. It is used to instantiate a SpliceBert model according\n    to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will\n    yield a similar configuration to that of the SpliceBert\n    [biomed-AI/SpliceBERT](https://github.com/biomed-AI/SpliceBERT) architecture.\n\n    Configuration objects inherit from [`PreTrainedConfig`][multimolecule.models.PreTrainedConfig] and can be used to\n    control the model outputs. Read the documentation from [`PreTrainedConfig`][multimolecule.models.PreTrainedConfig]\n    for more information.\n\n    Args:\n        vocab_size:\n            Vocabulary size of the SpliceBert model. Defines the number of different tokens that can be represented by\n            the `inputs_ids` passed when calling [`SpliceBertModel`].\n        hidden_size:\n            Dimensionality of the encoder layers and the pooler layer.\n        num_hidden_layers:\n            Number of hidden layers in the Transformer encoder.\n        num_attention_heads:\n            Number of attention heads for each attention layer in the Transformer encoder.\n        intermediate_size:\n            Dimensionality of the \"intermediate\" (often named feed-forward) layer in the Transformer encoder.\n        hidden_dropout:\n            The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.\n        attention_dropout:\n            The dropout ratio for the attention probabilities.\n        max_position_embeddings:\n            The maximum sequence length that this model might ever be used with. Typically set this to something large\n            just in case (e.g., 512 or 1024 or 2048).\n        initializer_range:\n            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.\n        layer_norm_eps:\n            The epsilon used by the layer normalization layers.\n\n    Examples:\n        >>> from multimolecule import SpliceBertModel, SpliceBertConfig\n        >>> # Initializing a SpliceBERT multimolecule/splicebert style configuration\n        >>> configuration = SpliceBertConfig()\n        >>> # Initializing a model (with random weights) from the multimolecule/splicebert style configuration\n        >>> model = SpliceBertModel(configuration)\n        >>> # Accessing the model configuration\n        >>> configuration = model.config\n    \"\"\"\n\n    model_type = \"splicebert\"\n\n    def __init__(\n        self,\n        vocab_size: int = 26,\n        hidden_size: int = 512,\n        num_hidden_layers: int = 6,\n        num_attention_heads: int = 16,\n        intermediate_size: int = 2048,\n        hidden_act: str = \"gelu\",\n        hidden_dropout: float = 0.1,\n        attention_dropout: float = 0.1,\n        max_position_embeddings: int = 1026,\n        initializer_range: float = 0.02,\n        layer_norm_eps: float = 1e-12,\n        position_embedding_type: str = \"absolute\",\n        is_decoder: bool = False,\n        use_cache: bool = True,\n        head: HeadConfig | None = None,\n        lm_head: MaskedLMHeadConfig | None = None,\n        **kwargs,\n    ):\n        super().__init__(**kwargs)\n\n        self.vocab_size = vocab_size\n        self.type_vocab_size = 2\n        self.hidden_size = hidden_size\n        self.num_hidden_layers = num_hidden_layers\n        self.num_attention_heads = num_attention_heads\n        self.intermediate_size = intermediate_size\n        self.hidden_act = hidden_act\n        self.hidden_dropout = hidden_dropout\n        self.attention_dropout = attention_dropout\n        self.max_position_embeddings = max_position_embeddings\n        self.initializer_range = initializer_range\n        self.layer_norm_eps = layer_norm_eps\n        self.position_embedding_type = position_embedding_type\n        self.is_decoder = is_decoder\n        self.use_cache = use_cache\n        self.head = HeadConfig(**head) if head is not None else None\n        self.lm_head = MaskedLMHeadConfig(**lm_head) if lm_head is not None else None\n
"},{"location":"models/splicebert/#multimolecule.models.splicebert.SpliceBertConfig(vocab_size)","title":"vocab_size","text":""},{"location":"models/splicebert/#multimolecule.models.splicebert.SpliceBertConfig(hidden_size)","title":"hidden_size","text":""},{"location":"models/splicebert/#multimolecule.models.splicebert.SpliceBertConfig(num_hidden_layers)","title":"num_hidden_layers","text":""},{"location":"models/splicebert/#multimolecule.models.splicebert.SpliceBertConfig(num_attention_heads)","title":"num_attention_heads","text":""},{"location":"models/splicebert/#multimolecule.models.splicebert.SpliceBertConfig(intermediate_size)","title":"intermediate_size","text":""},{"location":"models/splicebert/#multimolecule.models.splicebert.SpliceBertConfig(hidden_dropout)","title":"hidden_dropout","text":""},{"location":"models/splicebert/#multimolecule.models.splicebert.SpliceBertConfig(attention_dropout)","title":"attention_dropout","text":""},{"location":"models/splicebert/#multimolecule.models.splicebert.SpliceBertConfig(max_position_embeddings)","title":"max_position_embeddings","text":""},{"location":"models/splicebert/#multimolecule.models.splicebert.SpliceBertConfig(initializer_range)","title":"initializer_range","text":""},{"location":"models/splicebert/#multimolecule.models.splicebert.SpliceBertConfig(layer_norm_eps)","title":"layer_norm_eps","text":""},{"location":"models/splicebert/#multimolecule.models.splicebert.SpliceBertForContactPrediction","title":"SpliceBertForContactPrediction","text":"

Bases: SpliceBertPreTrainedModel

Examples:

Python Console Session
>>> from multimolecule import SpliceBertConfig, SpliceBertForContactPrediction, RnaTokenizer\n>>> config = SpliceBertConfig()\n>>> model = SpliceBertForContactPrediction(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=torch.randint(2, (1, 5, 5)))\n>>> output[\"logits\"].shape\ntorch.Size([1, 5, 5, 1])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n
Source code in multimolecule/models/splicebert/modeling_splicebert.py Python
class SpliceBertForContactPrediction(SpliceBertPreTrainedModel):\n    \"\"\"\n    Examples:\n        >>> from multimolecule import SpliceBertConfig, SpliceBertForContactPrediction, RnaTokenizer\n        >>> config = SpliceBertConfig()\n        >>> model = SpliceBertForContactPrediction(config)\n        >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n        >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n        >>> output = model(**input, labels=torch.randint(2, (1, 5, 5)))\n        >>> output[\"logits\"].shape\n        torch.Size([1, 5, 5, 1])\n        >>> output[\"loss\"]  # doctest:+ELLIPSIS\n        tensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n    \"\"\"\n\n    def __init__(self, config: SpliceBertConfig):\n        super().__init__(config)\n        self.splicebert = SpliceBertModel(config, add_pooling_layer=True)\n        self.contact_head = ContactPredictionHead(config)\n        self.head_config = self.contact_head.config\n\n        # Initialize weights and apply final processing\n        self.post_init()\n\n    def forward(\n        self,\n        input_ids: Tensor | NestedTensor,\n        attention_mask: Tensor | None = None,\n        position_ids: Tensor | None = None,\n        head_mask: Tensor | None = None,\n        inputs_embeds: Tensor | NestedTensor | None = None,\n        labels: Tensor | None = None,\n        output_attentions: bool | None = None,\n        output_hidden_states: bool | None = None,\n        return_dict: bool | None = None,\n        **kwargs,\n    ) -> Tuple[Tensor, ...] | ContactPredictorOutput:\n        if output_attentions is False:\n            warn(\"output_attentions must be True for contact classification and will be ignored.\")\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n        outputs = self.splicebert(\n            input_ids,\n            attention_mask=attention_mask,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            output_attentions=True,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n            **kwargs,\n        )\n        output = self.contact_head(outputs, attention_mask, input_ids, labels)\n        logits, loss = output.logits, output.loss\n\n        if not return_dict:\n            output = (logits,) + outputs[2:]\n            return ((loss,) + output) if loss is not None else output\n\n        return ContactPredictorOutput(\n            loss=loss,\n            logits=logits,\n            hidden_states=outputs.hidden_states,\n            attentions=outputs.attentions,\n        )\n
"},{"location":"models/splicebert/#multimolecule.models.splicebert.SpliceBertForMaskedLM","title":"SpliceBertForMaskedLM","text":"

Bases: SpliceBertPreTrainedModel

Examples:

Python Console Session
>>> from multimolecule import SpliceBertConfig, SpliceBertForMaskedLM, RnaTokenizer\n>>> config = SpliceBertConfig()\n>>> model = SpliceBertForMaskedLM(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=input[\"input_ids\"])\n>>> output[\"logits\"].shape\ntorch.Size([1, 7, 26])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<NllLossBackward0>)\n
Source code in multimolecule/models/splicebert/modeling_splicebert.py Python
class SpliceBertForMaskedLM(SpliceBertPreTrainedModel):\n    \"\"\"\n    Examples:\n        >>> from multimolecule import SpliceBertConfig, SpliceBertForMaskedLM, RnaTokenizer\n        >>> config = SpliceBertConfig()\n        >>> model = SpliceBertForMaskedLM(config)\n        >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n        >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n        >>> output = model(**input, labels=input[\"input_ids\"])\n        >>> output[\"logits\"].shape\n        torch.Size([1, 7, 26])\n        >>> output[\"loss\"]  # doctest:+ELLIPSIS\n        tensor(..., grad_fn=<NllLossBackward0>)\n    \"\"\"\n\n    _tied_weights_keys = [\"lm_head.decoder.bias\", \"lm_head.decoder.weight\"]\n\n    def __init__(self, config: SpliceBertConfig):\n        super().__init__(config)\n        if config.is_decoder:\n            logger.warning(\n                \"If you want to use `SpliceBertForMaskedLM` make sure `config.is_decoder=False` for \"\n                \"bi-directional self-attention.\"\n            )\n        self.splicebert = SpliceBertModel(config, add_pooling_layer=False)\n        self.lm_head = MaskedLMHead(config)\n\n        # Initialize weights and apply final processing\n        self.post_init()\n\n    def get_output_embeddings(self):\n        return self.lm_head.decoder\n\n    def set_output_embeddings(self, new_embeddings):\n        self.lm_head.decoder = new_embeddings\n\n    def forward(\n        self,\n        input_ids: Tensor | NestedTensor,\n        attention_mask: Tensor | None = None,\n        position_ids: Tensor | None = None,\n        head_mask: Tensor | None = None,\n        inputs_embeds: Tensor | NestedTensor | None = None,\n        encoder_hidden_states: Tensor | None = None,\n        encoder_attention_mask: Tensor | None = None,\n        labels: Tensor | None = None,\n        output_attentions: bool | None = None,\n        output_hidden_states: bool | None = None,\n        return_dict: bool | None = None,\n        **kwargs,\n    ) -> Tuple[Tensor, ...] | MaskedLMOutput:\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n        outputs = self.splicebert(\n            input_ids,\n            attention_mask=attention_mask,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            encoder_hidden_states=encoder_hidden_states,\n            encoder_attention_mask=encoder_attention_mask,\n            output_attentions=output_attentions,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n            **kwargs,\n        )\n        output = self.lm_head(outputs, labels)\n        logits, loss = output.logits, output.loss\n\n        if not return_dict:\n            output = (logits,) + outputs[2:]\n            return ((loss,) + output) if loss is not None else output\n\n        return MaskedLMOutput(\n            loss=loss,\n            logits=logits,\n            hidden_states=outputs.hidden_states,\n            attentions=outputs.attentions,\n        )\n
"},{"location":"models/splicebert/#multimolecule.models.splicebert.SpliceBertForSequencePrediction","title":"SpliceBertForSequencePrediction","text":"

Bases: SpliceBertPreTrainedModel

Examples:

Python Console Session
>>> from multimolecule import SpliceBertConfig, SpliceBertForSequencePrediction, RnaTokenizer\n>>> config = SpliceBertConfig()\n>>> model = SpliceBertForSequencePrediction(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=torch.tensor([[1]]))\n>>> output[\"logits\"].shape\ntorch.Size([1, 1])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n
Source code in multimolecule/models/splicebert/modeling_splicebert.py Python
class SpliceBertForSequencePrediction(SpliceBertPreTrainedModel):\n    \"\"\"\n    Examples:\n        >>> from multimolecule import SpliceBertConfig, SpliceBertForSequencePrediction, RnaTokenizer\n        >>> config = SpliceBertConfig()\n        >>> model = SpliceBertForSequencePrediction(config)\n        >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n        >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n        >>> output = model(**input, labels=torch.tensor([[1]]))\n        >>> output[\"logits\"].shape\n        torch.Size([1, 1])\n        >>> output[\"loss\"]  # doctest:+ELLIPSIS\n        tensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n    \"\"\"\n\n    def __init__(self, config: SpliceBertConfig):\n        super().__init__(config)\n        self.splicebert = SpliceBertModel(config, add_pooling_layer=True)\n        self.sequence_head = SequencePredictionHead(config)\n        self.head_config = self.sequence_head.config\n\n        # Initialize weights and apply final processing\n        self.post_init()\n\n    def forward(\n        self,\n        input_ids: Tensor | NestedTensor,\n        attention_mask: Tensor | None = None,\n        position_ids: Tensor | None = None,\n        head_mask: Tensor | None = None,\n        inputs_embeds: Tensor | NestedTensor | None = None,\n        labels: Tensor | None = None,\n        output_attentions: bool | None = None,\n        output_hidden_states: bool | None = None,\n        return_dict: bool | None = None,\n        **kwargs,\n    ) -> Tuple[Tensor, ...] | SequencePredictorOutput:\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n        outputs = self.splicebert(\n            input_ids,\n            attention_mask=attention_mask,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            output_attentions=output_attentions,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n            **kwargs,\n        )\n        output = self.sequence_head(outputs, labels)\n        logits, loss = output.logits, output.loss\n\n        if not return_dict:\n            output = (logits,) + outputs[2:]\n            return ((loss,) + output) if loss is not None else output\n\n        return SequencePredictorOutput(\n            loss=loss,\n            logits=logits,\n            hidden_states=outputs.hidden_states,\n            attentions=outputs.attentions,\n        )\n
"},{"location":"models/splicebert/#multimolecule.models.splicebert.SpliceBertForTokenPrediction","title":"SpliceBertForTokenPrediction","text":"

Bases: SpliceBertPreTrainedModel

Examples:

Python Console Session
>>> from multimolecule import SpliceBertConfig, SpliceBertForTokenPrediction, RnaTokenizer\n>>> config = SpliceBertConfig()\n>>> model = SpliceBertForTokenPrediction(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=torch.randint(2, (1, 5)))\n>>> output[\"logits\"].shape\ntorch.Size([1, 5, 1])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n
Source code in multimolecule/models/splicebert/modeling_splicebert.py Python
class SpliceBertForTokenPrediction(SpliceBertPreTrainedModel):\n    \"\"\"\n    Examples:\n        >>> from multimolecule import SpliceBertConfig, SpliceBertForTokenPrediction, RnaTokenizer\n        >>> config = SpliceBertConfig()\n        >>> model = SpliceBertForTokenPrediction(config)\n        >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n        >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n        >>> output = model(**input, labels=torch.randint(2, (1, 5)))\n        >>> output[\"logits\"].shape\n        torch.Size([1, 5, 1])\n        >>> output[\"loss\"]  # doctest:+ELLIPSIS\n        tensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n    \"\"\"\n\n    def __init__(self, config: SpliceBertConfig):\n        super().__init__(config)\n        self.splicebert = SpliceBertModel(config, add_pooling_layer=True)\n        self.token_head = TokenPredictionHead(config)\n        self.head_config = self.token_head.config\n\n        # Initialize weights and apply final processing\n        self.post_init()\n\n    def forward(\n        self,\n        input_ids: Tensor | NestedTensor,\n        attention_mask: Tensor | None = None,\n        position_ids: Tensor | None = None,\n        head_mask: Tensor | None = None,\n        inputs_embeds: Tensor | NestedTensor | None = None,\n        labels: Tensor | None = None,\n        output_attentions: bool | None = None,\n        output_hidden_states: bool | None = None,\n        return_dict: bool | None = None,\n        **kwargs,\n    ) -> Tuple[Tensor, ...] | TokenPredictorOutput:\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n        outputs = self.splicebert(\n            input_ids,\n            attention_mask=attention_mask,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            output_attentions=output_attentions,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n            **kwargs,\n        )\n        output = self.token_head(outputs, attention_mask, input_ids, labels)\n        logits, loss = output.logits, output.loss\n\n        if not return_dict:\n            output = (logits,) + outputs[2:]\n            return ((loss,) + output) if loss is not None else output\n\n        return TokenPredictorOutput(\n            loss=loss,\n            logits=logits,\n            hidden_states=outputs.hidden_states,\n            attentions=outputs.attentions,\n        )\n
"},{"location":"models/splicebert/#multimolecule.models.splicebert.SpliceBertModel","title":"SpliceBertModel","text":"

Bases: SpliceBertPreTrainedModel

Examples:

Python Console Session
>>> from multimolecule import SpliceBertConfig, SpliceBertModel, RnaTokenizer\n>>> config = SpliceBertConfig()\n>>> model = SpliceBertModel(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input)\n>>> output[\"last_hidden_state\"].shape\ntorch.Size([1, 7, 512])\n>>> output[\"pooler_output\"].shape\ntorch.Size([1, 512])\n
Source code in multimolecule/models/splicebert/modeling_splicebert.py Python
class SpliceBertModel(SpliceBertPreTrainedModel):\n    \"\"\"\n    Examples:\n        >>> from multimolecule import SpliceBertConfig, SpliceBertModel, RnaTokenizer\n        >>> config = SpliceBertConfig()\n        >>> model = SpliceBertModel(config)\n        >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n        >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n        >>> output = model(**input)\n        >>> output[\"last_hidden_state\"].shape\n        torch.Size([1, 7, 512])\n        >>> output[\"pooler_output\"].shape\n        torch.Size([1, 512])\n    \"\"\"\n\n    def __init__(self, config: SpliceBertConfig, add_pooling_layer: bool = True):\n        super().__init__(config)\n        self.pad_token_id = config.pad_token_id\n        self.embeddings = SpliceBertEmbeddings(config)\n        self.encoder = SpliceBertEncoder(config)\n        self.pooler = SpliceBertPooler(config) if add_pooling_layer else None\n\n        # Initialize weights and apply final processing\n        self.post_init()\n\n    def get_input_embeddings(self):\n        return self.embeddings.word_embeddings\n\n    def set_input_embeddings(self, value):\n        self.embeddings.word_embeddings = value\n\n    def _prune_heads(self, heads_to_prune):\n        \"\"\"\n        Prunes heads of the model. heads_to_prune: dict of {layer_num: list of heads to prune in this layer} See base\n        class PreTrainedModel\n        \"\"\"\n        for layer, heads in heads_to_prune.items():\n            self.encoder.layer[layer].attention.prune_heads(heads)\n\n    def forward(\n        self,\n        input_ids: Tensor | NestedTensor,\n        attention_mask: Tensor | None = None,\n        position_ids: Tensor | None = None,\n        head_mask: Tensor | None = None,\n        inputs_embeds: Tensor | NestedTensor | None = None,\n        encoder_hidden_states: Tensor | None = None,\n        encoder_attention_mask: Tensor | None = None,\n        past_key_values: Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None = None,\n        use_cache: bool | None = None,\n        output_attentions: bool | None = None,\n        output_hidden_states: bool | None = None,\n        return_dict: bool | None = None,\n        **kwargs,\n    ) -> Tuple[Tensor, ...] | BaseModelOutputWithPoolingAndCrossAttentions:\n        r\"\"\"\n        Args:\n            encoder_hidden_states:\n                Shape: `(batch_size, sequence_length, hidden_size)`\n\n                Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if\n                the model is configured as a decoder.\n            encoder_attention_mask:\n                Shape: `(batch_size, sequence_length)`\n\n                Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used\n                in the cross-attention if the model is configured as a decoder. Mask values selected in `[0, 1]`:\n\n                - 1 for tokens that are **not masked**,\n                - 0 for tokens that are **masked**.\n            past_key_values:\n                Tuple of length `config.n_layers` with each tuple having 4 tensors of shape\n                `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)\n\n                Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up\n                decoding.\n\n                If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those\n                that don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of\n                all `decoder_input_ids` of shape `(batch_size, sequence_length)`.\n            use_cache:\n                If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding\n                (see `past_key_values`).\n        \"\"\"\n        if kwargs:\n            warn(\n                f\"Additional keyword arguments `{', '.join(kwargs)}` are detected in \"\n                f\"`{self.__class__.__name__}.forward`, they will be ignored.\\n\"\n                \"This is provided for backward compatibility and may lead to unexpected behavior.\"\n            )\n        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions\n        output_hidden_states = (\n            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states\n        )\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n\n        if self.config.is_decoder:\n            use_cache = use_cache if use_cache is not None else self.config.use_cache\n        else:\n            use_cache = False\n\n        if isinstance(input_ids, NestedTensor):\n            input_ids, attention_mask = input_ids.tensor, input_ids.mask\n        if input_ids is not None and inputs_embeds is not None:\n            raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n        if input_ids is not None:\n            self.warn_if_padding_and_no_attention_mask(input_ids, attention_mask)\n            input_shape = input_ids.size()\n        elif inputs_embeds is not None:\n            input_shape = inputs_embeds.size()[:-1]\n        else:\n            raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n        batch_size, seq_length = input_shape\n        device = input_ids.device if input_ids is not None else inputs_embeds.device  # type: ignore[union-attr]\n\n        # past_key_values_length\n        past_key_values_length = past_key_values[0][0].shape[2] if past_key_values is not None else 0\n\n        if attention_mask is None:\n            attention_mask = (\n                input_ids.ne(self.pad_token_id)\n                if self.pad_token_id is not None\n                else torch.ones(((batch_size, seq_length + past_key_values_length)), device=device)\n            )\n\n        # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]\n        # ourselves in which case we just need to make it broadcastable to all heads.\n        extended_attention_mask: Tensor = self.get_extended_attention_mask(attention_mask, input_shape)\n\n        # If a 2D or 3D attention mask is provided for the cross-attention\n        # we need to make broadcastable to [batch_size, num_heads, seq_length, seq_length]\n        if self.config.is_decoder and encoder_hidden_states is not None:\n            encoder_batch_size, encoder_sequence_length, _ = encoder_hidden_states.size()\n            encoder_hidden_shape = (encoder_batch_size, encoder_sequence_length)\n            if encoder_attention_mask is None:\n                encoder_attention_mask = torch.ones(encoder_hidden_shape, device=device)\n            encoder_extended_attention_mask = self.invert_attention_mask(encoder_attention_mask)\n        else:\n            encoder_extended_attention_mask = None\n\n        # Prepare head mask if needed\n        # 1.0 in head_mask indicate we keep the head\n        # attention_probs has shape bsz x n_heads x N x N\n        # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]\n        # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]\n        head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)\n\n        embedding_output = self.embeddings(\n            input_ids=input_ids,\n            position_ids=position_ids,\n            inputs_embeds=inputs_embeds,\n            past_key_values_length=past_key_values_length,\n        )\n        encoder_outputs = self.encoder(\n            embedding_output,\n            attention_mask=extended_attention_mask,\n            head_mask=head_mask,\n            encoder_hidden_states=encoder_hidden_states,\n            encoder_attention_mask=encoder_extended_attention_mask,\n            past_key_values=past_key_values,\n            use_cache=use_cache,\n            output_attentions=output_attentions,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n        )\n        sequence_output = encoder_outputs[0]\n        pooled_output = self.pooler(sequence_output) if self.pooler is not None else None\n\n        if not return_dict:\n            return (sequence_output, pooled_output) + encoder_outputs[1:]\n\n        return BaseModelOutputWithPoolingAndCrossAttentions(\n            last_hidden_state=sequence_output,\n            pooler_output=pooled_output,\n            past_key_values=encoder_outputs.past_key_values,\n            hidden_states=encoder_outputs.hidden_states,\n            attentions=encoder_outputs.attentions,\n            cross_attentions=encoder_outputs.cross_attentions,\n        )\n
"},{"location":"models/splicebert/#multimolecule.models.splicebert.SpliceBertModel.forward","title":"forward","text":"Python
forward(input_ids: Tensor | NestedTensor, attention_mask: Tensor | None = None, position_ids: Tensor | None = None, head_mask: Tensor | None = None, inputs_embeds: Tensor | NestedTensor | None = None, encoder_hidden_states: Tensor | None = None, encoder_attention_mask: Tensor | None = None, past_key_values: Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None = None, use_cache: bool | None = None, output_attentions: bool | None = None, output_hidden_states: bool | None = None, return_dict: bool | None = None, **kwargs) -> Tuple[Tensor, ...] | BaseModelOutputWithPoolingAndCrossAttentions\n

Parameters:

Name Type Description Default Tensor | None

Shape: (batch_size, sequence_length, hidden_size)

Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if the model is configured as a decoder.

None Tensor | None

Shape: (batch_size, sequence_length)

Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used in the cross-attention if the model is configured as a decoder. Mask values selected in [0, 1]:

None Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None

Tuple of length config.n_layers with each tuple having 4 tensors of shape `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)

Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up decoding.

If past_key_values are used, the user can optionally input only the last decoder_input_ids (those that don\u2019t have their past key value states given to this model) of shape (batch_size, 1) instead of all decoder_input_ids of shape (batch_size, sequence_length).

None bool | None

If set to True, past_key_values key value states are returned and can be used to speed up decoding (see past_key_values).

None Source code in multimolecule/models/splicebert/modeling_splicebert.py Python
def forward(\n    self,\n    input_ids: Tensor | NestedTensor,\n    attention_mask: Tensor | None = None,\n    position_ids: Tensor | None = None,\n    head_mask: Tensor | None = None,\n    inputs_embeds: Tensor | NestedTensor | None = None,\n    encoder_hidden_states: Tensor | None = None,\n    encoder_attention_mask: Tensor | None = None,\n    past_key_values: Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None = None,\n    use_cache: bool | None = None,\n    output_attentions: bool | None = None,\n    output_hidden_states: bool | None = None,\n    return_dict: bool | None = None,\n    **kwargs,\n) -> Tuple[Tensor, ...] | BaseModelOutputWithPoolingAndCrossAttentions:\n    r\"\"\"\n    Args:\n        encoder_hidden_states:\n            Shape: `(batch_size, sequence_length, hidden_size)`\n\n            Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if\n            the model is configured as a decoder.\n        encoder_attention_mask:\n            Shape: `(batch_size, sequence_length)`\n\n            Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used\n            in the cross-attention if the model is configured as a decoder. Mask values selected in `[0, 1]`:\n\n            - 1 for tokens that are **not masked**,\n            - 0 for tokens that are **masked**.\n        past_key_values:\n            Tuple of length `config.n_layers` with each tuple having 4 tensors of shape\n            `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)\n\n            Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up\n            decoding.\n\n            If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those\n            that don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of\n            all `decoder_input_ids` of shape `(batch_size, sequence_length)`.\n        use_cache:\n            If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding\n            (see `past_key_values`).\n    \"\"\"\n    if kwargs:\n        warn(\n            f\"Additional keyword arguments `{', '.join(kwargs)}` are detected in \"\n            f\"`{self.__class__.__name__}.forward`, they will be ignored.\\n\"\n            \"This is provided for backward compatibility and may lead to unexpected behavior.\"\n        )\n    output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions\n    output_hidden_states = (\n        output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states\n    )\n    return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n\n    if self.config.is_decoder:\n        use_cache = use_cache if use_cache is not None else self.config.use_cache\n    else:\n        use_cache = False\n\n    if isinstance(input_ids, NestedTensor):\n        input_ids, attention_mask = input_ids.tensor, input_ids.mask\n    if input_ids is not None and inputs_embeds is not None:\n        raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n    if input_ids is not None:\n        self.warn_if_padding_and_no_attention_mask(input_ids, attention_mask)\n        input_shape = input_ids.size()\n    elif inputs_embeds is not None:\n        input_shape = inputs_embeds.size()[:-1]\n    else:\n        raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n    batch_size, seq_length = input_shape\n    device = input_ids.device if input_ids is not None else inputs_embeds.device  # type: ignore[union-attr]\n\n    # past_key_values_length\n    past_key_values_length = past_key_values[0][0].shape[2] if past_key_values is not None else 0\n\n    if attention_mask is None:\n        attention_mask = (\n            input_ids.ne(self.pad_token_id)\n            if self.pad_token_id is not None\n            else torch.ones(((batch_size, seq_length + past_key_values_length)), device=device)\n        )\n\n    # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]\n    # ourselves in which case we just need to make it broadcastable to all heads.\n    extended_attention_mask: Tensor = self.get_extended_attention_mask(attention_mask, input_shape)\n\n    # If a 2D or 3D attention mask is provided for the cross-attention\n    # we need to make broadcastable to [batch_size, num_heads, seq_length, seq_length]\n    if self.config.is_decoder and encoder_hidden_states is not None:\n        encoder_batch_size, encoder_sequence_length, _ = encoder_hidden_states.size()\n        encoder_hidden_shape = (encoder_batch_size, encoder_sequence_length)\n        if encoder_attention_mask is None:\n            encoder_attention_mask = torch.ones(encoder_hidden_shape, device=device)\n        encoder_extended_attention_mask = self.invert_attention_mask(encoder_attention_mask)\n    else:\n        encoder_extended_attention_mask = None\n\n    # Prepare head mask if needed\n    # 1.0 in head_mask indicate we keep the head\n    # attention_probs has shape bsz x n_heads x N x N\n    # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]\n    # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]\n    head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)\n\n    embedding_output = self.embeddings(\n        input_ids=input_ids,\n        position_ids=position_ids,\n        inputs_embeds=inputs_embeds,\n        past_key_values_length=past_key_values_length,\n    )\n    encoder_outputs = self.encoder(\n        embedding_output,\n        attention_mask=extended_attention_mask,\n        head_mask=head_mask,\n        encoder_hidden_states=encoder_hidden_states,\n        encoder_attention_mask=encoder_extended_attention_mask,\n        past_key_values=past_key_values,\n        use_cache=use_cache,\n        output_attentions=output_attentions,\n        output_hidden_states=output_hidden_states,\n        return_dict=return_dict,\n    )\n    sequence_output = encoder_outputs[0]\n    pooled_output = self.pooler(sequence_output) if self.pooler is not None else None\n\n    if not return_dict:\n        return (sequence_output, pooled_output) + encoder_outputs[1:]\n\n    return BaseModelOutputWithPoolingAndCrossAttentions(\n        last_hidden_state=sequence_output,\n        pooler_output=pooled_output,\n        past_key_values=encoder_outputs.past_key_values,\n        hidden_states=encoder_outputs.hidden_states,\n        attentions=encoder_outputs.attentions,\n        cross_attentions=encoder_outputs.cross_attentions,\n    )\n
"},{"location":"models/splicebert/#multimolecule.models.splicebert.SpliceBertModel.forward(encoder_hidden_states)","title":"encoder_hidden_states","text":""},{"location":"models/splicebert/#multimolecule.models.splicebert.SpliceBertModel.forward(encoder_attention_mask)","title":"encoder_attention_mask","text":""},{"location":"models/splicebert/#multimolecule.models.splicebert.SpliceBertModel.forward(past_key_values)","title":"past_key_values","text":""},{"location":"models/splicebert/#multimolecule.models.splicebert.SpliceBertModel.forward(use_cache)","title":"use_cache","text":""},{"location":"models/splicebert/#multimolecule.models.splicebert.SpliceBertPreTrainedModel","title":"SpliceBertPreTrainedModel","text":"

Bases: PreTrainedModel

An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained models.

Source code in multimolecule/models/splicebert/modeling_splicebert.py Python
class SpliceBertPreTrainedModel(PreTrainedModel):\n    \"\"\"\n    An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained\n    models.\n    \"\"\"\n\n    config_class = SpliceBertConfig\n    base_model_prefix = \"splicebert\"\n    supports_gradient_checkpointing = True\n    _no_split_modules = [\"SpliceBertLayer\", \"SpliceBertEmbeddings\"]\n\n    # Copied from transformers.models.bert.modeling_bert.BertPreTrainedModel._init_weights\n    def _init_weights(self, module: nn.Module):\n        \"\"\"Initialize the weights\"\"\"\n        if isinstance(module, nn.Linear):\n            # Slightly different from the TF version which uses truncated_normal for initialization\n            # cf https://github.com/pytorch/pytorch/pull/5617\n            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)\n            if module.bias is not None:\n                module.bias.data.zero_()\n        elif isinstance(module, nn.Embedding):\n            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)\n            if module.padding_idx is not None:\n                module.weight.data[module.padding_idx].zero_()\n        elif isinstance(module, nn.LayerNorm):\n            module.bias.data.zero_()\n            module.weight.data.fill_(1.0)\n\n    def _set_gradient_checkpointing(self, module, value=False):\n        if isinstance(module, SpliceBertEncoder):\n            module.gradient_checkpointing = value\n
"},{"location":"models/utrbert/","title":"3UTRBERT","text":"

Pre-trained model on 3\u2019 untranslated region (3\u2019UTR) using a masked language modeling (MLM) objective.

"},{"location":"models/utrbert/#disclaimer","title":"Disclaimer","text":"

This is an UNOFFICIAL implementation of the Deciphering 3\u2019 UTR mediated gene regulation using interpretable deep representation learning by Yuning Yang, Gen Li, et al.

The OFFICIAL repository of 3UTRBERT is at yangyn533/3UTRBERT.

Tip

The MultiMolecule team has confirmed that the provided model and checkpoints are producing the same intermediate representations as the original implementation.

The team releasing 3UTRBERT did not write this model card for this model so this model card has been written by the MultiMolecule team.

"},{"location":"models/utrbert/#model-details","title":"Model Details","text":"

3UTRBERT is a bert-style model pre-trained on a large corpus of 3\u2019 untranslated regions (3\u2019UTRs) in a self-supervised fashion. This means that the model was trained on the raw nucleotides of RNA sequences only, with an automatic process to generate inputs and labels from those texts. Please refer to the Training Details section for more information on the training process.

"},{"location":"models/utrbert/#variations","title":"Variations","text":""},{"location":"models/utrbert/#model-specification","title":"Model Specification","text":"Variants Num Layers Hidden Size Num Heads Intermediate Size Num Parameters (M) FLOPs (G) MACs (G) Max Num Tokens UTRBERT-3mer 12 768 12 3072 86.14 22.36 11.17 512 UTRBERT-4mer 86.53 UTRBERT-5mer 88.45 UTRBERT-6mer 98.05"},{"location":"models/utrbert/#links","title":"Links","text":""},{"location":"models/utrbert/#usage","title":"Usage","text":"

The model file depends on the multimolecule library. You can install it using pip:

Bash
pip install multimolecule\n
"},{"location":"models/utrbert/#direct-use","title":"Direct Use","text":"

Note: Default transformers pipeline does not support K-mer tokenization.

You can use this model directly with a pipeline for masked language modeling:

Python
>>> import multimolecule  # you must import multimolecule to register models\n>>> from transformers import pipeline\n>>> unmasker = pipeline(\"fill-mask\", model=\"multimolecule/utrbert-3mer\")\n>>> unmasker(\"gguc<mask><mask><mask>cugguuagaccagaucugagccu\")[1]\n\n[{'score': 0.40745577216148376,\n  'token': 47,\n  'token_str': 'CUC',\n  'sequence': '<cls> GGU GUC <mask> CUC <mask> CUG UGG GGU GUU UUA UAG AGA GAC ACC CCA CAG AGA GAU AUC UCU CUG UGA GAG AGC GCC CCU <eos>'},\n {'score': 0.40001827478408813,\n  'token': 32,\n  'token_str': 'CAC',\n  'sequence': '<cls> GGU GUC <mask> CAC <mask> CUG UGG GGU GUU UUA UAG AGA GAC ACC CCA CAG AGA GAU AUC UCU CUG UGA GAG AGC GCC CCU <eos>'},\n {'score': 0.14566268026828766,\n  'token': 37,\n  'token_str': 'CCC',\n  'sequence': '<cls> GGU GUC <mask> CCC <mask> CUG UGG GGU GUU UUA UAG AGA GAC ACC CCA CAG AGA GAU AUC UCU CUG UGA GAG AGC GCC CCU <eos>'},\n {'score': 0.04422207176685333,\n  'token': 42,\n  'token_str': 'CGC',\n  'sequence': '<cls> GGU GUC <mask> CGC <mask> CUG UGG GGU GUU UUA UAG AGA GAC ACC CCA CAG AGA GAU AUC UCU CUG UGA GAG AGC GCC CCU <eos>'},\n {'score': 0.0008025980787351727,\n  'token': 34,\n  'token_str': 'CAU',\n  'sequence': '<cls> GGU GUC <mask> CAU <mask> CUG UGG GGU GUU UUA UAG AGA GAC ACC CCA CAG AGA GAU AUC UCU CUG UGA GAG AGC GCC CCU <eos>'}]\n
"},{"location":"models/utrbert/#downstream-use","title":"Downstream Use","text":""},{"location":"models/utrbert/#extract-features","title":"Extract Features","text":"

Here is how to use this model to get the features of a given sequence in PyTorch:

Python
from multimolecule import RnaTokenizer, UtrBertModel\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/utrbert-3mer\")\nmodel = UtrBertModel.from_pretrained(\"multimolecule/utrbert-3mer\")\n\ntext = \"UAGCUUAUCAGACUGAUGUUGA\"\ninput = tokenizer(text, return_tensors=\"pt\")\n\noutput = model(**input)\n
"},{"location":"models/utrbert/#sequence-classification-regression","title":"Sequence Classification / Regression","text":"

Note: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for sequence classification or regression.

Here is how to use this model as backbone to fine-tune for a sequence-level task in PyTorch:

Python
import torch\nfrom multimolecule import RnaTokenizer, UtrBertForSequencePrediction\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/utrbert-3mer\")\nmodel = UtrBertForSequencePrediction.from_pretrained(\"multimolecule/utrbert-3mer\")\n\ntext = \"UAGCUUAUCAGACUGAUGUUGA\"\ninput = tokenizer(text, return_tensors=\"pt\")\nlabel = torch.tensor([1])\n\noutput = model(**input, labels=label)\n
"},{"location":"models/utrbert/#token-classification-regression","title":"Token Classification / Regression","text":"

Note: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for nucleotide classification or regression.

Here is how to use this model as backbone to fine-tune for a nucleotide-level task in PyTorch:

Python
import torch\nfrom multimolecule import RnaTokenizer, UtrBertForTokenPrediction\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/utrbert-3mer\")\nmodel = UtrBertForTokenPrediction.from_pretrained(\"multimolecule/utrbert-3mer\")\n\ntext = \"UAGCUUAUCAGACUGAUGUUGA\"\ninput = tokenizer(text, return_tensors=\"pt\")\nlabel = torch.randint(2, (len(text), ))\n\noutput = model(**input, labels=label)\n
"},{"location":"models/utrbert/#contact-classification-regression","title":"Contact Classification / Regression","text":"

Note: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for contact classification or regression.

Here is how to use this model as backbone to fine-tune for a contact-level task in PyTorch:

Python
import torch\nfrom multimolecule import RnaTokenizer, UtrBertForContactPrediction\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/utrbert-3mer\")\nmodel = UtrBertForContactPrediction.from_pretrained(\"multimolecule/utrbert-3mer\")\n\ntext = \"UAGCUUAUCAGACUGAUGUUGA\"\ninput = tokenizer(text, return_tensors=\"pt\")\nlabel = torch.randint(2, (len(text), len(text)))\n\noutput = model(**input, labels=label)\n
"},{"location":"models/utrbert/#training-details","title":"Training Details","text":"

3UTRBERT used Masked Language Modeling (MLM) as the pre-training objective: taking a sequence, the model randomly masks 15% of the tokens in the input then runs the entire masked sentence through the model and has to predict the masked tokens. This is comparable to the Cloze task in language modeling.

"},{"location":"models/utrbert/#training-data","title":"Training Data","text":"

The 3UTRBERT model was pre-trained on human mRNA transcript sequences from GENCODE. GENCODE aims to identify all gene features in the human genome using a combination of computational analysis, manual annotation, and experimental validation. The GENCODE release 40 used by this work contains 61,544 genes, and 246,624 transcripts.

3UTRBERT collected the human mRNA transcript sequences from GENCODE, including 108,573 unique mRNA transcripts. Only the longest transcript of each gene was used in the pre-training process. 3UTRBERT only used the 3\u2019 untranslated regions (3\u2019UTRs) of the mRNA transcripts for pre-training to avoid codon constrains in the CDS region, and to reduce increased complexity of the entire mRNA transcripts. The average length of the 3\u2019UTRs was 1,227 nucleotides, while the median length was 631 nucleotides. Each 3\u2019UTR sequence was cut to non-overlapping patches of 510 nucleotides. The remaining sequences were padded to the same length.

Note RnaTokenizer will convert \u201cT\u201ds to \u201cU\u201ds for you, you may disable this behaviour by passing replace_T_with_U=False.

"},{"location":"models/utrbert/#training-procedure","title":"Training Procedure","text":""},{"location":"models/utrbert/#preprocessing","title":"Preprocessing","text":"

3UTRBERT used masked language modeling (MLM) as the pre-training objective. The masking procedure is similar to the one used in BERT:

Since 3UTRBERT used k-mer tokenizer, it masks the entire k-mer instead of individual nucleotides to avoid information leakage.

For example, if the k-mer is 3, the sequence \"UAGCGUAU\" will be tokenized as [\"UAG\", \"AGC\", \"GCG\", \"CGU\", \"GUA\", \"UAU\"]. If the nucleotide \"C\" is masked, the adjacent tokens will also be masked, resulting [\"UAG\", \"<mask>\", \"<mask>\", \"<mask>\", \"GUA\", \"UAU\"].

"},{"location":"models/utrbert/#pretraining","title":"PreTraining","text":"

The model was trained on 4 NVIDIA Quadro RTX 6000 GPUs with 24GiB memories.

"},{"location":"models/utrbert/#citation","title":"Citation","text":"

BibTeX:

BibTeX
@article {yang2023deciphering,\n    author = {Yang, Yuning and Li, Gen and Pang, Kuan and Cao, Wuxinhao and Li, Xiangtao and Zhang, Zhaolei},\n    title = {Deciphering 3{\\textquoteright} UTR mediated gene regulation using interpretable deep representation learning},\n    elocation-id = {2023.09.08.556883},\n    year = {2023},\n    doi = {10.1101/2023.09.08.556883},\n    publisher = {Cold Spring Harbor Laboratory},\n    abstract = {The 3{\\textquoteright}untranslated regions (3{\\textquoteright}UTRs) of messenger RNAs contain many important cis-regulatory elements that are under functional and evolutionary constraints. We hypothesize that these constraints are similar to grammars and syntaxes in human languages and can be modeled by advanced natural language models such as Transformers, which has been very effective in modeling protein sequence and structures. Here we describe 3UTRBERT, which implements an attention-based language model, i.e., Bidirectional Encoder Representations from Transformers (BERT). 3UTRBERT was pre-trained on aggregated 3{\\textquoteright}UTR sequences of human mRNAs in a task-agnostic manner; the pre-trained model was then fine-tuned for specific downstream tasks such as predicting RBP binding sites, m6A RNA modification sites, and predicting RNA sub-cellular localizations. Benchmark results showed that 3UTRBERT generally outperformed other contemporary methods in each of these tasks. We also showed that the self-attention mechanism within 3UTRBERT allows direct visualization of the semantic relationship between sequence elements.Competing Interest StatementThe authors have declared no competing interest.},\n    URL = {https://www.biorxiv.org/content/early/2023/09/12/2023.09.08.556883},\n    eprint = {https://www.biorxiv.org/content/early/2023/09/12/2023.09.08.556883.full.pdf},\n    journal = {bioRxiv}\n}\n
"},{"location":"models/utrbert/#contact","title":"Contact","text":"

Please use GitHub issues of MultiMolecule for any questions or comments on the model card.

Please contact the authors of the 3UTRBERT paper for questions or comments on the paper/model.

"},{"location":"models/utrbert/#license","title":"License","text":"

This model is licensed under the AGPL-3.0 License.

Text Only
SPDX-License-Identifier: AGPL-3.0-or-later\n
"},{"location":"models/utrbert/#multimolecule.models.utrbert","title":"multimolecule.models.utrbert","text":""},{"location":"models/utrbert/#multimolecule.models.utrbert.RnaTokenizer","title":"RnaTokenizer","text":"

Bases: Tokenizer

Tokenizer for RNA sequences.

Parameters:

Name Type Description Default Alphabet | str | List[str] | None

alphabet to use for tokenization.

None int

Size of kmer to tokenize.

1 bool

Whether to tokenize into codons.

False bool

Whether to replace T with U.

True bool

Whether to convert input to uppercase.

True

Examples:

Python Console Session
>>> from multimolecule import RnaTokenizer\n>>> tokenizer = RnaTokenizer()\n>>> tokenizer('<pad><cls><eos><unk><mask><null>ACGUNRYSWKMBDHV.X*-I')[\"input_ids\"]\n[1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 2]\n>>> tokenizer('acgu')[\"input_ids\"]\n[1, 6, 7, 8, 9, 2]\n>>> tokenizer('acgt')[\"input_ids\"]\n[1, 6, 7, 8, 9, 2]\n>>> tokenizer = RnaTokenizer(replace_T_with_U=False)\n>>> tokenizer('acgt')[\"input_ids\"]\n[1, 6, 7, 8, 3, 2]\n>>> tokenizer = RnaTokenizer(nmers=3)\n>>> tokenizer('uagcuuauc')[\"input_ids\"]\n[1, 83, 17, 64, 49, 96, 84, 22, 2]\n>>> tokenizer = RnaTokenizer(codon=True)\n>>> tokenizer('uagcuuauc')[\"input_ids\"]\n[1, 83, 49, 22, 2]\n>>> tokenizer('uagcuuauca')[\"input_ids\"]\nTraceback (most recent call last):\nValueError: length of input sequence must be a multiple of 3 for codon tokenization, but got 10\n
Source code in multimolecule/tokenisers/rna/tokenization_rna.py Python
class RnaTokenizer(Tokenizer):\n    \"\"\"\n    Tokenizer for RNA sequences.\n\n    Args:\n        alphabet: alphabet to use for tokenization.\n\n            - If is `None`, the standard RNA alphabet will be used.\n            - If is a `string`, it should correspond to the name of a predefined alphabet. The options include\n                + `standard`\n                + `extended`\n                + `streamline`\n                + `nucleobase`\n            - If is an alphabet or a list of characters, that specific alphabet will be used.\n        nmers: Size of kmer to tokenize.\n        codon: Whether to tokenize into codons.\n        replace_T_with_U: Whether to replace T with U.\n        do_upper_case: Whether to convert input to uppercase.\n\n    Examples:\n        >>> from multimolecule import RnaTokenizer\n        >>> tokenizer = RnaTokenizer()\n        >>> tokenizer('<pad><cls><eos><unk><mask><null>ACGUNRYSWKMBDHV.X*-I')[\"input_ids\"]\n        [1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 2]\n        >>> tokenizer('acgu')[\"input_ids\"]\n        [1, 6, 7, 8, 9, 2]\n        >>> tokenizer('acgt')[\"input_ids\"]\n        [1, 6, 7, 8, 9, 2]\n        >>> tokenizer = RnaTokenizer(replace_T_with_U=False)\n        >>> tokenizer('acgt')[\"input_ids\"]\n        [1, 6, 7, 8, 3, 2]\n        >>> tokenizer = RnaTokenizer(nmers=3)\n        >>> tokenizer('uagcuuauc')[\"input_ids\"]\n        [1, 83, 17, 64, 49, 96, 84, 22, 2]\n        >>> tokenizer = RnaTokenizer(codon=True)\n        >>> tokenizer('uagcuuauc')[\"input_ids\"]\n        [1, 83, 49, 22, 2]\n        >>> tokenizer('uagcuuauca')[\"input_ids\"]\n        Traceback (most recent call last):\n        ValueError: length of input sequence must be a multiple of 3 for codon tokenization, but got 10\n    \"\"\"\n\n    model_input_names = [\"input_ids\", \"attention_mask\"]\n\n    def __init__(\n        self,\n        alphabet: Alphabet | str | List[str] | None = None,\n        nmers: int = 1,\n        codon: bool = False,\n        replace_T_with_U: bool = True,\n        do_upper_case: bool = True,\n        additional_special_tokens: List | Tuple | None = None,\n        **kwargs,\n    ):\n        if codon and (nmers > 1 and nmers != 3):\n            raise ValueError(\"Codon and nmers cannot be used together.\")\n        if codon:\n            nmers = 3  # set to 3 to get correct vocab\n        if not isinstance(alphabet, Alphabet):\n            alphabet = get_alphabet(alphabet, nmers=nmers)\n        super().__init__(\n            alphabet=alphabet,\n            nmers=nmers,\n            codon=codon,\n            replace_T_with_U=replace_T_with_U,\n            do_upper_case=do_upper_case,\n            additional_special_tokens=additional_special_tokens,\n            **kwargs,\n        )\n        self.replace_T_with_U = replace_T_with_U\n        self.nmers = nmers\n        self.codon = codon\n\n    def _tokenize(self, text: str, **kwargs):\n        if self.do_upper_case:\n            text = text.upper()\n        if self.replace_T_with_U:\n            text = text.replace(\"T\", \"U\")\n        if self.codon:\n            if len(text) % 3 != 0:\n                raise ValueError(\n                    f\"length of input sequence must be a multiple of 3 for codon tokenization, but got {len(text)}\"\n                )\n            return [text[i : i + 3] for i in range(0, len(text), 3)]\n        if self.nmers > 1:\n            return [text[i : i + self.nmers] for i in range(len(text) - self.nmers + 1)]  # noqa: E203\n        return list(text)\n
"},{"location":"models/utrbert/#multimolecule.models.utrbert.RnaTokenizer(alphabet)","title":"alphabet","text":""},{"location":"models/utrbert/#multimolecule.models.utrbert.RnaTokenizer(nmers)","title":"nmers","text":""},{"location":"models/utrbert/#multimolecule.models.utrbert.RnaTokenizer(codon)","title":"codon","text":""},{"location":"models/utrbert/#multimolecule.models.utrbert.RnaTokenizer(replace_T_with_U)","title":"replace_T_with_U","text":""},{"location":"models/utrbert/#multimolecule.models.utrbert.RnaTokenizer(do_upper_case)","title":"do_upper_case","text":""},{"location":"models/utrbert/#multimolecule.models.utrbert.UtrBertConfig","title":"UtrBertConfig","text":"

Bases: PreTrainedConfig

This is the configuration class to store the configuration of a UtrBertModel. It is used to instantiate a 3UTRBERT model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the 3UTRBERT yangyn533/3UTRBERT architecture.

Configuration objects inherit from PreTrainedConfig and can be used to control the model outputs. Read the documentation from PreTrainedConfig for more information.

Parameters:

Name Type Description Default int | None

Vocabulary size of the UTRBERT model. Defines the number of different tokens that can be represented by the inputs_ids passed when calling [BertModel].

None int | None

kmer size of the UTRBERT model. Defines the vocabulary size of the model.

None int

Dimensionality of the encoder layers and the pooler layer.

768 int

Number of hidden layers in the Transformer encoder.

12 int

Number of attention heads for each attention layer in the Transformer encoder.

12 int

Dimensionality of the \u201cintermediate\u201d (often named feed-forward) layer in the Transformer encoder.

3072 str

The non-linear activation function (function or string) in the encoder and pooler. If string, \"gelu\", \"relu\", \"silu\" and \"gelu_new\" are supported.

'gelu' float

The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.

0.1 float

The dropout ratio for the attention probabilities.

0.1 int

The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).

512 float

The standard deviation of the truncated_normal_initializer for initializing all weight matrices.

0.02 float

The epsilon used by the layer normalization layers.

1e-12 str

Type of position embedding. Choose one of \"absolute\", \"relative_key\", \"relative_key_query\". For positional embeddings use \"absolute\". For more information on \"relative_key\", please refer to Self-Attention with Relative Position Representations (Shaw et al.). For more information on \"relative_key_query\", please refer to Method 4 in Improve Transformer Models with Better Relative Position Embeddings (Huang et al.).

'absolute' bool

Whether the model is used as a decoder or not. If False, the model is used as an encoder.

False bool

Whether or not the model should return the last key/values attentions (not used by all models). Only relevant if config.is_decoder=True.

True

Examples:

Python Console Session
>>> from multimolecule import UtrBertConfig, UtrBertModel\n>>> # Initializing a UtrBERT multimolecule/utrbert style configuration\n>>> configuration = UtrBertConfig(vocab_size=26, nmers=1)\n>>> # Initializing a model (with random weights) from the multimolecule/utrbert style configuration\n>>> model = UtrBertModel(configuration)\n>>> # Accessing the model configuration\n>>> configuration = model.config\n
Source code in multimolecule/models/utrbert/configuration_utrbert.py Python
class UtrBertConfig(PreTrainedConfig):\n    r\"\"\"\n    This is the configuration class to store the configuration of a [`UtrBertModel`][multimolecule.models.UtrBertModel].\n    It is used to instantiate a 3UTRBERT model according to the specified arguments, defining the model architecture.\n    Instantiating a configuration with the defaults will yield a similar configuration to that of the 3UTRBERT\n    [yangyn533/3UTRBERT](https://github.com/yangyn533/3UTRBERT) architecture.\n\n    Configuration objects inherit from [`PreTrainedConfig`][multimolecule.models.PreTrainedConfig] and can be used to\n    control the model outputs. Read the documentation from [`PreTrainedConfig`][multimolecule.models.PreTrainedConfig]\n    for more information.\n\n    Args:\n        vocab_size:\n            Vocabulary size of the UTRBERT model. Defines the number of different tokens that can be represented by the\n            `inputs_ids` passed when calling [`BertModel`].\n        nmers:\n            kmer size of the UTRBERT model. Defines the vocabulary size of the model.\n        hidden_size:\n            Dimensionality of the encoder layers and the pooler layer.\n        num_hidden_layers:\n            Number of hidden layers in the Transformer encoder.\n        num_attention_heads:\n            Number of attention heads for each attention layer in the Transformer encoder.\n        intermediate_size:\n            Dimensionality of the \"intermediate\" (often named feed-forward) layer in the Transformer encoder.\n        hidden_act:\n            The non-linear activation function (function or string) in the encoder and pooler. If string, `\"gelu\"`,\n            `\"relu\"`, `\"silu\"` and `\"gelu_new\"` are supported.\n        hidden_dropout:\n            The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.\n        attention_dropout:\n            The dropout ratio for the attention probabilities.\n        max_position_embeddings:\n            The maximum sequence length that this model might ever be used with. Typically set this to something large\n            just in case (e.g., 512 or 1024 or 2048).\n        initializer_range:\n            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.\n        layer_norm_eps:\n            The epsilon used by the layer normalization layers.\n        position_embedding_type:\n            Type of position embedding. Choose one of `\"absolute\"`, `\"relative_key\"`, `\"relative_key_query\"`. For\n            positional embeddings use `\"absolute\"`. For more information on `\"relative_key\"`, please refer to\n            [Self-Attention with Relative Position Representations (Shaw et al.)](https://arxiv.org/abs/1803.02155).\n            For more information on `\"relative_key_query\"`, please refer to *Method 4* in [Improve Transformer Models\n            with Better Relative Position Embeddings (Huang et al.)](https://arxiv.org/abs/2009.13658).\n        is_decoder:\n            Whether the model is used as a decoder or not. If `False`, the model is used as an encoder.\n        use_cache:\n            Whether or not the model should return the last key/values attentions (not used by all models). Only\n            relevant if `config.is_decoder=True`.\n\n    Examples:\n        >>> from multimolecule import UtrBertConfig, UtrBertModel\n        >>> # Initializing a UtrBERT multimolecule/utrbert style configuration\n        >>> configuration = UtrBertConfig(vocab_size=26, nmers=1)\n        >>> # Initializing a model (with random weights) from the multimolecule/utrbert style configuration\n        >>> model = UtrBertModel(configuration)\n        >>> # Accessing the model configuration\n        >>> configuration = model.config\n    \"\"\"\n\n    model_type = \"utrbert\"\n\n    def __init__(\n        self,\n        vocab_size: int | None = None,\n        nmers: int | None = None,\n        hidden_size: int = 768,\n        num_hidden_layers: int = 12,\n        num_attention_heads: int = 12,\n        intermediate_size: int = 3072,\n        hidden_act: str = \"gelu\",\n        hidden_dropout: float = 0.1,\n        attention_dropout: float = 0.1,\n        max_position_embeddings: int = 512,\n        initializer_range: float = 0.02,\n        layer_norm_eps: float = 1e-12,\n        position_embedding_type: str = \"absolute\",\n        is_decoder: bool = False,\n        use_cache: bool = True,\n        head: HeadConfig | None = None,\n        lm_head: MaskedLMHeadConfig | None = None,\n        **kwargs,\n    ):\n        super().__init__(**kwargs)\n\n        self.vocab_size = vocab_size\n        self.type_vocab_size = 2\n        self.nmers = nmers\n        self.hidden_size = hidden_size\n        self.num_hidden_layers = num_hidden_layers\n        self.num_attention_heads = num_attention_heads\n        self.hidden_act = hidden_act\n        self.intermediate_size = intermediate_size\n        self.hidden_dropout = hidden_dropout\n        self.attention_dropout = attention_dropout\n        self.max_position_embeddings = max_position_embeddings\n        self.initializer_range = initializer_range\n        self.layer_norm_eps = layer_norm_eps\n        self.position_embedding_type = position_embedding_type\n        self.is_decoder = is_decoder\n        self.use_cache = use_cache\n        self.head = HeadConfig(**head) if head is not None else None\n        self.lm_head = MaskedLMHeadConfig(**lm_head) if lm_head is not None else None\n
"},{"location":"models/utrbert/#multimolecule.models.utrbert.UtrBertConfig(vocab_size)","title":"vocab_size","text":""},{"location":"models/utrbert/#multimolecule.models.utrbert.UtrBertConfig(nmers)","title":"nmers","text":""},{"location":"models/utrbert/#multimolecule.models.utrbert.UtrBertConfig(hidden_size)","title":"hidden_size","text":""},{"location":"models/utrbert/#multimolecule.models.utrbert.UtrBertConfig(num_hidden_layers)","title":"num_hidden_layers","text":""},{"location":"models/utrbert/#multimolecule.models.utrbert.UtrBertConfig(num_attention_heads)","title":"num_attention_heads","text":""},{"location":"models/utrbert/#multimolecule.models.utrbert.UtrBertConfig(intermediate_size)","title":"intermediate_size","text":""},{"location":"models/utrbert/#multimolecule.models.utrbert.UtrBertConfig(hidden_act)","title":"hidden_act","text":""},{"location":"models/utrbert/#multimolecule.models.utrbert.UtrBertConfig(hidden_dropout)","title":"hidden_dropout","text":""},{"location":"models/utrbert/#multimolecule.models.utrbert.UtrBertConfig(attention_dropout)","title":"attention_dropout","text":""},{"location":"models/utrbert/#multimolecule.models.utrbert.UtrBertConfig(max_position_embeddings)","title":"max_position_embeddings","text":""},{"location":"models/utrbert/#multimolecule.models.utrbert.UtrBertConfig(initializer_range)","title":"initializer_range","text":""},{"location":"models/utrbert/#multimolecule.models.utrbert.UtrBertConfig(layer_norm_eps)","title":"layer_norm_eps","text":""},{"location":"models/utrbert/#multimolecule.models.utrbert.UtrBertConfig(position_embedding_type)","title":"position_embedding_type","text":""},{"location":"models/utrbert/#multimolecule.models.utrbert.UtrBertConfig(is_decoder)","title":"is_decoder","text":""},{"location":"models/utrbert/#multimolecule.models.utrbert.UtrBertConfig(use_cache)","title":"use_cache","text":""},{"location":"models/utrbert/#multimolecule.models.utrbert.UtrBertForContactPrediction","title":"UtrBertForContactPrediction","text":"

Bases: UtrBertPreTrainedModel

Examples:

Python Console Session
>>> from multimolecule import UtrBertConfig, UtrBertForContactPrediction, RnaTokenizer\n>>> tokenizer = RnaTokenizer(nmers=1)\n>>> config = UtrBertConfig(vocab_size=tokenizer.vocab_size)\n>>> model = UtrBertForContactPrediction(config)\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=torch.randint(2, (1, 5, 5)))\n>>> output[\"logits\"].shape\ntorch.Size([1, 5, 5, 1])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n
Source code in multimolecule/models/utrbert/modeling_utrbert.py Python
class UtrBertForContactPrediction(UtrBertPreTrainedModel):\n    \"\"\"\n    Examples:\n        >>> from multimolecule import UtrBertConfig, UtrBertForContactPrediction, RnaTokenizer\n        >>> tokenizer = RnaTokenizer(nmers=1)\n        >>> config = UtrBertConfig(vocab_size=tokenizer.vocab_size)\n        >>> model = UtrBertForContactPrediction(config)\n        >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n        >>> output = model(**input, labels=torch.randint(2, (1, 5, 5)))\n        >>> output[\"logits\"].shape\n        torch.Size([1, 5, 5, 1])\n        >>> output[\"loss\"]  # doctest:+ELLIPSIS\n        tensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n    \"\"\"\n\n    def __init__(self, config: UtrBertConfig):\n        super().__init__(config)\n        self.utrbert = UtrBertModel(config, add_pooling_layer=True)\n        self.contact_head = ContactPredictionHead(config)\n        self.head_config = self.contact_head.config\n\n        # Initialize weights and apply final processing\n        self.post_init()\n\n    def forward(\n        self,\n        input_ids: Tensor | NestedTensor,\n        attention_mask: Tensor | None = None,\n        position_ids: Tensor | None = None,\n        head_mask: Tensor | None = None,\n        inputs_embeds: Tensor | NestedTensor | None = None,\n        labels: Tensor | None = None,\n        output_attentions: bool | None = None,\n        output_hidden_states: bool | None = None,\n        return_dict: bool | None = None,\n        **kwargs,\n    ) -> Tuple[Tensor, ...] | ContactPredictorOutput:\n        if output_attentions is False:\n            warn(\"output_attentions must be True for contact classification and will be ignored.\")\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n        outputs = self.utrbert(\n            input_ids,\n            attention_mask=attention_mask,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            output_attentions=True,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n            **kwargs,\n        )\n        output = self.contact_head(outputs, attention_mask, input_ids, labels)\n        logits, loss = output.logits, output.loss\n\n        if not return_dict:\n            output = (logits,) + outputs[2:]\n            return ((loss,) + output) if loss is not None else output\n\n        return ContactPredictorOutput(\n            loss=loss,\n            logits=logits,\n            hidden_states=outputs.hidden_states,\n            attentions=outputs.attentions,\n        )\n
"},{"location":"models/utrbert/#multimolecule.models.utrbert.UtrBertForMaskedLM","title":"UtrBertForMaskedLM","text":"

Bases: UtrBertPreTrainedModel

Examples:

Python Console Session
>>> from multimolecule import UtrBertConfig, UtrBertForMaskedLM, RnaTokenizer\n>>> tokenizer = RnaTokenizer(nmers=2)\n>>> config = UtrBertConfig(vocab_size=tokenizer.vocab_size)\n>>> model = UtrBertForMaskedLM(config)\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=input[\"input_ids\"])\n>>> output[\"logits\"].shape\ntorch.Size([1, 6, 31])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<NllLossBackward0>)\n
Source code in multimolecule/models/utrbert/modeling_utrbert.py Python
class UtrBertForMaskedLM(UtrBertPreTrainedModel):\n    \"\"\"\n    Examples:\n        >>> from multimolecule import UtrBertConfig, UtrBertForMaskedLM, RnaTokenizer\n        >>> tokenizer = RnaTokenizer(nmers=2)\n        >>> config = UtrBertConfig(vocab_size=tokenizer.vocab_size)\n        >>> model = UtrBertForMaskedLM(config)\n        >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n        >>> output = model(**input, labels=input[\"input_ids\"])\n        >>> output[\"logits\"].shape\n        torch.Size([1, 6, 31])\n        >>> output[\"loss\"]  # doctest:+ELLIPSIS\n        tensor(..., grad_fn=<NllLossBackward0>)\n    \"\"\"\n\n    _tied_weights_keys = [\"lm_head.decoder.weight\", \"lm_head.decoder.bias\"]\n\n    def __init__(self, config: UtrBertConfig):\n        super().__init__(config)\n        if config.is_decoder:\n            logger.warning(\n                \"If you want to use `BertForMaskedLM` make sure `config.is_decoder=False` for \"\n                \"bi-directional self-attention.\"\n            )\n        self.utrbert = UtrBertModel(config, add_pooling_layer=False)\n        self.lm_head = MaskedLMHead(config)\n\n        # Initialize weights and apply final processing\n        self.post_init()\n\n    def get_output_embeddings(self):\n        return self.lm_head.decoder\n\n    def set_output_embeddings(self, new_embeddings):\n        self.lm_head.decoder = new_embeddings\n\n    def forward(\n        self,\n        input_ids: Tensor | NestedTensor,\n        attention_mask: Tensor | None = None,\n        position_ids: Tensor | None = None,\n        head_mask: Tensor | None = None,\n        inputs_embeds: Tensor | NestedTensor | None = None,\n        encoder_hidden_states: Tensor | None = None,\n        encoder_attention_mask: Tensor | None = None,\n        labels: Tensor | None = None,\n        output_attentions: bool | None = None,\n        output_hidden_states: bool | None = None,\n        return_dict: bool | None = None,\n        **kwargs,\n    ) -> Tuple[Tensor, ...] | MaskedLMOutput:\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n        outputs = self.utrbert(\n            input_ids,\n            attention_mask=attention_mask,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            encoder_hidden_states=encoder_hidden_states,\n            encoder_attention_mask=encoder_attention_mask,\n            output_attentions=output_attentions,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n            **kwargs,\n        )\n        output = self.lm_head(outputs, labels)\n        logits, loss = output.logits, output.loss\n\n        if not return_dict:\n            output = (logits,) + outputs[2:]\n            return ((loss,) + output) if loss is not None else output\n\n        return MaskedLMOutput(\n            loss=loss,\n            logits=logits,\n            hidden_states=outputs.hidden_states,\n            attentions=outputs.attentions,\n        )\n
"},{"location":"models/utrbert/#multimolecule.models.utrbert.UtrBertForSequencePrediction","title":"UtrBertForSequencePrediction","text":"

Bases: UtrBertPreTrainedModel

Examples:

Python Console Session
>>> from multimolecule import UtrBertConfig, UtrBertForSequencePrediction, RnaTokenizer\n>>> tokenizer = RnaTokenizer(nmers=4)\n>>> config = UtrBertConfig(vocab_size=tokenizer.vocab_size)\n>>> model = UtrBertForSequencePrediction(config)\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=torch.tensor([[1]]))\n>>> output[\"logits\"].shape\ntorch.Size([1, 1])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n
Source code in multimolecule/models/utrbert/modeling_utrbert.py Python
class UtrBertForSequencePrediction(UtrBertPreTrainedModel):\n    \"\"\"\n    Examples:\n        >>> from multimolecule import UtrBertConfig, UtrBertForSequencePrediction, RnaTokenizer\n        >>> tokenizer = RnaTokenizer(nmers=4)\n        >>> config = UtrBertConfig(vocab_size=tokenizer.vocab_size)\n        >>> model = UtrBertForSequencePrediction(config)\n        >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n        >>> output = model(**input, labels=torch.tensor([[1]]))\n        >>> output[\"logits\"].shape\n        torch.Size([1, 1])\n        >>> output[\"loss\"]  # doctest:+ELLIPSIS\n        tensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n    \"\"\"\n\n    def __init__(self, config: UtrBertConfig):\n        super().__init__(config)\n        self.utrbert = UtrBertModel(config)\n        self.sequence_head = SequencePredictionHead(config)\n        self.head_config = self.sequence_head.config\n\n        # Initialize weights and apply final processing\n        self.post_init()\n\n    def forward(\n        self,\n        input_ids: Tensor | NestedTensor,\n        attention_mask: Tensor | None = None,\n        position_ids: Tensor | None = None,\n        head_mask: Tensor | None = None,\n        inputs_embeds: Tensor | NestedTensor | None = None,\n        labels: Tensor | None = None,\n        output_attentions: bool | None = None,\n        output_hidden_states: bool | None = None,\n        return_dict: bool | None = None,\n        **kwargs,\n    ) -> Tuple[Tensor, ...] | SequencePredictorOutput:\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n        outputs = self.utrbert(\n            input_ids,\n            attention_mask=attention_mask,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            output_attentions=output_attentions,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n            **kwargs,\n        )\n        output = self.sequence_head(outputs, labels)\n        logits, loss = output.logits, output.loss\n\n        if not return_dict:\n            output = (logits,) + outputs[2:]\n            return ((loss,) + output) if loss is not None else output\n\n        return SequencePredictorOutput(\n            loss=loss,\n            logits=logits,\n            hidden_states=outputs.hidden_states,\n            attentions=outputs.attentions,\n        )\n
"},{"location":"models/utrbert/#multimolecule.models.utrbert.UtrBertForTokenPrediction","title":"UtrBertForTokenPrediction","text":"

Bases: UtrBertPreTrainedModel

Examples:

Python Console Session
>>> from multimolecule import UtrBertConfig, UtrBertForTokenPrediction, RnaTokenizer\n>>> tokenizer = RnaTokenizer(nmers=2)\n>>> config = UtrBertConfig(vocab_size=tokenizer.vocab_size, nmers=2)\n>>> model = UtrBertForTokenPrediction(config)\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=torch.randint(2, (1, 5)))\n>>> output[\"logits\"].shape\ntorch.Size([1, 5, 1])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n
Source code in multimolecule/models/utrbert/modeling_utrbert.py Python
class UtrBertForTokenPrediction(UtrBertPreTrainedModel):\n    \"\"\"\n    Examples:\n        >>> from multimolecule import UtrBertConfig, UtrBertForTokenPrediction, RnaTokenizer\n        >>> tokenizer = RnaTokenizer(nmers=2)\n        >>> config = UtrBertConfig(vocab_size=tokenizer.vocab_size, nmers=2)\n        >>> model = UtrBertForTokenPrediction(config)\n        >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n        >>> output = model(**input, labels=torch.randint(2, (1, 5)))\n        >>> output[\"logits\"].shape\n        torch.Size([1, 5, 1])\n        >>> output[\"loss\"]  # doctest:+ELLIPSIS\n        tensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n    \"\"\"\n\n    def __init__(self, config: UtrBertConfig):\n        super().__init__(config)\n        self.num_labels = config.num_labels\n        self.utrbert = UtrBertModel(config, add_pooling_layer=False)\n        self.token_head = TokenKMerHead(config)\n        self.head_config = self.token_head.config\n\n        # Initialize weights and apply final processing\n        self.post_init()\n\n    def forward(\n        self,\n        input_ids: Tensor | NestedTensor,\n        attention_mask: Tensor | None = None,\n        position_ids: Tensor | None = None,\n        head_mask: Tensor | None = None,\n        inputs_embeds: Tensor | NestedTensor | None = None,\n        labels: Tensor | None = None,\n        output_attentions: bool | None = None,\n        output_hidden_states: bool | None = None,\n        return_dict: bool | None = None,\n        **kwargs,\n    ) -> Tuple[Tensor, ...] | TokenPredictorOutput:\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n        outputs = self.utrbert(\n            input_ids,\n            attention_mask=attention_mask,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            output_attentions=output_attentions,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n            **kwargs,\n        )\n        output = self.token_head(outputs, attention_mask, input_ids, labels)\n        logits, loss = output.logits, output.loss\n\n        if not return_dict:\n            output = (logits,) + outputs[2:]\n            return ((loss,) + output) if loss is not None else output\n\n        return TokenPredictorOutput(\n            loss=loss,\n            logits=logits,\n            hidden_states=outputs.hidden_states,\n            attentions=outputs.attentions,\n        )\n
"},{"location":"models/utrbert/#multimolecule.models.utrbert.UtrBertModel","title":"UtrBertModel","text":"

Bases: UtrBertPreTrainedModel

Examples:

Python Console Session
>>> from multimolecule import UtrBertConfig, UtrBertModel, RnaTokenizer\n>>> tokenizer = RnaTokenizer(nmers=1)\n>>> config = UtrBertConfig(vocab_size=tokenizer.vocab_size)\n>>> model = UtrBertModel(config)\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input)\n>>> output[\"last_hidden_state\"].shape\ntorch.Size([1, 7, 768])\n>>> output[\"pooler_output\"].shape\ntorch.Size([1, 768])\n
Source code in multimolecule/models/utrbert/modeling_utrbert.py Python
class UtrBertModel(UtrBertPreTrainedModel):\n    \"\"\"\n    Examples:\n        >>> from multimolecule import UtrBertConfig, UtrBertModel, RnaTokenizer\n        >>> tokenizer = RnaTokenizer(nmers=1)\n        >>> config = UtrBertConfig(vocab_size=tokenizer.vocab_size)\n        >>> model = UtrBertModel(config)\n        >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n        >>> output = model(**input)\n        >>> output[\"last_hidden_state\"].shape\n        torch.Size([1, 7, 768])\n        >>> output[\"pooler_output\"].shape\n        torch.Size([1, 768])\n    \"\"\"\n\n    def __init__(self, config: UtrBertConfig, add_pooling_layer: bool = True):\n        super().__init__(config)\n        self.pad_token_id = config.pad_token_id\n        self.embeddings = UtrBertEmbeddings(config)\n        self.encoder = UtrBertEncoder(config)\n        self.pooler = UtrBertPooler(config) if add_pooling_layer else None\n\n        # Initialize weights and apply final processing\n        self.post_init()\n\n    def get_input_embeddings(self):\n        return self.embeddings.word_embeddings\n\n    def set_input_embeddings(self, value):\n        self.embeddings.word_embeddings = value\n\n    def _prune_heads(self, heads_to_prune):\n        \"\"\"\n        Prunes heads of the model. heads_to_prune: dict of {layer_num: list of heads to prune in this layer} See base\n        class PreTrainedModel\n        \"\"\"\n        for layer, heads in heads_to_prune.items():\n            self.encoder.layer[layer].attention.prune_heads(heads)\n\n    def forward(\n        self,\n        input_ids: Tensor | NestedTensor,\n        attention_mask: Tensor | None = None,\n        position_ids: Tensor | None = None,\n        head_mask: Tensor | None = None,\n        inputs_embeds: Tensor | NestedTensor | None = None,\n        encoder_hidden_states: Tensor | None = None,\n        encoder_attention_mask: Tensor | None = None,\n        past_key_values: Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None = None,\n        use_cache: bool | None = None,\n        output_attentions: bool | None = None,\n        output_hidden_states: bool | None = None,\n        return_dict: bool | None = None,\n        **kwargs,\n    ) -> Tuple[Tensor, ...] | BaseModelOutputWithPoolingAndCrossAttentions:\n        r\"\"\"\n        Args:\n            encoder_hidden_states:\n                Shape: `(batch_size, sequence_length, hidden_size)`\n\n                Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if\n                the model is configured as a decoder.\n            encoder_attention_mask:\n                Shape: `(batch_size, sequence_length)`\n\n                Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used\n                in the cross-attention if the model is configured as a decoder. Mask values selected in `[0, 1]`:\n\n                - 1 for tokens that are **not masked**,\n                - 0 for tokens that are **masked**.\n            past_key_values:\n                Tuple of length `config.n_layers` with each tuple having 4 tensors of shape\n                `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)\n\n                Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up\n                decoding.\n\n                If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those\n                that don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of\n                all `decoder_input_ids` of shape `(batch_size, sequence_length)`.\n            use_cache:\n                If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding\n                (see `past_key_values`).\n        \"\"\"\n        if kwargs:\n            warn(\n                f\"Additional keyword arguments `{', '.join(kwargs)}` are detected in \"\n                f\"`{self.__class__.__name__}.forward`, they will be ignored.\\n\"\n                \"This is provided for backward compatibility and may lead to unexpected behavior.\"\n            )\n        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions\n        output_hidden_states = (\n            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states\n        )\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n\n        if self.config.is_decoder:\n            use_cache = use_cache if use_cache is not None else self.config.use_cache\n        else:\n            use_cache = False\n\n        if isinstance(input_ids, NestedTensor):\n            input_ids, attention_mask = input_ids.tensor, input_ids.mask\n        if input_ids is not None and inputs_embeds is not None:\n            raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n        if input_ids is not None:\n            self.warn_if_padding_and_no_attention_mask(input_ids, attention_mask)\n            input_shape = input_ids.size()\n        elif inputs_embeds is not None:\n            input_shape = inputs_embeds.size()[:-1]\n        else:\n            raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n        batch_size, seq_length = input_shape\n        device = input_ids.device if input_ids is not None else inputs_embeds.device  # type: ignore[union-attr]\n\n        # past_key_values_length\n        past_key_values_length = past_key_values[0][0].shape[2] if past_key_values is not None else 0\n\n        if attention_mask is None:\n            attention_mask = (\n                input_ids.ne(self.pad_token_id)\n                if self.pad_token_id is not None\n                else torch.ones(((batch_size, seq_length + past_key_values_length)), device=device)\n            )\n\n        # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]\n        # ourselves in which case we just need to make it broadcastable to all heads.\n        extended_attention_mask: Tensor = self.get_extended_attention_mask(attention_mask, input_shape)\n\n        # If a 2D or 3D attention mask is provided for the cross-attention\n        # we need to make broadcastable to [batch_size, num_heads, seq_length, seq_length]\n        if self.config.is_decoder and encoder_hidden_states is not None:\n            encoder_batch_size, encoder_sequence_length, _ = encoder_hidden_states.size()\n            encoder_hidden_shape = (encoder_batch_size, encoder_sequence_length)\n            if encoder_attention_mask is None:\n                encoder_attention_mask = torch.ones(encoder_hidden_shape, device=device)\n            encoder_extended_attention_mask = self.invert_attention_mask(encoder_attention_mask)\n        else:\n            encoder_extended_attention_mask = None\n\n        # Prepare head mask if needed\n        # 1.0 in head_mask indicate we keep the head\n        # attention_probs has shape bsz x n_heads x N x N\n        # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]\n        # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]\n        head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)\n\n        embedding_output = self.embeddings(\n            input_ids=input_ids,\n            position_ids=position_ids,\n            inputs_embeds=inputs_embeds,\n            past_key_values_length=past_key_values_length,\n        )\n        encoder_outputs = self.encoder(\n            embedding_output,\n            attention_mask=extended_attention_mask,\n            head_mask=head_mask,\n            encoder_hidden_states=encoder_hidden_states,\n            encoder_attention_mask=encoder_extended_attention_mask,\n            past_key_values=past_key_values,\n            use_cache=use_cache,\n            output_attentions=output_attentions,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n        )\n        sequence_output = encoder_outputs[0]\n        pooled_output = self.pooler(sequence_output) if self.pooler is not None else None\n\n        if not return_dict:\n            return (sequence_output, pooled_output) + encoder_outputs[1:]\n\n        return BaseModelOutputWithPoolingAndCrossAttentions(\n            last_hidden_state=sequence_output,\n            pooler_output=pooled_output,\n            past_key_values=encoder_outputs.past_key_values,\n            hidden_states=encoder_outputs.hidden_states,\n            attentions=encoder_outputs.attentions,\n            cross_attentions=encoder_outputs.cross_attentions,\n        )\n
"},{"location":"models/utrbert/#multimolecule.models.utrbert.UtrBertModel.forward","title":"forward","text":"Python
forward(input_ids: Tensor | NestedTensor, attention_mask: Tensor | None = None, position_ids: Tensor | None = None, head_mask: Tensor | None = None, inputs_embeds: Tensor | NestedTensor | None = None, encoder_hidden_states: Tensor | None = None, encoder_attention_mask: Tensor | None = None, past_key_values: Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None = None, use_cache: bool | None = None, output_attentions: bool | None = None, output_hidden_states: bool | None = None, return_dict: bool | None = None, **kwargs) -> Tuple[Tensor, ...] | BaseModelOutputWithPoolingAndCrossAttentions\n

Parameters:

Name Type Description Default Tensor | None

Shape: (batch_size, sequence_length, hidden_size)

Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if the model is configured as a decoder.

None Tensor | None

Shape: (batch_size, sequence_length)

Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used in the cross-attention if the model is configured as a decoder. Mask values selected in [0, 1]:

None Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None

Tuple of length config.n_layers with each tuple having 4 tensors of shape `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)

Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up decoding.

If past_key_values are used, the user can optionally input only the last decoder_input_ids (those that don\u2019t have their past key value states given to this model) of shape (batch_size, 1) instead of all decoder_input_ids of shape (batch_size, sequence_length).

None bool | None

If set to True, past_key_values key value states are returned and can be used to speed up decoding (see past_key_values).

None Source code in multimolecule/models/utrbert/modeling_utrbert.py Python
def forward(\n    self,\n    input_ids: Tensor | NestedTensor,\n    attention_mask: Tensor | None = None,\n    position_ids: Tensor | None = None,\n    head_mask: Tensor | None = None,\n    inputs_embeds: Tensor | NestedTensor | None = None,\n    encoder_hidden_states: Tensor | None = None,\n    encoder_attention_mask: Tensor | None = None,\n    past_key_values: Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None = None,\n    use_cache: bool | None = None,\n    output_attentions: bool | None = None,\n    output_hidden_states: bool | None = None,\n    return_dict: bool | None = None,\n    **kwargs,\n) -> Tuple[Tensor, ...] | BaseModelOutputWithPoolingAndCrossAttentions:\n    r\"\"\"\n    Args:\n        encoder_hidden_states:\n            Shape: `(batch_size, sequence_length, hidden_size)`\n\n            Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if\n            the model is configured as a decoder.\n        encoder_attention_mask:\n            Shape: `(batch_size, sequence_length)`\n\n            Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used\n            in the cross-attention if the model is configured as a decoder. Mask values selected in `[0, 1]`:\n\n            - 1 for tokens that are **not masked**,\n            - 0 for tokens that are **masked**.\n        past_key_values:\n            Tuple of length `config.n_layers` with each tuple having 4 tensors of shape\n            `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)\n\n            Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up\n            decoding.\n\n            If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those\n            that don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of\n            all `decoder_input_ids` of shape `(batch_size, sequence_length)`.\n        use_cache:\n            If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding\n            (see `past_key_values`).\n    \"\"\"\n    if kwargs:\n        warn(\n            f\"Additional keyword arguments `{', '.join(kwargs)}` are detected in \"\n            f\"`{self.__class__.__name__}.forward`, they will be ignored.\\n\"\n            \"This is provided for backward compatibility and may lead to unexpected behavior.\"\n        )\n    output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions\n    output_hidden_states = (\n        output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states\n    )\n    return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n\n    if self.config.is_decoder:\n        use_cache = use_cache if use_cache is not None else self.config.use_cache\n    else:\n        use_cache = False\n\n    if isinstance(input_ids, NestedTensor):\n        input_ids, attention_mask = input_ids.tensor, input_ids.mask\n    if input_ids is not None and inputs_embeds is not None:\n        raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n    if input_ids is not None:\n        self.warn_if_padding_and_no_attention_mask(input_ids, attention_mask)\n        input_shape = input_ids.size()\n    elif inputs_embeds is not None:\n        input_shape = inputs_embeds.size()[:-1]\n    else:\n        raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n    batch_size, seq_length = input_shape\n    device = input_ids.device if input_ids is not None else inputs_embeds.device  # type: ignore[union-attr]\n\n    # past_key_values_length\n    past_key_values_length = past_key_values[0][0].shape[2] if past_key_values is not None else 0\n\n    if attention_mask is None:\n        attention_mask = (\n            input_ids.ne(self.pad_token_id)\n            if self.pad_token_id is not None\n            else torch.ones(((batch_size, seq_length + past_key_values_length)), device=device)\n        )\n\n    # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]\n    # ourselves in which case we just need to make it broadcastable to all heads.\n    extended_attention_mask: Tensor = self.get_extended_attention_mask(attention_mask, input_shape)\n\n    # If a 2D or 3D attention mask is provided for the cross-attention\n    # we need to make broadcastable to [batch_size, num_heads, seq_length, seq_length]\n    if self.config.is_decoder and encoder_hidden_states is not None:\n        encoder_batch_size, encoder_sequence_length, _ = encoder_hidden_states.size()\n        encoder_hidden_shape = (encoder_batch_size, encoder_sequence_length)\n        if encoder_attention_mask is None:\n            encoder_attention_mask = torch.ones(encoder_hidden_shape, device=device)\n        encoder_extended_attention_mask = self.invert_attention_mask(encoder_attention_mask)\n    else:\n        encoder_extended_attention_mask = None\n\n    # Prepare head mask if needed\n    # 1.0 in head_mask indicate we keep the head\n    # attention_probs has shape bsz x n_heads x N x N\n    # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]\n    # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]\n    head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)\n\n    embedding_output = self.embeddings(\n        input_ids=input_ids,\n        position_ids=position_ids,\n        inputs_embeds=inputs_embeds,\n        past_key_values_length=past_key_values_length,\n    )\n    encoder_outputs = self.encoder(\n        embedding_output,\n        attention_mask=extended_attention_mask,\n        head_mask=head_mask,\n        encoder_hidden_states=encoder_hidden_states,\n        encoder_attention_mask=encoder_extended_attention_mask,\n        past_key_values=past_key_values,\n        use_cache=use_cache,\n        output_attentions=output_attentions,\n        output_hidden_states=output_hidden_states,\n        return_dict=return_dict,\n    )\n    sequence_output = encoder_outputs[0]\n    pooled_output = self.pooler(sequence_output) if self.pooler is not None else None\n\n    if not return_dict:\n        return (sequence_output, pooled_output) + encoder_outputs[1:]\n\n    return BaseModelOutputWithPoolingAndCrossAttentions(\n        last_hidden_state=sequence_output,\n        pooler_output=pooled_output,\n        past_key_values=encoder_outputs.past_key_values,\n        hidden_states=encoder_outputs.hidden_states,\n        attentions=encoder_outputs.attentions,\n        cross_attentions=encoder_outputs.cross_attentions,\n    )\n
"},{"location":"models/utrbert/#multimolecule.models.utrbert.UtrBertModel.forward(encoder_hidden_states)","title":"encoder_hidden_states","text":""},{"location":"models/utrbert/#multimolecule.models.utrbert.UtrBertModel.forward(encoder_attention_mask)","title":"encoder_attention_mask","text":""},{"location":"models/utrbert/#multimolecule.models.utrbert.UtrBertModel.forward(past_key_values)","title":"past_key_values","text":""},{"location":"models/utrbert/#multimolecule.models.utrbert.UtrBertModel.forward(use_cache)","title":"use_cache","text":""},{"location":"models/utrbert/#multimolecule.models.utrbert.UtrBertPreTrainedModel","title":"UtrBertPreTrainedModel","text":"

Bases: PreTrainedModel

An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained models.

Source code in multimolecule/models/utrbert/modeling_utrbert.py Python
class UtrBertPreTrainedModel(PreTrainedModel):\n    \"\"\"\n    An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained\n    models.\n    \"\"\"\n\n    config_class = UtrBertConfig\n    base_model_prefix = \"utrbert\"\n    supports_gradient_checkpointing = True\n    _no_split_modules = [\"UtrBertLayer\", \"UtrBertEmbeddings\"]\n\n    # Copied from transformers.models.bert.modeling_bert.BertPreTrainedModel._init_weights\n    def _init_weights(self, module: nn.Module):\n        \"\"\"Initialize the weights\"\"\"\n        if isinstance(module, nn.Linear):\n            # Slightly different from the TF version which uses truncated_normal for initialization\n            # cf https://github.com/pytorch/pytorch/pull/5617\n            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)\n            if module.bias is not None:\n                module.bias.data.zero_()\n        elif isinstance(module, nn.Embedding):\n            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)\n            if module.padding_idx is not None:\n                module.weight.data[module.padding_idx].zero_()\n        elif isinstance(module, nn.LayerNorm):\n            module.bias.data.zero_()\n            module.weight.data.fill_(1.0)\n
"},{"location":"models/utrlm/","title":"UTR-LM","text":"

Pre-trained model on 5\u2019 untranslated region (5\u2019UTR) using masked language modeling (MLM), Secondary Structure (SS), and Minimum Free Energy (MFE) objectives.

"},{"location":"models/utrlm/#statement","title":"Statement","text":"

A 5\u2019 UTR Language Model for Decoding Untranslated Regions of mRNA and Function Predictions is published in Nature Machine Intelligence, which is a Closed Access / Author-Fee journal.

Machine learning has been at the forefront of the movement for free and open access to research.

We see no role for closed access or author-fee publication in the future of machine learning research and believe the adoption of these journals as an outlet of record for the machine learning community would be a retrograde step.

The MultiMolecule team is committed to the principles of open access and open science.

We do NOT endorse the publication of manuscripts in Closed Access / Author-Fee journals and encourage the community to support Open Access journals and conferences.

Please consider signing the Statement on Nature Machine Intelligence.

"},{"location":"models/utrlm/#disclaimer","title":"Disclaimer","text":"

This is an UNOFFICIAL implementation of the A 5\u2019 UTR Language Model for Decoding Untranslated Regions of mRNA and Function Predictions by Yanyi Chu, Dan Yu, et al.

The OFFICIAL repository of UTR-LM is at a96123155/UTR-LM.

Warning

The MultiMolecule team is unable to confirm that the provided model and checkpoints are producing the same intermediate representations as the original implementation. This is because

The proposed method is published in a Closed Access / Author-Fee journal.

The team releasing UTR-LM did not write this model card for this model so this model card has been written by the MultiMolecule team.

"},{"location":"models/utrlm/#model-details","title":"Model Details","text":"

UTR-LM is a bert-style model pre-trained on a large corpus of 5\u2019 untranslated regions (5\u2019UTRs) in a self-supervised fashion. This means that the model was trained on the raw nucleotides of RNA sequences only, with an automatic process to generate inputs and labels from those texts. Please refer to the Training Details section for more information on the training process.

"},{"location":"models/utrlm/#variations","title":"Variations","text":""},{"location":"models/utrlm/#model-specification","title":"Model Specification","text":"Variants Num Layers Hidden Size Num Heads Intermediate Size Num Parameters (M) FLOPs (G) MACs (G) Max Num Tokens UTR-LM MRL 6 128 16 512 1.21 0.35 0.18 1022 UTR-LM TE_EL"},{"location":"models/utrlm/#links","title":"Links","text":""},{"location":"models/utrlm/#usage","title":"Usage","text":"

The model file depends on the multimolecule library. You can install it using pip:

Bash
pip install multimolecule\n
"},{"location":"models/utrlm/#direct-use","title":"Direct Use","text":"

You can use this model directly with a pipeline for masked language modeling:

Python
>>> import multimolecule  # you must import multimolecule to register models\n>>> from transformers import pipeline\n>>> unmasker = pipeline(\"fill-mask\", model=\"multimolecule/utrlm-te_el\")\n>>> unmasker(\"gguc<mask>cucugguuagaccagaucugagccu\")\n\n[{'score': 0.07707168161869049,\n  'token': 23,\n  'token_str': '*',\n  'sequence': 'G G U C * C U C U G G U U A G A C C A G A U C U G A G C C U'},\n {'score': 0.07588472962379456,\n  'token': 5,\n  'token_str': '<null>',\n  'sequence': 'G G U C C U C U G G U U A G A C C A G A U C U G A G C C U'},\n {'score': 0.07178673148155212,\n  'token': 9,\n  'token_str': 'U',\n  'sequence': 'G G U C U C U C U G G U U A G A C C A G A U C U G A G C C U'},\n {'score': 0.06414645165205002,\n  'token': 10,\n  'token_str': 'N',\n  'sequence': 'G G U C N C U C U G G U U A G A C C A G A U C U G A G C C U'},\n {'score': 0.06385370343923569,\n  'token': 12,\n  'token_str': 'Y',\n  'sequence': 'G G U C Y C U C U G G U U A G A C C A G A U C U G A G C C U'}]\n
"},{"location":"models/utrlm/#downstream-use","title":"Downstream Use","text":""},{"location":"models/utrlm/#extract-features","title":"Extract Features","text":"

Here is how to use this model to get the features of a given sequence in PyTorch:

Python
from multimolecule import RnaTokenizer, UtrLmModel\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/utrlm-te_el\")\nmodel = UtrLmModel.from_pretrained(\"multimolecule/utrlm-te_el\")\n\ntext = \"UAGCUUAUCAGACUGAUGUUGA\"\ninput = tokenizer(text, return_tensors=\"pt\")\n\noutput = model(**input)\n
"},{"location":"models/utrlm/#sequence-classification-regression","title":"Sequence Classification / Regression","text":"

Note: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for sequence classification or regression.

Here is how to use this model as backbone to fine-tune for a sequence-level task in PyTorch:

Python
import torch\nfrom multimolecule import RnaTokenizer, UtrLmForSequencePrediction\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/utrlm-te_el\")\nmodel = UtrLmForSequencePrediction.from_pretrained(\"multimolecule/utrlm-te_el\")\n\ntext = \"UAGCUUAUCAGACUGAUGUUGA\"\ninput = tokenizer(text, return_tensors=\"pt\")\nlabel = torch.tensor([1])\n\noutput = model(**input, labels=label)\n
"},{"location":"models/utrlm/#token-classification-regression","title":"Token Classification / Regression","text":"

Note: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for nucleotide classification or regression.

Here is how to use this model as backbone to fine-tune for a nucleotide-level task in PyTorch:

Python
import torch\nfrom multimolecule import RnaTokenizer, UtrLmForTokenPrediction\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/utrlm-te_el\")\nmodel = UtrLmForTokenPrediction.from_pretrained(\"multimolecule/utrlm-te_el\")\n\ntext = \"UAGCUUAUCAGACUGAUGUUGA\"\ninput = tokenizer(text, return_tensors=\"pt\")\nlabel = torch.randint(2, (len(text), ))\n\noutput = model(**input, labels=label)\n
"},{"location":"models/utrlm/#contact-classification-regression","title":"Contact Classification / Regression","text":"

Note: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for contact classification or regression.

Here is how to use this model as backbone to fine-tune for a contact-level task in PyTorch:

Python
import torch\nfrom multimolecule import RnaTokenizer, UtrLmForContactPrediction\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/utrlm-te_el\")\nmodel = UtrLmForContactPrediction.from_pretrained(\"multimolecule/utrlm-te_el\")\n\ntext = \"UAGCUUAUCAGACUGAUGUUGA\"\ninput = tokenizer(text, return_tensors=\"pt\")\nlabel = torch.randint(2, (len(text), len(text)))\n\noutput = model(**input, labels=label)\n
"},{"location":"models/utrlm/#training-details","title":"Training Details","text":"

UTR-LM used a mixed training strategy with one self-supervised task and two supervised tasks, where the labels of both supervised tasks are calculated using ViennaRNA.

  1. Masked Language Modeling (MLM): taking a sequence, the model randomly masks 15% of the tokens in the input then runs the entire masked sentence through the model and has to predict the masked tokens. This is comparable to the Cloze task in language modeling.
  2. Secondary Structure (SS): predicting the secondary structure of the <mask> token in the MLM task.
  3. Minimum Free Energy (MFE): predicting the minimum free energy of the 5\u2019 UTR sequence.
"},{"location":"models/utrlm/#training-data","title":"Training Data","text":"

The UTR-LM model was pre-trained on 5\u2019 UTR sequences from three sources:

UTR-LM preprocessed the 5\u2019 UTR sequences in a 4-step pipeline:

  1. removed all coding sequence (CDS) and non-5\u2019 UTR fragments from the raw sequences.
  2. identified and removed duplicate sequences
  3. truncated the sequences to fit within a range of 30 to 1022 bp
  4. filtered out incorrect and low-quality sequences

Note RnaTokenizer will convert \u201cT\u201ds to \u201cU\u201ds for you, you may disable this behaviour by passing replace_T_with_U=False.

"},{"location":"models/utrlm/#training-procedure","title":"Training Procedure","text":""},{"location":"models/utrlm/#preprocessing","title":"Preprocessing","text":"

UTR-LM used masked language modeling (MLM) as one of the pre-training objectives. The masking procedure is similar to the one used in BERT:

"},{"location":"models/utrlm/#pretraining","title":"PreTraining","text":"

The model was trained on two clusters:

  1. 4 NVIDIA V100 GPUs with 16GiB memories.
  2. 4 NVIDIA P100 GPUs with 32GiB memories.
"},{"location":"models/utrlm/#citation","title":"Citation","text":"

BibTeX:

BibTeX
@article {chu2023a,\n    author = {Chu, Yanyi and Yu, Dan and Li, Yupeng and Huang, Kaixuan and Shen, Yue and Cong, Le and Zhang, Jason and Wang, Mengdi},\n    title = {A 5{\\textquoteright} UTR Language Model for Decoding Untranslated Regions of mRNA and Function Predictions},\n    elocation-id = {2023.10.11.561938},\n    year = {2023},\n    doi = {10.1101/2023.10.11.561938},\n    publisher = {Cold Spring Harbor Laboratory},\n    abstract = {The 5{\\textquoteright} UTR, a regulatory region at the beginning of an mRNA molecule, plays a crucial role in regulating the translation process and impacts the protein expression level. Language models have showcased their effectiveness in decoding the functions of protein and genome sequences. Here, we introduced a language model for 5{\\textquoteright} UTR, which we refer to as the UTR-LM. The UTR-LM is pre-trained on endogenous 5{\\textquoteright} UTRs from multiple species and is further augmented with supervised information including secondary structure and minimum free energy. We fine-tuned the UTR-LM in a variety of downstream tasks. The model outperformed the best-known benchmark by up to 42\\% for predicting the Mean Ribosome Loading, and by up to 60\\% for predicting the Translation Efficiency and the mRNA Expression Level. The model also applies to identifying unannotated Internal Ribosome Entry Sites within the untranslated region and improves the AUPR from 0.37 to 0.52 compared to the best baseline. Further, we designed a library of 211 novel 5{\\textquoteright} UTRs with high predicted values of translation efficiency and evaluated them via a wet-lab assay. Experiment results confirmed that our top designs achieved a 32.5\\% increase in protein production level relative to well-established 5{\\textquoteright} UTR optimized for therapeutics.Competing Interest StatementThe authors have declared no competing interest.},\n    URL = {https://www.biorxiv.org/content/early/2023/10/14/2023.10.11.561938},\n    eprint = {https://www.biorxiv.org/content/early/2023/10/14/2023.10.11.561938.full.pdf},\n    journal = {bioRxiv}\n}\n
"},{"location":"models/utrlm/#contact","title":"Contact","text":"

Please use GitHub issues of MultiMolecule for any questions or comments on the model card.

Please contact the authors of the UTR-LM paper for questions or comments on the paper/model.

"},{"location":"models/utrlm/#license","title":"License","text":"

This model is licensed under the AGPL-3.0 License.

Text Only
SPDX-License-Identifier: AGPL-3.0-or-later\n
"},{"location":"models/utrlm/#multimolecule.models.utrlm","title":"multimolecule.models.utrlm","text":""},{"location":"models/utrlm/#multimolecule.models.utrlm.RnaTokenizer","title":"RnaTokenizer","text":"

Bases: Tokenizer

Tokenizer for RNA sequences.

Parameters:

Name Type Description Default Alphabet | str | List[str] | None

alphabet to use for tokenization.

None int

Size of kmer to tokenize.

1 bool

Whether to tokenize into codons.

False bool

Whether to replace T with U.

True bool

Whether to convert input to uppercase.

True

Examples:

Python Console Session
>>> from multimolecule import RnaTokenizer\n>>> tokenizer = RnaTokenizer()\n>>> tokenizer('<pad><cls><eos><unk><mask><null>ACGUNRYSWKMBDHV.X*-I')[\"input_ids\"]\n[1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 2]\n>>> tokenizer('acgu')[\"input_ids\"]\n[1, 6, 7, 8, 9, 2]\n>>> tokenizer('acgt')[\"input_ids\"]\n[1, 6, 7, 8, 9, 2]\n>>> tokenizer = RnaTokenizer(replace_T_with_U=False)\n>>> tokenizer('acgt')[\"input_ids\"]\n[1, 6, 7, 8, 3, 2]\n>>> tokenizer = RnaTokenizer(nmers=3)\n>>> tokenizer('uagcuuauc')[\"input_ids\"]\n[1, 83, 17, 64, 49, 96, 84, 22, 2]\n>>> tokenizer = RnaTokenizer(codon=True)\n>>> tokenizer('uagcuuauc')[\"input_ids\"]\n[1, 83, 49, 22, 2]\n>>> tokenizer('uagcuuauca')[\"input_ids\"]\nTraceback (most recent call last):\nValueError: length of input sequence must be a multiple of 3 for codon tokenization, but got 10\n
Source code in multimolecule/tokenisers/rna/tokenization_rna.py Python
class RnaTokenizer(Tokenizer):\n    \"\"\"\n    Tokenizer for RNA sequences.\n\n    Args:\n        alphabet: alphabet to use for tokenization.\n\n            - If is `None`, the standard RNA alphabet will be used.\n            - If is a `string`, it should correspond to the name of a predefined alphabet. The options include\n                + `standard`\n                + `extended`\n                + `streamline`\n                + `nucleobase`\n            - If is an alphabet or a list of characters, that specific alphabet will be used.\n        nmers: Size of kmer to tokenize.\n        codon: Whether to tokenize into codons.\n        replace_T_with_U: Whether to replace T with U.\n        do_upper_case: Whether to convert input to uppercase.\n\n    Examples:\n        >>> from multimolecule import RnaTokenizer\n        >>> tokenizer = RnaTokenizer()\n        >>> tokenizer('<pad><cls><eos><unk><mask><null>ACGUNRYSWKMBDHV.X*-I')[\"input_ids\"]\n        [1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 2]\n        >>> tokenizer('acgu')[\"input_ids\"]\n        [1, 6, 7, 8, 9, 2]\n        >>> tokenizer('acgt')[\"input_ids\"]\n        [1, 6, 7, 8, 9, 2]\n        >>> tokenizer = RnaTokenizer(replace_T_with_U=False)\n        >>> tokenizer('acgt')[\"input_ids\"]\n        [1, 6, 7, 8, 3, 2]\n        >>> tokenizer = RnaTokenizer(nmers=3)\n        >>> tokenizer('uagcuuauc')[\"input_ids\"]\n        [1, 83, 17, 64, 49, 96, 84, 22, 2]\n        >>> tokenizer = RnaTokenizer(codon=True)\n        >>> tokenizer('uagcuuauc')[\"input_ids\"]\n        [1, 83, 49, 22, 2]\n        >>> tokenizer('uagcuuauca')[\"input_ids\"]\n        Traceback (most recent call last):\n        ValueError: length of input sequence must be a multiple of 3 for codon tokenization, but got 10\n    \"\"\"\n\n    model_input_names = [\"input_ids\", \"attention_mask\"]\n\n    def __init__(\n        self,\n        alphabet: Alphabet | str | List[str] | None = None,\n        nmers: int = 1,\n        codon: bool = False,\n        replace_T_with_U: bool = True,\n        do_upper_case: bool = True,\n        additional_special_tokens: List | Tuple | None = None,\n        **kwargs,\n    ):\n        if codon and (nmers > 1 and nmers != 3):\n            raise ValueError(\"Codon and nmers cannot be used together.\")\n        if codon:\n            nmers = 3  # set to 3 to get correct vocab\n        if not isinstance(alphabet, Alphabet):\n            alphabet = get_alphabet(alphabet, nmers=nmers)\n        super().__init__(\n            alphabet=alphabet,\n            nmers=nmers,\n            codon=codon,\n            replace_T_with_U=replace_T_with_U,\n            do_upper_case=do_upper_case,\n            additional_special_tokens=additional_special_tokens,\n            **kwargs,\n        )\n        self.replace_T_with_U = replace_T_with_U\n        self.nmers = nmers\n        self.codon = codon\n\n    def _tokenize(self, text: str, **kwargs):\n        if self.do_upper_case:\n            text = text.upper()\n        if self.replace_T_with_U:\n            text = text.replace(\"T\", \"U\")\n        if self.codon:\n            if len(text) % 3 != 0:\n                raise ValueError(\n                    f\"length of input sequence must be a multiple of 3 for codon tokenization, but got {len(text)}\"\n                )\n            return [text[i : i + 3] for i in range(0, len(text), 3)]\n        if self.nmers > 1:\n            return [text[i : i + self.nmers] for i in range(len(text) - self.nmers + 1)]  # noqa: E203\n        return list(text)\n
"},{"location":"models/utrlm/#multimolecule.models.utrlm.RnaTokenizer(alphabet)","title":"alphabet","text":""},{"location":"models/utrlm/#multimolecule.models.utrlm.RnaTokenizer(nmers)","title":"nmers","text":""},{"location":"models/utrlm/#multimolecule.models.utrlm.RnaTokenizer(codon)","title":"codon","text":""},{"location":"models/utrlm/#multimolecule.models.utrlm.RnaTokenizer(replace_T_with_U)","title":"replace_T_with_U","text":""},{"location":"models/utrlm/#multimolecule.models.utrlm.RnaTokenizer(do_upper_case)","title":"do_upper_case","text":""},{"location":"models/utrlm/#multimolecule.models.utrlm.UtrLmConfig","title":"UtrLmConfig","text":"

Bases: PreTrainedConfig

This is the configuration class to store the configuration of a UtrLmModel. It is used to instantiate a UTR-LM model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the UTR-LM a96123155/UTR-LM architecture.

Configuration objects inherit from PreTrainedConfig and can be used to control the model outputs. Read the documentation from PreTrainedConfig for more information.

Parameters:

Name Type Description Default int

Vocabulary size of the UTR-LM model. Defines the number of different tokens that can be represented by the inputs_ids passed when calling [UtrLmModel].

26 int

Dimensionality of the encoder layers and the pooler layer.

128 int

Number of hidden layers in the Transformer encoder.

6 int

Number of attention heads for each attention layer in the Transformer encoder.

16 int

Dimensionality of the \u201cintermediate\u201d (often named feed-forward) layer in the Transformer encoder.

512 float

The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.

0.1 float

The dropout ratio for the attention probabilities.

0.1 int

The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).

1026 float

The standard deviation of the truncated_normal_initializer for initializing all weight matrices.

0.02 float

The epsilon used by the layer normalization layers.

1e-12 str

Type of position embedding. Choose one of \"absolute\", \"relative_key\", \"relative_key_query\", \"rotary\". For positional embeddings use \"absolute\". For more information on \"relative_key\", please refer to Self-Attention with Relative Position Representations (Shaw et al.). For more information on \"relative_key_query\", please refer to Method 4 in Improve Transformer Models with Better Relative Position Embeddings (Huang et al.).

'rotary' bool

Whether the model is used as a decoder or not. If False, the model is used as an encoder.

False bool

Whether or not the model should return the last key/values attentions (not used by all models). Only relevant if config.is_decoder=True.

True bool

Whether to apply layer normalization after embeddings but before the main stem of the network.

False bool

When this is enabled, masked tokens are treated as if they had been dropped out by input dropout.

False

Examples:

Python Console Session
>>> from multimolecule import UtrLmModel, UtrLmConfig\n>>> # Initializing a UTR-LM multimolecule/utrlm style configuration\n>>> configuration = UtrLmConfig()\n>>> # Initializing a model (with random weights) from the multimolecule/utrlm style configuration\n>>> model = UtrLmModel(configuration)\n>>> # Accessing the model configuration\n>>> configuration = model.config\n
Source code in multimolecule/models/utrlm/configuration_utrlm.py Python
class UtrLmConfig(PreTrainedConfig):\n    r\"\"\"\n    This is the configuration class to store the configuration of a [`UtrLmModel`][multimolecule.models.UtrLmModel].\n    It is used to instantiate a UTR-LM model according to the specified arguments, defining the model architecture.\n    Instantiating a configuration with the defaults will yield a similar configuration to that of the UTR-LM\n    [a96123155/UTR-LM](https://github.com/a96123155/UTR-LM) architecture.\n\n    Configuration objects inherit from [`PreTrainedConfig`][multimolecule.models.PreTrainedConfig] and can be used to\n    control the model outputs. Read the documentation from [`PreTrainedConfig`][multimolecule.models.PreTrainedConfig]\n    for more information.\n\n    Args:\n        vocab_size:\n            Vocabulary size of the UTR-LM model. Defines the number of different tokens that can be represented by the\n            `inputs_ids` passed when calling [`UtrLmModel`].\n        hidden_size:\n            Dimensionality of the encoder layers and the pooler layer.\n        num_hidden_layers:\n            Number of hidden layers in the Transformer encoder.\n        num_attention_heads:\n            Number of attention heads for each attention layer in the Transformer encoder.\n        intermediate_size:\n            Dimensionality of the \"intermediate\" (often named feed-forward) layer in the Transformer encoder.\n        hidden_dropout:\n            The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.\n        attention_dropout:\n            The dropout ratio for the attention probabilities.\n        max_position_embeddings:\n            The maximum sequence length that this model might ever be used with. Typically set this to something large\n            just in case (e.g., 512 or 1024 or 2048).\n        initializer_range:\n            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.\n        layer_norm_eps:\n            The epsilon used by the layer normalization layers.\n        position_embedding_type:\n            Type of position embedding. Choose one of `\"absolute\"`, `\"relative_key\"`, `\"relative_key_query\", \"rotary\"`.\n            For positional embeddings use `\"absolute\"`. For more information on `\"relative_key\"`, please refer to\n            [Self-Attention with Relative Position Representations (Shaw et al.)](https://arxiv.org/abs/1803.02155).\n            For more information on `\"relative_key_query\"`, please refer to *Method 4* in [Improve Transformer Models\n            with Better Relative Position Embeddings (Huang et al.)](https://arxiv.org/abs/2009.13658).\n        is_decoder:\n            Whether the model is used as a decoder or not. If `False`, the model is used as an encoder.\n        use_cache:\n            Whether or not the model should return the last key/values attentions (not used by all models). Only\n            relevant if `config.is_decoder=True`.\n        emb_layer_norm_before:\n            Whether to apply layer normalization after embeddings but before the main stem of the network.\n        token_dropout:\n            When this is enabled, masked tokens are treated as if they had been dropped out by input dropout.\n\n    Examples:\n        >>> from multimolecule import UtrLmModel, UtrLmConfig\n        >>> # Initializing a UTR-LM multimolecule/utrlm style configuration\n        >>> configuration = UtrLmConfig()\n        >>> # Initializing a model (with random weights) from the multimolecule/utrlm style configuration\n        >>> model = UtrLmModel(configuration)\n        >>> # Accessing the model configuration\n        >>> configuration = model.config\n    \"\"\"\n\n    model_type = \"utrlm\"\n\n    def __init__(\n        self,\n        vocab_size: int = 26,\n        hidden_size: int = 128,\n        num_hidden_layers: int = 6,\n        num_attention_heads: int = 16,\n        intermediate_size: int = 512,\n        hidden_act: str = \"gelu\",\n        hidden_dropout: float = 0.1,\n        attention_dropout: float = 0.1,\n        max_position_embeddings: int = 1026,\n        initializer_range: float = 0.02,\n        layer_norm_eps: float = 1e-12,\n        position_embedding_type: str = \"rotary\",\n        is_decoder: bool = False,\n        use_cache: bool = True,\n        emb_layer_norm_before: bool = False,\n        token_dropout: bool = False,\n        head: HeadConfig | None = None,\n        lm_head: MaskedLMHeadConfig | None = None,\n        ss_head: HeadConfig | None = None,\n        mfe_head: HeadConfig | None = None,\n        **kwargs,\n    ):\n        super().__init__(**kwargs)\n\n        self.vocab_size = vocab_size\n        self.hidden_size = hidden_size\n        self.num_hidden_layers = num_hidden_layers\n        self.num_attention_heads = num_attention_heads\n        self.intermediate_size = intermediate_size\n        self.hidden_act = hidden_act\n        self.hidden_dropout = hidden_dropout\n        self.attention_dropout = attention_dropout\n        self.max_position_embeddings = max_position_embeddings\n        self.initializer_range = initializer_range\n        self.layer_norm_eps = layer_norm_eps\n        self.position_embedding_type = position_embedding_type\n        self.is_decoder = is_decoder\n        self.use_cache = use_cache\n        self.emb_layer_norm_before = emb_layer_norm_before\n        self.token_dropout = token_dropout\n        self.head = HeadConfig(**head) if head is not None else None\n        self.lm_head = MaskedLMHeadConfig(**lm_head) if lm_head is not None else None\n        self.ss_head = HeadConfig(**ss_head) if ss_head is not None else None\n        self.mfe_head = HeadConfig(**mfe_head) if mfe_head is not None else None\n
"},{"location":"models/utrlm/#multimolecule.models.utrlm.UtrLmConfig(vocab_size)","title":"vocab_size","text":""},{"location":"models/utrlm/#multimolecule.models.utrlm.UtrLmConfig(hidden_size)","title":"hidden_size","text":""},{"location":"models/utrlm/#multimolecule.models.utrlm.UtrLmConfig(num_hidden_layers)","title":"num_hidden_layers","text":""},{"location":"models/utrlm/#multimolecule.models.utrlm.UtrLmConfig(num_attention_heads)","title":"num_attention_heads","text":""},{"location":"models/utrlm/#multimolecule.models.utrlm.UtrLmConfig(intermediate_size)","title":"intermediate_size","text":""},{"location":"models/utrlm/#multimolecule.models.utrlm.UtrLmConfig(hidden_dropout)","title":"hidden_dropout","text":""},{"location":"models/utrlm/#multimolecule.models.utrlm.UtrLmConfig(attention_dropout)","title":"attention_dropout","text":""},{"location":"models/utrlm/#multimolecule.models.utrlm.UtrLmConfig(max_position_embeddings)","title":"max_position_embeddings","text":""},{"location":"models/utrlm/#multimolecule.models.utrlm.UtrLmConfig(initializer_range)","title":"initializer_range","text":""},{"location":"models/utrlm/#multimolecule.models.utrlm.UtrLmConfig(layer_norm_eps)","title":"layer_norm_eps","text":""},{"location":"models/utrlm/#multimolecule.models.utrlm.UtrLmConfig(position_embedding_type)","title":"position_embedding_type","text":""},{"location":"models/utrlm/#multimolecule.models.utrlm.UtrLmConfig(is_decoder)","title":"is_decoder","text":""},{"location":"models/utrlm/#multimolecule.models.utrlm.UtrLmConfig(use_cache)","title":"use_cache","text":""},{"location":"models/utrlm/#multimolecule.models.utrlm.UtrLmConfig(emb_layer_norm_before)","title":"emb_layer_norm_before","text":""},{"location":"models/utrlm/#multimolecule.models.utrlm.UtrLmConfig(token_dropout)","title":"token_dropout","text":""},{"location":"models/utrlm/#multimolecule.models.utrlm.UtrLmForContactPrediction","title":"UtrLmForContactPrediction","text":"

Bases: UtrLmPreTrainedModel

Examples:

Python Console Session
>>> from multimolecule import UtrLmConfig, UtrLmForContactPrediction, RnaTokenizer\n>>> config = UtrLmConfig()\n>>> model = UtrLmForContactPrediction(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=torch.randint(2, (1, 5, 5)))\n>>> output[\"logits\"].shape\ntorch.Size([1, 5, 5, 1])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n
Source code in multimolecule/models/utrlm/modeling_utrlm.py Python
class UtrLmForContactPrediction(UtrLmPreTrainedModel):\n    \"\"\"\n    Examples:\n        >>> from multimolecule import UtrLmConfig, UtrLmForContactPrediction, RnaTokenizer\n        >>> config = UtrLmConfig()\n        >>> model = UtrLmForContactPrediction(config)\n        >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n        >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n        >>> output = model(**input, labels=torch.randint(2, (1, 5, 5)))\n        >>> output[\"logits\"].shape\n        torch.Size([1, 5, 5, 1])\n        >>> output[\"loss\"]  # doctest:+ELLIPSIS\n        tensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n    \"\"\"\n\n    def __init__(self, config: UtrLmConfig):\n        super().__init__(config)\n        self.utrlm = UtrLmModel(config, add_pooling_layer=True)\n        self.contact_head = ContactPredictionHead(config)\n        self.head_config = self.contact_head.config\n\n        # Initialize weights and apply final processing\n        self.post_init()\n\n    def forward(\n        self,\n        input_ids: Tensor | NestedTensor,\n        attention_mask: Tensor | None = None,\n        position_ids: Tensor | None = None,\n        head_mask: Tensor | None = None,\n        inputs_embeds: Tensor | NestedTensor | None = None,\n        labels: Tensor | None = None,\n        output_attentions: bool | None = None,\n        output_hidden_states: bool | None = None,\n        return_dict: bool | None = None,\n        **kwargs,\n    ) -> Tuple[Tensor, ...] | ContactPredictorOutput:\n        if output_attentions is False:\n            warn(\"output_attentions must be True for contact classification and will be ignored.\")\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n        outputs = self.utrlm(\n            input_ids,\n            attention_mask=attention_mask,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            output_attentions=True,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n            **kwargs,\n        )\n        output = self.contact_head(outputs, attention_mask, input_ids, labels)\n        logits, loss = output.logits, output.loss\n\n        if not return_dict:\n            output = (logits,) + outputs[2:]\n            return ((loss,) + output) if loss is not None else output\n\n        return ContactPredictorOutput(\n            loss=loss,\n            logits=logits,\n            hidden_states=outputs.hidden_states,\n            attentions=outputs.attentions,\n        )\n
"},{"location":"models/utrlm/#multimolecule.models.utrlm.UtrLmForMaskedLM","title":"UtrLmForMaskedLM","text":"

Bases: UtrLmPreTrainedModel

Examples:

Python Console Session
>>> from multimolecule import UtrLmConfig, UtrLmModel, RnaTokenizer\n>>> config = UtrLmConfig()\n>>> model = UtrLmForMaskedLM(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=input[\"input_ids\"])\n>>> output[\"logits\"].shape\ntorch.Size([1, 7, 26])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<NllLossBackward0>)\n
Source code in multimolecule/models/utrlm/modeling_utrlm.py Python
class UtrLmForMaskedLM(UtrLmPreTrainedModel):\n    \"\"\"\n    Examples:\n        >>> from multimolecule import UtrLmConfig, UtrLmModel, RnaTokenizer\n        >>> config = UtrLmConfig()\n        >>> model = UtrLmForMaskedLM(config)\n        >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n        >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n        >>> output = model(**input, labels=input[\"input_ids\"])\n        >>> output[\"logits\"].shape\n        torch.Size([1, 7, 26])\n        >>> output[\"loss\"]  # doctest:+ELLIPSIS\n        tensor(..., grad_fn=<NllLossBackward0>)\n    \"\"\"\n\n    _tied_weights_keys = [\"lm_head.decoder.weight\", \"lm_head.decoder.bias\"]\n\n    def __init__(self, config: UtrLmConfig):\n        super().__init__(config)\n        if config.is_decoder:\n            logger.warning(\n                \"If you want to use `UtrLmForMaskedLM` make sure `config.is_decoder=False` for \"\n                \"bi-directional self-attention.\"\n            )\n        self.utrlm = UtrLmModel(config, add_pooling_layer=False)\n        self.lm_head = MaskedLMHead(config)\n\n        # Initialize weights and apply final processing\n        self.post_init()\n\n    def forward(\n        self,\n        input_ids: Tensor | NestedTensor,\n        attention_mask: Tensor | None = None,\n        position_ids: Tensor | None = None,\n        head_mask: Tensor | None = None,\n        inputs_embeds: Tensor | NestedTensor | None = None,\n        encoder_hidden_states: Tensor | None = None,\n        encoder_attention_mask: Tensor | None = None,\n        labels: Tensor | None = None,\n        output_attentions: bool | None = None,\n        output_hidden_states: bool | None = None,\n        return_dict: bool | None = None,\n        **kwargs,\n    ) -> Tuple[Tensor, ...] | MaskedLMOutput:\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n        outputs = self.utrlm(\n            input_ids,\n            attention_mask=attention_mask,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            encoder_hidden_states=encoder_hidden_states,\n            encoder_attention_mask=encoder_attention_mask,\n            output_attentions=output_attentions,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n            **kwargs,\n        )\n        output = self.lm_head(outputs, labels)\n        logits, loss = output.logits, output.loss\n\n        if not return_dict:\n            output = (logits,) + outputs[2:]\n            return ((loss,) + output) if loss is not None else output\n\n        return MaskedLMOutput(\n            loss=loss,\n            logits=logits,\n            hidden_states=outputs.hidden_states,\n            attentions=outputs.attentions,\n        )\n
"},{"location":"models/utrlm/#multimolecule.models.utrlm.UtrLmForPreTraining","title":"UtrLmForPreTraining","text":"

Bases: UtrLmPreTrainedModel

Examples:

Python Console Session
>>> from multimolecule import UtrLmConfig, UtrLmModel, RnaTokenizer\n>>> config = UtrLmConfig()\n>>> model = UtrLmForPreTraining(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels_mlm=input[\"input_ids\"])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<AddBackward0>)\n>>> output[\"logits\"].shape\ntorch.Size([1, 7, 26])\n>>> output[\"contact_map\"].shape\ntorch.Size([1, 5, 5, 1])\n
Source code in multimolecule/models/utrlm/modeling_utrlm.py Python
class UtrLmForPreTraining(UtrLmPreTrainedModel):\n    \"\"\"\n    Examples:\n        >>> from multimolecule import UtrLmConfig, UtrLmModel, RnaTokenizer\n        >>> config = UtrLmConfig()\n        >>> model = UtrLmForPreTraining(config)\n        >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n        >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n        >>> output = model(**input, labels_mlm=input[\"input_ids\"])\n        >>> output[\"loss\"]  # doctest:+ELLIPSIS\n        tensor(..., grad_fn=<AddBackward0>)\n        >>> output[\"logits\"].shape\n        torch.Size([1, 7, 26])\n        >>> output[\"contact_map\"].shape\n        torch.Size([1, 5, 5, 1])\n    \"\"\"\n\n    _tied_weights_keys = [\n        \"lm_head.decoder.weight\",\n        \"lm_head.decoder.bias\",\n        \"pretrain.predictions.decoder.weight\",\n        \"pretrain.predictions.decoder.bias\",\n        \"pretrain.predictions_ss.decoder.weight\",\n        \"pretrain.predictions_ss.decoder.bias\",\n    ]\n\n    def __init__(self, config: UtrLmConfig):\n        super().__init__(config)\n        if config.is_decoder:\n            logger.warning(\n                \"If you want to use `UtrLmForPreTraining` make sure `config.is_decoder=False` for \"\n                \"bi-directional self-attention.\"\n            )\n        self.utrlm = UtrLmModel(config, add_pooling_layer=False)\n        self.pretrain = UtrLmPreTrainingHeads(config)\n\n        # Initialize weights and apply final processing\n        self.post_init()\n\n    def get_output_embeddings(self):\n        return self.pretrain.predictions.decoder\n\n    def set_output_embeddings(self, embeddings):\n        self.pretrain.predictions.decoder = embeddings\n\n    def forward(\n        self,\n        input_ids: Tensor | NestedTensor,\n        attention_mask: Tensor | None = None,\n        position_ids: Tensor | None = None,\n        head_mask: Tensor | None = None,\n        inputs_embeds: Tensor | NestedTensor | None = None,\n        encoder_hidden_states: Tensor | None = None,\n        encoder_attention_mask: Tensor | None = None,\n        labels_mlm: Tensor | None = None,\n        labels_contact: Tensor | None = None,\n        labels_ss: Tensor | None = None,\n        labels_mfe: Tensor | None = None,\n        output_attentions: bool | None = None,\n        output_hidden_states: bool | None = None,\n        return_dict: bool | None = None,\n        **kwargs,\n    ) -> Tuple[Tensor, ...] | UtrLmForPreTrainingOutput:\n        if output_attentions is False:\n            warn(\"output_attentions must be True for contact classification and will be ignored.\")\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n        outputs = self.utrlm(\n            input_ids,\n            attention_mask=attention_mask,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            encoder_hidden_states=encoder_hidden_states,\n            encoder_attention_mask=encoder_attention_mask,\n            output_attentions=True,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n            **kwargs,\n        )\n        total_loss, logits, contact_map, secondary_structure, minimum_free_energy = self.pretrain(\n            outputs,\n            attention_mask,\n            input_ids,\n            labels_mlm=labels_mlm,\n            labels_contact=labels_contact,\n            labels_ss=labels_ss,\n            labels_mfe=labels_mfe,\n        )\n\n        if not return_dict:\n            output = (logits,) + outputs[2:]\n            return ((total_loss,) + output) if total_loss is not None else output\n\n        return UtrLmForPreTrainingOutput(\n            loss=total_loss,\n            logits=logits,\n            contact_map=contact_map,\n            secondary_structure=secondary_structure,\n            minimum_free_energy=minimum_free_energy,\n            hidden_states=outputs.hidden_states,\n            attentions=outputs.attentions,\n        )\n
"},{"location":"models/utrlm/#multimolecule.models.utrlm.UtrLmForSequencePrediction","title":"UtrLmForSequencePrediction","text":"

Bases: UtrLmPreTrainedModel

Examples:

Python Console Session
>>> from multimolecule import UtrLmConfig, UtrLmModel, RnaTokenizer\n>>> config = UtrLmConfig()\n>>> model = UtrLmForSequencePrediction(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=torch.tensor([[1]]))\n>>> output[\"logits\"].shape\ntorch.Size([1, 1])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n
Source code in multimolecule/models/utrlm/modeling_utrlm.py Python
class UtrLmForSequencePrediction(UtrLmPreTrainedModel):\n    \"\"\"\n    Examples:\n        >>> from multimolecule import UtrLmConfig, UtrLmModel, RnaTokenizer\n        >>> config = UtrLmConfig()\n        >>> model = UtrLmForSequencePrediction(config)\n        >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n        >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n        >>> output = model(**input, labels=torch.tensor([[1]]))\n        >>> output[\"logits\"].shape\n        torch.Size([1, 1])\n        >>> output[\"loss\"]  # doctest:+ELLIPSIS\n        tensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n    \"\"\"\n\n    def __init__(self, config: UtrLmConfig):\n        super().__init__(config)\n        self.utrlm = UtrLmModel(config, add_pooling_layer=True)\n        self.sequence_head = SequencePredictionHead(config)\n        self.head_config = self.sequence_head.config\n\n        # Initialize weights and apply final processing\n        self.post_init()\n\n    def forward(\n        self,\n        input_ids: Tensor | NestedTensor,\n        attention_mask: Tensor | None = None,\n        position_ids: Tensor | None = None,\n        head_mask: Tensor | None = None,\n        inputs_embeds: Tensor | NestedTensor | None = None,\n        labels: Tensor | None = None,\n        output_attentions: bool | None = None,\n        output_hidden_states: bool | None = None,\n        return_dict: bool | None = None,\n        **kwargs,\n    ) -> Tuple[Tensor, ...] | SequencePredictorOutput:\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n        outputs = self.utrlm(\n            input_ids,\n            attention_mask=attention_mask,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            output_attentions=output_attentions,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n            **kwargs,\n        )\n        output = self.sequence_head(outputs, labels)\n        logits, loss = output.logits, output.loss\n\n        if not return_dict:\n            output = (logits,) + outputs[2:]\n            return ((loss,) + output) if loss is not None else output\n\n        return SequencePredictorOutput(\n            loss=loss,\n            logits=logits,\n            hidden_states=outputs.hidden_states,\n            attentions=outputs.attentions,\n        )\n
"},{"location":"models/utrlm/#multimolecule.models.utrlm.UtrLmForTokenPrediction","title":"UtrLmForTokenPrediction","text":"

Bases: UtrLmPreTrainedModel

Examples:

Python Console Session
>>> from multimolecule import UtrLmConfig, UtrLmModel, RnaTokenizer\n>>> config = UtrLmConfig()\n>>> model = UtrLmForTokenPrediction(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=torch.randint(2, (1, 5)))\n>>> output[\"logits\"].shape\ntorch.Size([1, 5, 1])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n
Source code in multimolecule/models/utrlm/modeling_utrlm.py Python
class UtrLmForTokenPrediction(UtrLmPreTrainedModel):\n    \"\"\"\n    Examples:\n        >>> from multimolecule import UtrLmConfig, UtrLmModel, RnaTokenizer\n        >>> config = UtrLmConfig()\n        >>> model = UtrLmForTokenPrediction(config)\n        >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n        >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n        >>> output = model(**input, labels=torch.randint(2, (1, 5)))\n        >>> output[\"logits\"].shape\n        torch.Size([1, 5, 1])\n        >>> output[\"loss\"]  # doctest:+ELLIPSIS\n        tensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n    \"\"\"\n\n    def __init__(self, config: UtrLmConfig):\n        super().__init__(config)\n        self.utrlm = UtrLmModel(config, add_pooling_layer=True)\n        self.token_head = TokenPredictionHead(config)\n        self.head_config = self.token_head.config\n\n        # Initialize weights and apply final processing\n        self.post_init()\n\n    def forward(\n        self,\n        input_ids: Tensor | NestedTensor,\n        attention_mask: Tensor | None = None,\n        position_ids: Tensor | None = None,\n        head_mask: Tensor | None = None,\n        inputs_embeds: Tensor | NestedTensor | None = None,\n        labels: Tensor | None = None,\n        output_attentions: bool | None = None,\n        output_hidden_states: bool | None = None,\n        return_dict: bool | None = None,\n        **kwargs,\n    ) -> Tuple[Tensor, ...] | TokenPredictorOutput:\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n        outputs = self.utrlm(\n            input_ids,\n            attention_mask=attention_mask,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            output_attentions=output_attentions,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n            **kwargs,\n        )\n        output = self.token_head(outputs, attention_mask, input_ids, labels)\n        logits, loss = output.logits, output.loss\n\n        if not return_dict:\n            output = (logits,) + outputs[2:]\n            return ((loss,) + output) if loss is not None else output\n\n        return TokenPredictorOutput(\n            loss=loss,\n            logits=logits,\n            hidden_states=outputs.hidden_states,\n            attentions=outputs.attentions,\n        )\n
"},{"location":"models/utrlm/#multimolecule.models.utrlm.UtrLmModel","title":"UtrLmModel","text":"

Bases: UtrLmPreTrainedModel

Examples:

Python Console Session
>>> from multimolecule import UtrLmConfig, UtrLmModel, RnaTokenizer\n>>> config = UtrLmConfig()\n>>> model = UtrLmModel(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input)\n>>> output[\"last_hidden_state\"].shape\ntorch.Size([1, 7, 128])\n>>> output[\"pooler_output\"].shape\ntorch.Size([1, 128])\n
Source code in multimolecule/models/utrlm/modeling_utrlm.py Python
class UtrLmModel(UtrLmPreTrainedModel):\n    \"\"\"\n    Examples:\n        >>> from multimolecule import UtrLmConfig, UtrLmModel, RnaTokenizer\n        >>> config = UtrLmConfig()\n        >>> model = UtrLmModel(config)\n        >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n        >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n        >>> output = model(**input)\n        >>> output[\"last_hidden_state\"].shape\n        torch.Size([1, 7, 128])\n        >>> output[\"pooler_output\"].shape\n        torch.Size([1, 128])\n    \"\"\"\n\n    def __init__(self, config: UtrLmConfig, add_pooling_layer: bool = True):\n        super().__init__(config)\n        self.pad_token_id = config.pad_token_id\n        self.embeddings = UtrLmEmbeddings(config)\n        self.encoder = UtrLmEncoder(config)\n        self.pooler = UtrLmPooler(config) if add_pooling_layer else None\n\n        # Initialize weights and apply final processing\n        self.post_init()\n\n    def get_input_embeddings(self):\n        return self.embeddings.word_embeddings\n\n    def set_input_embeddings(self, value):\n        self.embeddings.word_embeddings = value\n\n    def _prune_heads(self, heads_to_prune):\n        \"\"\"\n        Prunes heads of the model. heads_to_prune: dict of {layer_num: list of heads to prune in this layer} See base\n        class PreTrainedModel\n        \"\"\"\n        for layer, heads in heads_to_prune.items():\n            self.encoder.layer[layer].attention.prune_heads(heads)\n\n    def forward(\n        self,\n        input_ids: Tensor | NestedTensor,\n        attention_mask: Tensor | None = None,\n        position_ids: Tensor | None = None,\n        head_mask: Tensor | None = None,\n        inputs_embeds: Tensor | NestedTensor | None = None,\n        encoder_hidden_states: Tensor | None = None,\n        encoder_attention_mask: Tensor | None = None,\n        past_key_values: Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None = None,\n        use_cache: bool | None = None,\n        output_attentions: bool | None = None,\n        output_hidden_states: bool | None = None,\n        return_dict: bool | None = None,\n        **kwargs,\n    ) -> Tuple[Tensor, ...] | BaseModelOutputWithPoolingAndCrossAttentions:\n        r\"\"\"\n        Args:\n            encoder_hidden_states:\n                Shape: `(batch_size, sequence_length, hidden_size)`\n\n                Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if\n                the model is configured as a decoder.\n            encoder_attention_mask:\n                Shape: `(batch_size, sequence_length)`\n\n                Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used\n                in the cross-attention if the model is configured as a decoder. Mask values selected in `[0, 1]`:\n\n                - 1 for tokens that are **not masked**,\n                - 0 for tokens that are **masked**.\n            past_key_values:\n                Tuple of length `config.n_layers` with each tuple having 4 tensors of shape\n                `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)\n\n                Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up\n                decoding.\n\n                If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those\n                that don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of\n                all `decoder_input_ids` of shape `(batch_size, sequence_length)`.\n            use_cache:\n                If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding\n                (see `past_key_values`).\n        \"\"\"\n        if kwargs:\n            warn(\n                f\"Additional keyword arguments `{', '.join(kwargs)}` are detected in \"\n                f\"`{self.__class__.__name__}.forward`, they will be ignored.\\n\"\n                \"This is provided for backward compatibility and may lead to unexpected behavior.\"\n            )\n        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions\n        output_hidden_states = (\n            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states\n        )\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n\n        if self.config.is_decoder:\n            use_cache = use_cache if use_cache is not None else self.config.use_cache\n        else:\n            use_cache = False\n\n        if isinstance(input_ids, NestedTensor):\n            input_ids, attention_mask = input_ids.tensor, input_ids.mask\n        if input_ids is not None and inputs_embeds is not None:\n            raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n        if input_ids is not None:\n            self.warn_if_padding_and_no_attention_mask(input_ids, attention_mask)\n            input_shape = input_ids.size()\n        elif inputs_embeds is not None:\n            input_shape = inputs_embeds.size()[:-1]\n        else:\n            raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n        batch_size, seq_length = input_shape\n        device = input_ids.device if input_ids is not None else inputs_embeds.device  # type: ignore[union-attr]\n\n        # past_key_values_length\n        past_key_values_length = past_key_values[0][0].shape[2] if past_key_values is not None else 0\n\n        if attention_mask is None:\n            attention_mask = (\n                input_ids.ne(self.pad_token_id)\n                if self.pad_token_id is not None\n                else torch.ones(((batch_size, seq_length + past_key_values_length)), device=device)\n            )\n\n        # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]\n        # ourselves in which case we just need to make it broadcastable to all heads.\n        extended_attention_mask: Tensor = self.get_extended_attention_mask(attention_mask, input_shape)\n\n        # If a 2D or 3D attention mask is provided for the cross-attention\n        # we need to make broadcastable to [batch_size, num_heads, seq_length, seq_length]\n        if self.config.is_decoder and encoder_hidden_states is not None:\n            encoder_batch_size, encoder_sequence_length, _ = encoder_hidden_states.size()\n            encoder_hidden_shape = (encoder_batch_size, encoder_sequence_length)\n            if encoder_attention_mask is None:\n                encoder_attention_mask = torch.ones(encoder_hidden_shape, device=device)\n            encoder_extended_attention_mask = self.invert_attention_mask(encoder_attention_mask)\n        else:\n            encoder_extended_attention_mask = None\n\n        # Prepare head mask if needed\n        # 1.0 in head_mask indicate we keep the head\n        # attention_probs has shape bsz x n_heads x N x N\n        # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]\n        # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]\n        head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)\n\n        embedding_output = self.embeddings(\n            input_ids=input_ids,\n            position_ids=position_ids,\n            attention_mask=attention_mask,\n            inputs_embeds=inputs_embeds,\n            past_key_values_length=past_key_values_length,\n        )\n        encoder_outputs = self.encoder(\n            embedding_output,\n            attention_mask=extended_attention_mask,\n            head_mask=head_mask,\n            encoder_hidden_states=encoder_hidden_states,\n            encoder_attention_mask=encoder_extended_attention_mask,\n            past_key_values=past_key_values,\n            use_cache=use_cache,\n            output_attentions=output_attentions,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n        )\n        sequence_output = encoder_outputs[0]\n        pooled_output = self.pooler(sequence_output) if self.pooler is not None else None\n\n        if not return_dict:\n            return (sequence_output, pooled_output) + encoder_outputs[1:]\n\n        return BaseModelOutputWithPoolingAndCrossAttentions(\n            last_hidden_state=sequence_output,\n            pooler_output=pooled_output,\n            past_key_values=encoder_outputs.past_key_values,\n            hidden_states=encoder_outputs.hidden_states,\n            attentions=encoder_outputs.attentions,\n            cross_attentions=encoder_outputs.cross_attentions,\n        )\n
"},{"location":"models/utrlm/#multimolecule.models.utrlm.UtrLmModel.forward","title":"forward","text":"Python
forward(input_ids: Tensor | NestedTensor, attention_mask: Tensor | None = None, position_ids: Tensor | None = None, head_mask: Tensor | None = None, inputs_embeds: Tensor | NestedTensor | None = None, encoder_hidden_states: Tensor | None = None, encoder_attention_mask: Tensor | None = None, past_key_values: Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None = None, use_cache: bool | None = None, output_attentions: bool | None = None, output_hidden_states: bool | None = None, return_dict: bool | None = None, **kwargs) -> Tuple[Tensor, ...] | BaseModelOutputWithPoolingAndCrossAttentions\n

Parameters:

Name Type Description Default Tensor | None

Shape: (batch_size, sequence_length, hidden_size)

Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if the model is configured as a decoder.

None Tensor | None

Shape: (batch_size, sequence_length)

Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used in the cross-attention if the model is configured as a decoder. Mask values selected in [0, 1]:

None Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None

Tuple of length config.n_layers with each tuple having 4 tensors of shape `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)

Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up decoding.

If past_key_values are used, the user can optionally input only the last decoder_input_ids (those that don\u2019t have their past key value states given to this model) of shape (batch_size, 1) instead of all decoder_input_ids of shape (batch_size, sequence_length).

None bool | None

If set to True, past_key_values key value states are returned and can be used to speed up decoding (see past_key_values).

None Source code in multimolecule/models/utrlm/modeling_utrlm.py Python
def forward(\n    self,\n    input_ids: Tensor | NestedTensor,\n    attention_mask: Tensor | None = None,\n    position_ids: Tensor | None = None,\n    head_mask: Tensor | None = None,\n    inputs_embeds: Tensor | NestedTensor | None = None,\n    encoder_hidden_states: Tensor | None = None,\n    encoder_attention_mask: Tensor | None = None,\n    past_key_values: Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None = None,\n    use_cache: bool | None = None,\n    output_attentions: bool | None = None,\n    output_hidden_states: bool | None = None,\n    return_dict: bool | None = None,\n    **kwargs,\n) -> Tuple[Tensor, ...] | BaseModelOutputWithPoolingAndCrossAttentions:\n    r\"\"\"\n    Args:\n        encoder_hidden_states:\n            Shape: `(batch_size, sequence_length, hidden_size)`\n\n            Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if\n            the model is configured as a decoder.\n        encoder_attention_mask:\n            Shape: `(batch_size, sequence_length)`\n\n            Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used\n            in the cross-attention if the model is configured as a decoder. Mask values selected in `[0, 1]`:\n\n            - 1 for tokens that are **not masked**,\n            - 0 for tokens that are **masked**.\n        past_key_values:\n            Tuple of length `config.n_layers` with each tuple having 4 tensors of shape\n            `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)\n\n            Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up\n            decoding.\n\n            If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those\n            that don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of\n            all `decoder_input_ids` of shape `(batch_size, sequence_length)`.\n        use_cache:\n            If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding\n            (see `past_key_values`).\n    \"\"\"\n    if kwargs:\n        warn(\n            f\"Additional keyword arguments `{', '.join(kwargs)}` are detected in \"\n            f\"`{self.__class__.__name__}.forward`, they will be ignored.\\n\"\n            \"This is provided for backward compatibility and may lead to unexpected behavior.\"\n        )\n    output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions\n    output_hidden_states = (\n        output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states\n    )\n    return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n\n    if self.config.is_decoder:\n        use_cache = use_cache if use_cache is not None else self.config.use_cache\n    else:\n        use_cache = False\n\n    if isinstance(input_ids, NestedTensor):\n        input_ids, attention_mask = input_ids.tensor, input_ids.mask\n    if input_ids is not None and inputs_embeds is not None:\n        raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n    if input_ids is not None:\n        self.warn_if_padding_and_no_attention_mask(input_ids, attention_mask)\n        input_shape = input_ids.size()\n    elif inputs_embeds is not None:\n        input_shape = inputs_embeds.size()[:-1]\n    else:\n        raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n    batch_size, seq_length = input_shape\n    device = input_ids.device if input_ids is not None else inputs_embeds.device  # type: ignore[union-attr]\n\n    # past_key_values_length\n    past_key_values_length = past_key_values[0][0].shape[2] if past_key_values is not None else 0\n\n    if attention_mask is None:\n        attention_mask = (\n            input_ids.ne(self.pad_token_id)\n            if self.pad_token_id is not None\n            else torch.ones(((batch_size, seq_length + past_key_values_length)), device=device)\n        )\n\n    # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]\n    # ourselves in which case we just need to make it broadcastable to all heads.\n    extended_attention_mask: Tensor = self.get_extended_attention_mask(attention_mask, input_shape)\n\n    # If a 2D or 3D attention mask is provided for the cross-attention\n    # we need to make broadcastable to [batch_size, num_heads, seq_length, seq_length]\n    if self.config.is_decoder and encoder_hidden_states is not None:\n        encoder_batch_size, encoder_sequence_length, _ = encoder_hidden_states.size()\n        encoder_hidden_shape = (encoder_batch_size, encoder_sequence_length)\n        if encoder_attention_mask is None:\n            encoder_attention_mask = torch.ones(encoder_hidden_shape, device=device)\n        encoder_extended_attention_mask = self.invert_attention_mask(encoder_attention_mask)\n    else:\n        encoder_extended_attention_mask = None\n\n    # Prepare head mask if needed\n    # 1.0 in head_mask indicate we keep the head\n    # attention_probs has shape bsz x n_heads x N x N\n    # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]\n    # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]\n    head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)\n\n    embedding_output = self.embeddings(\n        input_ids=input_ids,\n        position_ids=position_ids,\n        attention_mask=attention_mask,\n        inputs_embeds=inputs_embeds,\n        past_key_values_length=past_key_values_length,\n    )\n    encoder_outputs = self.encoder(\n        embedding_output,\n        attention_mask=extended_attention_mask,\n        head_mask=head_mask,\n        encoder_hidden_states=encoder_hidden_states,\n        encoder_attention_mask=encoder_extended_attention_mask,\n        past_key_values=past_key_values,\n        use_cache=use_cache,\n        output_attentions=output_attentions,\n        output_hidden_states=output_hidden_states,\n        return_dict=return_dict,\n    )\n    sequence_output = encoder_outputs[0]\n    pooled_output = self.pooler(sequence_output) if self.pooler is not None else None\n\n    if not return_dict:\n        return (sequence_output, pooled_output) + encoder_outputs[1:]\n\n    return BaseModelOutputWithPoolingAndCrossAttentions(\n        last_hidden_state=sequence_output,\n        pooler_output=pooled_output,\n        past_key_values=encoder_outputs.past_key_values,\n        hidden_states=encoder_outputs.hidden_states,\n        attentions=encoder_outputs.attentions,\n        cross_attentions=encoder_outputs.cross_attentions,\n    )\n
"},{"location":"models/utrlm/#multimolecule.models.utrlm.UtrLmModel.forward(encoder_hidden_states)","title":"encoder_hidden_states","text":""},{"location":"models/utrlm/#multimolecule.models.utrlm.UtrLmModel.forward(encoder_attention_mask)","title":"encoder_attention_mask","text":""},{"location":"models/utrlm/#multimolecule.models.utrlm.UtrLmModel.forward(past_key_values)","title":"past_key_values","text":""},{"location":"models/utrlm/#multimolecule.models.utrlm.UtrLmModel.forward(use_cache)","title":"use_cache","text":""},{"location":"models/utrlm/#multimolecule.models.utrlm.UtrLmPreTrainedModel","title":"UtrLmPreTrainedModel","text":"

Bases: PreTrainedModel

An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained models.

Source code in multimolecule/models/utrlm/modeling_utrlm.py Python
class UtrLmPreTrainedModel(PreTrainedModel):\n    \"\"\"\n    An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained\n    models.\n    \"\"\"\n\n    config_class = UtrLmConfig\n    base_model_prefix = \"utrlm\"\n    supports_gradient_checkpointing = True\n    _no_split_modules = [\"UtrLmLayer\", \"UtrLmEmbeddings\"]\n\n    # Copied from transformers.models.bert.modeling_bert.BertPreTrainedModel._init_weights\n    def _init_weights(self, module: nn.Module):\n        \"\"\"Initialize the weights\"\"\"\n        if isinstance(module, nn.Linear):\n            # Slightly different from the TF version which uses truncated_normal for initialization\n            # cf https://github.com/pytorch/pytorch/pull/5617\n            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)\n            if module.bias is not None:\n                module.bias.data.zero_()\n        elif isinstance(module, nn.Embedding):\n            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)\n            if module.padding_idx is not None:\n                module.weight.data[module.padding_idx].zero_()\n        elif isinstance(module, nn.LayerNorm):\n            module.bias.data.zero_()\n            module.weight.data.fill_(1.0)\n
"},{"location":"module/","title":"module","text":"

module provides a collection of pre-defined modules for users to implement their own architectures.

MultiMolecule is built upon the ecosystem, embracing a similar design philosophy: Don\u2019t Repeat Yourself. We follow the single model file policy where each model under the models package contains one and only one modeling.py file that describes the network design.

The module package is intended for simple, reusable modules that are consistent across multiple models. This approach minimizes code duplication and promotes clean, maintainable code.

"},{"location":"module/#key-features","title":"Key Features","text":""},{"location":"module/#modules","title":"Modules","text":""},{"location":"module/embeddings/","title":"embeddings","text":"

embeddings provide a collection of pre-defined positional embeddings.

"},{"location":"module/embeddings/#multimolecule.module.embeddings","title":"multimolecule.module.embeddings","text":""},{"location":"module/embeddings/#multimolecule.module.embeddings.RotaryEmbedding","title":"RotaryEmbedding","text":"

Bases: Module

Rotary position embeddings based on those in RoFormer.

Query and keys are transformed by rotation matrices which depend on their relative positions.

Cache

The inverse frequency buffer is cached and updated only when the sequence length changes or the device changes.

Sequence Length

Rotary Embedding is irrespective of the sequence length and can be used for any sequence length.

Source code in multimolecule/module/embeddings/rotary.py Python
@PositionEmbeddingRegistry.register(\"rotary\")\n@PositionEmbeddingRegistryHF.register(\"rotary\")\nclass RotaryEmbedding(nn.Module):\n    \"\"\"\n    Rotary position embeddings based on those in\n    [RoFormer](https://huggingface.co/docs/transformers/model_doc/roformer).\n\n    Query and keys are transformed by rotation\n    matrices which depend on their relative positions.\n\n    Tip: **Cache**\n        The inverse frequency buffer is cached and updated only when the sequence length changes or the device changes.\n\n    Success: **Sequence Length**\n        Rotary Embedding is irrespective of the sequence length and can be used for any sequence length.\n    \"\"\"\n\n    def __init__(self, embedding_dim: int):\n        super().__init__()\n        # Generate and save the inverse frequency buffer (non trainable)\n        inv_freq = 1.0 / (10000 ** (torch.arange(0, embedding_dim, 2, dtype=torch.int64).float() / embedding_dim))\n        self.register_buffer(\"inv_freq\", inv_freq)\n\n        self._seq_len_cached = None\n        self._cos_cached = None\n        self._sin_cached = None\n\n    def forward(self, q: Tensor, k: Tensor) -> Tuple[Tensor, Tensor]:\n        self._update_cos_sin_tables(k, seq_dimension=-2)\n\n        return (self.apply_rotary_pos_emb(q), self.apply_rotary_pos_emb(k))\n\n    def _update_cos_sin_tables(self, x, seq_dimension=2):\n        seq_len = x.shape[seq_dimension]\n\n        # Reset the tables if the sequence length has changed,\n        # or if we're on a new device (possibly due to tracing for instance)\n        if seq_len != self._seq_len_cached or self._cos_cached.device != x.device:\n            self._seq_len_cached = seq_len\n            t = torch.arange(x.shape[seq_dimension], device=x.device).type_as(self.inv_freq)\n            freqs = torch.outer(t, self.inv_freq)\n            emb = torch.cat((freqs, freqs), dim=-1).to(x.device)\n\n            self._cos_cached = emb.cos()[None, None, :, :]\n            self._sin_cached = emb.sin()[None, None, :, :]\n\n        return self._cos_cached, self._sin_cached\n\n    def apply_rotary_pos_emb(self, x):\n        cos = self._cos_cached[:, :, : x.shape[-2], :]\n        sin = self._sin_cached[:, :, : x.shape[-2], :]\n\n        return (x * cos) + (self.rotate_half(x) * sin)\n\n    @staticmethod\n    def rotate_half(x):\n        x1, x2 = x.chunk(2, dim=-1)\n        return torch.cat((-x2, x1), dim=-1)\n
"},{"location":"module/embeddings/#multimolecule.module.embeddings.SinusoidalEmbedding","title":"SinusoidalEmbedding","text":"

Bases: Embedding

Sinusoidal positional embeddings for inputs with any length.

Freezing

The embeddings are frozen and cannot be trained. They will not be saved in the model\u2019s state_dict.

Padding Idx

Padding symbols are ignored if the padding_idx is specified.

Sequence Length

These embeddings get automatically extended in forward if more positions is needed.

Source code in multimolecule/module/embeddings/sinusoidal.py Python
@PositionEmbeddingRegistry.register(\"sinusoidal\")\n@PositionEmbeddingRegistryHF.register(\"sinusoidal\")\nclass SinusoidalEmbedding(nn.Embedding):\n    r\"\"\"\n    Sinusoidal positional embeddings for inputs with any length.\n\n    Note: **Freezing**\n        The embeddings are frozen and cannot be trained.\n        They will not be saved in the model's state_dict.\n\n    Tip: **Padding Idx**\n        Padding symbols are ignored if the padding_idx is specified.\n\n    Success: **Sequence Length**\n        These embeddings get automatically extended in forward if more positions is needed.\n    \"\"\"\n\n    _is_hf_initialized = True\n\n    def __init__(self, num_embeddings: int, embedding_dim: int, padding_idx: int | None = None, bias: int = 0):\n        weight = self.get_embedding(num_embeddings, embedding_dim, padding_idx)\n        super().__init__(num_embeddings, embedding_dim, padding_idx, _weight=weight.detach(), _freeze=True)\n        self.bias = bias\n\n    def update_weight(self, num_embeddings: int, embedding_dim: int, padding_idx: int | None = None):\n        weight = self.get_embedding(num_embeddings, embedding_dim, padding_idx).to(\n            dtype=self.weight.dtype, device=self.weight.device  # type: ignore[has-type]\n        )\n        self.weight = nn.Parameter(weight.detach(), requires_grad=False)\n\n    @staticmethod\n    def get_embedding(num_embeddings: int, embedding_dim: int, padding_idx: int | None = None) -> Tensor:\n        \"\"\"\n        Build sinusoidal embeddings.\n\n        This matches the implementation in tensor2tensor, but differs slightly from the description in Section 3.5 of\n        \"Attention Is All You Need\".\n        \"\"\"\n        half_dim = embedding_dim // 2\n        emb = torch.exp(torch.arange(half_dim, dtype=torch.float) * -(math.log(10000) / (half_dim - 1)))\n        emb = torch.arange(num_embeddings, dtype=torch.float).unsqueeze(1) * emb.unsqueeze(0)\n        emb = torch.cat([torch.sin(emb), torch.cos(emb)], dim=1).view(num_embeddings, -1)\n        if embedding_dim % 2 == 1:\n            # zero pad\n            emb = torch.cat([emb, torch.zeros(num_embeddings, 1)], dim=1)\n        if padding_idx is not None:\n            emb[padding_idx, :] = 0\n        return emb\n\n    @staticmethod\n    def get_position_ids(tensor, padding_idx: int | None = None):\n        \"\"\"\n        Replace non-padding symbols with their position numbers.\n\n        Position numbers begin at padding_idx+1. Padding symbols are ignored.\n        \"\"\"\n        # The series of casts and type-conversions here are carefully\n        # balanced to both work with ONNX export and XLA. In particular XLA\n        # prefers ints, cumsum defaults to output longs, and ONNX doesn't know\n        # how to handle the dtype kwarg in cumsum.\n        if padding_idx is None:\n            return torch.cumsum(tensor.new_ones(tensor.size(1)).long(), dim=0) - 1\n        mask = tensor.ne(padding_idx).int()\n        return (torch.cumsum(mask, dim=1).type_as(mask) * mask).long() + padding_idx\n\n    def forward(self, input_ids: Tensor) -> Tensor:\n        _, seq_len = input_ids.shape[:2]\n        # expand embeddings if needed\n        max_pos = seq_len + self.bias + 1\n        if self.padding_idx is not None:\n            max_pos += self.padding_idx\n        if max_pos > self.weight.size(0):\n            self.update_weight(max_pos, self.embedding_dim, self.padding_idx)\n        # Need to shift the position ids by the padding index\n        position_ids = self.get_position_ids(input_ids, self.padding_idx) + self.bias\n        return super().forward(position_ids)\n\n    def state_dict(self, destination=None, prefix=\"\", keep_vars=False):\n        return {}\n\n    def load_state_dict(self, *args, state_dict, strict=True):\n        return\n\n    def _load_from_state_dict(\n        self, state_dict, prefix, local_metadata, strict, missing_keys, unexpected_keys, error_msgs\n    ):\n        return\n
"},{"location":"module/embeddings/#multimolecule.module.embeddings.SinusoidalEmbedding.get_embedding","title":"get_embedding staticmethod","text":"Python
get_embedding(num_embeddings: int, embedding_dim: int, padding_idx: int | None = None) -> Tensor\n

Build sinusoidal embeddings.

This matches the implementation in tensor2tensor, but differs slightly from the description in Section 3.5 of \u201cAttention Is All You Need\u201d.

Source code in multimolecule/module/embeddings/sinusoidal.py Python
@staticmethod\ndef get_embedding(num_embeddings: int, embedding_dim: int, padding_idx: int | None = None) -> Tensor:\n    \"\"\"\n    Build sinusoidal embeddings.\n\n    This matches the implementation in tensor2tensor, but differs slightly from the description in Section 3.5 of\n    \"Attention Is All You Need\".\n    \"\"\"\n    half_dim = embedding_dim // 2\n    emb = torch.exp(torch.arange(half_dim, dtype=torch.float) * -(math.log(10000) / (half_dim - 1)))\n    emb = torch.arange(num_embeddings, dtype=torch.float).unsqueeze(1) * emb.unsqueeze(0)\n    emb = torch.cat([torch.sin(emb), torch.cos(emb)], dim=1).view(num_embeddings, -1)\n    if embedding_dim % 2 == 1:\n        # zero pad\n        emb = torch.cat([emb, torch.zeros(num_embeddings, 1)], dim=1)\n    if padding_idx is not None:\n        emb[padding_idx, :] = 0\n    return emb\n
"},{"location":"module/embeddings/#multimolecule.module.embeddings.SinusoidalEmbedding.get_position_ids","title":"get_position_ids staticmethod","text":"Python
get_position_ids(tensor, padding_idx: int | None = None)\n

Replace non-padding symbols with their position numbers.

Position numbers begin at padding_idx+1. Padding symbols are ignored.

Source code in multimolecule/module/embeddings/sinusoidal.py Python
@staticmethod\ndef get_position_ids(tensor, padding_idx: int | None = None):\n    \"\"\"\n    Replace non-padding symbols with their position numbers.\n\n    Position numbers begin at padding_idx+1. Padding symbols are ignored.\n    \"\"\"\n    # The series of casts and type-conversions here are carefully\n    # balanced to both work with ONNX export and XLA. In particular XLA\n    # prefers ints, cumsum defaults to output longs, and ONNX doesn't know\n    # how to handle the dtype kwarg in cumsum.\n    if padding_idx is None:\n        return torch.cumsum(tensor.new_ones(tensor.size(1)).long(), dim=0) - 1\n    mask = tensor.ne(padding_idx).int()\n    return (torch.cumsum(mask, dim=1).type_as(mask) * mask).long() + padding_idx\n
"},{"location":"module/heads/","title":"heads","text":"

heads provide a collection of pre-defined prediction heads.

heads take in either a ModelOutupt, a dict, or a tuple as input. It automatically looks for the model output required for prediction and processes it accordingly.

Some prediction heads may require additional information, such as the attention_mask or the input_ids, like ContactPredictionHead. These additional arguments can be passed in as arguments/keyword arguments.

Note that heads use the same ModelOutupt conventions as the Transformers. If the model output is a tuple, we consider the first element as the pooler_output, the second element as the last_hidden_state, and the last element as the attention_map. It is the user\u2019s responsibility to ensure that the model output is correctly formatted.

If the model output is a ModelOutupt or a dict, the heads will look for the HeadConfig.output_name from the model output. You can specify the output_name in the HeadConfig to ensure that the heads can correctly locate the required tensor.

"},{"location":"module/heads/#multimolecule.module.heads.config","title":"multimolecule.module.heads.config","text":""},{"location":"module/heads/#multimolecule.module.heads.config.HeadConfig","title":"HeadConfig","text":"

Bases: BaseHeadConfig

Configuration class for a prediction head.

Parameters:

Name Type Description Default

Number of labels to use in the last layer added to the model, typically for a classification task.

Head should look for Config.num_labels if is None.

required

Problem type for XxxForYyyPrediction models. Can be one of \"binary\", \"regression\", \"multiclass\" or \"multilabel\".

Head should look for Config.problem_type if is None.

required

Dimensionality of the encoder layers and the pooler layer.

Head should look for Config.hidden_size if is None.

required

The dropout ratio for the hidden states.

required

The transform operation applied to hidden states.

required

The activation function of transform applied to hidden states.

required

Whether to apply bias to the final prediction layer.

required

The activation function of the final prediction output.

required

The epsilon used by the layer normalization layers.

required

The name of the tensor required in model outputs.

If is None, will use the default output name of the corresponding head.

required

The type of the head in the model.

This is used by MultiMoleculeModel to construct heads.

required Source code in multimolecule/module/heads/config.py Python
class HeadConfig(BaseHeadConfig):\n    r\"\"\"\n    Configuration class for a prediction head.\n\n    Args:\n        num_labels:\n            Number of labels to use in the last layer added to the model, typically for a classification task.\n\n            Head should look for [`Config.num_labels`][multimolecule.PreTrainedConfig] if is `None`.\n        problem_type:\n            Problem type for `XxxForYyyPrediction` models. Can be one of `\"binary\"`, `\"regression\"`,\n            `\"multiclass\"` or `\"multilabel\"`.\n\n            Head should look for [`Config.problem_type`][multimolecule.PreTrainedConfig] if is `None`.\n        hidden_size:\n            Dimensionality of the encoder layers and the pooler layer.\n\n            Head should look for [`Config.hidden_size`][multimolecule.PreTrainedConfig] if is `None`.\n        dropout:\n            The dropout ratio for the hidden states.\n        transform:\n            The transform operation applied to hidden states.\n        transform_act:\n            The activation function of transform applied to hidden states.\n        bias:\n            Whether to apply bias to the final prediction layer.\n        act:\n            The activation function of the final prediction output.\n        layer_norm_eps:\n            The epsilon used by the layer normalization layers.\n        output_name:\n            The name of the tensor required in model outputs.\n\n            If is `None`, will use the default output name of the corresponding head.\n        type:\n            The type of the head in the model.\n\n            This is used by [`MultiMoleculeModel`][multimolecule.MultiMoleculeModel] to construct heads.\n    \"\"\"\n\n    num_labels: Optional[int] = None\n    problem_type: Optional[str] = None\n    hidden_size: Optional[int] = None\n    dropout: float = 0.0\n    transform: Optional[str] = None\n    transform_act: Optional[str] = \"gelu\"\n    bias: bool = True\n    act: Optional[str] = None\n    layer_norm_eps: float = 1e-12\n    output_name: Optional[str] = None\n    type: Optional[str] = None\n
"},{"location":"module/heads/#multimolecule.module.heads.config.HeadConfig(num_labels)","title":"num_labels","text":""},{"location":"module/heads/#multimolecule.module.heads.config.HeadConfig(problem_type)","title":"problem_type","text":""},{"location":"module/heads/#multimolecule.module.heads.config.HeadConfig(hidden_size)","title":"hidden_size","text":""},{"location":"module/heads/#multimolecule.module.heads.config.HeadConfig(dropout)","title":"dropout","text":""},{"location":"module/heads/#multimolecule.module.heads.config.HeadConfig(transform)","title":"transform","text":""},{"location":"module/heads/#multimolecule.module.heads.config.HeadConfig(transform_act)","title":"transform_act","text":""},{"location":"module/heads/#multimolecule.module.heads.config.HeadConfig(bias)","title":"bias","text":""},{"location":"module/heads/#multimolecule.module.heads.config.HeadConfig(act)","title":"act","text":""},{"location":"module/heads/#multimolecule.module.heads.config.HeadConfig(layer_norm_eps)","title":"layer_norm_eps","text":""},{"location":"module/heads/#multimolecule.module.heads.config.HeadConfig(output_name)","title":"output_name","text":""},{"location":"module/heads/#multimolecule.module.heads.config.HeadConfig(type)","title":"type","text":""},{"location":"module/heads/#multimolecule.module.heads.config.MaskedLMHeadConfig","title":"MaskedLMHeadConfig","text":"

Bases: BaseHeadConfig

Configuration class for a Masked Language Modeling head.

Parameters:

Name Type Description Default

Dimensionality of the encoder layers and the pooler layer.

Head should look for Config.hidden_size if is None.

required

The dropout ratio for the hidden states.

required

The transform operation applied to hidden states.

required

The activation function of transform applied to hidden states.

required

Whether to apply bias to the final prediction layer.

required

The activation function of the final prediction output.

required

The epsilon used by the layer normalization layers.

required

The name of the tensor required in model outputs.

If is None, will use the default output name of the corresponding head.

required Source code in multimolecule/module/heads/config.py Python
class MaskedLMHeadConfig(BaseHeadConfig):\n    r\"\"\"\n    Configuration class for a Masked Language Modeling head.\n\n    Args:\n        hidden_size:\n            Dimensionality of the encoder layers and the pooler layer.\n\n            Head should look for [`Config.hidden_size`][multimolecule.PreTrainedConfig] if is `None`.\n        dropout:\n            The dropout ratio for the hidden states.\n        transform:\n            The transform operation applied to hidden states.\n        transform_act:\n            The activation function of transform applied to hidden states.\n        bias:\n            Whether to apply bias to the final prediction layer.\n        act:\n            The activation function of the final prediction output.\n        layer_norm_eps:\n            The epsilon used by the layer normalization layers.\n        output_name:\n            The name of the tensor required in model outputs.\n\n            If is `None`, will use the default output name of the corresponding head.\n    \"\"\"\n\n    hidden_size: Optional[int] = None\n    dropout: float = 0.0\n    transform: Optional[str] = \"nonlinear\"\n    transform_act: Optional[str] = \"gelu\"\n    bias: bool = True\n    act: Optional[str] = None\n    layer_norm_eps: float = 1e-12\n    output_name: Optional[str] = None\n
"},{"location":"module/heads/#multimolecule.module.heads.config.MaskedLMHeadConfig(hidden_size)","title":"hidden_size","text":""},{"location":"module/heads/#multimolecule.module.heads.config.MaskedLMHeadConfig(dropout)","title":"dropout","text":""},{"location":"module/heads/#multimolecule.module.heads.config.MaskedLMHeadConfig(transform)","title":"transform","text":""},{"location":"module/heads/#multimolecule.module.heads.config.MaskedLMHeadConfig(transform_act)","title":"transform_act","text":""},{"location":"module/heads/#multimolecule.module.heads.config.MaskedLMHeadConfig(bias)","title":"bias","text":""},{"location":"module/heads/#multimolecule.module.heads.config.MaskedLMHeadConfig(act)","title":"act","text":""},{"location":"module/heads/#multimolecule.module.heads.config.MaskedLMHeadConfig(layer_norm_eps)","title":"layer_norm_eps","text":""},{"location":"module/heads/#multimolecule.module.heads.config.MaskedLMHeadConfig(output_name)","title":"output_name","text":""},{"location":"module/heads/#multimolecule.module.heads.sequence","title":"multimolecule.module.heads.sequence","text":""},{"location":"module/heads/#multimolecule.module.heads.sequence.SequencePredictionHead","title":"SequencePredictionHead","text":"

Bases: PredictionHead

Head for tasks in sequence-level.

Parameters:

Name Type Description Default PreTrainedConfig

The configuration object for the model.

required HeadConfig | None

The configuration object for the head. If None, will use configuration from the config.

None Source code in multimolecule/module/heads/sequence.py Python
@HeadRegistry.register(\"sequence\")\nclass SequencePredictionHead(PredictionHead):\n    r\"\"\"\n    Head for tasks in sequence-level.\n\n    Args:\n        config: The configuration object for the model.\n        head_config: The configuration object for the head.\n            If None, will use configuration from the `config`.\n    \"\"\"\n\n    output_name: str = \"pooler_output\"\n    r\"\"\"The default output to use for the head.\"\"\"\n\n    def __init__(self, config: PreTrainedConfig, head_config: HeadConfig | None = None):\n        super().__init__(config, head_config)\n        if head_config is not None and head_config.output_name is not None:\n            self.output_name = head_config.output_name\n\n    def forward(  # type: ignore[override]  # pylint: disable=arguments-renamed\n        self,\n        outputs: ModelOutput | Tuple[Tensor, ...],\n        labels: Tensor | None = None,\n        output_name: str | None = None,\n        **kwargs,\n    ) -> HeadOutput:\n        r\"\"\"\n        Forward pass of the SequencePredictionHead.\n\n        Args:\n            outputs: The outputs of the model.\n            labels: The labels for the head.\n            output_name: The name of the output to use.\n                Defaults to `self.output_name`.\n        \"\"\"\n        if isinstance(outputs, (Mapping, ModelOutput)):\n            output = outputs[output_name or self.output_name]\n        elif isinstance(outputs, tuple):\n            output = outputs[1]\n        return super().forward(output, labels, **kwargs)\n
"},{"location":"module/heads/#multimolecule.module.heads.sequence.SequencePredictionHead(config)","title":"config","text":""},{"location":"module/heads/#multimolecule.module.heads.sequence.SequencePredictionHead(head_config)","title":"head_config","text":""},{"location":"module/heads/#multimolecule.module.heads.sequence.SequencePredictionHead.output_name","title":"output_name class-attribute instance-attribute","text":"Python
output_name: str = 'pooler_output'\n

The default output to use for the head.

"},{"location":"module/heads/#multimolecule.module.heads.sequence.SequencePredictionHead.forward","title":"forward","text":"Python
forward(outputs: ModelOutput | Tuple[Tensor, ...], labels: Tensor | None = None, output_name: str | None = None, **kwargs) -> HeadOutput\n

Forward pass of the SequencePredictionHead.

Parameters:

Name Type Description Default ModelOutput | Tuple[Tensor, ...]

The outputs of the model.

required Tensor | None

The labels for the head.

None str | None

The name of the output to use. Defaults to self.output_name.

None Source code in multimolecule/module/heads/sequence.py Python
def forward(  # type: ignore[override]  # pylint: disable=arguments-renamed\n    self,\n    outputs: ModelOutput | Tuple[Tensor, ...],\n    labels: Tensor | None = None,\n    output_name: str | None = None,\n    **kwargs,\n) -> HeadOutput:\n    r\"\"\"\n    Forward pass of the SequencePredictionHead.\n\n    Args:\n        outputs: The outputs of the model.\n        labels: The labels for the head.\n        output_name: The name of the output to use.\n            Defaults to `self.output_name`.\n    \"\"\"\n    if isinstance(outputs, (Mapping, ModelOutput)):\n        output = outputs[output_name or self.output_name]\n    elif isinstance(outputs, tuple):\n        output = outputs[1]\n    return super().forward(output, labels, **kwargs)\n
"},{"location":"module/heads/#multimolecule.module.heads.sequence.SequencePredictionHead.forward(outputs)","title":"outputs","text":""},{"location":"module/heads/#multimolecule.module.heads.sequence.SequencePredictionHead.forward(labels)","title":"labels","text":""},{"location":"module/heads/#multimolecule.module.heads.sequence.SequencePredictionHead.forward(output_name)","title":"output_name","text":""},{"location":"module/heads/#multimolecule.module.heads.token","title":"multimolecule.module.heads.token","text":""},{"location":"module/heads/#multimolecule.module.heads.token.TokenPredictionHead","title":"TokenPredictionHead","text":"

Bases: PredictionHead

Head for tasks in token-level.

Parameters:

Name Type Description Default PreTrainedConfig

The configuration object for the model.

required HeadConfig | None

The configuration object for the head. If None, will use configuration from the config.

None Source code in multimolecule/module/heads/token.py Python
@HeadRegistry.token.register(\"single\", default=True)\n@TokenHeadRegistryHF.register(\"single\", default=True)\nclass TokenPredictionHead(PredictionHead):\n    r\"\"\"\n    Head for tasks in token-level.\n\n    Args:\n        config: The configuration object for the model.\n        head_config: The configuration object for the head.\n            If None, will use configuration from the `config`.\n    \"\"\"\n\n    output_name: str = \"last_hidden_state\"\n    r\"\"\"The default output to use for the head.\"\"\"\n\n    def __init__(self, config: PreTrainedConfig, head_config: HeadConfig | None = None):\n        super().__init__(config, head_config)\n        if head_config is not None and head_config.output_name is not None:\n            self.output_name = head_config.output_name\n\n    def forward(  # type: ignore[override]  # pylint: disable=arguments-renamed\n        self,\n        outputs: ModelOutput | Tuple[Tensor, ...],\n        attention_mask: Tensor | None = None,\n        input_ids: NestedTensor | Tensor | None = None,\n        labels: Tensor | None = None,\n        output_name: str | None = None,\n        **kwargs,\n    ) -> HeadOutput:\n        r\"\"\"\n        Forward pass of the TokenPredictionHead.\n\n        Args:\n            outputs: The outputs of the model.\n            attention_mask: The attention mask for the inputs.\n            input_ids: The input ids for the inputs.\n            labels: The labels for the head.\n            output_name: The name of the output to use.\n                Defaults to `self.output_name`.\n        \"\"\"\n        if isinstance(outputs, (Mapping, ModelOutput)):\n            output = outputs[output_name or self.output_name]\n        elif isinstance(outputs, tuple):\n            output = outputs[0]\n        else:\n            raise ValueError(f\"Unsupported type for outputs: {type(outputs)}\")\n\n        if attention_mask is None:\n            attention_mask = self._get_attention_mask(input_ids)\n        output = output * attention_mask.unsqueeze(-1)\n        output, _, _ = self._remove_special_tokens(output, attention_mask, input_ids)\n\n        return super().forward(output, labels, **kwargs)\n
"},{"location":"module/heads/#multimolecule.module.heads.token.TokenPredictionHead(config)","title":"config","text":""},{"location":"module/heads/#multimolecule.module.heads.token.TokenPredictionHead(head_config)","title":"head_config","text":""},{"location":"module/heads/#multimolecule.module.heads.token.TokenPredictionHead.output_name","title":"output_name class-attribute instance-attribute","text":"Python
output_name: str = 'last_hidden_state'\n

The default output to use for the head.

"},{"location":"module/heads/#multimolecule.module.heads.token.TokenPredictionHead.forward","title":"forward","text":"Python
forward(outputs: ModelOutput | Tuple[Tensor, ...], attention_mask: Tensor | None = None, input_ids: NestedTensor | Tensor | None = None, labels: Tensor | None = None, output_name: str | None = None, **kwargs) -> HeadOutput\n

Forward pass of the TokenPredictionHead.

Parameters:

Name Type Description Default ModelOutput | Tuple[Tensor, ...]

The outputs of the model.

required Tensor | None

The attention mask for the inputs.

None NestedTensor | Tensor | None

The input ids for the inputs.

None Tensor | None

The labels for the head.

None str | None

The name of the output to use. Defaults to self.output_name.

None Source code in multimolecule/module/heads/token.py Python
def forward(  # type: ignore[override]  # pylint: disable=arguments-renamed\n    self,\n    outputs: ModelOutput | Tuple[Tensor, ...],\n    attention_mask: Tensor | None = None,\n    input_ids: NestedTensor | Tensor | None = None,\n    labels: Tensor | None = None,\n    output_name: str | None = None,\n    **kwargs,\n) -> HeadOutput:\n    r\"\"\"\n    Forward pass of the TokenPredictionHead.\n\n    Args:\n        outputs: The outputs of the model.\n        attention_mask: The attention mask for the inputs.\n        input_ids: The input ids for the inputs.\n        labels: The labels for the head.\n        output_name: The name of the output to use.\n            Defaults to `self.output_name`.\n    \"\"\"\n    if isinstance(outputs, (Mapping, ModelOutput)):\n        output = outputs[output_name or self.output_name]\n    elif isinstance(outputs, tuple):\n        output = outputs[0]\n    else:\n        raise ValueError(f\"Unsupported type for outputs: {type(outputs)}\")\n\n    if attention_mask is None:\n        attention_mask = self._get_attention_mask(input_ids)\n    output = output * attention_mask.unsqueeze(-1)\n    output, _, _ = self._remove_special_tokens(output, attention_mask, input_ids)\n\n    return super().forward(output, labels, **kwargs)\n
"},{"location":"module/heads/#multimolecule.module.heads.token.TokenPredictionHead.forward(outputs)","title":"outputs","text":""},{"location":"module/heads/#multimolecule.module.heads.token.TokenPredictionHead.forward(attention_mask)","title":"attention_mask","text":""},{"location":"module/heads/#multimolecule.module.heads.token.TokenPredictionHead.forward(input_ids)","title":"input_ids","text":""},{"location":"module/heads/#multimolecule.module.heads.token.TokenPredictionHead.forward(labels)","title":"labels","text":""},{"location":"module/heads/#multimolecule.module.heads.token.TokenPredictionHead.forward(output_name)","title":"output_name","text":""},{"location":"module/heads/#multimolecule.module.heads.token.TokenKMerHead","title":"TokenKMerHead","text":"

Bases: PredictionHead

Head for tasks in token-level with kmer inputs.

Parameters:

Name Type Description Default PreTrainedConfig

The configuration object for the model.

required HeadConfig | None

The configuration object for the head. If None, will use configuration from the config.

None Source code in multimolecule/module/heads/token.py Python
@HeadRegistry.register(\"token.kmer\")\n@TokenHeadRegistryHF.register(\"kmer\")\nclass TokenKMerHead(PredictionHead):\n    r\"\"\"\n    Head for tasks in token-level with kmer inputs.\n\n    Args:\n        config: The configuration object for the model.\n        head_config: The configuration object for the head.\n            If None, will use configuration from the `config`.\n    \"\"\"\n\n    output_name: str = \"last_hidden_state\"\n    r\"\"\"The default output to use for the head.\"\"\"\n\n    def __init__(self, config: PreTrainedConfig, head_config: HeadConfig | None = None):\n        super().__init__(config, head_config)\n        self.nmers = config.nmers\n        if head_config is not None and head_config.output_name is not None:\n            self.output_name = head_config.output_name\n        # Do not pass bos_token_id and eos_token_id to unfold_kmer_embeddings\n        # As they will be removed in preprocess\n        self.unfold_kmer_embeddings = partial(unfold_kmer_embeddings, nmers=self.nmers)\n\n    def forward(  # type: ignore[override]  # pylint: disable=arguments-renamed\n        self,\n        outputs: ModelOutput | Tuple[Tensor, ...],\n        attention_mask: Tensor | None = None,\n        input_ids: NestedTensor | Tensor | None = None,\n        labels: Tensor | None = None,\n        output_name: str | None = None,\n        **kwargs,\n    ) -> HeadOutput:\n        r\"\"\"\n        Forward pass of the TokenKMerHead.\n\n        Args:\n            outputs: The outputs of the model.\n            attention_mask: The attention mask for the inputs.\n            input_ids: The input ids for the inputs.\n            labels: The labels for the head.\n            output_name: The name of the output to use.\n                Defaults to `self.output_name`.\n        \"\"\"\n        if isinstance(outputs, (Mapping, ModelOutput)):\n            output = outputs[output_name or self.output_name]\n        elif isinstance(outputs, tuple):\n            output = outputs[0]\n        else:\n            raise ValueError(f\"Unsupported type for outputs: {type(outputs)}\")\n\n        if attention_mask is None:\n            attention_mask = self._get_attention_mask(input_ids)\n        output = output * attention_mask.unsqueeze(-1)\n        output, attention_mask, _ = self._remove_special_tokens(output, attention_mask, input_ids)\n\n        output = self.unfold_kmer_embeddings(output, attention_mask)\n        return super().forward(output, labels, **kwargs)\n
"},{"location":"module/heads/#multimolecule.module.heads.token.TokenKMerHead(config)","title":"config","text":""},{"location":"module/heads/#multimolecule.module.heads.token.TokenKMerHead(head_config)","title":"head_config","text":""},{"location":"module/heads/#multimolecule.module.heads.token.TokenKMerHead.output_name","title":"output_name class-attribute instance-attribute","text":"Python
output_name: str = 'last_hidden_state'\n

The default output to use for the head.

"},{"location":"module/heads/#multimolecule.module.heads.token.TokenKMerHead.forward","title":"forward","text":"Python
forward(outputs: ModelOutput | Tuple[Tensor, ...], attention_mask: Tensor | None = None, input_ids: NestedTensor | Tensor | None = None, labels: Tensor | None = None, output_name: str | None = None, **kwargs) -> HeadOutput\n

Forward pass of the TokenKMerHead.

Parameters:

Name Type Description Default ModelOutput | Tuple[Tensor, ...]

The outputs of the model.

required Tensor | None

The attention mask for the inputs.

None NestedTensor | Tensor | None

The input ids for the inputs.

None Tensor | None

The labels for the head.

None str | None

The name of the output to use. Defaults to self.output_name.

None Source code in multimolecule/module/heads/token.py Python
def forward(  # type: ignore[override]  # pylint: disable=arguments-renamed\n    self,\n    outputs: ModelOutput | Tuple[Tensor, ...],\n    attention_mask: Tensor | None = None,\n    input_ids: NestedTensor | Tensor | None = None,\n    labels: Tensor | None = None,\n    output_name: str | None = None,\n    **kwargs,\n) -> HeadOutput:\n    r\"\"\"\n    Forward pass of the TokenKMerHead.\n\n    Args:\n        outputs: The outputs of the model.\n        attention_mask: The attention mask for the inputs.\n        input_ids: The input ids for the inputs.\n        labels: The labels for the head.\n        output_name: The name of the output to use.\n            Defaults to `self.output_name`.\n    \"\"\"\n    if isinstance(outputs, (Mapping, ModelOutput)):\n        output = outputs[output_name or self.output_name]\n    elif isinstance(outputs, tuple):\n        output = outputs[0]\n    else:\n        raise ValueError(f\"Unsupported type for outputs: {type(outputs)}\")\n\n    if attention_mask is None:\n        attention_mask = self._get_attention_mask(input_ids)\n    output = output * attention_mask.unsqueeze(-1)\n    output, attention_mask, _ = self._remove_special_tokens(output, attention_mask, input_ids)\n\n    output = self.unfold_kmer_embeddings(output, attention_mask)\n    return super().forward(output, labels, **kwargs)\n
"},{"location":"module/heads/#multimolecule.module.heads.token.TokenKMerHead.forward(outputs)","title":"outputs","text":""},{"location":"module/heads/#multimolecule.module.heads.token.TokenKMerHead.forward(attention_mask)","title":"attention_mask","text":""},{"location":"module/heads/#multimolecule.module.heads.token.TokenKMerHead.forward(input_ids)","title":"input_ids","text":""},{"location":"module/heads/#multimolecule.module.heads.token.TokenKMerHead.forward(labels)","title":"labels","text":""},{"location":"module/heads/#multimolecule.module.heads.token.TokenKMerHead.forward(output_name)","title":"output_name","text":""},{"location":"module/heads/#multimolecule.module.heads.contact","title":"multimolecule.module.heads.contact","text":""},{"location":"module/heads/#multimolecule.module.heads.contact.ContactPredictionHead","title":"ContactPredictionHead","text":"

Bases: PredictionHead

Head for tasks in contact-level.

Performs symmetrization, and average product correct.

Parameters:

Name Type Description Default PreTrainedConfig

The configuration object for the model.

required HeadConfig | None

The configuration object for the head. If None, will use configuration from the config.

None Source code in multimolecule/module/heads/contact.py Python
@HeadRegistry.contact.register(\"attention\")\nclass ContactPredictionHead(PredictionHead):\n    r\"\"\"\n    Head for tasks in contact-level.\n\n    Performs symmetrization, and average product correct.\n\n    Args:\n        config: The configuration object for the model.\n        head_config: The configuration object for the head.\n            If None, will use configuration from the `config`.\n    \"\"\"\n\n    output_name: str = \"attentions\"\n    r\"\"\"The default output to use for the head.\"\"\"\n\n    requires_attention: bool = True\n\n    def __init__(self, config: PreTrainedConfig, head_config: HeadConfig | None = None):\n        super().__init__(config, head_config)\n        self.config.hidden_size = config.num_hidden_layers * config.num_attention_heads\n        num_layers = self.config.get(\"num_layers\", 16)\n        num_channels = self.config.get(\"num_channels\", self.config.hidden_size // 10)  # type: ignore[operator]\n        block = self.config.get(\"block\", \"auto\")\n        self.decoder = ResNet(\n            num_layers=num_layers,\n            hidden_size=self.config.hidden_size,  # type: ignore[arg-type]\n            block=block,\n            num_channels=num_channels,\n            num_labels=self.num_labels,\n        )\n        if head_config is not None and head_config.output_name is not None:\n            self.output_name = head_config.output_name\n\n    def forward(  # type: ignore[override]  # pylint: disable=arguments-renamed\n        self,\n        outputs: ModelOutput | Mapping | Tuple[Tensor, ...],\n        attention_mask: Tensor | None = None,\n        input_ids: NestedTensor | Tensor | None = None,\n        labels: Tensor | None = None,\n        output_name: str | None = None,\n        **kwargs,\n    ) -> HeadOutput:\n        r\"\"\"\n        Forward pass of the ContactPredictionHead.\n\n        Args:\n            outputs: The outputs of the model.\n            attention_mask: The attention mask for the inputs.\n            input_ids: The input ids for the inputs.\n            labels: The labels for the head.\n            output_name: The name of the output to use.\n                Defaults to `self.output_name`.\n        \"\"\"\n\n        if isinstance(outputs, (Mapping, ModelOutput)):\n            output = outputs[output_name or self.output_name]\n        elif isinstance(outputs, tuple):\n            output = outputs[-1]\n        attentions = torch.stack(output, 1)\n\n        # In the original model, attentions for padding tokens are completely zeroed out.\n        # This makes no difference most of the time because the other tokens won't attend to them,\n        # but it does for the contact prediction task, which takes attentions as input,\n        # so we have to mimic that here.\n        if attention_mask is None:\n            attention_mask = self._get_attention_mask(input_ids)\n        attention_mask = attention_mask.unsqueeze(1) * attention_mask.unsqueeze(2)\n        attentions = attentions * attention_mask[:, None, None, :, :]\n\n        # remove cls token attentions\n        if self.bos_token_id is not None:\n            attentions = attentions[..., 1:, 1:]\n            attention_mask = attention_mask[..., 1:]\n            if input_ids is not None:\n                input_ids = input_ids[..., 1:]\n        # remove eos token attentions\n        if self.eos_token_id is not None:\n            if input_ids is not None:\n                eos_mask = input_ids.ne(self.eos_token_id).to(attentions)\n            else:\n                last_valid_indices = attention_mask.sum(dim=-1)\n                seq_length = attention_mask.size(-1)\n                eos_mask = torch.arange(seq_length, device=attentions.device).unsqueeze(0) == last_valid_indices\n            eos_mask = eos_mask.unsqueeze(1) * eos_mask.unsqueeze(2)\n            attentions = attentions * eos_mask[:, None, None, :, :]\n            attentions = attentions[..., :-1, :-1]\n\n        # features: batch x channels x input_ids x input_ids (symmetric)\n        batch_size, layers, heads, seqlen, _ = attentions.size()\n        attentions = attentions.view(batch_size, layers * heads, seqlen, seqlen)\n        attentions = attentions.to(self.decoder.proj.weight.device)\n        attentions = average_product_correct(symmetrize(attentions))\n        attentions = attentions.permute(0, 2, 3, 1).squeeze(3)\n\n        return super().forward(attentions, labels, **kwargs)\n
"},{"location":"module/heads/#multimolecule.module.heads.contact.ContactPredictionHead(config)","title":"config","text":""},{"location":"module/heads/#multimolecule.module.heads.contact.ContactPredictionHead(head_config)","title":"head_config","text":""},{"location":"module/heads/#multimolecule.module.heads.contact.ContactPredictionHead.output_name","title":"output_name class-attribute instance-attribute","text":"Python
output_name: str = 'attentions'\n

The default output to use for the head.

"},{"location":"module/heads/#multimolecule.module.heads.contact.ContactPredictionHead.forward","title":"forward","text":"Python
forward(outputs: ModelOutput | Mapping | Tuple[Tensor, ...], attention_mask: Tensor | None = None, input_ids: NestedTensor | Tensor | None = None, labels: Tensor | None = None, output_name: str | None = None, **kwargs) -> HeadOutput\n

Forward pass of the ContactPredictionHead.

Parameters:

Name Type Description Default ModelOutput | Mapping | Tuple[Tensor, ...]

The outputs of the model.

required Tensor | None

The attention mask for the inputs.

None NestedTensor | Tensor | None

The input ids for the inputs.

None Tensor | None

The labels for the head.

None str | None

The name of the output to use. Defaults to self.output_name.

None Source code in multimolecule/module/heads/contact.py Python
def forward(  # type: ignore[override]  # pylint: disable=arguments-renamed\n    self,\n    outputs: ModelOutput | Mapping | Tuple[Tensor, ...],\n    attention_mask: Tensor | None = None,\n    input_ids: NestedTensor | Tensor | None = None,\n    labels: Tensor | None = None,\n    output_name: str | None = None,\n    **kwargs,\n) -> HeadOutput:\n    r\"\"\"\n    Forward pass of the ContactPredictionHead.\n\n    Args:\n        outputs: The outputs of the model.\n        attention_mask: The attention mask for the inputs.\n        input_ids: The input ids for the inputs.\n        labels: The labels for the head.\n        output_name: The name of the output to use.\n            Defaults to `self.output_name`.\n    \"\"\"\n\n    if isinstance(outputs, (Mapping, ModelOutput)):\n        output = outputs[output_name or self.output_name]\n    elif isinstance(outputs, tuple):\n        output = outputs[-1]\n    attentions = torch.stack(output, 1)\n\n    # In the original model, attentions for padding tokens are completely zeroed out.\n    # This makes no difference most of the time because the other tokens won't attend to them,\n    # but it does for the contact prediction task, which takes attentions as input,\n    # so we have to mimic that here.\n    if attention_mask is None:\n        attention_mask = self._get_attention_mask(input_ids)\n    attention_mask = attention_mask.unsqueeze(1) * attention_mask.unsqueeze(2)\n    attentions = attentions * attention_mask[:, None, None, :, :]\n\n    # remove cls token attentions\n    if self.bos_token_id is not None:\n        attentions = attentions[..., 1:, 1:]\n        attention_mask = attention_mask[..., 1:]\n        if input_ids is not None:\n            input_ids = input_ids[..., 1:]\n    # remove eos token attentions\n    if self.eos_token_id is not None:\n        if input_ids is not None:\n            eos_mask = input_ids.ne(self.eos_token_id).to(attentions)\n        else:\n            last_valid_indices = attention_mask.sum(dim=-1)\n            seq_length = attention_mask.size(-1)\n            eos_mask = torch.arange(seq_length, device=attentions.device).unsqueeze(0) == last_valid_indices\n        eos_mask = eos_mask.unsqueeze(1) * eos_mask.unsqueeze(2)\n        attentions = attentions * eos_mask[:, None, None, :, :]\n        attentions = attentions[..., :-1, :-1]\n\n    # features: batch x channels x input_ids x input_ids (symmetric)\n    batch_size, layers, heads, seqlen, _ = attentions.size()\n    attentions = attentions.view(batch_size, layers * heads, seqlen, seqlen)\n    attentions = attentions.to(self.decoder.proj.weight.device)\n    attentions = average_product_correct(symmetrize(attentions))\n    attentions = attentions.permute(0, 2, 3, 1).squeeze(3)\n\n    return super().forward(attentions, labels, **kwargs)\n
"},{"location":"module/heads/#multimolecule.module.heads.contact.ContactPredictionHead.forward(outputs)","title":"outputs","text":""},{"location":"module/heads/#multimolecule.module.heads.contact.ContactPredictionHead.forward(attention_mask)","title":"attention_mask","text":""},{"location":"module/heads/#multimolecule.module.heads.contact.ContactPredictionHead.forward(input_ids)","title":"input_ids","text":""},{"location":"module/heads/#multimolecule.module.heads.contact.ContactPredictionHead.forward(labels)","title":"labels","text":""},{"location":"module/heads/#multimolecule.module.heads.contact.ContactPredictionHead.forward(output_name)","title":"output_name","text":""},{"location":"module/heads/#multimolecule.module.heads.contact.ContactLogitsHead","title":"ContactLogitsHead","text":"

Bases: PredictionHead

Head for tasks in contact-level.

Performs symmetrization, and average product correct.

Parameters:

Name Type Description Default PreTrainedConfig

The configuration object for the model.

required HeadConfig | None

The configuration object for the head. If None, will use configuration from the config.

None Source code in multimolecule/module/heads/contact.py Python
@HeadRegistry.contact.register(\"logits\")\nclass ContactLogitsHead(PredictionHead):\n    r\"\"\"\n    Head for tasks in contact-level.\n\n    Performs symmetrization, and average product correct.\n\n    Args:\n        config: The configuration object for the model.\n        head_config: The configuration object for the head.\n            If None, will use configuration from the `config`.\n    \"\"\"\n\n    output_name: str = \"last_hidden_state\"\n    r\"\"\"The default output to use for the head.\"\"\"\n\n    requires_attention: bool = False\n\n    def __init__(self, config: PreTrainedConfig, head_config: HeadConfig | None = None):\n        super().__init__(config, head_config)\n        num_layers = self.config.get(\"num_layers\", 16)\n        num_channels = self.config.get(\"num_channels\", self.config.hidden_size // 10)  # type: ignore[operator]\n        block = self.config.get(\"block\", \"auto\")\n        self.decoder = ResNet(\n            num_layers=num_layers,\n            hidden_size=self.config.hidden_size,  # type: ignore[arg-type]\n            block=block,\n            num_channels=num_channels,\n            num_labels=self.num_labels,\n        )\n        if head_config is not None and head_config.output_name is not None:\n            self.output_name = head_config.output_name\n\n    def forward(  # type: ignore[override]  # pylint: disable=arguments-renamed\n        self,\n        outputs: ModelOutput | Mapping | Tuple[Tensor, ...],\n        attention_mask: Tensor | None = None,\n        input_ids: NestedTensor | Tensor | None = None,\n        labels: Tensor | None = None,\n        output_name: str | None = None,\n        **kwargs,\n    ) -> HeadOutput:\n        r\"\"\"\n        Forward pass of the ContactPredictionHead.\n\n        Args:\n            outputs: The outputs of the model.\n            attention_mask: The attention mask for the inputs.\n            input_ids: The input ids for the inputs.\n            labels: The labels for the head.\n            output_name: The name of the output to use.\n                Defaults to `self.output_name`.\n        \"\"\"\n        if isinstance(outputs, (Mapping, ModelOutput)):\n            output = outputs[output_name or self.output_name]\n        elif isinstance(outputs, tuple):\n            output = outputs[0]\n        else:\n            raise ValueError(f\"Unsupported type for outputs: {type(outputs)}\")\n\n        if attention_mask is None:\n            attention_mask = self._get_attention_mask(input_ids)\n        output = output * attention_mask.unsqueeze(-1)\n        output, _, _ = self._remove_special_tokens(output, attention_mask, input_ids)\n\n        # make symmetric contact map\n        contact_map = output.unsqueeze(1) * output.unsqueeze(2)\n\n        return super().forward(contact_map, labels, **kwargs)\n
"},{"location":"module/heads/#multimolecule.module.heads.contact.ContactLogitsHead(config)","title":"config","text":""},{"location":"module/heads/#multimolecule.module.heads.contact.ContactLogitsHead(head_config)","title":"head_config","text":""},{"location":"module/heads/#multimolecule.module.heads.contact.ContactLogitsHead.output_name","title":"output_name class-attribute instance-attribute","text":"Python
output_name: str = 'last_hidden_state'\n

The default output to use for the head.

"},{"location":"module/heads/#multimolecule.module.heads.contact.ContactLogitsHead.forward","title":"forward","text":"Python
forward(outputs: ModelOutput | Mapping | Tuple[Tensor, ...], attention_mask: Tensor | None = None, input_ids: NestedTensor | Tensor | None = None, labels: Tensor | None = None, output_name: str | None = None, **kwargs) -> HeadOutput\n

Forward pass of the ContactPredictionHead.

Parameters:

Name Type Description Default ModelOutput | Mapping | Tuple[Tensor, ...]

The outputs of the model.

required Tensor | None

The attention mask for the inputs.

None NestedTensor | Tensor | None

The input ids for the inputs.

None Tensor | None

The labels for the head.

None str | None

The name of the output to use. Defaults to self.output_name.

None Source code in multimolecule/module/heads/contact.py Python
def forward(  # type: ignore[override]  # pylint: disable=arguments-renamed\n    self,\n    outputs: ModelOutput | Mapping | Tuple[Tensor, ...],\n    attention_mask: Tensor | None = None,\n    input_ids: NestedTensor | Tensor | None = None,\n    labels: Tensor | None = None,\n    output_name: str | None = None,\n    **kwargs,\n) -> HeadOutput:\n    r\"\"\"\n    Forward pass of the ContactPredictionHead.\n\n    Args:\n        outputs: The outputs of the model.\n        attention_mask: The attention mask for the inputs.\n        input_ids: The input ids for the inputs.\n        labels: The labels for the head.\n        output_name: The name of the output to use.\n            Defaults to `self.output_name`.\n    \"\"\"\n    if isinstance(outputs, (Mapping, ModelOutput)):\n        output = outputs[output_name or self.output_name]\n    elif isinstance(outputs, tuple):\n        output = outputs[0]\n    else:\n        raise ValueError(f\"Unsupported type for outputs: {type(outputs)}\")\n\n    if attention_mask is None:\n        attention_mask = self._get_attention_mask(input_ids)\n    output = output * attention_mask.unsqueeze(-1)\n    output, _, _ = self._remove_special_tokens(output, attention_mask, input_ids)\n\n    # make symmetric contact map\n    contact_map = output.unsqueeze(1) * output.unsqueeze(2)\n\n    return super().forward(contact_map, labels, **kwargs)\n
"},{"location":"module/heads/#multimolecule.module.heads.contact.ContactLogitsHead.forward(outputs)","title":"outputs","text":""},{"location":"module/heads/#multimolecule.module.heads.contact.ContactLogitsHead.forward(attention_mask)","title":"attention_mask","text":""},{"location":"module/heads/#multimolecule.module.heads.contact.ContactLogitsHead.forward(input_ids)","title":"input_ids","text":""},{"location":"module/heads/#multimolecule.module.heads.contact.ContactLogitsHead.forward(labels)","title":"labels","text":""},{"location":"module/heads/#multimolecule.module.heads.contact.ContactLogitsHead.forward(output_name)","title":"output_name","text":""},{"location":"module/heads/#multimolecule.module.heads.contact.symmetrize","title":"symmetrize","text":"Python
symmetrize(x)\n

Make layer symmetric in final two dimensions, used for contact prediction.

Source code in multimolecule/module/heads/contact.py Python
def symmetrize(x):\n    \"Make layer symmetric in final two dimensions, used for contact prediction.\"\n    return x + x.transpose(-1, -2)\n
"},{"location":"module/heads/#multimolecule.module.heads.contact.average_product_correct","title":"average_product_correct","text":"Python
average_product_correct(x)\n

Perform average product correct, used for contact prediction.

Source code in multimolecule/module/heads/contact.py Python
def average_product_correct(x):\n    \"Perform average product correct, used for contact prediction.\"\n    a1 = x.sum(-1, keepdims=True)\n    a2 = x.sum(-2, keepdims=True)\n    a12 = x.sum((-1, -2), keepdims=True)\n\n    avg = a1 * a2\n    avg.div_(a12)  # in-place to reduce memory\n    normalized = x - avg\n    return normalized\n
"},{"location":"module/heads/#multimolecule.module.heads.pretrain","title":"multimolecule.module.heads.pretrain","text":""},{"location":"module/heads/#multimolecule.module.heads.pretrain.MaskedLMHead","title":"MaskedLMHead","text":"

Bases: Module

Head for masked language modeling.

Parameters:

Name Type Description Default PreTrainedConfig

The configuration object for the model.

required MaskedLMHeadConfig | None

The configuration object for the head. If None, will use configuration from the config.

None Source code in multimolecule/module/heads/pretrain.py Python
@HeadRegistry.register(\"masked_lm\")\nclass MaskedLMHead(nn.Module):\n    r\"\"\"\n    Head for masked language modeling.\n\n    Args:\n        config: The configuration object for the model.\n        head_config: The configuration object for the head.\n            If None, will use configuration from the `config`.\n    \"\"\"\n\n    output_name: str = \"last_hidden_state\"\n    r\"\"\"The default output to use for the head.\"\"\"\n\n    def __init__(\n        self, config: PreTrainedConfig, weight: Tensor | None = None, head_config: MaskedLMHeadConfig | None = None\n    ):\n        super().__init__()\n        if head_config is None:\n            head_config = (config.lm_head if hasattr(config, \"lm_head\") else config.head) or MaskedLMHeadConfig()\n        self.config: MaskedLMHeadConfig = head_config\n        if self.config.hidden_size is None:\n            self.config.hidden_size = config.hidden_size\n        self.num_labels = config.vocab_size\n        self.dropout = nn.Dropout(self.config.dropout)\n        self.transform = HeadTransformRegistryHF.build(self.config)\n        self.decoder = nn.Linear(self.config.hidden_size, self.num_labels, bias=False)\n        if weight is not None:\n            self.decoder.weight = weight\n        if self.config.bias:\n            self.bias = nn.Parameter(torch.zeros(self.num_labels))\n            self.decoder.bias = self.bias\n        self.activation = ACT2FN[self.config.act] if self.config.act is not None else None\n        if head_config is not None and head_config.output_name is not None:\n            self.output_name = head_config.output_name\n\n    def forward(\n        self, outputs: ModelOutput | Tuple[Tensor, ...], labels: Tensor | None = None, output_name: str | None = None\n    ) -> HeadOutput:\n        r\"\"\"\n        Forward pass of the MaskedLMHead.\n\n        Args:\n            outputs: The outputs of the model.\n            labels: The labels for the head.\n            output_name: The name of the output to use.\n                Defaults to `self.output_name`.\n        \"\"\"\n        if isinstance(outputs, (Mapping, ModelOutput)):\n            output = outputs[output_name or self.output_name]\n        elif isinstance(outputs, tuple):\n            output = outputs[0]\n        else:\n            raise ValueError(f\"Unsupported type for outputs: {type(outputs)}\")\n        output = self.dropout(output)\n        output = self.transform(output)\n        output = self.decoder(output)\n        if self.activation is not None:\n            output = self.activation(output)\n        if labels is not None:\n            if isinstance(labels, NestedTensor):\n                if isinstance(output, Tensor):\n                    output = labels.nested_like(output, strict=False)\n                return HeadOutput(output, F.cross_entropy(output.concat, labels.concat))\n            return HeadOutput(output, F.cross_entropy(output.view(-1, self.num_labels), labels.view(-1)))\n        return HeadOutput(output)\n
"},{"location":"module/heads/#multimolecule.module.heads.pretrain.MaskedLMHead(config)","title":"config","text":""},{"location":"module/heads/#multimolecule.module.heads.pretrain.MaskedLMHead(head_config)","title":"head_config","text":""},{"location":"module/heads/#multimolecule.module.heads.pretrain.MaskedLMHead.output_name","title":"output_name class-attribute instance-attribute","text":"Python
output_name: str = 'last_hidden_state'\n

The default output to use for the head.

"},{"location":"module/heads/#multimolecule.module.heads.pretrain.MaskedLMHead.forward","title":"forward","text":"Python
forward(outputs: ModelOutput | Tuple[Tensor, ...], labels: Tensor | None = None, output_name: str | None = None) -> HeadOutput\n

Forward pass of the MaskedLMHead.

Parameters:

Name Type Description Default ModelOutput | Tuple[Tensor, ...]

The outputs of the model.

required Tensor | None

The labels for the head.

None str | None

The name of the output to use. Defaults to self.output_name.

None Source code in multimolecule/module/heads/pretrain.py Python
def forward(\n    self, outputs: ModelOutput | Tuple[Tensor, ...], labels: Tensor | None = None, output_name: str | None = None\n) -> HeadOutput:\n    r\"\"\"\n    Forward pass of the MaskedLMHead.\n\n    Args:\n        outputs: The outputs of the model.\n        labels: The labels for the head.\n        output_name: The name of the output to use.\n            Defaults to `self.output_name`.\n    \"\"\"\n    if isinstance(outputs, (Mapping, ModelOutput)):\n        output = outputs[output_name or self.output_name]\n    elif isinstance(outputs, tuple):\n        output = outputs[0]\n    else:\n        raise ValueError(f\"Unsupported type for outputs: {type(outputs)}\")\n    output = self.dropout(output)\n    output = self.transform(output)\n    output = self.decoder(output)\n    if self.activation is not None:\n        output = self.activation(output)\n    if labels is not None:\n        if isinstance(labels, NestedTensor):\n            if isinstance(output, Tensor):\n                output = labels.nested_like(output, strict=False)\n            return HeadOutput(output, F.cross_entropy(output.concat, labels.concat))\n        return HeadOutput(output, F.cross_entropy(output.view(-1, self.num_labels), labels.view(-1)))\n    return HeadOutput(output)\n
"},{"location":"module/heads/#multimolecule.module.heads.pretrain.MaskedLMHead.forward(outputs)","title":"outputs","text":""},{"location":"module/heads/#multimolecule.module.heads.pretrain.MaskedLMHead.forward(labels)","title":"labels","text":""},{"location":"module/heads/#multimolecule.module.heads.pretrain.MaskedLMHead.forward(output_name)","title":"output_name","text":""},{"location":"module/heads/#multimolecule.module.heads.generic","title":"multimolecule.module.heads.generic","text":""},{"location":"module/heads/#multimolecule.module.heads.generic.PredictionHead","title":"PredictionHead","text":"

Bases: Module

Head for all-level of tasks.

Parameters:

Name Type Description Default PreTrainedConfig

The configuration object for the model.

required HeadConfig | None

The configuration object for the head. If None, will use configuration from the config.

None Source code in multimolecule/module/heads/generic.py Python
class PredictionHead(nn.Module):\n    r\"\"\"\n    Head for all-level of tasks.\n\n    Args:\n        config: The configuration object for the model.\n        head_config: The configuration object for the head.\n            If None, will use configuration from the `config`.\n    \"\"\"\n\n    num_labels: int\n    requires_attention: bool = False\n\n    def __init__(self, config: PreTrainedConfig, head_config: HeadConfig | None = None):\n        super().__init__()\n        if head_config is None:\n            head_config = config.head or HeadConfig(num_labels=config.num_labels)\n        elif head_config.num_labels is None:\n            head_config.num_labels = config.num_labels\n        self.config = head_config\n        if self.config.hidden_size is None:\n            self.config.hidden_size = config.hidden_size\n        if self.config.problem_type is None:\n            self.config.problem_type = config.problem_type\n        self.bos_token_id = config.bos_token_id\n        self.eos_token_id = config.eos_token_id\n        self.pad_token_id = config.pad_token_id\n        self.num_labels = self.config.num_labels  # type: ignore[assignment]\n        self.dropout = nn.Dropout(self.config.dropout)\n        self.transform = HeadTransformRegistryHF.build(self.config)\n        self.decoder = nn.Linear(self.config.hidden_size, self.num_labels, bias=self.config.bias)\n        self.activation = ACT2FN[self.config.act] if self.config.act is not None else None\n        self.criterion = CriterionRegistry.build(self.config)\n\n    def forward(self, embeddings: Tensor, labels: Tensor | None, **kwargs) -> HeadOutput:\n        r\"\"\"\n        Forward pass of the PredictionHead.\n\n        Args:\n            embeddings: The embeddings to be passed through the head.\n            labels: The labels for the head.\n        \"\"\"\n        if kwargs:\n            warn(\n                f\"The following arguments are not applicable to {self.__class__.__name__}\"\n                f\"and will be ignored: {kwargs.keys()}\"\n            )\n        output = self.dropout(embeddings)\n        output = self.transform(output)\n        output = self.decoder(output)\n        if self.activation is not None:\n            output = self.activation(output)\n        if labels is not None:\n            if isinstance(labels, NestedTensor):\n                if isinstance(output, Tensor):\n                    output = labels.nested_like(output, strict=False)\n                return HeadOutput(output, self.criterion(output.concat, labels.concat))\n            return HeadOutput(output, self.criterion(output, labels))\n        return HeadOutput(output)\n\n    def _get_attention_mask(self, input_ids: NestedTensor | Tensor) -> Tensor:\n        if isinstance(input_ids, NestedTensor):\n            return input_ids.mask\n        if input_ids is None:\n            raise ValueError(\n                f\"Either attention_mask or input_ids must be provided for {self.__class__.__name__} to work.\"\n            )\n        if self.pad_token_id is None:\n            raise ValueError(\n                f\"pad_token_id must be provided when attention_mask is not passed to {self.__class__.__name__}.\"\n            )\n        return input_ids.ne(self.pad_token_id)\n\n    def _remove_special_tokens(\n        self, output: Tensor, attention_mask: Tensor, input_ids: Tensor | None\n    ) -> Tuple[Tensor, Tensor, Tensor]:\n        # remove cls token embeddings\n        if self.bos_token_id is not None:\n            output = output[..., 1:, :]\n            attention_mask = attention_mask[..., 1:]\n            if input_ids is not None:\n                input_ids = input_ids[..., 1:]\n        # remove eos token embeddings\n        if self.eos_token_id is not None:\n            if input_ids is not None:\n                eos_mask = input_ids.ne(self.eos_token_id).to(output)\n                input_ids = input_ids[..., :-1]\n            else:\n                last_valid_indices = attention_mask.sum(dim=-1)\n                seq_length = attention_mask.size(-1)\n                eos_mask = torch.arange(seq_length, device=output.device) == last_valid_indices.unsqueeze(1)\n            output = output * eos_mask[:, :, None]\n            output = output[..., :-1, :]\n            attention_mask = attention_mask[..., 1:]\n        return output, attention_mask, input_ids\n
"},{"location":"module/heads/#multimolecule.module.heads.generic.PredictionHead(config)","title":"config","text":""},{"location":"module/heads/#multimolecule.module.heads.generic.PredictionHead(head_config)","title":"head_config","text":""},{"location":"module/heads/#multimolecule.module.heads.generic.PredictionHead.forward","title":"forward","text":"Python
forward(embeddings: Tensor, labels: Tensor | None, **kwargs) -> HeadOutput\n

Forward pass of the PredictionHead.

Parameters:

Name Type Description Default Tensor

The embeddings to be passed through the head.

required Tensor | None

The labels for the head.

required Source code in multimolecule/module/heads/generic.py Python
def forward(self, embeddings: Tensor, labels: Tensor | None, **kwargs) -> HeadOutput:\n    r\"\"\"\n    Forward pass of the PredictionHead.\n\n    Args:\n        embeddings: The embeddings to be passed through the head.\n        labels: The labels for the head.\n    \"\"\"\n    if kwargs:\n        warn(\n            f\"The following arguments are not applicable to {self.__class__.__name__}\"\n            f\"and will be ignored: {kwargs.keys()}\"\n        )\n    output = self.dropout(embeddings)\n    output = self.transform(output)\n    output = self.decoder(output)\n    if self.activation is not None:\n        output = self.activation(output)\n    if labels is not None:\n        if isinstance(labels, NestedTensor):\n            if isinstance(output, Tensor):\n                output = labels.nested_like(output, strict=False)\n            return HeadOutput(output, self.criterion(output.concat, labels.concat))\n        return HeadOutput(output, self.criterion(output, labels))\n    return HeadOutput(output)\n
"},{"location":"module/heads/#multimolecule.module.heads.generic.PredictionHead.forward(embeddings)","title":"embeddings","text":""},{"location":"module/heads/#multimolecule.module.heads.generic.PredictionHead.forward(labels)","title":"labels","text":""},{"location":"module/heads/#multimolecule.module.heads.output","title":"multimolecule.module.heads.output","text":""},{"location":"module/heads/#multimolecule.module.heads.output.HeadOutput","title":"HeadOutput dataclass","text":"

Bases: ModelOutput

Output of a prediction head.

Parameters:

Name Type Description Default FloatTensor

The prediction logits from the head.

required FloatTensor | None

The loss from the head. Defaults to None.

None Source code in multimolecule/module/heads/output.py Python
@dataclass\nclass HeadOutput(ModelOutput):\n    r\"\"\"\n    Output of a prediction head.\n\n    Args:\n        logits: The prediction logits from the head.\n        loss: The loss from the head.\n            Defaults to None.\n    \"\"\"\n\n    logits: FloatTensor\n    loss: FloatTensor | None = None\n
"},{"location":"module/heads/#multimolecule.module.heads.output.HeadOutput(logits)","title":"logits","text":""},{"location":"module/heads/#multimolecule.module.heads.output.HeadOutput(loss)","title":"loss","text":""},{"location":"tokenisers/","title":"tokenisers","text":"

tokenisers provide a collection of pre-defined tokenizers.

A tokenizer is a class that converts a sequence of nucleotides or amino acids into a sequence of indices. It is used to pre-process the input sequence before feeding it into a model.

Please refer to Tokenizer for more details.

"},{"location":"tokenisers/#available-tokenizers","title":"Available Tokenizers","text":""},{"location":"tokenisers/dna/","title":"DnaTokenizer","text":"

DnaTokenizer is smart, it tokenizes raw DNA nucleotides into tokens, no matter if the input is in uppercase or lowercase, uses T (Thymine) or U (Uracil), and with or without special tokens. It also supports tokenization into nmers and codons, so you don\u2019t have to write complex code to preprocess your data.

By default, DnaTokenizer uses the standard alphabet. If nmers is greater than 1, or codon is set to True, it will instead use the streamline alphabet.

MultiMolecule provides a set of predefined alphabets for tokenization.

"},{"location":"tokenisers/dna/#multimolecule.tokenisers.DnaTokenizer","title":"multimolecule.tokenisers.DnaTokenizer","text":"

Bases: Tokenizer

Tokenizer for DNA sequences.

Parameters:

Name Type Description Default Alphabet | str | List[str] | None

alphabet to use for tokenization.

None int

Size of kmer to tokenize.

1 bool

Whether to tokenize into codons.

False bool

Whether to replace U with T.

True bool

Whether to convert input to uppercase.

True

Examples:

Python Console Session
>>> from multimolecule import DnaTokenizer\n>>> tokenizer = DnaTokenizer()\n>>> tokenizer('<pad><cls><eos><unk><mask><null>ACGTNRYSWKMBDHV.X*-')[\"input_ids\"]\n[1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 2]\n>>> tokenizer('acgt')[\"input_ids\"]\n[1, 6, 7, 8, 9, 2]\n>>> tokenizer('acgu')[\"input_ids\"]\n[1, 6, 7, 8, 9, 2]\n>>> tokenizer = DnaTokenizer(replace_U_with_T=False)\n>>> tokenizer('acgu')[\"input_ids\"]\n[1, 6, 7, 8, 3, 2]\n>>> tokenizer = DnaTokenizer(nmers=3)\n>>> tokenizer('tataaagta')[\"input_ids\"]\n[1, 84, 21, 81, 6, 8, 19, 71, 2]\n>>> tokenizer = DnaTokenizer(codon=True)\n>>> tokenizer('tataaagta')[\"input_ids\"]\n[1, 84, 6, 71, 2]\n>>> tokenizer('tataaagtaa')[\"input_ids\"]\nTraceback (most recent call last):\nValueError: length of input sequence must be a multiple of 3 for codon tokenization, but got 10\n
Source code in multimolecule/tokenisers/dna/tokenization_dna.py Python
class DnaTokenizer(Tokenizer):\n    \"\"\"\n    Tokenizer for DNA sequences.\n\n    Args:\n        alphabet: alphabet to use for tokenization.\n\n            - If is `None`, the standard RNA alphabet will be used.\n            - If is a `string`, it should correspond to the name of a predefined alphabet. The options include\n                + `standard`\n                + `iupac`\n                + `streamline`\n                + `nucleobase`\n            - If is an alphabet or a list of characters, that specific alphabet will be used.\n        nmers: Size of kmer to tokenize.\n        codon: Whether to tokenize into codons.\n        replace_U_with_T: Whether to replace U with T.\n        do_upper_case: Whether to convert input to uppercase.\n\n    Examples:\n        >>> from multimolecule import DnaTokenizer\n        >>> tokenizer = DnaTokenizer()\n        >>> tokenizer('<pad><cls><eos><unk><mask><null>ACGTNRYSWKMBDHV.X*-')[\"input_ids\"]\n        [1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 2]\n        >>> tokenizer('acgt')[\"input_ids\"]\n        [1, 6, 7, 8, 9, 2]\n        >>> tokenizer('acgu')[\"input_ids\"]\n        [1, 6, 7, 8, 9, 2]\n        >>> tokenizer = DnaTokenizer(replace_U_with_T=False)\n        >>> tokenizer('acgu')[\"input_ids\"]\n        [1, 6, 7, 8, 3, 2]\n        >>> tokenizer = DnaTokenizer(nmers=3)\n        >>> tokenizer('tataaagta')[\"input_ids\"]\n        [1, 84, 21, 81, 6, 8, 19, 71, 2]\n        >>> tokenizer = DnaTokenizer(codon=True)\n        >>> tokenizer('tataaagta')[\"input_ids\"]\n        [1, 84, 6, 71, 2]\n        >>> tokenizer('tataaagtaa')[\"input_ids\"]\n        Traceback (most recent call last):\n        ValueError: length of input sequence must be a multiple of 3 for codon tokenization, but got 10\n    \"\"\"\n\n    model_input_names = [\"input_ids\", \"attention_mask\"]\n\n    def __init__(\n        self,\n        alphabet: Alphabet | str | List[str] | None = None,\n        nmers: int = 1,\n        codon: bool = False,\n        replace_U_with_T: bool = True,\n        do_upper_case: bool = True,\n        additional_special_tokens: List | Tuple | None = None,\n        **kwargs,\n    ):\n        if codon and (nmers > 1 and nmers != 3):\n            raise ValueError(\"Codon and nmers cannot be used together.\")\n        if codon:\n            nmers = 3  # set to 3 to get correct vocab\n        if not isinstance(alphabet, Alphabet):\n            alphabet = get_alphabet(alphabet, nmers=nmers)\n        super().__init__(\n            alphabet=alphabet,\n            nmers=nmers,\n            codon=codon,\n            replace_U_with_T=replace_U_with_T,\n            do_upper_case=do_upper_case,\n            additional_special_tokens=additional_special_tokens,\n            **kwargs,\n        )\n        self.replace_U_with_T = replace_U_with_T\n        self.nmers = nmers\n        self.codon = codon\n\n    def _tokenize(self, text: str, **kwargs):\n        if self.do_upper_case:\n            text = text.upper()\n        if self.replace_U_with_T:\n            text = text.replace(\"U\", \"T\")\n        if self.codon:\n            if len(text) % 3 != 0:\n                raise ValueError(\n                    f\"length of input sequence must be a multiple of 3 for codon tokenization, but got {len(text)}\"\n                )\n            return [text[i : i + 3] for i in range(0, len(text), 3)]\n        if self.nmers > 1:\n            return [text[i : i + self.nmers] for i in range(len(text) - self.nmers + 1)]  # noqa: E203\n        return list(text)\n
"},{"location":"tokenisers/dna/#multimolecule.tokenisers.DnaTokenizer(alphabet)","title":"alphabet","text":""},{"location":"tokenisers/dna/#multimolecule.tokenisers.DnaTokenizer(nmers)","title":"nmers","text":""},{"location":"tokenisers/dna/#multimolecule.tokenisers.DnaTokenizer(codon)","title":"codon","text":""},{"location":"tokenisers/dna/#multimolecule.tokenisers.DnaTokenizer(replace_U_with_T)","title":"replace_U_with_T","text":""},{"location":"tokenisers/dna/#multimolecule.tokenisers.DnaTokenizer(do_upper_case)","title":"do_upper_case","text":""},{"location":"tokenisers/dna/#standard-alphabet","title":"Standard Alphabet","text":"

The standard alphabet is an extended version of the IUPAC alphabet. This extension includes two additional symbols to the IUPAC alphabet, X and *.

gap

Note that we use . to represent a gap in the sequence.

While - exists in the standard alphabet, it is not used in MultiMolecule and is reserved for future use.

Code Represents A Adenine C Cytosine G Guanine T Thymine N Unknown R A or G Y C or T S C or G W A or T K G or T M A or C B C, G, or T D A, G, or T H A, C, or T V A, C, or G . Gap X Any * Not Used - Not Used"},{"location":"tokenisers/dna/#iupac-alphabet","title":"IUPAC Alphabet","text":"

IUPAC nucleotide code is a standard nucleotide code proposed by the International Union of Pure and Applied Chemistry (IUPAC) to represent DNA sequences.

It consists of 10 symbols that represent ambiguity in the nucleotide sequence and 1 symbol that represents a gap in addition to the streamline alphabet.

Code Represents A Adenine C Cytosine G Guanine T Thymine R A or G Y C or T S C or G W A or T K G or T M A or C B C, G, or T D A, G, or T H A, C, or T V A, C, or G N A, C, G, or T . Gap

Note that we use . to represent a gap in the sequence.

"},{"location":"tokenisers/dna/#streamline-alphabet","title":"Streamline Alphabet","text":"

The streamline alphabet includes one additional symbol to the nucleobase alphabet, N to represent unknown nucleobase.

Code Nucleotide A Adenine C Cytosine G Guanine T Thymine N Unknown"},{"location":"tokenisers/dna/#nucleobase-alphabet","title":"Nucleobase Alphabet","text":"

The nucleobase alphabet is a minimal version of the DNA alphabet that includes only the four canonical nucleotides A, C, G, and T.

Code Nucleotide A Adenine C Cytosine G Guanine T Thymine"},{"location":"tokenisers/dot_bracket/","title":"DotBracketTokenizer","text":"

DotBracketTokenizer provides a simple way to tokenize secondary structure in dot-bracket notation. It also supports tokenization into nmers and codons, so you don\u2019t have to write complex code to preprocess your data.

By default, DotBracketTokenizer uses the standard alphabet. If nmers is greater than 1, or codon is set to True, it will instead use the streamline alphabet.

MultiMolecule provides a set of predefined alphabets for tokenization.

"},{"location":"tokenisers/dot_bracket/#multimolecule.tokenisers.DotBracketTokenizer","title":"multimolecule.tokenisers.DotBracketTokenizer","text":"

Bases: Tokenizer

Tokenizer for Secondary Structure sequences.

Parameters:

Name Type Description Default Alphabet | str | List[str] | None

alphabet to use for tokenization.

None int

Size of kmer to tokenize.

1 bool

Whether to tokenize into codons.

False

Examples:

Python Console Session
>>> from multimolecule import DotBracketTokenizer\n>>> tokenizer = DotBracketTokenizer()\n>>> tokenizer('<pad><cls><eos><unk><mask><null>.()+,[]{}|<>-_:~$@^%*')[\"input_ids\"]\n[1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 2]\n>>> tokenizer('(.)')[\"input_ids\"]\n[1, 7, 6, 8, 2]\n>>> tokenizer('+(.)')[\"input_ids\"]\n[1, 9, 7, 6, 8, 2]\n>>> tokenizer = DotBracketTokenizer(nmers=3)\n>>> tokenizer('(((((+..........)))))')[\"input_ids\"]\n[1, 27, 27, 27, 29, 34, 54, 6, 6, 6, 6, 6, 6, 6, 6, 8, 16, 48, 48, 48, 2]\n>>> tokenizer = DotBracketTokenizer(codon=True)\n>>> tokenizer('(((((+..........)))))')[\"input_ids\"]\n[1, 27, 29, 6, 6, 6, 16, 48, 2]\n>>> tokenizer('(((((+...........)))))')[\"input_ids\"]\nTraceback (most recent call last):\nValueError: length of input sequence must be a multiple of 3 for codon tokenization, but got 22\n
Source code in multimolecule/tokenisers/dot_bracket/tokenization_db.py Python
class DotBracketTokenizer(Tokenizer):\n    \"\"\"\n    Tokenizer for Secondary Structure sequences.\n\n    Args:\n        alphabet: alphabet to use for tokenization.\n\n            - If is `None`, the standard Secondary Structure alphabet will be used.\n            - If is a `string`, it should correspond to the name of a predefined alphabet. The options include\n                + `standard`\n                + `iupac`\n                + `streamline`\n                + `nucleobase`\n            - If is an alphabet or a list of characters, that specific alphabet will be used.\n        nmers: Size of kmer to tokenize.\n        codon: Whether to tokenize into codons.\n\n    Examples:\n        >>> from multimolecule import DotBracketTokenizer\n        >>> tokenizer = DotBracketTokenizer()\n        >>> tokenizer('<pad><cls><eos><unk><mask><null>.()+,[]{}|<>-_:~$@^%*')[\"input_ids\"]\n        [1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 2]\n        >>> tokenizer('(.)')[\"input_ids\"]\n        [1, 7, 6, 8, 2]\n        >>> tokenizer('+(.)')[\"input_ids\"]\n        [1, 9, 7, 6, 8, 2]\n        >>> tokenizer = DotBracketTokenizer(nmers=3)\n        >>> tokenizer('(((((+..........)))))')[\"input_ids\"]\n        [1, 27, 27, 27, 29, 34, 54, 6, 6, 6, 6, 6, 6, 6, 6, 8, 16, 48, 48, 48, 2]\n        >>> tokenizer = DotBracketTokenizer(codon=True)\n        >>> tokenizer('(((((+..........)))))')[\"input_ids\"]\n        [1, 27, 29, 6, 6, 6, 16, 48, 2]\n        >>> tokenizer('(((((+...........)))))')[\"input_ids\"]\n        Traceback (most recent call last):\n        ValueError: length of input sequence must be a multiple of 3 for codon tokenization, but got 22\n    \"\"\"\n\n    model_input_names = [\"input_ids\", \"attention_mask\"]\n\n    def __init__(\n        self,\n        alphabet: Alphabet | str | List[str] | None = None,\n        nmers: int = 1,\n        codon: bool = False,\n        additional_special_tokens: List | Tuple | None = None,\n        **kwargs,\n    ):\n        if codon and (nmers > 1 and nmers != 3):\n            raise ValueError(\"Codon and nmers cannot be used together.\")\n        if codon:\n            nmers = 3  # set to 3 to get correct vocab\n        if not isinstance(alphabet, Alphabet):\n            alphabet = get_alphabet(alphabet, nmers=nmers)\n        super().__init__(\n            alphabet=alphabet,\n            nmers=nmers,\n            codon=codon,\n            additional_special_tokens=additional_special_tokens,\n            **kwargs,\n        )\n        self.nmers = nmers\n        self.codon = codon\n\n    def _tokenize(self, text: str, **kwargs):\n        if self.codon:\n            if len(text) % 3 != 0:\n                raise ValueError(\n                    f\"length of input sequence must be a multiple of 3 for codon tokenization, but got {len(text)}\"\n                )\n            return [text[i : i + 3] for i in range(0, len(text), 3)]\n        if self.nmers > 1:\n            return [text[i : i + self.nmers] for i in range(len(text) - self.nmers + 1)]  # noqa: E203\n        return list(text)\n
"},{"location":"tokenisers/dot_bracket/#multimolecule.tokenisers.DotBracketTokenizer(alphabet)","title":"alphabet","text":""},{"location":"tokenisers/dot_bracket/#multimolecule.tokenisers.DotBracketTokenizer(nmers)","title":"nmers","text":""},{"location":"tokenisers/dot_bracket/#multimolecule.tokenisers.DotBracketTokenizer(codon)","title":"codon","text":""},{"location":"tokenisers/dot_bracket/#standard-alphabet","title":"Standard Alphabet","text":"

The standard alphabet is an extended version of the Extended Dot-Bracket Notation. This extension includes most symbols from the WUSS notation for better compatibility with existing tools.

Code Represents . unpaired ( internal helices of all terminal stems ) internal helices of all terminal stems + nick between strand , unpaired in multibranch loops [ internal helices that includes at least one annotated () stem ] internal helices that includes at least one annotated () stem { all internal helices of deeper multifurcations } all internal helices of deeper multifurcations | mostly paired < simple terminal stems > simple terminal stems - bulges and interior loops _ unpaired : single stranded in the exterior loop ~ local structural alignment left regions of target and query unaligned $ Not Used @ Not Used ^ Not Used % Not Used * Not Used"},{"location":"tokenisers/dot_bracket/#extended-alphabet","title":"Extended Alphabet","text":"

Extended Dot-Bracket Notation is a more generalized version of the original Dot-Bracket notation may use additional pairs of brackets for annotating pseudo-knots, since different pairs of brackets are not required to be nested.

Code Represents . unpaired ( internal helices of all terminal stems ) internal helices of all terminal stems + nick between strand , unpaired in multibranch loops [ internal helices that includes at least one annotated () stem ] internal helices that includes at least one annotated () stem { all internal helices of deeper multifurcations } all internal helices of deeper multifurcations | mostly paired < simple terminal stems > simple terminal stems

Note that we use . to represent a gap in the sequence.

"},{"location":"tokenisers/dot_bracket/#streamline-alphabet","title":"Streamline Alphabet","text":"

The streamline alphabet includes one additional symbol to the nucleobase alphabet, N to represent unknown nucleobase.

Code Represents . unpaired ( internal helices of all terminal stems ) internal helices of all terminal stems + nick between strand"},{"location":"tokenisers/protein/","title":"ProteinTokenizer","text":"

ProteinTokenizer is smart, it tokenizes raw amino acids into tokens, no matter if the input is in uppercase or lowercase, and with or without special tokens.

By default, ProteinTokenizer uses the standard alphabet.

MultiMolecule provides a set of predefined alphabets for tokenization.

"},{"location":"tokenisers/protein/#multimolecule.tokenisers.ProteinTokenizer","title":"multimolecule.tokenisers.ProteinTokenizer","text":"

Bases: Tokenizer

Tokenizer for Protein sequences.

Parameters:

Name Type Description Default Alphabet | str | List[str] | None

alphabet to use for tokenization.

None bool

Whether to convert input to uppercase.

True

Examples:

Python Console Session
>>> from multimolecule import ProteinTokenizer\n>>> tokenizer = ProteinTokenizer()\n>>> tokenizer('ACDEFGHIKLMNPQRSTVWYXZBJUO')[\"input_ids\"]\n[1, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 2]\n>>> tokenizer('<pad><cls><eos><unk><mask><null>.*-')[\"input_ids\"]\n[1, 0, 1, 2, 3, 4, 5, 32, 33, 34, 2]\n>>> tokenizer('manlgcwmlv')[\"input_ids\"]\n[1, 16, 6, 17, 15, 11, 7, 24, 16, 15, 23, 2]\n
Source code in multimolecule/tokenisers/protein/tokenization_protein.py Python
class ProteinTokenizer(Tokenizer):\n    \"\"\"\n    Tokenizer for Protein sequences.\n\n    Args:\n        alphabet: alphabet to use for tokenization.\n\n            - If is `None`, the standard RNA alphabet will be used.\n            - If is a `string`, it should correspond to the name of a predefined alphabet. The options include\n                + `standard`\n                + `iupac`\n                + `streamline`\n            - If is an alphabet or a list of characters, that specific alphabet will be used.\n        do_upper_case: Whether to convert input to uppercase.\n\n    Examples:\n        >>> from multimolecule import ProteinTokenizer\n        >>> tokenizer = ProteinTokenizer()\n        >>> tokenizer('ACDEFGHIKLMNPQRSTVWYXZBJUO')[\"input_ids\"]\n        [1, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 2]\n        >>> tokenizer('<pad><cls><eos><unk><mask><null>.*-')[\"input_ids\"]\n        [1, 0, 1, 2, 3, 4, 5, 32, 33, 34, 2]\n        >>> tokenizer('manlgcwmlv')[\"input_ids\"]\n        [1, 16, 6, 17, 15, 11, 7, 24, 16, 15, 23, 2]\n    \"\"\"\n\n    model_input_names = [\"input_ids\", \"attention_mask\"]\n\n    def __init__(\n        self,\n        alphabet: Alphabet | str | List[str] | None = None,\n        do_upper_case: bool = True,\n        additional_special_tokens: List | Tuple | None = None,\n        **kwargs,\n    ):\n        if not isinstance(alphabet, Alphabet):\n            alphabet = get_alphabet(alphabet)\n        super().__init__(\n            alphabet=alphabet,\n            additional_special_tokens=additional_special_tokens,\n            do_upper_case=do_upper_case,\n            **kwargs,\n        )\n\n    def _tokenize(self, text: str, **kwargs):\n        if self.do_upper_case:\n            text = text.upper()\n        return list(text)\n
"},{"location":"tokenisers/protein/#multimolecule.tokenisers.ProteinTokenizer(alphabet)","title":"alphabet","text":""},{"location":"tokenisers/protein/#multimolecule.tokenisers.ProteinTokenizer(do_upper_case)","title":"do_upper_case","text":""},{"location":"tokenisers/protein/#standard-alphabet","title":"Standard Alphabet","text":"

The standard alphabet is an extended version of the IUPAC alphabet. This extension includes six additional symbols to the IUPAC alphabet, J, U, O, ., -, and *.

Amino Acid Code Three letter Code Amino Acid A Ala Alanine C Cys Cysteine D Asp Aspartic Acid E Glu Glutamic Acid F Phe Phenylalanine G Gly Glycine H His Histidine I Ile Isoleucine K Lys Lysine L Leu Leucine M Met Methionine N Asn Asparagine P Pro Proline Q Gln Glutamine R Arg Arginine S Ser Serine T Thr Threonine V Val Valine W Trp Tryptophan Y Tyr Tyrosine X Xaa Any amino acid Z Glx Glutamine (Q) or Glutamic acid (E) B Asx Aspartic acid (D) or Asparagine (N) J Xle Leucine (L) or Isoleucine (I) U Sec Selenocysteine O Pyl Pyrrolysine . \u2026 Not Used * *** Not Used - \u2014 Not Used"},{"location":"tokenisers/protein/#iupac-alphabet","title":"IUPAC Alphabet","text":"

IUPAC amino acid code is a standard amino acid code proposed by the International Union of Pure and Applied Chemistry (IUPAC) to represent Protein sequences.

The IUPAC amino acid code consists of three additional symbols to Streamline Alphabet, B, Z, and X.

Amino Acid Code Three letter Code Amino Acid A Ala Alanine B Asx Aspartic acid (D) or Asparagine (N) C Cys Cysteine D Asp Aspartic Acid E Glu Glutamic Acid F Phe Phenylalanine G Gly Glycine H His Histidine I Ile Isoleucine K Lys Lysine L Leu Leucine M Met Methionine N Asn Asparagine P Pro Proline Q Gln Glutamine R Arg Arginine S Ser Serine T Thr Threonine V Val Valine W Trp Tryptophan Y Tyr Tyrosine X Xaa Any amino acid Z Glx Glutamine (Q) or Glutamic acid (E)"},{"location":"tokenisers/protein/#streamline-alphabet","title":"Streamline Alphabet","text":"

The streamline alphabet is a simplified version of the standard alphabet.

Amino Acid Code Three letter Code Amino Acid A Ala Alanine C Cys Cysteine D Asp Aspartic Acid E Glu Glutamic Acid F Phe Phenylalanine G Gly Glycine H His Histidine I Ile Isoleucine K Lys Lysine L Leu Leucine M Met Methionine N Asn Asparagine P Pro Proline Q Gln Glutamine R Arg Arginine S Ser Serine T Thr Threonine V Val Valine W Trp Tryptophan Y Tyr Tyrosine X Xaa Any amino acid"},{"location":"tokenisers/rna/","title":"RnaTokenizer","text":"

RnaTokenizer is smart, it tokenizes raw RNA nucleotides into tokens, no matter if the input is in uppercase or lowercase, uses U (Uracil) or U (Thymine), and with or without special tokens. It also supports tokenization into nmers and codons, so you don\u2019t have to write complex code to preprocess your data.

By default, RnaTokenizer uses the standard alphabet. If nmers is greater than 1, or codon is set to True, it will instead use the streamline alphabet.

MultiMolecule provides a set of predefined alphabets for tokenization.

"},{"location":"tokenisers/rna/#multimolecule.tokenisers.RnaTokenizer","title":"multimolecule.tokenisers.RnaTokenizer","text":"

Bases: Tokenizer

Tokenizer for RNA sequences.

Parameters:

Name Type Description Default Alphabet | str | List[str] | None

alphabet to use for tokenization.

None int

Size of kmer to tokenize.

1 bool

Whether to tokenize into codons.

False bool

Whether to replace T with U.

True bool

Whether to convert input to uppercase.

True

Examples:

Python Console Session
>>> from multimolecule import RnaTokenizer\n>>> tokenizer = RnaTokenizer()\n>>> tokenizer('<pad><cls><eos><unk><mask><null>ACGUNRYSWKMBDHV.X*-I')[\"input_ids\"]\n[1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 2]\n>>> tokenizer('acgu')[\"input_ids\"]\n[1, 6, 7, 8, 9, 2]\n>>> tokenizer('acgt')[\"input_ids\"]\n[1, 6, 7, 8, 9, 2]\n>>> tokenizer = RnaTokenizer(replace_T_with_U=False)\n>>> tokenizer('acgt')[\"input_ids\"]\n[1, 6, 7, 8, 3, 2]\n>>> tokenizer = RnaTokenizer(nmers=3)\n>>> tokenizer('uagcuuauc')[\"input_ids\"]\n[1, 83, 17, 64, 49, 96, 84, 22, 2]\n>>> tokenizer = RnaTokenizer(codon=True)\n>>> tokenizer('uagcuuauc')[\"input_ids\"]\n[1, 83, 49, 22, 2]\n>>> tokenizer('uagcuuauca')[\"input_ids\"]\nTraceback (most recent call last):\nValueError: length of input sequence must be a multiple of 3 for codon tokenization, but got 10\n
Source code in multimolecule/tokenisers/rna/tokenization_rna.py Python
class RnaTokenizer(Tokenizer):\n    \"\"\"\n    Tokenizer for RNA sequences.\n\n    Args:\n        alphabet: alphabet to use for tokenization.\n\n            - If is `None`, the standard RNA alphabet will be used.\n            - If is a `string`, it should correspond to the name of a predefined alphabet. The options include\n                + `standard`\n                + `extended`\n                + `streamline`\n                + `nucleobase`\n            - If is an alphabet or a list of characters, that specific alphabet will be used.\n        nmers: Size of kmer to tokenize.\n        codon: Whether to tokenize into codons.\n        replace_T_with_U: Whether to replace T with U.\n        do_upper_case: Whether to convert input to uppercase.\n\n    Examples:\n        >>> from multimolecule import RnaTokenizer\n        >>> tokenizer = RnaTokenizer()\n        >>> tokenizer('<pad><cls><eos><unk><mask><null>ACGUNRYSWKMBDHV.X*-I')[\"input_ids\"]\n        [1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 2]\n        >>> tokenizer('acgu')[\"input_ids\"]\n        [1, 6, 7, 8, 9, 2]\n        >>> tokenizer('acgt')[\"input_ids\"]\n        [1, 6, 7, 8, 9, 2]\n        >>> tokenizer = RnaTokenizer(replace_T_with_U=False)\n        >>> tokenizer('acgt')[\"input_ids\"]\n        [1, 6, 7, 8, 3, 2]\n        >>> tokenizer = RnaTokenizer(nmers=3)\n        >>> tokenizer('uagcuuauc')[\"input_ids\"]\n        [1, 83, 17, 64, 49, 96, 84, 22, 2]\n        >>> tokenizer = RnaTokenizer(codon=True)\n        >>> tokenizer('uagcuuauc')[\"input_ids\"]\n        [1, 83, 49, 22, 2]\n        >>> tokenizer('uagcuuauca')[\"input_ids\"]\n        Traceback (most recent call last):\n        ValueError: length of input sequence must be a multiple of 3 for codon tokenization, but got 10\n    \"\"\"\n\n    model_input_names = [\"input_ids\", \"attention_mask\"]\n\n    def __init__(\n        self,\n        alphabet: Alphabet | str | List[str] | None = None,\n        nmers: int = 1,\n        codon: bool = False,\n        replace_T_with_U: bool = True,\n        do_upper_case: bool = True,\n        additional_special_tokens: List | Tuple | None = None,\n        **kwargs,\n    ):\n        if codon and (nmers > 1 and nmers != 3):\n            raise ValueError(\"Codon and nmers cannot be used together.\")\n        if codon:\n            nmers = 3  # set to 3 to get correct vocab\n        if not isinstance(alphabet, Alphabet):\n            alphabet = get_alphabet(alphabet, nmers=nmers)\n        super().__init__(\n            alphabet=alphabet,\n            nmers=nmers,\n            codon=codon,\n            replace_T_with_U=replace_T_with_U,\n            do_upper_case=do_upper_case,\n            additional_special_tokens=additional_special_tokens,\n            **kwargs,\n        )\n        self.replace_T_with_U = replace_T_with_U\n        self.nmers = nmers\n        self.codon = codon\n\n    def _tokenize(self, text: str, **kwargs):\n        if self.do_upper_case:\n            text = text.upper()\n        if self.replace_T_with_U:\n            text = text.replace(\"T\", \"U\")\n        if self.codon:\n            if len(text) % 3 != 0:\n                raise ValueError(\n                    f\"length of input sequence must be a multiple of 3 for codon tokenization, but got {len(text)}\"\n                )\n            return [text[i : i + 3] for i in range(0, len(text), 3)]\n        if self.nmers > 1:\n            return [text[i : i + self.nmers] for i in range(len(text) - self.nmers + 1)]  # noqa: E203\n        return list(text)\n
"},{"location":"tokenisers/rna/#multimolecule.tokenisers.RnaTokenizer(alphabet)","title":"alphabet","text":""},{"location":"tokenisers/rna/#multimolecule.tokenisers.RnaTokenizer(nmers)","title":"nmers","text":""},{"location":"tokenisers/rna/#multimolecule.tokenisers.RnaTokenizer(codon)","title":"codon","text":""},{"location":"tokenisers/rna/#multimolecule.tokenisers.RnaTokenizer(replace_T_with_U)","title":"replace_T_with_U","text":""},{"location":"tokenisers/rna/#multimolecule.tokenisers.RnaTokenizer(do_upper_case)","title":"do_upper_case","text":""},{"location":"tokenisers/rna/#standard-alphabet","title":"Standard Alphabet","text":"

The standard alphabet is an extended version of the IUPAC alphabet. This extension includes three additional symbols to the IUPAC alphabet, I, X and *.

gap

Note that we use . to represent a gap in the sequence.

While - exists in the standard alphabet, it is not used in MultiMolecule and is reserved for future use.

Code Represents A Adenine C Cytosine G Guanine U Uracil N Unknown R A or G Y C or U S C or G W A or U K G or U M A or C B C, G, or U D A, G, or U H A, C, or U V A, C, or G . Gap X Any * Not Used - Not Used I Inosine"},{"location":"tokenisers/rna/#iupac-alphabet","title":"IUPAC Alphabet","text":"

IUPAC nucleotide code is a standard nucleotide code proposed by the International Union of Pure and Applied Chemistry (IUPAC) to represent RNA sequences.

It consists of 10 symbols that represent ambiguity in the nucleotide sequence and 1 symbol that represents a gap in addition to the streamline alphabet.

Code Represents A Adenine C Cytosine G Guanine U Uracil R A or G Y C or U S G or C W A or U K G or U M A or C B C, G, or U D A, G, or U H A, C, or U V A, C, or G N A, C, G, or U . Gap

Note that we use . to represent a gap in the sequence.

"},{"location":"tokenisers/rna/#streamline-alphabet","title":"Streamline Alphabet","text":"

The streamline alphabet includes one additional symbol to the nucleobase alphabet, N to represent unknown nucleobase.

Code Nucleotide A Adenine C Cytosine G Guanine U Uracil N Unknown"},{"location":"tokenisers/rna/#nucleobase-alphabet","title":"Nucleobase Alphabet","text":"

The nucleobase alphabet is a minimal version of the RNA alphabet that includes only the four canonical nucleotides A, C, G, and U.

Code Nucleotide A Adenine C Cytosine G Guanine U Uracil"},{"location":"zh/","title":"MultiMolecule","text":"

\u200b\u4f7f\u7528\u200b\u673a\u5668\u200b\u5b66\u4e60\u200b\u52a0\u901f\u200b\u5206\u5b50\u751f\u7269\u5b66\u200b\u7814\u7a76\u200b

"},{"location":"zh/#_1","title":"\u4ecb\u7ecd","text":"

\u200b\u6b22\u8fce\u200b\u6765\u5230\u200b MultiMolecule (\u200b\u6d66\u539f\u200b)\uff0c\u200b\u8fd9\u662f\u200b\u4e00\u6b3e\u200b\u57fa\u7840\u200b\u5e93\u200b\uff0c\u200b\u65e8\u5728\u200b\u901a\u8fc7\u200b\u673a\u5668\u200b\u5b66\u4e60\u200b\u52a0\u901f\u200b\u5206\u5b50\u751f\u7269\u5b66\u200b\u7684\u200b\u79d1\u7814\u200b\u8fdb\u5c55\u200b\u3002 MultiMolecule \u200b\u63d0\u4f9b\u200b\u4e86\u200b\u4e00\u5957\u200b\u5168\u9762\u200b\u4e14\u200b\u7075\u6d3b\u200b\u7684\u200b\u5de5\u5177\u200b\uff0c\u200b\u5e2e\u52a9\u200b\u7814\u7a76\u200b\u4eba\u5458\u200b\u8f7b\u677e\u200b\u5229\u7528\u200b AI\uff0c\u200b\u4e3b\u8981\u200b\u805a\u7126\u200b\u4e8e\u200b\u751f\u7269\u200b\u5206\u5b50\u200b\u6570\u636e\u200b\uff08RNA\u3001DNA \u200b\u548c\u200b\u86cb\u767d\u8d28\u200b\uff09\u3002

"},{"location":"zh/#_2","title":"\u6982\u89c8","text":"

MultiMolecule \u200b\u4ee5\u200b\u7075\u6d3b\u6027\u200b\u548c\u200b\u6613\u7528\u6027\u200b\u4e3a\u200b\u8bbe\u8ba1\u200b\u6838\u5fc3\u200b\u3002 \u200b\u5176\u200b\u6a21\u5757\u5316\u200b\u8bbe\u8ba1\u200b\u5141\u8bb8\u200b\u60a8\u200b\u6839\u636e\u200b\u9700\u8981\u200b\u4ec5\u200b\u4f7f\u7528\u200b\u6240\u200b\u9700\u200b\u7684\u200b\u7ec4\u4ef6\u200b\uff0c\u200b\u5e76\u200b\u80fd\u200b\u65e0\u7f1d\u200b\u96c6\u6210\u200b\u5230\u200b\u73b0\u6709\u200b\u7684\u200b\u5de5\u4f5c\u200b\u6d41\u7a0b\u200b\u4e2d\u200b\uff0c\u200b\u800c\u200b\u4e0d\u4f1a\u200b\u589e\u52a0\u200b\u4e0d\u5fc5\u8981\u200b\u7684\u200b\u590d\u6742\u6027\u200b\u3002

"},{"location":"zh/#_3","title":"\u5b89\u88c5","text":"

\u200b\u4ece\u200b PyPI \u200b\u5b89\u88c5\u200b\u6700\u65b0\u200b\u7684\u200b\u7a33\u5b9a\u200b\u7248\u672c\u200b\uff1a

Bash
pip install multimolecule\n

\u200b\u4ece\u200b\u6e90\u4ee3\u7801\u200b\u5b89\u88c5\u200b\u6700\u65b0\u200b\u7248\u672c\u200b\uff1a

Bash
pip install git+https://github.com/DLS5-Omics/MultiMolecule\n
"},{"location":"zh/#_4","title":"\u5f15\u7528","text":"

\u200b\u5982\u679c\u200b\u60a8\u200b\u5728\u200b\u7814\u7a76\u200b\u4e2d\u200b\u4f7f\u7528\u200b MultiMolecule\uff0c\u200b\u8bf7\u200b\u6309\u7167\u200b\u4ee5\u4e0b\u200b\u65b9\u5f0f\u200b\u5f15\u7528\u200b\u6211\u4eec\u200b\uff1a

BibTeX
@software{chen_2024_12638419,\n  author    = {Chen, Zhiyuan and Zhu, Sophia Y.},\n  title     = {MultiMolecule},\n  doi       = {10.5281/zenodo.12638419},\n  publisher = {Zenodo},\n  url       = {https://doi.org/10.5281/zenodo.12638419},\n  year      = 2024,\n  month     = may,\n  day       = 4\n}\n

Caution

MultiMolecule \u200b\u9879\u76ee\u200b\u4f7f\u7528\u200bGNU Affero \u200b\u901a\u7528\u200b\u516c\u5171\u200b\u8bb8\u53ef\u8bc1\u200b\u6388\u6743\u200b\u3002 \u200b\u7814\u7a76\u200b\u8bba\u6587\u200b\u88ab\u200b\u8ba4\u4e3a\u200b\u662f\u200b\u884d\u751f\u200b\u4f5c\u54c1\u200b\uff0c\u200b\u56e0\u6b64\u200b\u9700\u8981\u200b\u4f9d\u636e\u200b\u76f8\u540c\u200b\u6761\u6b3e\u200b\u8fdb\u884c\u200b\u8bb8\u53ef\u200b\u3002

\u200b\u4f60\u200b\u53ea\u80fd\u200b\u5728\u200b\u514d\u8d39\u200b\u53d1\u8868\u200b\u548c\u200b\u9605\u8bfb\u200b\u7684\u200b\u5b8c\u5168\u200b\u5f00\u653e\u200b\u83b7\u53d6\u200b\u7684\u200b\u671f\u520a\u200b\u3001\u200b\u4f1a\u8bae\u200b\u6216\u9884\u200b\u5370\u672c\u200b\u670d\u52a1\u5668\u200b\u4e0a\u200b\u53d1\u5e03\u200b\u4f7f\u7528\u200b MultiMolecule \u200b\u7684\u200b\u7814\u7a76\u200b\u8bba\u6587\u200b\u3002 \u200b\u4f60\u200b\u5fc5\u987b\u200b\u4ece\u200b\u4f5c\u8005\u200b\u83b7\u53d6\u200b\u8c41\u514d\u200b\u4ee5\u200b\u5728\u200b\u5c01\u95ed\u200b\u83b7\u53d6\u200b/\u200b\u4f5c\u8005\u200b\u8d39\u7528\u200b\u7684\u200b\u671f\u520a\u200b\u3001\u200b\u4f1a\u8bae\u200b\u6216\u9884\u200b\u5370\u672c\u200b\u670d\u52a1\u5668\u200b\u4e0a\u200b\u53d1\u5e03\u200b\u4f7f\u7528\u200b MultiMolecule \u200b\u7684\u200b\u7814\u7a76\u200b\u8bba\u6587\u200b\u3002

\u200b\u4f60\u200b\u53ef\u80fd\u200b\u83b7\u53d6\u200b\u4e00\u4e2a\u200b\u81ea\u52a8\u200b\u8c41\u514d\u200b\u5982\u679c\u200b\u4f60\u200b\u63d0\u4ea4\u200b\u5230\u200b\u4ee5\u4e0b\u200b\u975e\u76c8\u5229\u6027\u200b\u671f\u520a\u200b\u5f53\u4e2d\u200b\uff1a

\u200b\u8bf7\u53c2\u9605\u200b \u200b\u8bb8\u53ef\u200b\u534f\u8bae\u200b\u5e38\u89c1\u200b\u95ee\u9898\u89e3\u7b54\u200b\u3002

"},{"location":"zh/#_5","title":"\u8bb8\u53ef\u8bc1","text":"

\u200b\u6211\u4eec\u200b\u76f8\u4fe1\u200b\u5f00\u653e\u200b\u662f\u200b\u7814\u7a76\u200b\u7684\u200b\u57fa\u7840\u200b\u3002

MultiMolecule \u200b\u5728\u200bGNU Affero \u200b\u901a\u7528\u200b\u516c\u5171\u200b\u8bb8\u53ef\u8bc1\u200b\u4e0b\u200b\u6388\u6743\u200b\u3002

\u200b\u8bf7\u200b\u52a0\u5165\u200b\u6211\u4eec\u200b\uff0c\u200b\u5171\u540c\u200b\u5efa\u7acb\u200b\u4e00\u4e2a\u200b\u5f00\u653e\u200b\u7684\u200b\u7814\u7a76\u200b\u793e\u533a\u200b\u3002

SPDX-License-Identifier: AGPL-3.0-or-later

"},{"location":"zh/about/","title":"\u5173\u4e8e","text":"

\u200b\u7531\u4e39\u7075\u200b\u5728\u200b\u5730\u7403\u200b\u5f00\u53d1\u200b

\u200b\u6211\u4eec\u200b\u662f\u200b\u4e00\u4e2a\u200b\u7531\u200b\u5f00\u53d1\u8005\u200b\u3001\u200b\u8bbe\u8ba1\u200b\u4eba\u5458\u200b\u548c\u200b\u5176\u4ed6\u200b\u4eba\u5458\u200b\u7ec4\u6210\u200b\u7684\u200b\u793e\u533a\u200b\uff0c\u200b\u81f4\u529b\u4e8e\u200b\u8ba9\u200b\u6df1\u5ea6\u200b\u5b66\u4e60\u200b\u6280\u672f\u200b\u66f4\u52a0\u200b\u5f00\u653e\u200b\u3002

\u200b\u6211\u4eec\u200b\u662f\u200b\u4e00\u4e2a\u200b\u7531\u200b\u4e2a\u4f53\u200b\u7ec4\u6210\u200b\u7684\u200b\u793e\u533a\u200b\uff0c\u200b\u81f4\u529b\u4e8e\u200b\u63a8\u52a8\u200b\u6df1\u5ea6\u200b\u5b66\u4e60\u200b\u7684\u200b\u53ef\u80fd\u6027\u200b\u8fb9\u754c\u200b\u3002

\u200b\u6211\u4eec\u200b\u5bf9\u200b\u6df1\u5ea6\u200b\u5b66\u4e60\u200b\u53ca\u5176\u200b\u7528\u6237\u200b\u5145\u6ee1\u200b\u6fc0\u60c5\u200b\u3002

\u200b\u6211\u4eec\u200b\u662f\u200b\u4e39\u7075\u200b\u3002

"},{"location":"zh/about/license-faq/","title":"License FAQ","text":"

\u200b\u7ffb\u8bd1\u200b

\u200b\u672c\u6587\u200b\u5185\u5bb9\u200b\u4e3a\u200b\u7ffb\u8bd1\u200b\u7248\u672c\u200b\uff0c\u200b\u65e8\u5728\u200b\u4e3a\u200b\u7528\u6237\u200b\u63d0\u4f9b\u65b9\u4fbf\u200b\u3002 \u200b\u6211\u4eec\u200b\u5df2\u7ecf\u200b\u5c3d\u529b\u200b\u786e\u4fdd\u200b\u7ffb\u8bd1\u200b\u7684\u200b\u51c6\u786e\u6027\u200b\u3002 \u200b\u4f46\u200b\u8bf7\u200b\u6ce8\u610f\u200b\uff0c\u200b\u7ffb\u8bd1\u200b\u5185\u5bb9\u200b\u53ef\u80fd\u200b\u5305\u542b\u200b\u9519\u8bef\u200b\uff0c\u200b\u4ec5\u4f9b\u53c2\u8003\u200b\u3002 \u200b\u8bf7\u4ee5\u200b\u82f1\u6587\u200b\u539f\u6587\u200b\u4e3a\u51c6\u200b\u3002

\u200b\u4e3a\u200b\u6ee1\u8db3\u200b\u5408\u89c4\u6027\u200b\u4e0e\u200b\u6267\u6cd5\u200b\u8981\u6c42\u200b\uff0c\u200b\u7ffb\u8bd1\u200b\u6587\u6863\u200b\u4e2d\u200b\u7684\u200b\u4efb\u4f55\u200b\u4e0d\u200b\u51c6\u786e\u200b\u6216\u200b\u6b67\u4e49\u200b\u4e4b\u5904\u200b\u5747\u200b\u4e0d\u200b\u5177\u6709\u200b\u7ea6\u675f\u529b\u200b\uff0c\u200b\u4e5f\u200b\u4e0d\u200b\u5177\u5907\u200b\u6cd5\u5f8b\u6548\u529b\u200b\u3002

"},{"location":"zh/about/license-faq/#_1","title":"\u8bb8\u53ef\u200b\u534f\u8bae\u200b\u5e38\u89c1\u200b\u95ee\u9898\u89e3\u7b54","text":"

\u200b\u672c\u200b\u8bb8\u53ef\u200b\u534f\u8bae\u200b\u5e38\u89c1\u200b\u95ee\u9898\u89e3\u7b54\u200b\u89e3\u91ca\u200b\u4e86\u200b\u60a8\u200b\u53ef\u4ee5\u200b\u5728\u200b\u4f55\u79cd\u200b\u6761\u4ef6\u200b\u4e0b\u200b\u4f7f\u7528\u200b\u7531\u4e39\u7075\u200b\u56e2\u961f\u200b\uff08\u200b\u4e5f\u200b\u79f0\u4e3a\u200b\u4e39\u7075\u200b\uff09\uff08\u201c\u200b\u6211\u4eec\u200b\u201d\u200b\u6216\u200b\u201c\u200b\u6211\u4eec\u200b\u7684\u200b\u201d\uff09\u200b\u63d0\u4f9b\u200b\u7684\u200b\u6570\u636e\u200b\u3001\u200b\u6a21\u578b\u200b\u3001\u200b\u4ee3\u7801\u200b\u3001\u200b\u914d\u7f6e\u200b\u3001\u200b\u6587\u6863\u200b\u548c\u200b\u6743\u91cd\u200b\u3002 \u200b\u5b83\u200b\u4f5c\u4e3a\u200b\u6211\u4eec\u200b\u7684\u200b \u200b\u8bb8\u53ef\u200b\u534f\u8bae\u200b \u200b\u7684\u200b\u9644\u52a0\u6587\u4ef6\u200b\u3002

"},{"location":"zh/about/license-faq/#0","title":"0. \u200b\u5173\u952e\u70b9\u200b\u603b\u7ed3","text":"

\u200b\u672c\u200b\u603b\u7ed3\u200b\u63d0\u4f9b\u200b\u4e86\u200b\u5e38\u89c1\u200b\u95ee\u9898\u89e3\u7b54\u200b\u7684\u200b\u5173\u952e\u70b9\u200b\uff0c\u200b\u4f46\u200b\u60a8\u200b\u53ef\u4ee5\u200b\u901a\u8fc7\u200b\u70b9\u51fb\u200b\u6bcf\u4e2a\u200b\u5173\u952e\u70b9\u200b\u540e\u200b\u7684\u200b\u94fe\u63a5\u200b\u6216\u200b\u4f7f\u7528\u200b\u76ee\u5f55\u200b\u6765\u200b\u627e\u5230\u200b\u60a8\u200b\u6240\u200b\u67e5\u627e\u200b\u7684\u200b\u90e8\u5206\u200b\u4ee5\u200b\u4e86\u89e3\u200b\u66f4\u200b\u591a\u200b\u8be6\u60c5\u200b\u3002

\u200b\u5728\u200b MultiMolecule \u200b\u4e2d\u200b\uff0c\u200b\u4ec0\u4e48\u200b\u6784\u6210\u200b\u4e86\u200b\u201c\u200b\u6e90\u4ee3\u7801\u200b\u201d\uff1f

\u200b\u6211\u4eec\u200b\u8ba4\u4e3a\u200b\u6211\u4eec\u200b\u5b58\u50a8\u200b\u5e93\u4e2d\u200b\u7684\u200b\u6240\u6709\u200b\u5185\u5bb9\u200b\u90fd\u200b\u662f\u200b\u6e90\u4ee3\u7801\u200b\uff0c\u200b\u5305\u62ec\u200b\u6570\u636e\u200b\u3001\u200b\u6a21\u578b\u200b\u3001\u200b\u4ee3\u7801\u200b\u3001\u200b\u914d\u7f6e\u200b\u548c\u200b\u6587\u6863\u200b\u3002

\u200b\u5728\u200bMultiMolecule\u200b\u4e2d\u200b\uff0c\u200b\u4ec0\u4e48\u200b\u6784\u6210\u200b\u4e86\u200b\u201c\u200b\u6e90\u4ee3\u7801\u200b\u201d\uff1f

\u200b\u6211\u200b\u53ef\u4ee5\u200b\u4f7f\u7528\u200b MultiMolecule \u200b\u53d1\u8868\u200b\u7814\u7a76\u200b\u8bba\u6587\u200b\u5417\u200b\uff1f

\u200b\u89c6\u200b\u60c5\u51b5\u200b\u800c\u5b9a\u200b\u3002

\u200b\u60a8\u200b\u53ef\u4ee5\u200b\u6309\u7167\u200b \u200b\u8bb8\u53ef\u200b\u534f\u8bae\u200b \u200b\u7684\u200b\u6761\u6b3e\u200b\u5728\u200b\u5b8c\u5168\u200b\u5f00\u653e\u200b\u83b7\u53d6\u200b\u7684\u200b\u671f\u520a\u200b\u548c\u200b\u4f1a\u8bae\u200b\u6216\u9884\u200b\u5370\u672c\u200b\u670d\u52a1\u5668\u200b\u4e0a\u200b\u53d1\u8868\u200b\u7814\u7a76\u200b\u8bba\u6587\u200b\u3002

\u200b\u8981\u200b\u5728\u200b\u5c01\u95ed\u200b\u83b7\u53d6\u200b\u7684\u200b\u671f\u520a\u200b\u548c\u200b\u4f1a\u8bae\u200b\u4e0a\u200b\u53d1\u8868\u200b\u7814\u7a76\u200b\u8bba\u6587\u200b\uff0c\u200b\u60a8\u200b\u5fc5\u987b\u200b\u4ece\u200b\u6211\u4eec\u200b\u8fd9\u91cc\u200b\u83b7\u5f97\u200b\u5355\u72ec\u200b\u7684\u200b\u8bb8\u53ef\u200b\u3002

\u200b\u6211\u200b\u53ef\u4ee5\u200b\u4f7f\u7528\u200bMultiMolecule\u200b\u53d1\u8868\u200b\u7814\u7a76\u200b\u8bba\u6587\u200b\u5417\u200b\uff1f

\u200b\u6211\u200b\u53ef\u4ee5\u200b\u5c06\u200b MultiMolecule \u200b\u7528\u4e8e\u200b\u5546\u4e1a\u7528\u9014\u200b\u5417\u200b\uff1f

\u200b\u662f\u200b\u7684\u200b\uff0c\u200b\u60a8\u200b\u53ef\u4ee5\u200b\u6839\u636e\u200b \u200b\u8bb8\u53ef\u200b\u534f\u8bae\u200b \u200b\u7684\u200b\u6761\u6b3e\u200b\u5c06\u200bMultiMolecule\u200b\u7528\u4e8e\u200b\u5546\u4e1a\u7528\u9014\u200b\u3002

\u200b\u6211\u200b\u53ef\u4ee5\u200b\u5c06\u200bMultiMolecule\u200b\u7528\u4e8e\u200b\u5546\u4e1a\u7528\u9014\u200b\u5417\u200b\uff1f

\u200b\u4e0e\u200b\u67d0\u4e9b\u200b\u7ec4\u7ec7\u200b\u6709\u200b\u5173\u7cfb\u200b\u7684\u200b\u4eba\u200b\u662f\u5426\u200b\u6709\u200b\u7279\u5b9a\u200b\u7684\u200b\u8bb8\u53ef\u200b\u6761\u6b3e\u200b\uff1f

\u200b\u662f\u200b\u7684\u200b\uff0c\u200b\u4e0e\u200b\u67d0\u4e9b\u200b\u7ec4\u7ec7\u200b\u6709\u200b\u5173\u7cfb\u200b\u7684\u200b\u4eba\u200b\u6709\u200b\u7279\u5b9a\u200b\u7684\u200b\u8bb8\u53ef\u200b\u6761\u6b3e\u200b\u3002

\u200b\u4e0e\u200b\u67d0\u4e9b\u200b\u7ec4\u7ec7\u200b\u6709\u200b\u5173\u7cfb\u200b\u7684\u200b\u4eba\u200b\u662f\u5426\u200b\u6709\u200b\u7279\u5b9a\u200b\u7684\u200b\u8bb8\u53ef\u200b\u6761\u6b3e\u200b\uff1f

"},{"location":"zh/about/license-faq/#1-multimolecule","title":"1. \u200b\u5728\u200b MultiMolecule \u200b\u4e2d\u200b\uff0c\u200b\u4ec0\u4e48\u200b\u6784\u6210\u200b\u4e86\u200b\u201c\u200b\u6e90\u4ee3\u7801\u200b\u201d\uff1f","text":"

\u200b\u6211\u4eec\u200b\u8ba4\u4e3a\u200b\u6211\u4eec\u200b\u5b58\u50a8\u200b\u5e93\u4e2d\u200b\u7684\u200b\u6240\u6709\u200b\u5185\u5bb9\u200b\u90fd\u200b\u662f\u200b\u6e90\u4ee3\u7801\u200b\u3002

\u200b\u673a\u5668\u200b\u5b66\u4e60\u200b\u6a21\u578b\u200b\u7684\u200b\u8bad\u7ec3\u200b\u8fc7\u7a0b\u200b\u88ab\u200b\u89c6\u4f5c\u200b\u7c7b\u4f3c\u200b\u4e8e\u200b\u4f20\u7edf\u200b\u8f6f\u4ef6\u200b\u7684\u200b\u7f16\u8bd1\u200b\u8fc7\u7a0b\u200b\u3002\u200b\u56e0\u6b64\u200b\uff0c\u200b\u6a21\u578b\u200b\u3001\u200b\u4ee3\u7801\u200b\u3001\u200b\u914d\u7f6e\u200b\u3001\u200b\u6587\u6863\u200b\u548c\u200b\u7528\u4e8e\u200b\u8bad\u7ec3\u200b\u7684\u200b\u6570\u636e\u200b\u90fd\u200b\u88ab\u200b\u89c6\u4e3a\u200b\u6e90\u4ee3\u7801\u200b\u7684\u200b\u4e00\u90e8\u5206\u200b\uff0c\u200b\u800c\u200b\u8bad\u7ec3\u200b\u51fa\u200b\u7684\u200b\u6a21\u578b\u200b\u6743\u91cd\u200b\u5219\u200b\u88ab\u200b\u89c6\u4e3a\u200b\u76ee\u6807\u200b\u4ee3\u7801\u200b\u7684\u200b\u4e00\u90e8\u5206\u200b\u3002

\u200b\u6211\u4eec\u200b\u8fd8\u200b\u5c06\u200b\u7814\u7a76\u200b\u8bba\u6587\u200b\u548c\u200b\u624b\u7a3f\u200b\u89c6\u4e3a\u200b\u4e00\u79cd\u200b\u7279\u6b8a\u200b\u7684\u200b\u6587\u6863\u200b\u5f62\u5f0f\u200b\uff0c\u200b\u5b83\u4eec\u200b\u4e5f\u200b\u662f\u200b\u6e90\u4ee3\u7801\u200b\u7684\u200b\u4e00\u90e8\u5206\u200b\u3002

"},{"location":"zh/about/license-faq/#2-multimolecule","title":"2 \u200b\u6211\u200b\u53ef\u4ee5\u200b\u4f7f\u7528\u200b MultiMolecule \u200b\u53d1\u8868\u200b\u7814\u7a76\u200b\u8bba\u6587\u200b\u5417\u200b\uff1f","text":"

\u200b\u7531\u4e8e\u200b\u7814\u7a76\u200b\u8bba\u6587\u200b\u88ab\u200b\u89c6\u4e3a\u200b\u6e90\u4ee3\u7801\u200b\u7684\u200b\u4e00\u79cd\u200b\u5f62\u5f0f\u200b\uff0c\u200b\u5982\u679c\u200b\u53d1\u8868\u200b\u4f7f\u7528\u200b MultiMolecule \u200b\u7684\u200b\u8bba\u6587\u200b\uff0c\u200b\u51fa\u7248\u5546\u200b\u5fc5\u987b\u200b\u5f00\u6e90\u200b\u5176\u200b\u670d\u52a1\u5668\u200b\u4e0a\u200b\u7684\u200b\u6240\u6709\u200b\u6750\u6599\u200b\uff0c\u200b\u4ee5\u200b\u7b26\u5408\u200b \u200b\u8bb8\u53ef\u200b\u534f\u8bae\u200b \u200b\u7684\u200b\u8981\u6c42\u200b\u3002\u200b\u5bf9\u4e8e\u200b\u5927\u591a\u6570\u200b\u51fa\u7248\u5546\u200b\u6765\u8bf4\u200b\uff0c\u200b\u8fd9\u662f\u200b\u4e0d\u5207\u5b9e\u9645\u200b\u7684\u200b\u3002

\u200b\u4f5c\u4e3a\u200b \u200b\u8bb8\u53ef\u200b\u534f\u8bae\u200b \u200b\u7b2c\u200b 7 \u200b\u6761\u200b\u7684\u200b\u7279\u522b\u200b\u8c41\u514d\u200b\uff0c\u200b\u6211\u4eec\u200b\u5141\u8bb8\u200b\u5728\u200b\u4e0d\u200b\u5411\u200b\u4f5c\u8005\u200b\u6536\u53d6\u200b\u4efb\u4f55\u200b\u8d39\u7528\u200b\u7684\u200b\u5b8c\u5168\u200b\u5f00\u653e\u200b\u83b7\u53d6\u200b\u7684\u200b\u671f\u520a\u200b\u3001\u200b\u4f1a\u8bae\u200b\u6216\u9884\u200b\u5370\u672c\u200b\u670d\u52a1\u5668\u200b\u4e0a\u200b\u53d1\u8868\u200b\u4f7f\u7528\u200b MultiMolecule \u200b\u7684\u200b\u7814\u7a76\u200b\u8bba\u6587\u200b\uff0c\u200b\u524d\u63d0\u200b\u662f\u200b\u6240\u6709\u200b\u53d1\u8868\u200b\u7684\u200b\u624b\u7a3f\u200b\u90fd\u200b\u5e94\u200b\u6309\u7167\u200b\u5141\u8bb8\u200b\u5171\u4eab\u200b\u624b\u7a3f\u200b\u7684\u200bGNU \u200b\u81ea\u7531\u200b\u6587\u6863\u200b\u8bb8\u53ef\u200b\u534f\u8bae\u200b\uff08GFDL\uff09\u200b\u6216\u200b\u77e5\u8bc6\u200b\u5171\u4eab\u200b\u8bb8\u53ef\u200b\u534f\u8bae\u200b\u6216\u200bOSI \u200b\u6279\u51c6\u200b\u8bb8\u53ef\u200b\u534f\u8bae\u200b\u63d0\u4f9b\u200b\u3002

\u200b\u4f5c\u4e3a\u200b \u200b\u8bb8\u53ef\u200b\u534f\u8bae\u200b \u200b\u7b2c\u200b 7 \u200b\u6761\u200b\u7684\u200b\u7279\u522b\u200b\u8c41\u514d\u200b\uff0c\u200b\u6211\u4eec\u200b\u5141\u8bb8\u200b\u5728\u200b\u90e8\u5206\u200b\u975e\u76c8\u5229\u6027\u200b\u7684\u200b\u6742\u5fd7\u200b\u3001\u200b\u4f1a\u8bae\u200b\u6216\u9884\u200b\u5370\u672c\u200b\u670d\u52a1\u5668\u200b\u4e0a\u200b\u53d1\u8868\u200b\u4f7f\u7528\u200b MultiMolecule \u200b\u7684\u200b\u7814\u7a76\u200b\u8bba\u6587\u200b\u3002\u200b\u76ee\u524d\u200b\uff0c\u200b\u6211\u4eec\u200b\u5141\u8bb8\u200b\u7684\u200b\u975e\u76c8\u5229\u6027\u200b\u6742\u5fd7\u200b\u3001\u200b\u4f1a\u8bae\u200b\u6216\u9884\u200b\u5370\u672c\u200b\u670d\u52a1\u5668\u200b\u5305\u62ec\u200b\uff1a

\u200b\u8981\u200b\u5728\u200b\u5c01\u95ed\u200b\u83b7\u53d6\u200b\u7684\u200b\u671f\u520a\u200b\u6216\u200b\u4f1a\u8bae\u200b\u4e0a\u200b\u53d1\u8868\u200b\u8bba\u6587\u200b\uff0c\u200b\u60a8\u200b\u5fc5\u987b\u200b\u4ece\u200b\u6211\u4eec\u200b\u8fd9\u91cc\u200b\u83b7\u5f97\u200b\u5355\u72ec\u200b\u7684\u200b\u8bb8\u53ef\u200b\u3002\u200b\u8fd9\u200b\u901a\u5e38\u200b\u5305\u62ec\u200b\u5171\u540c\u200b\u7f72\u540d\u200b\u3001\u200b\u652f\u6301\u200b\u9879\u76ee\u200b\u7684\u200b\u8d39\u7528\u200b\u6216\u200b\u4e24\u8005\u200b\u517c\u800c\u6709\u4e4b\u200b\u3002\u200b\u8bf7\u200b\u901a\u8fc7\u200b multimolecule@zyc.ai \u200b\u4e0e\u200b\u6211\u4eec\u200b\u8054\u7cfb\u200b\u4ee5\u200b\u83b7\u53d6\u200b\u66f4\u200b\u591a\u200b\u4fe1\u606f\u200b\u3002

\u200b\u867d\u7136\u200b\u4e0d\u662f\u200b\u5f3a\u5236\u6027\u200b\u7684\u200b\uff0c\u200b\u4f46\u200b\u6211\u4eec\u200b\u5efa\u8bae\u200b\u5728\u200b\u7814\u7a76\u200b\u8bba\u6587\u200b\u4e2d\u200b\u5f15\u7528\u200b MultiMolecule \u200b\u9879\u76ee\u200b\u3002

"},{"location":"zh/about/license-faq/#3-multimolecule","title":"3. \u200b\u6211\u200b\u53ef\u4ee5\u200b\u5c06\u200b MultiMolecule \u200b\u7528\u4e8e\u200b\u5546\u4e1a\u7528\u9014\u200b\u5417\u200b\uff1f","text":"

\u200b\u662f\u200b\u7684\u200b\uff0c\u200b\u60a8\u200b\u53ef\u4ee5\u200b\u6839\u636e\u200b \u200b\u8bb8\u53ef\u200b\u534f\u8bae\u200b \u200b\u5c06\u200b MultiMolecule \u200b\u7528\u4e8e\u200b\u5546\u4e1a\u7528\u9014\u200b\u3002\u200b\u4f46\u662f\u200b\uff0c\u200b\u60a8\u200b\u5fc5\u987b\u200b\u5f00\u6e90\u200b\u5bf9\u200b\u6e90\u4ee3\u7801\u200b\u7684\u200b\u4efb\u4f55\u200b\u4fee\u6539\u200b\uff0c\u200b\u5e76\u200b\u4f7f\u200b\u5176\u200b\u5728\u200b \u200b\u8bb8\u53ef\u200b\u534f\u8bae\u200b \u200b\u4e0b\u200b\u53ef\u7528\u200b\u3002

\u200b\u5982\u679c\u200b\u60a8\u200b\u5e0c\u671b\u200b\u5728\u200b\u4e0d\u200b\u5f00\u6e90\u200b\u4fee\u6539\u200b\u5185\u5bb9\u200b\u7684\u200b\u60c5\u51b5\u200b\u4e0b\u200b\u5c06\u200b MultiMolecule \u200b\u7528\u4e8e\u200b\u5546\u4e1a\u7528\u9014\u200b\uff0c\u200b\u5219\u200b\u5fc5\u987b\u200b\u4ece\u200b\u6211\u4eec\u200b\u8fd9\u91cc\u200b\u83b7\u5f97\u200b\u5355\u72ec\u200b\u7684\u200b\u8bb8\u53ef\u200b\u3002\u200b\u8fd9\u200b\u901a\u5e38\u200b\u6d89\u53ca\u200b\u652f\u6301\u200b\u9879\u76ee\u200b\u7684\u200b\u8d39\u7528\u200b\u3002\u200b\u8bf7\u200b\u901a\u8fc7\u200b multimolecule@zyc.ai \u200b\u4e0e\u200b\u6211\u4eec\u200b\u8054\u7cfb\u200b\u4ee5\u200b\u83b7\u53d6\u200b\u66f4\u200b\u591a\u200b\u8be6\u7ec6\u4fe1\u606f\u200b\u3002

"},{"location":"zh/about/license-faq/#4","title":"4. \u200b\u4e0e\u200b\u67d0\u4e9b\u200b\u7ec4\u7ec7\u200b\u6709\u200b\u5173\u7cfb\u200b\u7684\u200b\u4eba\u200b\u662f\u5426\u200b\u6709\u200b\u7279\u5b9a\u200b\u7684\u200b\u8bb8\u53ef\u200b\u6761\u6b3e\u200b\uff1f","text":"

\u200b\u662f\u200b\u7684\u200b\uff01

\u200b\u5982\u679c\u200b\u60a8\u200b\u4e0e\u200b\u4e00\u4e2a\u200b\u4e0e\u200b\u6211\u4eec\u200b\u6709\u200b\u5355\u72ec\u200b\u8bb8\u53ef\u200b\u534f\u8bae\u200b\u7684\u200b\u7ec4\u7ec7\u200b\u6709\u200b\u5173\u7cfb\u200b\uff0c\u200b\u60a8\u200b\u53ef\u80fd\u200b\u4f1a\u200b\u53d7\u5230\u200b\u4e0d\u540c\u200b\u7684\u200b\u8bb8\u53ef\u200b\u6761\u6b3e\u200b\u7684\u200b\u7ea6\u675f\u200b\u3002\u200b\u8bf7\u200b\u54a8\u8be2\u200b\u60a8\u200b\u7ec4\u7ec7\u200b\u7684\u200b\u6cd5\u5f8b\u200b\u90e8\u95e8\u200b\uff0c\u200b\u4ee5\u200b\u786e\u5b9a\u200b\u60a8\u200b\u662f\u5426\u200b\u53d7\u5236\u4e8e\u200b\u5355\u72ec\u200b\u7684\u200b\u8bb8\u53ef\u200b\u534f\u8bae\u200b\u3002

\u200b\u4ee5\u4e0b\u200b\u7ec4\u7ec7\u200b\u7684\u200b\u6210\u5458\u200b\u81ea\u52a8\u200b\u83b7\u5f97\u200b\u4e00\u4e2a\u200b\u4e0d\u53ef\u200b\u8f6c\u8ba9\u200b\u3001\u200b\u4e0d\u53ef\u200b\u518d\u200b\u8bb8\u53ef\u200b\u3001\u200b\u4e0d\u53ef\u200b\u5206\u53d1\u200b\u7684\u200b MIT \u200b\u8bb8\u53ef\u200b\u534f\u8bae\u200b \u200b\u6765\u200b\u4f7f\u7528\u200b MultiMolecule\uff1a

\u200b\u6b64\u200b\u7279\u522b\u200b\u8bb8\u53ef\u200b\u88ab\u200b\u89c6\u4e3a\u200b \u200b\u8bb8\u53ef\u200b\u534f\u8bae\u200b \u200b\u7b2c\u200b 7 \u200b\u6761\u4e2d\u200b\u7684\u200b\u9644\u52a0\u200b\u6761\u6b3e\u200b\u3002 \u200b\u5b83\u200b\u4e0d\u53ef\u200b\u518d\u200b\u5206\u53d1\u200b\uff0c\u200b\u5e76\u4e14\u200b\u60a8\u200b\u88ab\u200b\u7981\u6b62\u200b\u521b\u5efa\u200b\u4efb\u4f55\u200b\u72ec\u7acb\u200b\u7684\u200b\u884d\u751f\u200b\u4f5c\u54c1\u200b\u3002 \u200b\u57fa\u4e8e\u200b\u6b64\u200b\u8bb8\u53ef\u200b\u7684\u200b\u4efb\u4f55\u200b\u4fee\u6539\u200b\u6216\u200b\u884d\u751f\u200b\u4f5c\u54c1\u200b\u5c06\u200b\u81ea\u52a8\u200b\u88ab\u200b\u89c6\u4e3a\u200b MultiMolecule \u200b\u7684\u200b\u884d\u751f\u200b\u4f5c\u54c1\u200b\uff0c\u200b\u5fc5\u987b\u200b\u9075\u5b88\u200b \u200b\u8bb8\u53ef\u200b\u534f\u8bae\u200b \u200b\u7684\u200b\u6240\u6709\u200b\u6761\u6b3e\u200b\u3002 \u200b\u8fd9\u200b\u786e\u4fdd\u200b\u4e86\u200b\u7b2c\u4e09\u65b9\u200b\u65e0\u6cd5\u200b\u7ed5\u8fc7\u200b\u8bb8\u53ef\u200b\u6761\u6b3e\u200b\u6216\u200b\u4ece\u200b\u884d\u751f\u200b\u4f5c\u54c1\u200b\u4e2d\u200b\u521b\u5efa\u200b\u5355\u72ec\u200b\u7684\u200b\u8bb8\u53ef\u200b\u534f\u8bae\u200b\u3002

"},{"location":"zh/about/license-faq/#5-agpl-multimolecule","title":"5. \u200b\u5982\u679c\u200b\u6211\u200b\u7684\u200b\u7ec4\u7ec7\u200b\u7981\u6b62\u200b\u4f7f\u7528\u200b AGPL \u200b\u8bb8\u53ef\u200b\u534f\u8bae\u200b\u4e0b\u200b\u7684\u200b\u4ee3\u7801\u200b\uff0c\u200b\u6211\u8be5\u200b\u5982\u4f55\u200b\u4f7f\u7528\u200b MultiMolecule\uff1f","text":"

\u200b\u4e00\u4e9b\u200b\u7ec4\u7ec7\u200b\uff08\u200b\u5982\u200bGoogle\uff09\u200b\u6709\u200b\u7981\u6b62\u200b\u4f7f\u7528\u200b AGPL \u200b\u8bb8\u53ef\u200b\u534f\u8bae\u200b\u4e0b\u200b\u4ee3\u7801\u200b\u7684\u200b\u653f\u7b56\u200b\u3002

\u200b\u5982\u679c\u200b\u60a8\u200b\u4e0e\u200b\u7981\u6b62\u200b\u4f7f\u7528\u200b AGPL \u200b\u8bb8\u53ef\u200b\u534f\u8bae\u200b\u4ee3\u7801\u200b\u7684\u200b\u7ec4\u7ec7\u200b\u6709\u200b\u5173\u7cfb\u200b\uff0c\u200b\u60a8\u200b\u5fc5\u987b\u200b\u4ece\u200b\u6211\u4eec\u200b\u8fd9\u91cc\u200b\u83b7\u5f97\u200b\u5355\u72ec\u200b\u7684\u200b\u8bb8\u53ef\u200b\u3002\u200b\u8bf7\u200b\u901a\u8fc7\u200b multimolecule@zyc.ai \u200b\u4e0e\u200b\u6211\u4eec\u200b\u8054\u7cfb\u200b\u4ee5\u200b\u83b7\u53d6\u200b\u66f4\u200b\u591a\u200b\u8be6\u7ec6\u4fe1\u606f\u200b\u3002

"},{"location":"zh/about/license-faq/#6-multimolecule","title":"6. \u200b\u5982\u679c\u200b\u6211\u200b\u662f\u200b\u7f8e\u56fd\u8054\u90a6\u653f\u5e9c\u200b\u7684\u200b\u96c7\u5458\u200b\uff0c\u200b\u6211\u200b\u53ef\u4ee5\u200b\u4f7f\u7528\u200b MultiMolecule \u200b\u5417\u200b\uff1f","text":"

\u200b\u4e0d\u80fd\u200b\u3002

\u200b\u6839\u636e\u200b17 U.S. Code \u00a7 105\uff0c\u200b\u7f8e\u56fd\u8054\u90a6\u653f\u5e9c\u200b\u96c7\u5458\u200b\u64b0\u5199\u200b\u7684\u200b\u4ee3\u7801\u200b\u4e0d\u200b\u53d7\u200b\u7248\u6743\u4fdd\u62a4\u200b\u3002

\u200b\u56e0\u6b64\u200b\uff0c\u200b\u7f8e\u56fd\u8054\u90a6\u653f\u5e9c\u200b\u96c7\u5458\u200b\u65e0\u6cd5\u200b\u9075\u5b88\u200b \u200b\u8bb8\u53ef\u200b\u534f\u8bae\u200b \u200b\u7684\u200b\u6761\u6b3e\u200b\u3002

"},{"location":"zh/about/license-faq/#7","title":"7. \u200b\u6211\u4eec\u200b\u4f1a\u200b\u66f4\u65b0\u200b\u6b64\u200b\u5e38\u89c1\u200b\u95ee\u9898\u89e3\u7b54\u200b\u5417\u200b\uff1f","text":"

\u200b\u7b80\u800c\u8a00\u4e4b\u200b

\u200b\u662f\u200b\u7684\u200b\uff0c\u200b\u6211\u4eec\u200b\u5c06\u200b\u6839\u636e\u200b\u9700\u8981\u200b\u66f4\u65b0\u200b\u6b64\u200b\u5e38\u89c1\u200b\u95ee\u9898\u89e3\u7b54\u200b\u4ee5\u200b\u4fdd\u6301\u200b\u4e0e\u200b\u76f8\u5173\u200b\u6cd5\u5f8b\u200b\u7684\u200b\u4e00\u81f4\u200b\u3002

\u200b\u6211\u4eec\u200b\u53ef\u80fd\u200b\u4f1a\u200b\u4e0d\u65f6\u200b\u66f4\u65b0\u200b\u6b64\u200b\u8bb8\u53ef\u200b\u534f\u8bae\u200b\u5e38\u89c1\u200b\u95ee\u9898\u89e3\u7b54\u200b\u3002 \u200b\u66f4\u65b0\u200b\u540e\u200b\u7684\u200b\u7248\u672c\u200b\u5c06\u200b\u901a\u8fc7\u200b\u66f4\u65b0\u200b\u672c\u200b\u9875\u9762\u200b\u5e95\u90e8\u200b\u7684\u200b\u201c\u200b\u6700\u540e\u200b\u4fee\u8ba2\u200b\u65f6\u95f4\u200b\u201d\u200b\u6765\u200b\u8868\u793a\u200b\u3002 \u200b\u5982\u679c\u200b\u6211\u4eec\u200b\u8fdb\u884c\u200b\u4efb\u4f55\u200b\u91cd\u5927\u200b\u66f4\u6539\u200b\uff0c\u200b\u6211\u4eec\u200b\u5c06\u200b\u901a\u8fc7\u200b\u5728\u200b\u672c\u9875\u200b\u53d1\u5e03\u200b\u65b0\u200b\u7684\u200b\u8bb8\u53ef\u200b\u534f\u8bae\u200b\u5e38\u89c1\u200b\u95ee\u9898\u89e3\u7b54\u200b\u6765\u200b\u901a\u77e5\u200b\u60a8\u200b\u3002 \u200b\u7531\u4e8e\u200b\u6211\u4eec\u200b\u4e0d\u200b\u6536\u96c6\u200b\u60a8\u200b\u7684\u200b\u4efb\u4f55\u200b\u8054\u7cfb\u200b\u4fe1\u606f\u200b\uff0c\u200b\u6211\u4eec\u200b\u65e0\u6cd5\u200b\u76f4\u63a5\u200b\u901a\u77e5\u200b\u60a8\u200b\u3002 \u200b\u6211\u4eec\u200b\u9f13\u52b1\u200b\u60a8\u200b\u7ecf\u5e38\u200b\u67e5\u770b\u200b\u672c\u200b\u8bb8\u53ef\u200b\u534f\u8bae\u200b\u5e38\u89c1\u200b\u95ee\u9898\u89e3\u7b54\u200b\uff0c\u200b\u4ee5\u200b\u4e86\u89e3\u200b\u60a8\u200b\u53ef\u4ee5\u200b\u5982\u4f55\u200b\u4f7f\u7528\u200b\u6211\u4eec\u200b\u7684\u200b\u6570\u636e\u200b\u3001\u200b\u6a21\u578b\u200b\u3001\u200b\u4ee3\u7801\u200b\u3001\u200b\u914d\u7f6e\u200b\u3001\u200b\u6587\u6863\u200b\u548c\u200b\u6743\u91cd\u200b\u3002

"},{"location":"zh/data/","title":"data","text":"

data \u200b\u63d0\u4f9b\u200b\u4e86\u200b\u4e00\u7cfb\u5217\u200b\u7528\u4e8e\u200b\u5904\u7406\u200b\u6570\u636e\u200b\u7684\u200b\u5b9e\u7528\u5de5\u5177\u200b\u3002

\u200b\u5c3d\u7ba1\u200b datasets \u200b\u662f\u200b\u4e00\u4e2a\u200b\u5f3a\u5927\u200b\u7684\u200b\u7ba1\u7406\u200b\u6570\u636e\u200b\u96c6\u200b\u7684\u200b\u5e93\u200b\uff0c\u200b\u4f46\u200b\u5b83\u200b\u662f\u200b\u4e00\u4e2a\u200b\u901a\u7528\u200b\u5de5\u5177\u200b\uff0c\u200b\u53ef\u80fd\u200b\u65e0\u6cd5\u200b\u6db5\u76d6\u200b\u79d1\u5b66\u200b\u5e94\u7528\u7a0b\u5e8f\u200b\u7684\u200b\u6240\u6709\u200b\u7279\u5b9a\u200b\u529f\u80fd\u200b\u3002

data \u200b\u5305\u200b\u65e8\u5728\u200b\u901a\u8fc7\u200b\u63d0\u4f9b\u200b\u5728\u200b\u79d1\u5b66\u200b\u4efb\u52a1\u200b\u4e2d\u200b\u5e38\u7528\u200b\u7684\u200b\u6570\u636e\u5904\u7406\u200b\u5b9e\u7528\u7a0b\u5e8f\u200b\u6765\u200b\u8865\u5145\u200b datasets\u3002

"},{"location":"zh/data/#_1","title":"\u4f7f\u7528","text":""},{"location":"zh/data/#_2","title":"\u4ece\u200b\u672c\u5730\u200b\u6570\u636e\u6587\u4ef6\u200b\u52a0\u8f7d","text":"Python
from multimolecule.data import Dataset\n\ndata = Dataset(\"data/rna/5utr.csv\", split=\"train\", pretrained=\"multimolecule/rna\")\n
"},{"location":"zh/data/#datasets","title":"\u4ece\u200b datasets\u200b\u52a0\u8f7d","text":"Python
from multimolecule.data import Dataset\n\ndata = Dataset(\"multimolecule/bprna-spot\", split=\"train\", pretrained=\"multimolecule/rna\")\n
"},{"location":"zh/datasets/","title":"datasets","text":"

datasets \u200b\u63d0\u4f9b\u200b\u4e86\u200b\u4e00\u7cfb\u5217\u200b\u5e7f\u6cdb\u200b\u4f7f\u7528\u200b\u7684\u200b\u6570\u636e\u200b\u96c6\u200b\u3002

"},{"location":"zh/datasets/#_1","title":"\u53ef\u7528\u200b\u6570\u636e\u200b\u96c6","text":""},{"location":"zh/datasets/#dna","title":"\u8131\u6c27\u6838\u7cd6\u6838\u9178\u200b\uff08DNA\uff09","text":""},{"location":"zh/datasets/#rna","title":"\u6838\u7cd6\u6838\u9178\u200b\uff08RNA\uff09","text":""},{"location":"zh/datasets/#_2","title":"\u4f7f\u7528","text":""},{"location":"zh/datasets/#multimolecule","title":"\u4f7f\u7528\u200b MultiMolecule \u200b\u52a0\u8f7d","text":"Python
from multimolecule.data import Dataset\n\ndata = Dataset(\"multimolecule/bprna-spot\", split=\"train\", pretrained=\"multimolecule/rna\")\n
"},{"location":"zh/models/","title":"models","text":"

models \u200b\u63d0\u4f9b\u200b\u4e86\u200b\u4e00\u7cfb\u5217\u200b\u9884\u200b\u8bad\u7ec3\u200b\u6a21\u578b\u200b\u3002

"},{"location":"zh/models/#_1","title":"\u6a21\u578b\u200b\u7c7b","text":"

\u200b\u5728\u200b transformers \u200b\u5e93\u200b\u5f53\u4e2d\u200b\uff0c\u200b\u6a21\u578b\u200b\u7c7b\u200b\u7684\u200b\u540d\u5b57\u200b\u6709\u65f6\u200b\u53ef\u4ee5\u200b\u5f15\u8d77\u200b\u8bef\u89e3\u200b\u3002 \u200b\u5c3d\u7ba1\u200b\u8fd9\u4e9b\u200b\u7c7b\u200b\u652f\u6301\u200b\u56de\u5f52\u200b\u548c\u200b\u5206\u7c7b\u200b\u4efb\u52a1\u200b\uff0c\u200b\u4f46\u200b\u5b83\u4eec\u200b\u7684\u200b\u540d\u5b57\u200b\u901a\u5e38\u200b\u5305\u542b\u200b xxxForSequenceClassification\uff0c\u200b\u8fd9\u200b\u53ef\u80fd\u200b\u6697\u793a\u200b\u5b83\u4eec\u200b\u53ea\u80fd\u200b\u7528\u4e8e\u200b\u5206\u7c7b\u200b\u3002

\u200b\u4e3a\u4e86\u200b\u907f\u514d\u200b\u8fd9\u79cd\u200b\u6b67\u4e49\u200b\uff0cMultiMolecule \u200b\u63d0\u4f9b\u200b\u4e86\u200b\u4e00\u7cfb\u5217\u200b\u6a21\u578b\u200b\u7c7b\u200b\uff0c\u200b\u8fd9\u4e9b\u200b\u7c7b\u200b\u7684\u200b\u540d\u79f0\u200b\u6e05\u6670\u200b\u3001\u200b\u76f4\u89c2\u200b\uff0c\u200b\u53cd\u6620\u200b\u4e86\u200b\u5b83\u4eec\u200b\u7684\u200b\u9884\u671f\u200b\u7528\u9014\u200b\uff1a

\u200b\u6bcf\u4e2a\u200b\u6a21\u578b\u200b\u90fd\u200b\u652f\u6301\u200b\u56de\u5f52\u200b\u548c\u200b\u5206\u7c7b\u200b\u4efb\u52a1\u200b\uff0c\u200b\u4e3a\u200b\u5e7f\u6cdb\u200b\u7684\u200b\u5e94\u7528\u200b\u63d0\u4f9b\u200b\u4e86\u200b\u7075\u6d3b\u6027\u200b\u548c\u200b\u7cbe\u5ea6\u200b\u3002

"},{"location":"zh/models/#_2","title":"\u63a5\u89e6\u200b\u9884\u6d4b","text":"

\u200b\u63a5\u89e6\u200b\u9884\u6d4b\u200b\u4e3a\u200b\u5e8f\u5217\u200b\u4e2d\u200b\u7684\u200b\u6bcf\u200b\u4e00\u5bf9\u200b\u4ee4\u724c\u200b\u5206\u914d\u200b\u4e00\u4e2a\u200b\u6807\u7b7e\u200b\u3002 \u200b\u6700\u200b\u5e38\u89c1\u200b\u7684\u200b\u63a5\u89e6\u200b\u9884\u6d4b\u200b\u4efb\u52a1\u200b\u4e4b\u4e00\u200b\u662f\u200b\u86cb\u767d\u8d28\u200b\u8ddd\u79bb\u200b\u56fe\u200b\u9884\u6d4b\u200b\u3002 \u200b\u86cb\u767d\u8d28\u200b\u8ddd\u79bb\u200b\u56fe\u200b\u9884\u6d4b\u200b\u8bd5\u56fe\u200b\u627e\u5230\u200b\u4e09\u7ef4\u200b\u86cb\u767d\u8d28\u200b\u7ed3\u6784\u200b\u4e2d\u200b\u6240\u6709\u200b\u53ef\u80fd\u200b\u7684\u200b\u6c28\u57fa\u9178\u200b\u6b8b\u57fa\u200b\u5bf9\u200b\u4e4b\u95f4\u200b\u7684\u200b\u8ddd\u79bb\u200b

"},{"location":"zh/models/#_3","title":"\u6838\u82f7\u9178\u200b\u9884\u6d4b","text":"

\u200b\u4e0e\u200b Token Classification \u200b\u7c7b\u4f3c\u200b\uff0c\u200b\u4f46\u200b\u5982\u679c\u200b\u6a21\u578b\u200b\u914d\u7f6e\u200b\u4e2d\u200b\u5b9a\u4e49\u200b\u4e86\u200b <bos> \u200b\u6216\u200b <eos> \u200b\u4ee4\u724c\u200b\uff0c\u200b\u5219\u200b\u5c06\u200b\u5176\u200b\u79fb\u9664\u200b\u3002

<bos> \u200b\u548c\u200b <eos> \u200b\u4ee4\u724c\u200b

\u200b\u5728\u200b MultiMolecule \u200b\u63d0\u4f9b\u200b\u7684\u200b\u5206\u8bcd\u5668\u200b\u4e2d\u200b\uff0c<bos> \u200b\u4ee4\u724c\u200b\u6307\u5411\u200b <cls> \u200b\u4ee4\u724c\u200b\uff0c<sep> \u200b\u4ee4\u724c\u200b\u6307\u5411\u200b <eos> \u200b\u4ee4\u724c\u200b\u3002

"},{"location":"zh/models/#_4","title":"\u4f7f\u7528","text":""},{"location":"zh/models/#multimoleculeautomodel","title":"\u4f7f\u7528\u200b multimolecule.AutoModel \u200b\u6784\u5efa","text":"Python
from transformers import AutoTokenizer\n\nfrom multimolecule import AutoModelForSequencePrediction\n\nmodel = AutoModelForSequencePrediction.from_pretrained(\"multimolecule/rnafm\")\ntokenizer = AutoTokenizer.from_pretrained(\"multimolecule/rnafm\")\n\nsequence = \"UAGCGUAUCAGACUGAUGUUG\"\noutput = model(**tokenizer(sequence, return_tensors=\"pt\"))\n
"},{"location":"zh/models/#_5","title":"\u76f4\u63a5\u200b\u8bbf\u95ee","text":"

\u200b\u6240\u6709\u200b\u6a21\u578b\u200b\u53ef\u4ee5\u200b\u901a\u8fc7\u200b from_pretrained \u200b\u65b9\u6cd5\u200b\u76f4\u63a5\u200b\u52a0\u8f7d\u200b\u3002

Python
from multimolecule.models import RnaFmForTokenPrediction, RnaTokenizer\n\nmodel = RnaFmForTokenPrediction.from_pretrained(\"multimolecule/rnafm\")\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rnafm\")\n\nsequence = \"UAGCGUAUCAGACUGAUGUUG\"\noutput = model(**tokenizer(sequence, return_tensors=\"pt\"))\n
"},{"location":"zh/models/#transformersautomodel","title":"\u4f7f\u7528\u200b transformers.AutoModel \u200b\u6784\u5efa","text":"

\u200b\u867d\u7136\u200b\u6211\u4eec\u200b\u4e3a\u200b\u6a21\u578b\u200b\u7c7b\u200b\u4f7f\u7528\u200b\u4e86\u200b\u4e0d\u540c\u200b\u7684\u200b\u547d\u540d\u200b\u7ea6\u5b9a\u200b\uff0c\u200b\u4f46\u200b\u6a21\u578b\u200b\u4ecd\u7136\u200b\u6ce8\u518c\u200b\u5230\u200b\u76f8\u5e94\u200b\u7684\u200b transformers.AutoModel \u200b\u4e2d\u200b\u3002

Python
from transformers import AutoModelForSequenceClassification, AutoTokenizer\n\nimport multimolecule  # noqa: F401\n\nmodel = AutoModelForSequenceClassification.from_pretrained(\"multimolecule/mrnafm\")\ntokenizer = AutoTokenizer.from_pretrained(\"multimolecule/mrnafm\")\n\nsequence = \"UAGCGUAUCAGACUGAUGUUG\"\noutput = model(**tokenizer(sequence, return_tensors=\"pt\"))\n

\u200b\u4f7f\u7528\u200b\u524d\u5148\u200b import multimolecule

\u200b\u8bf7\u200b\u6ce8\u610f\u200b\uff0c\u200b\u5728\u200b\u4f7f\u7528\u200b transformers.AutoModel \u200b\u6784\u5efa\u200b\u6a21\u578b\u200b\u4e4b\u524d\u200b\uff0c\u200b\u5fc5\u987b\u200b\u5148\u200b import multimolecule\u3002 \u200b\u6a21\u578b\u200b\u7684\u200b\u6ce8\u518c\u200b\u5728\u200b multimolecule \u200b\u5305\u4e2d\u200b\u5b8c\u6210\u200b\uff0c\u200b\u6a21\u578b\u200b\u5728\u200b transformers \u200b\u5305\u4e2d\u200b\u4e0d\u53ef\u200b\u7528\u200b\u3002

\u200b\u5982\u679c\u200b\u5728\u200b\u4f7f\u7528\u200b transformers.AutoModel \u200b\u4e4b\u524d\u200b\u672a\u200b import multimolecule\uff0c\u200b\u5c06\u4f1a\u200b\u5f15\u53d1\u200b\u4ee5\u4e0b\u200b\u9519\u8bef\u200b\uff1a

Python
ValueError: The checkpoint you are trying to load has model type `rnafm` but Transformers does not recognize this architecture. This could be because of an issue with the checkpoint, or because your version of Transformers is out of date.\n
"},{"location":"zh/models/#_6","title":"\u521d\u59cb\u5316\u200b\u4e00\u4e2a\u200b\u9999\u8349\u200b\u6a21\u578b","text":"

\u200b\u4f60\u200b\u4e5f\u200b\u53ef\u4ee5\u200b\u4f7f\u7528\u200b\u6a21\u578b\u200b\u7c7b\u200b\u521d\u59cb\u5316\u200b\u4e00\u4e2a\u200b\u57fa\u7840\u200b\u6a21\u578b\u200b\u3002

Python
from multimolecule.models import RnaFmConfig, RnaFmForTokenPrediction, RnaTokenizer\n\nconfig = RnaFmConfig()\nmodel = RnaFmForTokenPrediction(config)\ntokenizer = RnaTokenizer()\n\nsequence = \"UAGCGUAUCAGACUGAUGUUG\"\noutput = model(**tokenizer(sequence, return_tensors=\"pt\"))\n
"},{"location":"zh/models/#_7","title":"\u53ef\u7528\u200b\u6a21\u578b","text":""},{"location":"zh/models/#dna","title":"\u8131\u6c27\u6838\u7cd6\u6838\u9178\u200b\uff08DNA\uff09","text":""},{"location":"zh/models/#rna","title":"\u6838\u7cd6\u6838\u9178\u200b\uff08RNA\uff09","text":""},{"location":"zh/module/","title":"module","text":"

module \u200b\u63d0\u4f9b\u200b\u4e86\u200b\u4e00\u7cfb\u5217\u200b\u9884\u5b9a\u200b\u4e49\u200b\u6a21\u5757\u200b\uff0c\u200b\u4f9b\u200b\u7528\u6237\u200b\u5b9e\u73b0\u200b\u81ea\u5df1\u200b\u7684\u200b\u67b6\u6784\u200b\u3002

MultiMolecule \u200b\u5efa\u7acb\u200b\u5728\u200b \u200b\u751f\u6001\u7cfb\u7edf\u200b\u4e4b\u4e0a\u200b\uff0c\u200b\u62e5\u62b1\u200b\u7c7b\u4f3c\u200b\u7684\u200b\u8bbe\u8ba1\u200b\u7406\u5ff5\u200b\uff1a\u200b\u4e0d\u8981\u200b \u200b\u91cd\u590d\u200b\u81ea\u5df1\u200b\u3002 \u200b\u6211\u4eec\u200b\u9075\u5faa\u200b \u200b\u5355\u4e00\u200b\u6a21\u578b\u200b\u6587\u4ef6\u200b\u7b56\u7565\u200b\uff0c\u200b\u5176\u4e2d\u200b models \u200b\u5305\u4e2d\u200b\u7684\u200b\u6bcf\u4e2a\u200b\u6a21\u578b\u200b\u90fd\u200b\u5305\u542b\u200b\u4e00\u4e2a\u200b\u4e14\u200b\u4ec5\u200b\u6709\u200b\u4e00\u4e2a\u200b\u63cf\u8ff0\u200b\u7f51\u7edc\u200b\u8bbe\u8ba1\u200b\u7684\u200b modeling.py \u200b\u6587\u4ef6\u200b\u3002

module \u200b\u5305\u200b\u65e8\u5728\u200b\u63d0\u4f9b\u200b\u7b80\u5355\u200b\u3001\u200b\u53ef\u200b\u91cd\u7528\u200b\u7684\u200b\u6a21\u5757\u200b\uff0c\u200b\u8fd9\u4e9b\u200b\u6a21\u5757\u200b\u5728\u200b\u591a\u4e2a\u200b\u6a21\u578b\u200b\u4e2d\u200b\u4fdd\u6301\u4e00\u81f4\u200b\u3002\u200b\u8fd9\u79cd\u200b\u65b9\u6cd5\u200b\u6700\u5927\u200b\u7a0b\u5ea6\u200b\u5730\u200b\u51cf\u5c11\u200b\u4e86\u200b\u4ee3\u7801\u200b\u91cd\u590d\u200b\uff0c\u200b\u5e76\u200b\u4fc3\u8fdb\u200b\u4e86\u200b\u5e72\u51c0\u200b\u3001\u200b\u6613\u4e8e\u200b\u7ef4\u62a4\u200b\u7684\u200b\u4ee3\u7801\u200b\u3002

"},{"location":"zh/module/#_1","title":"\u6838\u5fc3\u200b\u7279\u6027","text":""},{"location":"zh/module/#modules","title":"Modules","text":""},{"location":"zh/module/embeddings/","title":"embeddings","text":"

embeddings \u200b\u63d0\u4f9b\u200b\u4e86\u200b\u4e00\u7cfb\u5217\u200b\u9884\u5b9a\u200b\u4e49\u200b\u7684\u200b\u4f4d\u7f6e\u200b\u7f16\u7801\u200b\u3002

"},{"location":"zh/module/heads/","title":"heads","text":"

heads \u200b\u63d0\u4f9b\u200b\u4e86\u200b\u4e00\u7cfb\u5217\u200b\u7684\u200b\u6a21\u578b\u200b\u9884\u6d4b\u200b\u5934\u200b\uff0c\u200b\u7528\u4e8e\u200b\u5904\u7406\u200b\u4e0d\u540c\u200b\u7684\u200b\u4efb\u52a1\u200b\u3002

heads \u200b\u63a5\u53d7\u200b ModelOutupt\u3001dict \u200b\u6216\u200b tuple \u200b\u4f5c\u4e3a\u200b\u8f93\u5165\u200b\u3002 \u200b\u5b83\u4f1a\u200b\u81ea\u52a8\u200b\u67e5\u627e\u200b\u9884\u6d4b\u200b\u6240\u200b\u9700\u200b\u7684\u200b\u6a21\u578b\u200b\u8f93\u51fa\u200b\u5e76\u200b\u76f8\u5e94\u200b\u5730\u200b\u5904\u7406\u200b\u3002

\u200b\u4e00\u4e9b\u200b\u9884\u6d4b\u200b\u5934\u200b\u53ef\u80fd\u200b\u9700\u8981\u200b\u989d\u5916\u200b\u7684\u200b\u4fe1\u606f\u200b\uff0c\u200b\u4f8b\u5982\u200b attention_mask \u200b\u6216\u200b input_ids\uff0c\u200b\u4f8b\u5982\u200b ContactPredictionHead\u3002 \u200b\u8fd9\u4e9b\u200b\u989d\u5916\u200b\u7684\u200b\u53c2\u6570\u200b\u53ef\u4ee5\u200b\u4f5c\u4e3a\u200b\u53c2\u6570\u200b/\u200b\u5173\u952e\u5b57\u200b\u53c2\u6570\u200b\u4f20\u5165\u200b\u3002

\u200b\u8bf7\u200b\u6ce8\u610f\u200b\uff0cheads \u200b\u4f7f\u7528\u200b\u4e0e\u200b Transformers \u200b\u76f8\u540c\u200b\u7684\u200b ModelOutupt \u200b\u7ea6\u5b9a\u200b\u3002 \u200b\u5982\u679c\u200b\u6a21\u578b\u200b\u8f93\u51fa\u200b\u662f\u200b\u4e00\u4e2a\u200b tuple\uff0c\u200b\u6211\u4eec\u200b\u5c06\u200b\u7b2c\u4e00\u4e2a\u200b\u5143\u7d20\u200b\u89c6\u4e3a\u200b pooler_output\uff0c\u200b\u7b2c\u4e8c\u4e2a\u200b\u5143\u7d20\u200b\u89c6\u4e3a\u200b last_hidden_state\uff0c\u200b\u6700\u540e\u200b\u4e00\u4e2a\u200b\u5143\u7d20\u200b\u89c6\u4e3a\u200b attention_map\u3002 \u200b\u7528\u6237\u200b\u6709\u200b\u8d23\u4efb\u200b\u786e\u4fdd\u200b\u6a21\u578b\u200b\u8f93\u51fa\u200b\u683c\u5f0f\u200b\u6b63\u786e\u200b\u3002

\u200b\u5982\u679c\u200b\u6a21\u578b\u200b\u8f93\u51fa\u200b\u662f\u200b\u4e00\u4e2a\u200b ModelOutupt \u200b\u6216\u200b\u4e00\u4e2a\u200b dict\uff0cheads \u200b\u5c06\u200b\u4ece\u200b\u6a21\u578b\u200b\u8f93\u51fa\u200b\u4e2d\u200b\u67e5\u627e\u200b HeadConfig.output_name\u3002 \u200b\u4f60\u200b\u53ef\u4ee5\u200b\u5728\u200b HeadConfig \u200b\u4e2d\u200b\u6307\u5b9a\u200b output_name\uff0c\u200b\u4ee5\u200b\u786e\u4fdd\u200b heads \u200b\u53ef\u4ee5\u200b\u6b63\u786e\u200b\u5b9a\u4f4d\u200b\u6240\u200b\u9700\u200b\u7684\u200b\u5f20\u91cf\u200b\u3002

"},{"location":"zh/tokenisers/","title":"tokenisers","text":"

tokenisers \u200b\u63d0\u4f9b\u200b\u4e86\u200b\u4e00\u7cfb\u5217\u200b\u9884\u5b9a\u200b\u4e49\u200b\u4ee4\u724c\u200b\u5668\u200b\u3002

\u200b\u4ee4\u724c\u200b\u5668\u662f\u200b\u4e00\u4e2a\u200b\u5c06\u200b\u6838\u82f7\u9178\u200b\u6216\u200b\u6c28\u57fa\u9178\u200b\u5e8f\u5217\u200b\u8f6c\u6362\u200b\u4e3a\u200b\u7d22\u5f15\u200b\u5e8f\u5217\u200b\u7684\u200b\u7c7b\u200b\u3002\u200b\u5b83\u200b\u7528\u4e8e\u200b\u5728\u200b\u5c06\u200b\u8f93\u5165\u200b\u5e8f\u5217\u200b\u9988\u9001\u200b\u5230\u200b\u6a21\u578b\u200b\u4e4b\u524d\u200b\u5bf9\u200b\u5176\u200b\u8fdb\u884c\u200b\u9884\u5904\u7406\u200b\u3002

\u200b\u8bf7\u53c2\u9605\u200b Tokenizer \u200b\u4e86\u89e3\u200b\u66f4\u200b\u591a\u200b\u7ec6\u8282\u200b\u3002

"},{"location":"zh/tokenisers/#_1","title":"\u53ef\u7528\u200b\u4ee4\u724c\u200b\u5668","text":""}]} \ No newline at end of file +{"config":{"lang":["en","zh"],"separator":"[\\s\\u200b\\-]","pipeline":["stemmer"]},"docs":[{"location":"","title":"MultiMolecule","text":"

Accelerate Molecular Biology Research with Machine Learning

"},{"location":"#introduction","title":"Introduction","text":"

Welcome to MultiMolecule (\u200b\u6d66\u539f\u200b), a foundational library designed to accelerate scientific research in molecular biology through machine learning. MultiMolecule provides a comprehensive yet flexible set of tools for researchers aiming to leverage AI with ease, focusing on biomolecular data (RNA, DNA, and protein).

"},{"location":"#overview","title":"Overview","text":"

MultiMolecule is built with flexibility and ease of use in mind. Its modular design allows you to utilize only the components you need, integrating seamlessly into your existing workflows without adding unnecessary complexity.

"},{"location":"#installation","title":"Installation","text":"

Install the most recent stable version on PyPI:

Bash
pip install multimolecule\n

Install the latest version from the source:

Bash
pip install git+https://github.com/DLS5-Omics/MultiMolecule\n
"},{"location":"#citation","title":"Citation","text":"

If you use MultiMolecule in your research, please cite us as follows:

BibTeX
@software{chen_2024_12638419,\n  author    = {Chen, Zhiyuan and Zhu, Sophia Y.},\n  title     = {MultiMolecule},\n  doi       = {10.5281/zenodo.12638419},\n  publisher = {Zenodo},\n  url       = {https://doi.org/10.5281/zenodo.12638419},\n  year      = 2024,\n  month     = may,\n  day       = 4\n}\n
"},{"location":"#license","title":"License","text":"

We believe openness is the Foundation of Research.

MultiMolecule is licensed under the GNU Affero General Public License.

Please join us in building an open research community.

SPDX-License-Identifier: AGPL-3.0-or-later

"},{"location":"about/","title":"About","text":"

Developed by DanLing on Earth

We are a community of developers, designers, and others from around the world who are working together to make deep learning more accessible.

We are a community of individuals who seek to push the boundaries of what is possible with deep learning.

We are passionate about Deep Learning and the people who use it.

We are DanLing.

"},{"location":"about/license-faq/","title":"License FAQ","text":"

This License FAQ explains the terms and conditions under which you may use the data, models, code, configuration, documentation, and weights provided by the DanLing Team (also known as DanLing) (\u2018we\u2019, \u2018us\u2019, or \u2018our\u2019). It serves as an addendum to our License.

"},{"location":"about/license-faq/#0-summary-of-key-points","title":"0. Summary of Key Points","text":"

This summary provides key points from our license, but you can find out more details about any of these topics by clicking the link following each key point and by reading the full license.

What constitutes the \u2018source code\u2019 in MultiMolecule?

We consider everything in our repositories to be source code, including data, models, code, configuration, and documentation.

What constitutes the \u2018source code\u2019 in MultiMolecule?

Can I publish research papers using MultiMolecule?

It depends.

You can publish research papers on fully open access journals and conferences or preprint servers following the terms of the License.

You must obtain a separate license from us to publish research papers on closed access journals and conferences.

Can I publish research papers using MultiMolecule?

Can I use MultiMolecule for commercial purposes?

Yes, you can use MultiMolecule for commercial purposes under the terms of the License.

Can I use MultiMolecule for commercial purposes?

Do people affiliated with certain organizations have specific license terms?

Yes, people affiliated with certain organizations have specific license terms.

Do people affiliated with certain organizations have specific license terms?

"},{"location":"about/license-faq/#1-what-constitutes-the-source-code-in-multimolecule","title":"1. What constitutes the \u201csource code\u201d in MultiMolecule?","text":"

We consider everything in our repositories to be source code.

The training process of machine learning models is viewed similarly to the compilation process of traditional software. As such, the model, code, configuration, documentation, and data used for training are all part of the source code, while the trained model weights are part of the object code.

We also consider research papers and manuscripts a special form of documentation, which are also part of the source code.

"},{"location":"about/license-faq/#2-can-i-publish-research-papers-using-multimolecule","title":"2. Can I publish research papers using MultiMolecule?","text":"

Since research papers are considered a form of source code, publishers are legally required to open-source all materials on their server to comply with the License if they publish papers using MultiMolecule. This is generally impractical for most publishers.

As a special exemption under section 7 of the License, we grant permission to publish research papers using MultiMolecule in fully open access journals, conferences, or preprint servers that do not charge any fee from authors, provided all published manuscripts are made available under the GNU Free Documentation License (GFDL), or a Creative Commons license, or an OSI-approved license that permits the sharing of manuscripts.

As a special exemption under section 7 of the License, we grant permission to publish research papers using MultiMolecule in certain non-profit journals, conferences, or preprint servers. Currently, the non-profit journals, conferences, or preprint servers we allow include:

For publishing in closed access journals or conferences, you must obtain a separate license from us. This typically involves co-authorship, a fee to support the project, or both. Contact us at multimolecule@zyc.ai for more information.

While not mandatory, we recommend citing the MultiMolecule project in your research papers.

"},{"location":"about/license-faq/#3-can-i-use-multimolecule-for-commercial-purposes","title":"3. Can I use MultiMolecule for commercial purposes?","text":"

Yes, MultiMolecule can be used for commercial purposes under the License. However, you must open-source any modifications to the source code and make them available under the License.

If you prefer to use MultiMolecule for commercial purposes without open-sourcing your modifications, you must obtain a separate license from us. This typically involves a fee to support the project. Contact us at multimolecule@zyc.ai for further details.

"},{"location":"about/license-faq/#4-do-people-affiliated-with-certain-organizations-have-specific-license-terms","title":"4. Do people affiliated with certain organizations have specific license terms?","text":"

YES!

If you are affiliated with an organization that has a separate license agreement with us, you may be subject to different license terms. Please consult your organization\u2019s legal department to determine if you are subject to a separate license agreement.

Members of the following organizations automatically receive a non-transferable, non-sublicensable, and non-distributable MIT License to use MultiMolecule:

This special license is considered an additional term under section 7 of the License. It is not redistributable, and you are prohibited from creating any independent derivative works. Any modifications or derivative works based on this license are automatically considered derivative works of MultiMolecule and must comply with all the terms of the License. This ensures that third parties cannot bypass the license terms or create separate licenses from derivative works.

"},{"location":"about/license-faq/#5-how-can-i-use-multimolecule-if-my-organization-forbids-the-use-of-code-under-the-agpl-license","title":"5. How can I use MultiMolecule if my organization forbids the use of code under the AGPL License?","text":"

Some organizations, such as Google, have policies that prohibit the use of code under the AGPL License.

If you are affiliated with an organization that forbids the use of AGPL-licensed code, you must obtain a separate license from us. Contact us at multimolecule@zyc.ai for more information.

"},{"location":"about/license-faq/#6-can-i-use-multimolecule-if-i-am-a-federal-employee-of-the-united-states-government","title":"6. Can I use MultiMolecule if I am a federal employee of the United States Government?","text":"

No.

Code written by federal employees of the United States Government is not protected by copyright under 17 U.S. Code \u00a7 105.

As a result, federal employees of the United States Government cannot comply with the terms of the License.

"},{"location":"about/license-faq/#7-do-we-make-updates-to-this-faq","title":"7. Do we make updates to this FAQ?","text":"

In Short

Yes, we will update this FAQ as necessary to stay compliant with relevant laws.

We may update this license FAQ from time to time. The updated version will be indicated by an updated \u2018Last Revised Time\u2019 at the bottom of this license FAQ. If we make any material changes, we will notify you by posting the new license FAQ on this page. We are unable to notify you directly as we do not collect any contact information from you. We encourage you to review this license FAQ frequently to stay informed of how you can use our data, models, code, configuration, documentation, and weights.

"},{"location":"about/license/","title":"GNU AFFERO GENERAL PUBLIC LICENSE","text":"

Version 3, 19 November 2007

Copyright (C) 2007 Free Software Foundation, Inc. https://fsf.org/

Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed.

"},{"location":"about/license/#preamble","title":"Preamble","text":"

The GNU Affero General Public License is a free, copyleft license for software and other kinds of works, specifically designed to ensure cooperation with the community in the case of network server software.

The licenses for most software and other practical works are designed to take away your freedom to share and change the works. By contrast, our General Public Licenses are intended to guarantee your freedom to share and change all versions of a program\u2013to make sure it remains free software for all its users.

When we speak of free software, we are referring to freedom, not price. Our General Public Licenses are designed to make sure that you have the freedom to distribute copies of free software (and charge for them if you wish), that you receive source code or can get it if you want it, that you can change the software or use pieces of it in new free programs, and that you know you can do these things.

Developers that use our General Public Licenses protect your rights with two steps: (1) assert copyright on the software, and (2) offer you this License which gives you legal permission to copy, distribute and/or modify the software.

A secondary benefit of defending all users\u2019 freedom is that improvements made in alternate versions of the program, if they receive widespread use, become available for other developers to incorporate. Many developers of free software are heartened and encouraged by the resulting cooperation. However, in the case of software used on network servers, this result may fail to come about. The GNU General Public License permits making a modified version and letting the public access it on a server without ever releasing its source code to the public.

The GNU Affero General Public License is designed specifically to ensure that, in such cases, the modified source code becomes available to the community. It requires the operator of a network server to provide the source code of the modified version running there to the users of that server. Therefore, public use of a modified version, on a publicly accessible server, gives the public access to the source code of the modified version.

An older license, called the Affero General Public License and published by Affero, was designed to accomplish similar goals. This is a different license, not a version of the Affero GPL, but Affero has released a new version of the Affero GPL which permits relicensing under this license.

The precise terms and conditions for copying, distribution and modification follow.

"},{"location":"about/license/#terms-and-conditions","title":"TERMS AND CONDITIONS","text":""},{"location":"about/license/#0-definitions","title":"0. Definitions.","text":"

\u201cThis License\u201d refers to version 3 of the GNU Affero General Public License.

\u201cCopyright\u201d also means copyright-like laws that apply to other kinds of works, such as semiconductor masks.

\u201cThe Program\u201d refers to any copyrightable work licensed under this License. Each licensee is addressed as \u201cyou\u201d. \u201cLicensees\u201d and \u201crecipients\u201d may be individuals or organizations.

To \u201cmodify\u201d a work means to copy from or adapt all or part of the work in a fashion requiring copyright permission, other than the making of an exact copy. The resulting work is called a \u201cmodified version\u201d of the earlier work or a work \u201cbased on\u201d the earlier work.

A \u201ccovered work\u201d means either the unmodified Program or a work based on the Program.

To \u201cpropagate\u201d a work means to do anything with it that, without permission, would make you directly or secondarily liable for infringement under applicable copyright law, except executing it on a computer or modifying a private copy. Propagation includes copying, distribution (with or without modification), making available to the public, and in some countries other activities as well.

To \u201cconvey\u201d a work means any kind of propagation that enables other parties to make or receive copies. Mere interaction with a user through a computer network, with no transfer of a copy, is not conveying.

An interactive user interface displays \u201cAppropriate Legal Notices\u201d to the extent that it includes a convenient and prominently visible feature that (1) displays an appropriate copyright notice, and (2) tells the user that there is no warranty for the work (except to the extent that warranties are provided), that licensees may convey the work under this License, and how to view a copy of this License. If the interface presents a list of user commands or options, such as a menu, a prominent item in the list meets this criterion.

"},{"location":"about/license/#1-source-code","title":"1. Source Code.","text":"

The \u201csource code\u201d for a work means the preferred form of the work for making modifications to it. \u201cObject code\u201d means any non-source form of a work.

A \u201cStandard Interface\u201d means an interface that either is an official standard defined by a recognized standards body, or, in the case of interfaces specified for a particular programming language, one that is widely used among developers working in that language.

The \u201cSystem Libraries\u201d of an executable work include anything, other than the work as a whole, that (a) is included in the normal form of packaging a Major Component, but which is not part of that Major Component, and (b) serves only to enable use of the work with that Major Component, or to implement a Standard Interface for which an implementation is available to the public in source code form. A \u201cMajor Component\u201d, in this context, means a major essential component (kernel, window system, and so on) of the specific operating system (if any) on which the executable work runs, or a compiler used to produce the work, or an object code interpreter used to run it.

The \u201cCorresponding Source\u201d for a work in object code form means all the source code needed to generate, install, and (for an executable work) run the object code and to modify the work, including scripts to control those activities. However, it does not include the work\u2019s System Libraries, or general-purpose tools or generally available free programs which are used unmodified in performing those activities but which are not part of the work. For example, Corresponding Source includes interface definition files associated with source files for the work, and the source code for shared libraries and dynamically linked subprograms that the work is specifically designed to require, such as by intimate data communication or control flow between those subprograms and other parts of the work.

The Corresponding Source need not include anything that users can regenerate automatically from other parts of the Corresponding Source.

The Corresponding Source for a work in source code form is that same work.

"},{"location":"about/license/#2-basic-permissions","title":"2. Basic Permissions.","text":"

All rights granted under this License are granted for the term of copyright on the Program, and are irrevocable provided the stated conditions are met. This License explicitly affirms your unlimited permission to run the unmodified Program. The output from running a covered work is covered by this License only if the output, given its content, constitutes a covered work. This License acknowledges your rights of fair use or other equivalent, as provided by copyright law.

You may make, run and propagate covered works that you do not convey, without conditions so long as your license otherwise remains in force. You may convey covered works to others for the sole purpose of having them make modifications exclusively for you, or provide you with facilities for running those works, provided that you comply with the terms of this License in conveying all material for which you do not control copyright. Those thus making or running the covered works for you must do so exclusively on your behalf, under your direction and control, on terms that prohibit them from making any copies of your copyrighted material outside their relationship with you.

Conveying under any other circumstances is permitted solely under the conditions stated below. Sublicensing is not allowed; section 10 makes it unnecessary.

"},{"location":"about/license/#3-protecting-users-legal-rights-from-anti-circumvention-law","title":"3. Protecting Users\u2019 Legal Rights From Anti-Circumvention Law.","text":"

No covered work shall be deemed part of an effective technological measure under any applicable law fulfilling obligations under article 11 of the WIPO copyright treaty adopted on 20 December 1996, or similar laws prohibiting or restricting circumvention of such measures.

When you convey a covered work, you waive any legal power to forbid circumvention of technological measures to the extent such circumvention is effected by exercising rights under this License with respect to the covered work, and you disclaim any intention to limit operation or modification of the work as a means of enforcing, against the work\u2019s users, your or third parties\u2019 legal rights to forbid circumvention of technological measures.

"},{"location":"about/license/#4-conveying-verbatim-copies","title":"4. Conveying Verbatim Copies.","text":"

You may convey verbatim copies of the Program\u2019s source code as you receive it, in any medium, provided that you conspicuously and appropriately publish on each copy an appropriate copyright notice; keep intact all notices stating that this License and any non-permissive terms added in accord with section 7 apply to the code; keep intact all notices of the absence of any warranty; and give all recipients a copy of this License along with the Program.

You may charge any price or no price for each copy that you convey, and you may offer support or warranty protection for a fee.

"},{"location":"about/license/#5-conveying-modified-source-versions","title":"5. Conveying Modified Source Versions.","text":"

You may convey a work based on the Program, or the modifications to produce it from the Program, in the form of source code under the terms of section 4, provided that you also meet all of these conditions:

A compilation of a covered work with other separate and independent works, which are not by their nature extensions of the covered work, and which are not combined with it such as to form a larger program, in or on a volume of a storage or distribution medium, is called an \u201caggregate\u201d if the compilation and its resulting copyright are not used to limit the access or legal rights of the compilation\u2019s users beyond what the individual works permit. Inclusion of a covered work in an aggregate does not cause this License to apply to the other parts of the aggregate.

"},{"location":"about/license/#6-conveying-non-source-forms","title":"6. Conveying Non-Source Forms.","text":"

You may convey a covered work in object code form under the terms of sections 4 and 5, provided that you also convey the machine-readable Corresponding Source under the terms of this License, in one of these ways:

A separable portion of the object code, whose source code is excluded from the Corresponding Source as a System Library, need not be included in conveying the object code work.

A \u201cUser Product\u201d is either (1) a \u201cconsumer product\u201d, which means any tangible personal property which is normally used for personal, family, or household purposes, or (2) anything designed or sold for incorporation into a dwelling. In determining whether a product is a consumer product, doubtful cases shall be resolved in favor of coverage. For a particular product received by a particular user, \u201cnormally used\u201d refers to a typical or common use of that class of product, regardless of the status of the particular user or of the way in which the particular user actually uses, or expects or is expected to use, the product. A product is a consumer product regardless of whether the product has substantial commercial, industrial or non-consumer uses, unless such uses represent the only significant mode of use of the product.

\u201cInstallation Information\u201d for a User Product means any methods, procedures, authorization keys, or other information required to install and execute modified versions of a covered work in that User Product from a modified version of its Corresponding Source. The information must suffice to ensure that the continued functioning of the modified object code is in no case prevented or interfered with solely because modification has been made.

If you convey an object code work under this section in, or with, or specifically for use in, a User Product, and the conveying occurs as part of a transaction in which the right of possession and use of the User Product is transferred to the recipient in perpetuity or for a fixed term (regardless of how the transaction is characterized), the Corresponding Source conveyed under this section must be accompanied by the Installation Information. But this requirement does not apply if neither you nor any third party retains the ability to install modified object code on the User Product (for example, the work has been installed in ROM).

The requirement to provide Installation Information does not include a requirement to continue to provide support service, warranty, or updates for a work that has been modified or installed by the recipient, or for the User Product in which it has been modified or installed. Access to a network may be denied when the modification itself materially and adversely affects the operation of the network or violates the rules and protocols for communication across the network.

Corresponding Source conveyed, and Installation Information provided, in accord with this section must be in a format that is publicly documented (and with an implementation available to the public in source code form), and must require no special password or key for unpacking, reading or copying.

"},{"location":"about/license/#7-additional-terms","title":"7. Additional Terms.","text":"

\u201cAdditional permissions\u201d are terms that supplement the terms of this License by making exceptions from one or more of its conditions. Additional permissions that are applicable to the entire Program shall be treated as though they were included in this License, to the extent that they are valid under applicable law. If additional permissions apply only to part of the Program, that part may be used separately under those permissions, but the entire Program remains governed by this License without regard to the additional permissions.

When you convey a copy of a covered work, you may at your option remove any additional permissions from that copy, or from any part of it. (Additional permissions may be written to require their own removal in certain cases when you modify the work.) You may place additional permissions on material, added by you to a covered work, for which you have or can give appropriate copyright permission.

Notwithstanding any other provision of this License, for material you add to a covered work, you may (if authorized by the copyright holders of that material) supplement the terms of this License with terms:

All other non-permissive additional terms are considered \u201cfurther restrictions\u201d within the meaning of section 10. If the Program as you received it, or any part of it, contains a notice stating that it is governed by this License along with a term that is a further restriction, you may remove that term. If a license document contains a further restriction but permits relicensing or conveying under this License, you may add to a covered work material governed by the terms of that license document, provided that the further restriction does not survive such relicensing or conveying.

If you add terms to a covered work in accord with this section, you must place, in the relevant source files, a statement of the additional terms that apply to those files, or a notice indicating where to find the applicable terms.

Additional terms, permissive or non-permissive, may be stated in the form of a separately written license, or stated as exceptions; the above requirements apply either way.

"},{"location":"about/license/#8-termination","title":"8. Termination.","text":"

You may not propagate or modify a covered work except as expressly provided under this License. Any attempt otherwise to propagate or modify it is void, and will automatically terminate your rights under this License (including any patent licenses granted under the third paragraph of section 11).

However, if you cease all violation of this License, then your license from a particular copyright holder is reinstated (a) provisionally, unless and until the copyright holder explicitly and finally terminates your license, and (b) permanently, if the copyright holder fails to notify you of the violation by some reasonable means prior to 60 days after the cessation.

Moreover, your license from a particular copyright holder is reinstated permanently if the copyright holder notifies you of the violation by some reasonable means, this is the first time you have received notice of violation of this License (for any work) from that copyright holder, and you cure the violation prior to 30 days after your receipt of the notice.

Termination of your rights under this section does not terminate the licenses of parties who have received copies or rights from you under this License. If your rights have been terminated and not permanently reinstated, you do not qualify to receive new licenses for the same material under section 10.

"},{"location":"about/license/#9-acceptance-not-required-for-having-copies","title":"9. Acceptance Not Required for Having Copies.","text":"

You are not required to accept this License in order to receive or run a copy of the Program. Ancillary propagation of a covered work occurring solely as a consequence of using peer-to-peer transmission to receive a copy likewise does not require acceptance. However, nothing other than this License grants you permission to propagate or modify any covered work. These actions infringe copyright if you do not accept this License. Therefore, by modifying or propagating a covered work, you indicate your acceptance of this License to do so.

"},{"location":"about/license/#10-automatic-licensing-of-downstream-recipients","title":"10. Automatic Licensing of Downstream Recipients.","text":"

Each time you convey a covered work, the recipient automatically receives a license from the original licensors, to run, modify and propagate that work, subject to this License. You are not responsible for enforcing compliance by third parties with this License.

An \u201centity transaction\u201d is a transaction transferring control of an organization, or substantially all assets of one, or subdividing an organization, or merging organizations. If propagation of a covered work results from an entity transaction, each party to that transaction who receives a copy of the work also receives whatever licenses to the work the party\u2019s predecessor in interest had or could give under the previous paragraph, plus a right to possession of the Corresponding Source of the work from the predecessor in interest, if the predecessor has it or can get it with reasonable efforts.

You may not impose any further restrictions on the exercise of the rights granted or affirmed under this License. For example, you may not impose a license fee, royalty, or other charge for exercise of rights granted under this License, and you may not initiate litigation (including a cross-claim or counterclaim in a lawsuit) alleging that any patent claim is infringed by making, using, selling, offering for sale, or importing the Program or any portion of it.

"},{"location":"about/license/#11-patents","title":"11. Patents.","text":"

A \u201ccontributor\u201d is a copyright holder who authorizes use under this License of the Program or a work on which the Program is based. The work thus licensed is called the contributor\u2019s \u201ccontributor version\u201d.

A contributor\u2019s \u201cessential patent claims\u201d are all patent claims owned or controlled by the contributor, whether already acquired or hereafter acquired, that would be infringed by some manner, permitted by this License, of making, using, or selling its contributor version, but do not include claims that would be infringed only as a consequence of further modification of the contributor version. For purposes of this definition, \u201ccontrol\u201d includes the right to grant patent sublicenses in a manner consistent with the requirements of this License.

Each contributor grants you a non-exclusive, worldwide, royalty-free patent license under the contributor\u2019s essential patent claims, to make, use, sell, offer for sale, import and otherwise run, modify and propagate the contents of its contributor version.

In the following three paragraphs, a \u201cpatent license\u201d is any express agreement or commitment, however denominated, not to enforce a patent (such as an express permission to practice a patent or covenant not to sue for patent infringement). To \u201cgrant\u201d such a patent license to a party means to make such an agreement or commitment not to enforce a patent against the party.

If you convey a covered work, knowingly relying on a patent license, and the Corresponding Source of the work is not available for anyone to copy, free of charge and under the terms of this License, through a publicly available network server or other readily accessible means, then you must either (1) cause the Corresponding Source to be so available, or (2) arrange to deprive yourself of the benefit of the patent license for this particular work, or (3) arrange, in a manner consistent with the requirements of this License, to extend the patent license to downstream recipients. \u201cKnowingly relying\u201d means you have actual knowledge that, but for the patent license, your conveying the covered work in a country, or your recipient\u2019s use of the covered work in a country, would infringe one or more identifiable patents in that country that you have reason to believe are valid.

If, pursuant to or in connection with a single transaction or arrangement, you convey, or propagate by procuring conveyance of, a covered work, and grant a patent license to some of the parties receiving the covered work authorizing them to use, propagate, modify or convey a specific copy of the covered work, then the patent license you grant is automatically extended to all recipients of the covered work and works based on it.

A patent license is \u201cdiscriminatory\u201d if it does not include within the scope of its coverage, prohibits the exercise of, or is conditioned on the non-exercise of one or more of the rights that are specifically granted under this License. You may not convey a covered work if you are a party to an arrangement with a third party that is in the business of distributing software, under which you make payment to the third party based on the extent of your activity of conveying the work, and under which the third party grants, to any of the parties who would receive the covered work from you, a discriminatory patent license (a) in connection with copies of the covered work conveyed by you (or copies made from those copies), or (b) primarily for and in connection with specific products or compilations that contain the covered work, unless you entered into that arrangement, or that patent license was granted, prior to 28 March 2007.

Nothing in this License shall be construed as excluding or limiting any implied license or other defenses to infringement that may otherwise be available to you under applicable patent law.

"},{"location":"about/license/#12-no-surrender-of-others-freedom","title":"12. No Surrender of Others\u2019 Freedom.","text":"

If conditions are imposed on you (whether by court order, agreement or otherwise) that contradict the conditions of this License, they do not excuse you from the conditions of this License. If you cannot convey a covered work so as to satisfy simultaneously your obligations under this License and any other pertinent obligations, then as a consequence you may not convey it at all. For example, if you agree to terms that obligate you to collect a royalty for further conveying from those to whom you convey the Program, the only way you could satisfy both those terms and this License would be to refrain entirely from conveying the Program.

"},{"location":"about/license/#13-remote-network-interaction-use-with-the-gnu-general-public-license","title":"13. Remote Network Interaction; Use with the GNU General Public License.","text":"

Notwithstanding any other provision of this License, if you modify the Program, your modified version must prominently offer all users interacting with it remotely through a computer network (if your version supports such interaction) an opportunity to receive the Corresponding Source of your version by providing access to the Corresponding Source from a network server at no charge, through some standard or customary means of facilitating copying of software. This Corresponding Source shall include the Corresponding Source for any work covered by version 3 of the GNU General Public License that is incorporated pursuant to the following paragraph.

Notwithstanding any other provision of this License, you have permission to link or combine any covered work with a work licensed under version 3 of the GNU General Public License into a single combined work, and to convey the resulting work. The terms of this License will continue to apply to the part which is the covered work, but the work with which it is combined will remain governed by version 3 of the GNU General Public License.

"},{"location":"about/license/#14-revised-versions-of-this-license","title":"14. Revised Versions of this License.","text":"

The Free Software Foundation may publish revised and/or new versions of the GNU Affero General Public License from time to time. Such new versions will be similar in spirit to the present version, but may differ in detail to address new problems or concerns.

Each version is given a distinguishing version number. If the Program specifies that a certain numbered version of the GNU Affero General Public License \u201cor any later version\u201d applies to it, you have the option of following the terms and conditions either of that numbered version or of any later version published by the Free Software Foundation. If the Program does not specify a version number of the GNU Affero General Public License, you may choose any version ever published by the Free Software Foundation.

If the Program specifies that a proxy can decide which future versions of the GNU Affero General Public License can be used, that proxy\u2019s public statement of acceptance of a version permanently authorizes you to choose that version for the Program.

Later license versions may give you additional or different permissions. However, no additional obligations are imposed on any author or copyright holder as a result of your choosing to follow a later version.

"},{"location":"about/license/#15-disclaimer-of-warranty","title":"15. Disclaimer of Warranty.","text":"

THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM \u201cAS IS\u201d WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU. SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING, REPAIR OR CORRECTION.

"},{"location":"about/license/#16-limitation-of-liability","title":"16. Limitation of Liability.","text":"

IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MODIFIES AND/OR CONVEYS THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

"},{"location":"about/license/#17-interpretation-of-sections-15-and-16","title":"17. Interpretation of Sections 15 and 16.","text":"

If the disclaimer of warranty and limitation of liability provided above cannot be given local legal effect according to their terms, reviewing courts shall apply local law that most closely approximates an absolute waiver of all civil liability in connection with the Program, unless a warranty or assumption of liability accompanies a copy of the Program in return for a fee.

END OF TERMS AND CONDITIONS

"},{"location":"about/license/#how-to-apply-these-terms-to-your-new-programs","title":"How to Apply These Terms to Your New Programs","text":"

If you develop a new program, and you want it to be of the greatest possible use to the public, the best way to achieve this is to make it free software which everyone can redistribute and change under these terms.

To do so, attach the following notices to the program. It is safest to attach them to the start of each source file to most effectively state the exclusion of warranty; and each file should have at least the \u201ccopyright\u201d line and a pointer to where the full notice is found.

Text Only
    <one line to give the program's name and a brief idea of what it does.>\n    Copyright (C) <year>  <name of author>\n\n    This program is free software: you can redistribute it and/or modify\n    it under the terms of the GNU Affero General Public License as\n    published by the Free Software Foundation, either version 3 of the\n    License, or (at your option) any later version.\n\n    This program is distributed in the hope that it will be useful,\n    but WITHOUT ANY WARRANTY; without even the implied warranty of\n    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\n    GNU Affero General Public License for more details.\n\n    You should have received a copy of the GNU Affero General Public License\n    along with this program.  If not, see <https://www.gnu.org/licenses/>.\n

Also add information on how to contact you by electronic and paper mail.

If your software can interact with users remotely through a computer network, you should also make sure that it provides a way for users to get its source. For example, if your program is a web application, its interface could display a \u201cSource\u201d link that leads users to an archive of the code. There are many ways you could offer source, and different solutions will be better for different programs; see section 13 for the specific requirements.

You should also get your employer (if you work as a programmer) or school, if any, to sign a \u201ccopyright disclaimer\u201d for the program, if necessary. For more information on this, and how to apply and follow the GNU AGPL, see https://www.gnu.org/licenses/.

"},{"location":"data/","title":"data","text":"

data provides a collection of data processing utilities for handling data.

While datasets is a powerful library for managing datasets, it is a general-purpose tool that may not cover all the specific functionalities of scientific applications.

The data package is designed to complement datasets by offering additional data processing utilities that are commonly used in scientific tasks.

"},{"location":"data/#usage","title":"Usage","text":""},{"location":"data/#load-from-local-data-file","title":"Load from local data file","text":"Python
from multimolecule.data import Dataset\n\ndata = Dataset(\"data/rna/5utr.csv\", split=\"train\", pretrained=\"multimolecule/rna\")\n
"},{"location":"data/#load-from-datasets","title":"Load from datasets","text":"Python
from multimolecule.data import Dataset\n\ndata = Dataset(\"multimolecule/bprna-spot\", split=\"train\", pretrained=\"multimolecule/rna\")\n
"},{"location":"data/dataset/","title":"Dataset","text":""},{"location":"data/dataset/#multimolecule.data.Dataset","title":"multimolecule.data.Dataset","text":"

Bases: Dataset

The base class for all datasets.

Dataset is a subclass of datasets.Dataset that provides additional functionality for handling structured data. It has three main features:

Attributes:

Name Type Description tasks NestedDict

A nested dictionary of the inferred tasks for each label column in the dataset.

tokenizer PreTrainedTokenizerBase

The pretrained tokenizer to use for tokenization.

truncation bool

Whether to truncate sequences that exceed the maximum length of the tokenizer.

max_seq_length int

The maximum length of the input sequences.

data_cols List

The names of all columns in the dataset.

feature_cols List

The names of the feature columns in the dataset.

label_cols List

The names of the label columns in the dataset.

sequence_cols List

The names of the sequence columns in the dataset.

column_names_map Mapping[str, str] | None

A mapping of column names to new column names.

preprocess bool

Whether to preprocess the dataset.

Parameters:

Name Type Description Default Table | DataFrame | dict | list | str

The dataset. This can be a path to a file, a tag on the Hugging Face Hub, a pyarrow.Table, a dict, a list, or a pandas.DataFrame.

required NamedSplit

The split of the dataset.

required PreTrainedTokenizerBase | None

A pretrained tokenizer to use for tokenization. Either tokenizer or pretrained must be specified.

None str | None

The name of a pretrained tokenizer to use for tokenization. Either tokenizer or pretrained must be specified.

None List | None

The names of the feature columns in the dataset. Will be inferred automatically if not specified.

None List | None

The names of the label columns in the dataset. Will be inferred automatically if not specified.

None List | None

The names of the ID columns in the dataset. Will be inferred automatically if not specified.

None bool | None

Whether to preprocess the dataset. Preprocessing involves pre-tokenizing the sequences using the tokenizer. Defaults to True.

None bool | None

Whether to automatically rename sequence columns to standard name. Only works when there is exactly one sequence column You can control the naming through multimolecule.defaults.SEQUENCE_COL_NAME. For more refined control, use column_names_map.

None

Whether to automatically rename label column to standard name. Only works when there is exactly one label column. You can control the naming through multimolecule.defaults.LABEL_COL_NAME. For more refined control, use column_names_map.

required Mapping[str, str] | None

A mapping of column names to new column names. This is useful for renaming columns to inputs that are expected by a model. Defaults to None.

None bool | None

Whether to truncate sequences that exceed the maximum length of the tokenizer. Defaults to False.

None int | None

The maximum length of the input sequences. Defaults to the model_max_length of the tokenizer.

None Mapping[str, Task] | None

A mapping of column names to tasks. Will be inferred automatically if not specified.

None Mapping[str, int] | None

A mapping of column names to discrete mappings. This is useful for mapping the raw value to nominal value in classification tasks. Will be inferred automatically if not specified.

None str

How to handle NaN and inf values in the dataset. Can be \u201cignore\u201d, \u201cerror\u201d, \u201cdrop\u201d, or \u201cfill\u201d. Defaults to \u201cignore\u201d.

'ignore' str | int | float

The value to fill NaN and inf values with. Defaults to 0.

0 DatasetInfo | None

The dataset info.

None Table | None

The indices table.

None str | None

The fingerprint of the dataset.

None Source code in multimolecule/data/dataset.py Python
class Dataset(datasets.Dataset):\n    r\"\"\"\n    The base class for all datasets.\n\n    Dataset is a subclass of [`datasets.Dataset`][] that provides additional functionality for handling structured data.\n    It has three main features:\n\n    - column identification: identify the special columns (sequence and structure columns) in the dataset.\n    - tokenization: tokenize the sequence columns in the dataset using a pretrained tokenizer.\n    - task inference: infer the task type and level of each label column in the dataset.\n\n    Attributes:\n        tasks: A nested dictionary of the inferred tasks for each label column in the dataset.\n        tokenizer: The pretrained tokenizer to use for tokenization.\n        truncation: Whether to truncate sequences that exceed the maximum length of the tokenizer.\n        max_seq_length: The maximum length of the input sequences.\n        data_cols: The names of all columns in the dataset.\n        feature_cols: The names of the feature columns in the dataset.\n        label_cols: The names of the label columns in the dataset.\n        sequence_cols: The names of the sequence columns in the dataset.\n        column_names_map: A mapping of column names to new column names.\n        preprocess: Whether to preprocess the dataset.\n\n    Args:\n        data: The dataset. This can be a path to a file, a tag on the Hugging Face Hub, a pyarrow.Table,\n            a [dict][], a [list][], or a [pandas.DataFrame][].\n        split: The split of the dataset.\n        tokenizer: A pretrained tokenizer to use for tokenization.\n            Either `tokenizer` or `pretrained` must be specified.\n        pretrained: The name of a pretrained tokenizer to use for tokenization.\n            Either `tokenizer` or `pretrained` must be specified.\n        feature_cols: The names of the feature columns in the dataset.\n            Will be inferred automatically if not specified.\n        label_cols: The names of the label columns in the dataset.\n            Will be inferred automatically if not specified.\n        id_cols: The names of the ID columns in the dataset.\n            Will be inferred automatically if not specified.\n        preprocess: Whether to preprocess the dataset.\n            Preprocessing involves pre-tokenizing the sequences using the tokenizer.\n            Defaults to `True`.\n        auto_rename_sequence_col: Whether to automatically rename sequence columns to standard name.\n            Only works when there is exactly one sequence column\n            You can control the naming through `multimolecule.defaults.SEQUENCE_COL_NAME`.\n            For more refined control, use `column_names_map`.\n        auto_rename_label_cols: Whether to automatically rename label column to standard name.\n            Only works when there is exactly one label column.\n            You can control the naming through `multimolecule.defaults.LABEL_COL_NAME`.\n            For more refined control, use `column_names_map`.\n        column_names_map: A mapping of column names to new column names.\n            This is useful for renaming columns to inputs that are expected by a model.\n            Defaults to `None`.\n        truncation: Whether to truncate sequences that exceed the maximum length of the tokenizer.\n            Defaults to `False`.\n        max_seq_length: The maximum length of the input sequences.\n            Defaults to the `model_max_length` of the tokenizer.\n        tasks: A mapping of column names to tasks.\n            Will be inferred automatically if not specified.\n        discrete_map: A mapping of column names to discrete mappings.\n            This is useful for mapping the raw value to nominal value in classification tasks.\n            Will be inferred automatically if not specified.\n        nan_process: How to handle NaN and inf values in the dataset.\n            Can be \"ignore\", \"error\", \"drop\", or \"fill\". Defaults to \"ignore\".\n        fill_value: The value to fill NaN and inf values with.\n            Defaults to 0.\n        info: The dataset info.\n        indices_table: The indices table.\n        fingerprint: The fingerprint of the dataset.\n    \"\"\"\n\n    tokenizer: PreTrainedTokenizerBase\n    truncation: bool = False\n    max_seq_length: int\n    seq_length_offset: int = 0\n\n    _id_cols: List\n    _feature_cols: List\n    _label_cols: List\n\n    _sequence_cols: List\n    _secondary_structure_cols: List\n\n    _tasks: NestedDict[str, Task]\n    _discrete_map: Mapping\n\n    preprocess: bool = True\n    auto_rename_sequence_col: bool = True\n    auto_rename_label_col: bool = False\n    column_names_map: Mapping[str, str] | None = None\n    ignored_cols: List[str] = []\n\n    def __init__(\n        self,\n        data: Table | DataFrame | dict | list | str,\n        split: datasets.NamedSplit,\n        tokenizer: PreTrainedTokenizerBase | None = None,\n        pretrained: str | None = None,\n        feature_cols: List | None = None,\n        label_cols: List | None = None,\n        id_cols: List | None = None,\n        preprocess: bool | None = None,\n        auto_rename_sequence_col: bool | None = None,\n        auto_rename_label_col: bool | None = None,\n        column_names_map: Mapping[str, str] | None = None,\n        truncation: bool | None = None,\n        max_seq_length: int | None = None,\n        tasks: Mapping[str, Task] | None = None,\n        discrete_map: Mapping[str, int] | None = None,\n        nan_process: str = \"ignore\",\n        fill_value: str | int | float = 0,\n        info: datasets.DatasetInfo | None = None,\n        indices_table: Table | None = None,\n        fingerprint: str | None = None,\n        ignored_cols: List[str] | None = None,\n    ):\n        self._tasks = NestedDict()\n        if tasks is not None:\n            self.tasks = tasks\n        if discrete_map is not None:\n            self._discrete_map = discrete_map\n        arrow_table = self.build_table(\n            data, split, feature_cols, label_cols, nan_process=nan_process, fill_value=fill_value\n        )\n        super().__init__(\n            arrow_table=arrow_table, split=split, info=info, indices_table=indices_table, fingerprint=fingerprint\n        )\n        self.identify_special_cols(feature_cols=feature_cols, label_cols=label_cols, id_cols=id_cols)\n        self.post(\n            tokenizer=tokenizer,\n            pretrained=pretrained,\n            preprocess=preprocess,\n            truncation=truncation,\n            max_seq_length=max_seq_length,\n            auto_rename_sequence_col=auto_rename_sequence_col,\n            auto_rename_label_col=auto_rename_label_col,\n            column_names_map=column_names_map,\n        )\n        self.ignored_cols = ignored_cols or self.id_cols\n        self.train = split == datasets.Split.TRAIN\n\n    def build_table(\n        self,\n        data: Table | DataFrame | dict | str,\n        split: datasets.NamedSplit,\n        feature_cols: List | None = None,\n        label_cols: List | None = None,\n        nan_process: str | None = \"ignore\",\n        fill_value: str | int | float = 0,\n    ) -> datasets.table.Table:\n        if isinstance(data, str):\n            try:\n                data = datasets.load_dataset(data, split=split).data\n            except FileNotFoundError:\n                data = dl.load_pandas(data)\n                if isinstance(data, DataFrame):\n                    data = data.loc[:, ~data.columns.str.contains(\"^Unnamed\")]\n                    data = pa.Table.from_pandas(data, preserve_index=False)\n        elif isinstance(data, dict):\n            data = pa.Table.from_pydict(data)\n        elif isinstance(data, list):\n            data = pa.Table.from_pylist(data)\n        elif isinstance(data, DataFrame):\n            data = pa.Table.from_pandas(data, preserve_index=False)\n        if feature_cols is not None and label_cols is not None:\n            data = data.select(feature_cols + label_cols)\n        data = self.process_nan(data, nan_process=nan_process, fill_value=fill_value)\n        return data\n\n    def post(\n        self,\n        tokenizer: PreTrainedTokenizerBase | None = None,\n        pretrained: str | None = None,\n        max_seq_length: int | None = None,\n        truncation: bool | None = None,\n        preprocess: bool | None = None,\n        auto_rename_sequence_col: bool | None = None,\n        auto_rename_label_col: bool | None = None,\n        column_names_map: Mapping[str, str] | None = None,\n    ) -> None:\n        r\"\"\"\n        Perform pre-processing steps after initialization.\n\n        It first identifies the special columns (sequence and structure columns) in the dataset.\n        Then it sets the feature and label columns based on the input arguments.\n        If `auto_rename_sequence_col` is `True`, it will automatically rename the sequence column.\n        If `auto_rename_label_col` is `True`, it will automatically rename the label column.\n        Finally, it sets the [`transform`][datasets.Dataset.set_transform] function based on the `preprocess` flag.\n        \"\"\"\n        if tokenizer is None:\n            if pretrained is None:\n                raise ValueError(\"tokenizer and pretrained can not be both None.\")\n            tokenizer = AutoTokenizer.from_pretrained(pretrained)\n        if max_seq_length is None:\n            max_seq_length = tokenizer.model_max_length\n        else:\n            tokenizer.model_max_length = max_seq_length\n        self.tokenizer = tokenizer\n        self.max_seq_length = max_seq_length\n        if truncation is not None:\n            self.truncation = truncation\n        if self.tokenizer.cls_token is not None:\n            self.seq_length_offset += 1\n        if self.tokenizer.sep_token is not None and self.tokenizer.sep_token != self.tokenizer.eos_token:\n            self.seq_length_offset += 1\n        if self.tokenizer.eos_token is not None:\n            self.seq_length_offset += 1\n        if preprocess is not None:\n            self.preprocess = preprocess\n        if auto_rename_sequence_col is not None:\n            self.auto_rename_sequence_col = auto_rename_sequence_col\n        if auto_rename_label_col is not None:\n            self.auto_rename_label_col = auto_rename_label_col\n        if column_names_map is None:\n            column_names_map = {}\n        if self.auto_rename_sequence_col:\n            if len(self.sequence_cols) != 1:\n                raise ValueError(\"auto_rename_sequence_col can only be used when there is exactly one sequence column.\")\n            column_names_map[self.sequence_cols[0]] = defaults.SEQUENCE_COL_NAME  # type: ignore[index]\n        if self.auto_rename_label_col:\n            if len(self.label_cols) != 1:\n                raise ValueError(\"auto_rename_label_col can only be used when there is exactly one label column.\")\n            column_names_map[self.label_cols[0]] = defaults.LABEL_COL_NAME  # type: ignore[index]\n        self.column_names_map = column_names_map\n        if self.column_names_map:\n            self.rename_columns(self.column_names_map)\n        self.infer_tasks()\n\n        if self.preprocess:\n            self.update(self.map(self.tokenization))\n            if self.secondary_structure_cols:\n                self.update(self.map(self.convert_secondary_structure))\n            if self.discrete_map:\n                self.update(self.map(self.map_discrete))\n            fn_kwargs = {\n                \"columns\": [name for name, task in self.tasks.items() if task.level in [\"token\", \"contact\"]],\n                \"max_seq_length\": self.max_seq_length - self.seq_length_offset,\n            }\n            if self.truncation and 0 < self.max_seq_length < 2**32:\n                self.update(self.map(self.truncate, fn_kwargs=fn_kwargs))\n        self.set_transform(self.transform)\n\n    def transform(self, batch: Mapping) -> Mapping:\n        r\"\"\"\n        Default [`transform`][datasets.Dataset.set_transform].\n\n        See Also:\n            [`collate`][multimolecule.Dataset.collate]\n        \"\"\"\n        return {k: self.collate(k, v) for k, v in batch.items() if k not in self.ignored_cols}\n\n    def collate(self, col: str, data: Any) -> Tensor | NestedTensor | None:\n        r\"\"\"\n        Collate the data for a column.\n\n        If the column is a sequence column, it will tokenize the data if `tokenize` is `True`.\n        Otherwise, it will return a tensor or nested tensor.\n        \"\"\"\n        if col in self.sequence_cols:\n            if isinstance(data[0], str):\n                data = self.tokenize(data)\n            return NestedTensor(data)\n        if not self.preprocess:\n            if col in self.discrete_map:\n                data = map_value(data, self.discrete_map[col])\n            if col in self.tasks:\n                data = truncate_value(data, self.max_seq_length - self.seq_length_offset, self.tasks[col].level)\n        if isinstance(data[0], str):\n            return data\n        try:\n            return torch.tensor(data)\n        except ValueError:\n            return NestedTensor(data)\n\n    def infer_tasks(self, sequence_col: str | None = None) -> NestedDict:\n        for col in self.label_cols:\n            if col in self.tasks:\n                continue\n            if col in self.secondary_structure_cols:\n                task = Task(TaskType.Binary, level=TaskLevel.Contact, num_labels=1)\n                self.tasks[col] = task  # type: ignore[index]\n                warn(\n                    f\"Secondary structure columns are assumed to be {task}. \"\n                    \"Please explicitly specify the task if this is not the case.\"\n                )\n            else:\n                try:\n                    self.tasks[col] = self.infer_task(col, sequence_col)  # type: ignore[index]\n                except ValueError:\n                    raise ValueError(f\"Unable to infer task for column {col}.\")\n        return self.tasks\n\n    def infer_task(self, label_col: str, sequence_col: str | None = None) -> Task:\n        if sequence_col is None:\n            if len(self.sequence_cols) != 1:\n                raise ValueError(\"sequence_col must be specified if there are multiple sequence columns.\")\n            sequence_col = self.sequence_cols[0]\n        sequence = self._data.column(sequence_col)\n        column = self._data.column(label_col)\n        return infer_task(\n            sequence,\n            column,\n            truncation=self.truncation,\n            max_seq_length=self.max_seq_length,\n            seq_length_offset=self.seq_length_offset,\n        )\n\n    def infer_discrete_map(self, discrete_map: Mapping | None = None):\n        self._discrete_map = discrete_map or NestedDict()\n        ignored_cols = set(self.discrete_map.keys()) | set(self.sequence_cols) | set(self.secondary_structure_cols)\n        data_cols = [i for i in self.data_cols if i not in ignored_cols]\n        for col in data_cols:\n            discrete_map = infer_discrete_map(self._data.column(col))\n            if discrete_map:\n                self._discrete_map[col] = discrete_map  # type: ignore[index]\n        return self._discrete_map\n\n    def __getitems__(self, keys: int | slice | Iterable[int]) -> Any:\n        return self.__getitem__(keys)\n\n    def identify_special_cols(\n        self, feature_cols: List | None = None, label_cols: List | None = None, id_cols: List | None = None\n    ) -> Sequence:\n        all_cols = self.data.column_names\n        self._id_cols = id_cols or [i for i in all_cols if i in defaults.ID_COL_NAMES]\n\n        string_cols: list[str] = [k for k, v in self.features.items() if k not in self.id_cols and v.dtype == \"string\"]\n        self._sequence_cols = [i for i in string_cols if i.lower() in defaults.SEQUENCE_COL_NAMES]\n        self._secondary_structure_cols = [i for i in string_cols if i in defaults.SECONDARY_STRUCTURE_COL_NAMES]\n\n        data_cols = [i for i in all_cols if i not in self.id_cols]\n        if label_cols is None:\n            if feature_cols is None:\n                feature_cols = [i for i in data_cols if i in defaults.SEQUENCE_COL_NAMES]\n            label_cols = [i for i in data_cols if i not in feature_cols]\n        self._label_cols = label_cols\n        if feature_cols is None:\n            feature_cols = [i for i in data_cols if i not in self.label_cols]\n        self._feature_cols = feature_cols\n        missing_feature_cols = set(self.feature_cols).difference(data_cols)\n        if missing_feature_cols:\n            raise ValueError(f\"{missing_feature_cols} are specified in feature_cols, but not found in dataset.\")\n        missing_label_cols = set(self.label_cols).difference(data_cols)\n        if missing_label_cols:\n            raise ValueError(f\"{missing_label_cols} are specified in label_cols, but not found in dataset.\")\n        return string_cols\n\n    def tokenize(self, string: str) -> Tensor:\n        return self.tokenizer(string, return_attention_mask=False, truncation=self.truncation)[\"input_ids\"]\n\n    def tokenization(self, data: Mapping[str, str]) -> Mapping[str, Tensor]:\n        return {col: self.tokenize(data[col]) for col in self.sequence_cols}\n\n    def convert_secondary_structure(self, data: Mapping) -> Mapping:\n        return {col: dot_bracket_to_contact_map(data[col]) for col in self.secondary_structure_cols}\n\n    def map_discrete(self, data: Mapping) -> Mapping:\n        return {name: map_value(data[name], mapping) for name, mapping in self.discrete_map.items()}\n\n    def truncate(self, data: Mapping, columns: List[str], max_seq_length: int) -> Mapping:\n        return {name: truncate_value(data[name], max_seq_length, self.tasks[name].level) for name in columns}\n\n    def update(self, dataset: datasets.Dataset):\n        r\"\"\"\n        Perform an in-place update of the dataset.\n\n        This method is used to update the dataset after changes have been made to the underlying data.\n        It updates the format columns, data, info, and fingerprint of the dataset.\n        \"\"\"\n        # pylint: disable=W0212\n        # Why datasets won't support in-place changes?\n        # It's just impossible to extend.\n        self._format_columns = dataset._format_columns\n        self._data = dataset._data\n        self._info = dataset._info\n        self._fingerprint = dataset._fingerprint\n\n    def rename_columns(self, column_mapping: Mapping[str, str], new_fingerprint: str | None = None) -> datasets.Dataset:\n        self.update(super().rename_columns(column_mapping, new_fingerprint=new_fingerprint))\n        self._id_cols = [column_mapping.get(i, i) for i in self.id_cols]\n        self._feature_cols = [column_mapping.get(i, i) for i in self.feature_cols]\n        self._label_cols = [column_mapping.get(i, i) for i in self.label_cols]\n        self._sequence_cols = [column_mapping.get(i, i) for i in self.sequence_cols]\n        self._secondary_structure_cols = [column_mapping.get(i, i) for i in self.secondary_structure_cols]\n        self.tasks = {column_mapping.get(k, k): v for k, v in self.tasks.items()}\n        return self\n\n    def rename_column(\n        self, original_column_name: str, new_column_name: str, new_fingerprint: str | None = None\n    ) -> datasets.Dataset:\n        self.update(super().rename_column(original_column_name, new_column_name, new_fingerprint))\n        self._id_cols = [new_column_name if i == original_column_name else i for i in self.id_cols]\n        self._feature_cols = [new_column_name if i == original_column_name else i for i in self.feature_cols]\n        self._label_cols = [new_column_name if i == original_column_name else i for i in self.label_cols]\n        self._sequence_cols = [new_column_name if i == original_column_name else i for i in self.sequence_cols]\n        self._secondary_structure_cols = [\n            new_column_name if i == original_column_name else i for i in self.secondary_structure_cols\n        ]\n        self.tasks = {new_column_name if k == original_column_name else k: v for k, v in self.tasks.items()}\n        return self\n\n    def process_nan(self, data: Table, nan_process: str | None, fill_value: str | int | float = 0) -> Table:\n        if nan_process == \"ignore\":\n            return data\n        data = data.to_pandas()\n        data = data.replace([float(\"inf\"), -float(\"inf\")], float(\"nan\"))\n        if data.isnull().values.any():\n            if nan_process is None or nan_process == \"error\":\n                raise ValueError(\"NaN / inf values have been found in the dataset.\")\n            warn(\n                \"NaN / inf values have been found in the dataset.\\n\"\n                \"While we can handle them, the data type of the corresponding column may be set to float, \"\n                \"which can and very likely will disrupt the auto task recognition.\\n\"\n                \"It is recommended to address these values before loading the dataset.\"\n            )\n            if nan_process == \"drop\":\n                data = data.dropna()\n            elif nan_process == \"fill\":\n                data = data.fillna(fill_value)\n            else:\n                raise ValueError(f\"Invalid nan_process: {nan_process}\")\n        return pa.Table.from_pandas(data, preserve_index=False)\n\n    @property\n    def id_cols(self) -> List:\n        return self._id_cols\n\n    @property\n    def data_cols(self) -> List:\n        return self.feature_cols + self.label_cols\n\n    @property\n    def feature_cols(self) -> List:\n        return self._feature_cols\n\n    @property\n    def label_cols(self) -> List:\n        return self._label_cols\n\n    @property\n    def sequence_cols(self) -> List:\n        return self._sequence_cols\n\n    @property\n    def secondary_structure_cols(self) -> List:\n        return self._secondary_structure_cols\n\n    @property\n    def tasks(self) -> NestedDict:\n        if not hasattr(self, \"_tasks\"):\n            self._tasks = NestedDict()\n            return self.infer_tasks()\n        return self._tasks\n\n    @tasks.setter\n    def tasks(self, tasks: Mapping):\n        self._tasks = NestedDict()\n        for name, task in tasks.items():\n            if not isinstance(task, Task):\n                task = Task(**task)\n            self._tasks[name] = task\n\n    @property\n    def discrete_map(self) -> Mapping:\n        if not hasattr(self, \"_discrete_map\"):\n            return self.infer_discrete_map()\n        return self._discrete_map\n
"},{"location":"data/dataset/#multimolecule.data.Dataset(data)","title":"data","text":""},{"location":"data/dataset/#multimolecule.data.Dataset(split)","title":"split","text":""},{"location":"data/dataset/#multimolecule.data.Dataset(tokenizer)","title":"tokenizer","text":""},{"location":"data/dataset/#multimolecule.data.Dataset(pretrained)","title":"pretrained","text":""},{"location":"data/dataset/#multimolecule.data.Dataset(feature_cols)","title":"feature_cols","text":""},{"location":"data/dataset/#multimolecule.data.Dataset(label_cols)","title":"label_cols","text":""},{"location":"data/dataset/#multimolecule.data.Dataset(id_cols)","title":"id_cols","text":""},{"location":"data/dataset/#multimolecule.data.Dataset(preprocess)","title":"preprocess","text":""},{"location":"data/dataset/#multimolecule.data.Dataset(auto_rename_sequence_col)","title":"auto_rename_sequence_col","text":""},{"location":"data/dataset/#multimolecule.data.Dataset(auto_rename_label_cols)","title":"auto_rename_label_cols","text":""},{"location":"data/dataset/#multimolecule.data.Dataset(column_names_map)","title":"column_names_map","text":""},{"location":"data/dataset/#multimolecule.data.Dataset(truncation)","title":"truncation","text":""},{"location":"data/dataset/#multimolecule.data.Dataset(max_seq_length)","title":"max_seq_length","text":""},{"location":"data/dataset/#multimolecule.data.Dataset(tasks)","title":"tasks","text":""},{"location":"data/dataset/#multimolecule.data.Dataset(discrete_map)","title":"discrete_map","text":""},{"location":"data/dataset/#multimolecule.data.Dataset(nan_process)","title":"nan_process","text":""},{"location":"data/dataset/#multimolecule.data.Dataset(fill_value)","title":"fill_value","text":""},{"location":"data/dataset/#multimolecule.data.Dataset(info)","title":"info","text":""},{"location":"data/dataset/#multimolecule.data.Dataset(indices_table)","title":"indices_table","text":""},{"location":"data/dataset/#multimolecule.data.Dataset(fingerprint)","title":"fingerprint","text":""},{"location":"data/dataset/#multimolecule.data.Dataset.post","title":"post","text":"Python
post(tokenizer: PreTrainedTokenizerBase | None = None, pretrained: str | None = None, max_seq_length: int | None = None, truncation: bool | None = None, preprocess: bool | None = None, auto_rename_sequence_col: bool | None = None, auto_rename_label_col: bool | None = None, column_names_map: Mapping[str, str] | None = None) -> None\n

Perform pre-processing steps after initialization.

It first identifies the special columns (sequence and structure columns) in the dataset. Then it sets the feature and label columns based on the input arguments. If auto_rename_sequence_col is True, it will automatically rename the sequence column. If auto_rename_label_col is True, it will automatically rename the label column. Finally, it sets the transform function based on the preprocess flag.

Source code in multimolecule/data/dataset.py Python
def post(\n    self,\n    tokenizer: PreTrainedTokenizerBase | None = None,\n    pretrained: str | None = None,\n    max_seq_length: int | None = None,\n    truncation: bool | None = None,\n    preprocess: bool | None = None,\n    auto_rename_sequence_col: bool | None = None,\n    auto_rename_label_col: bool | None = None,\n    column_names_map: Mapping[str, str] | None = None,\n) -> None:\n    r\"\"\"\n    Perform pre-processing steps after initialization.\n\n    It first identifies the special columns (sequence and structure columns) in the dataset.\n    Then it sets the feature and label columns based on the input arguments.\n    If `auto_rename_sequence_col` is `True`, it will automatically rename the sequence column.\n    If `auto_rename_label_col` is `True`, it will automatically rename the label column.\n    Finally, it sets the [`transform`][datasets.Dataset.set_transform] function based on the `preprocess` flag.\n    \"\"\"\n    if tokenizer is None:\n        if pretrained is None:\n            raise ValueError(\"tokenizer and pretrained can not be both None.\")\n        tokenizer = AutoTokenizer.from_pretrained(pretrained)\n    if max_seq_length is None:\n        max_seq_length = tokenizer.model_max_length\n    else:\n        tokenizer.model_max_length = max_seq_length\n    self.tokenizer = tokenizer\n    self.max_seq_length = max_seq_length\n    if truncation is not None:\n        self.truncation = truncation\n    if self.tokenizer.cls_token is not None:\n        self.seq_length_offset += 1\n    if self.tokenizer.sep_token is not None and self.tokenizer.sep_token != self.tokenizer.eos_token:\n        self.seq_length_offset += 1\n    if self.tokenizer.eos_token is not None:\n        self.seq_length_offset += 1\n    if preprocess is not None:\n        self.preprocess = preprocess\n    if auto_rename_sequence_col is not None:\n        self.auto_rename_sequence_col = auto_rename_sequence_col\n    if auto_rename_label_col is not None:\n        self.auto_rename_label_col = auto_rename_label_col\n    if column_names_map is None:\n        column_names_map = {}\n    if self.auto_rename_sequence_col:\n        if len(self.sequence_cols) != 1:\n            raise ValueError(\"auto_rename_sequence_col can only be used when there is exactly one sequence column.\")\n        column_names_map[self.sequence_cols[0]] = defaults.SEQUENCE_COL_NAME  # type: ignore[index]\n    if self.auto_rename_label_col:\n        if len(self.label_cols) != 1:\n            raise ValueError(\"auto_rename_label_col can only be used when there is exactly one label column.\")\n        column_names_map[self.label_cols[0]] = defaults.LABEL_COL_NAME  # type: ignore[index]\n    self.column_names_map = column_names_map\n    if self.column_names_map:\n        self.rename_columns(self.column_names_map)\n    self.infer_tasks()\n\n    if self.preprocess:\n        self.update(self.map(self.tokenization))\n        if self.secondary_structure_cols:\n            self.update(self.map(self.convert_secondary_structure))\n        if self.discrete_map:\n            self.update(self.map(self.map_discrete))\n        fn_kwargs = {\n            \"columns\": [name for name, task in self.tasks.items() if task.level in [\"token\", \"contact\"]],\n            \"max_seq_length\": self.max_seq_length - self.seq_length_offset,\n        }\n        if self.truncation and 0 < self.max_seq_length < 2**32:\n            self.update(self.map(self.truncate, fn_kwargs=fn_kwargs))\n    self.set_transform(self.transform)\n
"},{"location":"data/dataset/#multimolecule.data.Dataset.transform","title":"transform","text":"Python
transform(batch: Mapping) -> Mapping\n

Default transform.

See Also

collate

Source code in multimolecule/data/dataset.py Python
def transform(self, batch: Mapping) -> Mapping:\n    r\"\"\"\n    Default [`transform`][datasets.Dataset.set_transform].\n\n    See Also:\n        [`collate`][multimolecule.Dataset.collate]\n    \"\"\"\n    return {k: self.collate(k, v) for k, v in batch.items() if k not in self.ignored_cols}\n
"},{"location":"data/dataset/#multimolecule.data.Dataset.collate","title":"collate","text":"Python
collate(col: str, data: Any) -> Tensor | NestedTensor | None\n

Collate the data for a column.

If the column is a sequence column, it will tokenize the data if tokenize is True. Otherwise, it will return a tensor or nested tensor.

Source code in multimolecule/data/dataset.py Python
def collate(self, col: str, data: Any) -> Tensor | NestedTensor | None:\n    r\"\"\"\n    Collate the data for a column.\n\n    If the column is a sequence column, it will tokenize the data if `tokenize` is `True`.\n    Otherwise, it will return a tensor or nested tensor.\n    \"\"\"\n    if col in self.sequence_cols:\n        if isinstance(data[0], str):\n            data = self.tokenize(data)\n        return NestedTensor(data)\n    if not self.preprocess:\n        if col in self.discrete_map:\n            data = map_value(data, self.discrete_map[col])\n        if col in self.tasks:\n            data = truncate_value(data, self.max_seq_length - self.seq_length_offset, self.tasks[col].level)\n    if isinstance(data[0], str):\n        return data\n    try:\n        return torch.tensor(data)\n    except ValueError:\n        return NestedTensor(data)\n
"},{"location":"data/dataset/#multimolecule.data.Dataset.update","title":"update","text":"Python
update(dataset: Dataset)\n

Perform an in-place update of the dataset.

This method is used to update the dataset after changes have been made to the underlying data. It updates the format columns, data, info, and fingerprint of the dataset.

Source code in multimolecule/data/dataset.py Python
def update(self, dataset: datasets.Dataset):\n    r\"\"\"\n    Perform an in-place update of the dataset.\n\n    This method is used to update the dataset after changes have been made to the underlying data.\n    It updates the format columns, data, info, and fingerprint of the dataset.\n    \"\"\"\n    # pylint: disable=W0212\n    # Why datasets won't support in-place changes?\n    # It's just impossible to extend.\n    self._format_columns = dataset._format_columns\n    self._data = dataset._data\n    self._info = dataset._info\n    self._fingerprint = dataset._fingerprint\n
"},{"location":"datasets/","title":"datasets","text":"

datasets provide a collection of widely used datasets.

"},{"location":"datasets/#available-datasets","title":"Available Datasets","text":""},{"location":"datasets/#deoxyribonucleic-acid-dna","title":"DeoxyriboNucleic Acid (DNA)","text":""},{"location":"datasets/#ribonucleic-acid-rna","title":"RiboNucleic Acid (RNA)","text":""},{"location":"datasets/#usage","title":"Usage","text":""},{"location":"datasets/#load-with-multimolecule","title":"Load with MultiMolecule","text":"Python
from multimolecule.data import Dataset\n\ndata = Dataset(\"multimolecule/bprna-spot\", split=\"train\", pretrained=\"multimolecule/rna\")\n
"},{"location":"datasets/archiveii/","title":"ArchiveII","text":"

ArchiveII is a dataset of RNA sequences and their secondary structures, widely used in RNA secondary structure prediction benchmarks.

ArchiveII contains 2975 RNA samples across 10 RNA families, with sequence lengths ranging from 28 to 2968 nucleotides. This dataset is frequently used to evaluate RNA secondary structure prediction methods, including those that handle both pseudoknotted and non-pseudoknotted structures.

It is considered complementary to the RNAStrAlign dataset.

"},{"location":"datasets/archiveii/#disclaimer","title":"Disclaimer","text":"

This is an UNOFFICIAL release of the ArchiveII by Mehdi Saman Booy, et al.

The team releasing ArchiveII did not write this dataset card for this dataset so this dataset card has been written by the MultiMolecule team.

"},{"location":"datasets/archiveii/#dataset-description","title":"Dataset Description","text":""},{"location":"datasets/archiveii/#example-entry","title":"Example Entry","text":"id sequence secondary_structure family 16S_rRNA-A.fulgidus AUUCUGGUUGAUCCUGCCAGAGGCCGCUGCUA\u2026 \u2026(((((\u2026(((.))))).((((((((((.... 16S_rRNA"},{"location":"datasets/archiveii/#column-description","title":"Column Description","text":""},{"location":"datasets/archiveii/#variations","title":"Variations","text":"

This dataset is available in two additional variants:

"},{"location":"datasets/archiveii/#related-datasets","title":"Related Datasets","text":""},{"location":"datasets/archiveii/#license","title":"License","text":"

This dataset is licensed under the AGPL-3.0 License.

Text Only
SPDX-License-Identifier: AGPL-3.0-or-later\n
"},{"location":"datasets/archiveii/#citation","title":"Citation","text":"BibTeX
@article{samanbooy2022rna,\n  author    = {Saman Booy, Mehdi and Ilin, Alexander and Orponen, Pekka},\n  journal   = {BMC Bioinformatics},\n  keywords  = {Deep learning; Pseudoknotted structures; RNA structure prediction},\n  month     = feb,\n  number    = 1,\n  pages     = {58},\n  publisher = {Springer Science and Business Media LLC},\n  title     = {{RNA} secondary structure prediction with convolutional neural networks},\n  volume    = 23,\n  year      = 2022\n}\n
"},{"location":"datasets/bprna-new/","title":"bpRNA-1m","text":"

bpRNA-new is a database of single molecule secondary structures annotated using bpRNA.

bpRNA-new is a dataset of RNA families from Rfam 14.2, designed for cross-family validation to assess generalization capability. It focuses on families distinct from those in bpRNA-1m, providing a robust benchmark for evaluating model performance on unseen RNA families.

"},{"location":"datasets/bprna-new/#disclaimer","title":"Disclaimer","text":"

This is an UNOFFICIAL release of the bpRNA-new by Kengo Sato, et al.

The team releasing bpRNA-new did not write this dataset card for this dataset so this dataset card has been written by the MultiMolecule team.

"},{"location":"datasets/bprna-new/#dataset-description","title":"Dataset Description","text":""},{"location":"datasets/bprna-new/#related-datasets","title":"Related Datasets","text":""},{"location":"datasets/bprna-new/#license","title":"License","text":"

This dataset is licensed under the AGPL-3.0 License.

Text Only
SPDX-License-Identifier: AGPL-3.0-or-later\n
"},{"location":"datasets/bprna-new/#citation","title":"Citation","text":"BibTeX
@article{sato2021rna,\n  author    = {Sato, Kengo and Akiyama, Manato and Sakakibara, Yasubumi},\n  journal   = {Nature Communications},\n  month     = feb,\n  number    = 1,\n  pages     = {941},\n  publisher = {Springer Science and Business Media LLC},\n  title     = {{RNA} secondary structure prediction using deep learning with thermodynamic integration},\n  volume    = 12,\n  year      = 2021\n}\n
"},{"location":"datasets/bprna-spot/","title":"bpRNA-1m","text":"

bpRNA-spot is a database of single molecule secondary structures annotated using bpRNA.

bpRNA-spot is a subset of bpRNA-1m. It applies CD-HIT (CD-HIT-EST) to remove sequences with more than 80% sequence similarity from bpRNA-1m. It further randomly splits the remaining sequences into training, validation, and test sets with a ratio of apprxiately 8:1:1.

"},{"location":"datasets/bprna-spot/#disclaimer","title":"Disclaimer","text":"

This is an UNOFFICIAL release of the bpRNA-spot by Jaswinder Singh, et al.

The team releasing bpRNA-spot did not write this dataset card for this dataset so this dataset card has been written by the MultiMolecule team.

"},{"location":"datasets/bprna-spot/#dataset-description","title":"Dataset Description","text":""},{"location":"datasets/bprna-spot/#related-datasets","title":"Related Datasets","text":""},{"location":"datasets/bprna-spot/#license","title":"License","text":"

This dataset is licensed under the AGPL-3.0 License.

Text Only
SPDX-License-Identifier: AGPL-3.0-or-later\n
"},{"location":"datasets/bprna-spot/#citation","title":"Citation","text":"BibTeX
@article{singh2019rna,\n  author    = {Singh, Jaswinder and Hanson, Jack and Paliwal, Kuldip and Zhou, Yaoqi},\n  journal   = {Nature Communications},\n  month     = nov,\n  number    = 1,\n  pages     = {5407},\n  publisher = {Springer Science and Business Media LLC},\n  title     = {{RNA} secondary structure prediction using an ensemble of two-dimensional deep neural networks and transfer learning},\n  volume    = 10,\n  year      = 2019\n}\n\n@article{darty2009varna,\n  author    = {Darty, K{\\'e}vin and Denise, Alain and Ponty, Yann},\n  journal   = {Bioinformatics},\n  month     = aug,\n  number    = 15,\n  pages     = {1974--1975},\n  publisher = {Oxford University Press (OUP)},\n  title     = {{VARNA}: Interactive drawing and editing of the {RNA} secondary structure},\n  volume    = 25,\n  year      = 2009\n}\n\n@article{berman2000protein,\n  author    = {Berman, H M and Westbrook, J and Feng, Z and Gilliland, G and Bhat, T N and Weissig, H and Shindyalov, I N and Bourne, P E},\n  journal   = {Nucleic Acids Research},\n  month     = jan,\n  number    = 1,\n  pages     = {235--242},\n  publisher = {Oxford University Press (OUP)},\n  title     = {The Protein Data Bank},\n  volume    = 28,\n  year      = 2000\n}\n
"},{"location":"datasets/bprna/","title":"bpRNA-1m","text":"

bpRNA-1m is a database of single molecule secondary structures annotated using bpRNA.

"},{"location":"datasets/bprna/#disclaimer","title":"Disclaimer","text":"

This is an UNOFFICIAL release of the bpRNA-1m by Center for Quantitative Life Sciences of the Oregon State University.

The team releasing bpRNA-1m did not write this dataset card for this dataset so this dataset card has been written by the MultiMolecule team.

"},{"location":"datasets/bprna/#dataset-description","title":"Dataset Description","text":""},{"location":"datasets/bprna/#example-entry","title":"Example Entry","text":"id sequence secondary_structure structural_annotation functional_annotation bpRNA_RFAM_1016 AUUGCUUCUCGGCCUUUUGGCUAACAUCAAGU\u2026 ......(((.((((....)))).)))...... EEEEEESSSISSSSHHHHSSSSISSSXXXXXX\u2026 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN\u2026"},{"location":"datasets/bprna/#column-description","title":"Column Description","text":"

The converted dataset consists of the following columns, each providing specific information about the RNA secondary structures, consistent with the bpRNA standard:

"},{"location":"datasets/bprna/#variations","title":"Variations","text":"

This dataset is available in two variants:

"},{"location":"datasets/bprna/#related-datasets","title":"Related Datasets","text":""},{"location":"datasets/bprna/#license","title":"License","text":"

This dataset is licensed under the AGPL-3.0 License.

Text Only
SPDX-License-Identifier: AGPL-3.0-or-later\n
"},{"location":"datasets/bprna/#citation","title":"Citation","text":"BibTeX
@article{danaee2018bprna,\n  author  = {Danaee, Padideh and Rouches, Mason and Wiley, Michelle and Deng, Dezhong and Huang, Liang and Hendrix, David},\n  journal = {Nucleic Acids Research},\n  month   = jun,\n  number  = 11,\n  pages   = {5381--5394},\n  title   = {{bpRNA}: large-scale automated annotation and analysis of {RNA} secondary structure},\n  volume  = 46,\n  year    = 2018\n}\n\n@article{cannone2002comparative,\n  author    = {Cannone, Jamie J and Subramanian, Sankar and Schnare, Murray N and Collett, James R and D'Souza, Lisa M and Du, Yushi and Feng, Brian and Lin, Nan and Madabusi, Lakshmi V and M{\\\"u}ller, Kirsten M and Pande, Nupur and Shang, Zhidi and Yu, Nan and Gutell, Robin R},\n  copyright = {https://www.springernature.com/gp/researchers/text-and-data-mining},\n  journal   = {BMC Bioinformatics},\n  month     = jan,\n  number    = 1,\n  pages     = {2},\n  publisher = {Springer Science and Business Media LLC},\n  title     = {The comparative {RNA} web ({CRW}) site: an online database of comparative sequence and structure information for ribosomal, intron, and other {RNAs}},\n  volume    = 3,\n  year      = 2002\n}\n\n@article{zwieb2003tmrdb,\n  author    = {Zwieb, Christian and Gorodkin, Jan and Knudsen, Bjarne and Burks, Jody and Wower, Jacek},\n  journal   = {Nucleic Acids Research},\n  month     = jan,\n  number    = 1,\n  pages     = {446--447},\n  publisher = {Oxford University Press (OUP)},\n  title     = {{tmRDB} ({tmRNA} database)},\n  volume    = 31,\n  year      = 2003\n}\n\n@article{rosenblad2003srpdb,\n  author    = {Rosenblad, Magnus Alm and Gorodkin, Jan and Knudsen, Bjarne and Zwieb, Christian and Samuelsson, Tore},\n  journal   = {Nucleic Acids Research},\n  month     = jan,\n  number    = 1,\n  pages     = {363--364},\n  publisher = {Oxford University Press (OUP)},\n  title     = {{SRPDB}: Signal Recognition Particle Database},\n  volume    = 31,\n  year      = 2003\n}\n\n@article{sprinzl2005compilation,\n  author    = {Sprinzl, Mathias and Vassilenko, Konstantin S},\n  journal   = {Nucleic Acids Research},\n  month     = jan,\n  number    = {Database issue},\n  pages     = {D139--40},\n  publisher = {Oxford University Press (OUP)},\n  title     = {Compilation of {tRNA} sequences and sequences of {tRNA} genes},\n  volume    = 33,\n  year      = 2005\n}\n\n@article{brown1994ribonuclease,\n  author    = {Brown, J W and Haas, E S and Gilbert, D G and Pace, N R},\n  journal   = {Nucleic Acids Research},\n  month     = sep,\n  number    = 17,\n  pages     = {3660--3662},\n  publisher = {Oxford University Press (OUP)},\n  title     = {The Ribonuclease {P} database},\n  volume    = 22,\n  year      = 1994\n}\n\n@article{griffiths2003rfam,\n  author    = {Griffiths-Jones, Sam and Bateman, Alex and Marshall, Mhairi and Khanna, Ajay and Eddy, Sean R},\n  journal   = {Nucleic Acids Research},\n  month     = jan,\n  number    = 1,\n  pages     = {439--441},\n  publisher = {Oxford University Press (OUP)},\n  title     = {Rfam: an {RNA} family database},\n  volume    = 31,\n  year      = 2003\n}\n\n@article{berman2000protein,\n  author    = {Berman, H M and Westbrook, J and Feng, Z and Gilliland, G and Bhat, T N and Weissig, H and Shindyalov, I N and Bourne, P E},\n  journal   = {Nucleic Acids Research},\n  month     = jan,\n  number    = 1,\n  pages     = {235--242},\n  publisher = {Oxford University Press (OUP)},\n  title     = {The Protein Data Bank},\n  volume    = 28,\n  year      = 2000\n}\n
"},{"location":"datasets/eternabench-cm/","title":"EternaBench-CM","text":"

EternaBench-CM is a synthetic RNA dataset comprising 12,711 RNA constructs that have been chemically mapped using SHAPE and MAP-seq methods. These RNA sequences are probed to obtain experimental data on their nucleotide reactivity, which indicates whether specific regions of the RNA are flexible or structured. The dataset provides high-resolution, large-scale data that can be used for studying RNA folding and stability.

"},{"location":"datasets/eternabench-cm/#disclaimer","title":"Disclaimer","text":"

This is an UNOFFICIAL release of the EternaBench-CM by Hannah K. Wayment-Steele, et al.

The team releasing EternaBench-CM did not write this dataset card for this dataset so this dataset card has been written by the MultiMolecule team.

"},{"location":"datasets/eternabench-cm/#dataset-description","title":"Dataset Description","text":"

The dataset includes a large set of synthetic RNA sequences with experimental chemical mapping data, which provides a quantitative readout of RNA nucleotide reactivity. These data are ensemble-averaged and serve as a critical benchmark for evaluating secondary structure prediction algorithms in their ability to model RNA folding dynamics.

"},{"location":"datasets/eternabench-cm/#example-entry","title":"Example Entry","text":"index design sequence secondary_structure reactivity errors signal_to_noise 769337-1 d+m plots weaker again GGAAAAAAAAAAA\u2026 ................ [0.642,1.4853,0.1629, \u2026] [0.3181,0.4221,0.1823, \u2026] 3.227"},{"location":"datasets/eternabench-cm/#column-description","title":"Column Description","text":""},{"location":"datasets/eternabench-cm/#related-datasets","title":"Related Datasets","text":""},{"location":"datasets/eternabench-cm/#license","title":"License","text":"

This dataset is licensed under the AGPL-3.0 License.

Text Only
SPDX-License-Identifier: AGPL-3.0-or-later\n
"},{"location":"datasets/eternabench-cm/#citation","title":"Citation","text":"BibTeX
@article{waymentsteele2022rna,\n  author    = {Wayment-Steele, Hannah K and Kladwang, Wipapat and Strom, Alexandra I and Lee, Jeehyung and Treuille, Adrien and Becka, Alex and {Eterna Participants} and Das, Rhiju},\n  journal   = {Nature Methods},\n  month     = oct,\n  number    = 10,\n  pages     = {1234--1242},\n  publisher = {Springer Science and Business Media LLC},\n  title     = {{RNA} secondary structure packages evaluated and improved by high-throughput experiments},\n  volume    = 19,\n  year      = 2022\n}\n
"},{"location":"datasets/eternabench-external/","title":"EternaBench-External","text":"

EternaBench-External consists of 31 independent RNA datasets from various biological sources, including viral genomes, mRNAs, and synthetic RNAs. These sequences were probed using techniques such as SHAPE-CE, SHAPE-MaP, and DMS-MaP-seq to understand RNA secondary structures under different experimental and biological conditions. This dataset serves as a benchmark for evaluating RNA structure prediction models, with a particular focus on generalization to natural RNA molecules.

"},{"location":"datasets/eternabench-external/#disclaimer","title":"Disclaimer","text":"

This is an UNOFFICIAL release of the EternaBench-External by Hannah K. Wayment-Steele, et al.

The team releasing EternaBench-External did not write this dataset card for this dataset so this dataset card has been written by the MultiMolecule team.

"},{"location":"datasets/eternabench-external/#dataset-description","title":"Dataset Description","text":"

This dataset includes RNA sequences from various biological origins, including viral genomes and mRNAs, and covers a wide range of probing methods like SHAPE-CE and icSHAPE. Each dataset entry provides sequence information, reactivity profiles, and RNA secondary structure data. This dataset can be used to examine how RNA structures vary under different conditions and to validate structural predictions for diverse RNA types.

"},{"location":"datasets/eternabench-external/#example-entry","title":"Example Entry","text":"name sequence reactivity seqpos class dataset Dadonaite,2019 Influenza genome SHAPE(1M7) SSII-Mn(2+) Mut. TTTACCCACAGCTGTGAATT\u2026 [0.639309,0.813297,0.622869,\u2026] [7425,7426,7427,\u2026] viral_gRNA Dadonaite,2019"},{"location":"datasets/eternabench-external/#column-description","title":"Column Description","text":""},{"location":"datasets/eternabench-external/#variations","title":"Variations","text":"

This dataset is available in four variants:

"},{"location":"datasets/eternabench-external/#related-datasets","title":"Related Datasets","text":""},{"location":"datasets/eternabench-external/#license","title":"License","text":"

This dataset is licensed under the AGPL-3.0 License.

Text Only
SPDX-License-Identifier: AGPL-3.0-or-later\n
"},{"location":"datasets/eternabench-external/#citation","title":"Citation","text":"BibTeX
@article{waymentsteele2022rna,\n  author    = {Wayment-Steele, Hannah K and Kladwang, Wipapat and Strom, Alexandra I and Lee, Jeehyung and Treuille, Adrien and Becka, Alex and {Eterna Participants} and Das, Rhiju},\n  journal   = {Nature Methods},\n  month     = oct,\n  number    = 10,\n  pages     = {1234--1242},\n  publisher = {Springer Science and Business Media LLC},\n  title     = {{RNA} secondary structure packages evaluated and improved by high-throughput experiments},\n  volume    = 19,\n  year      = 2022\n}\n
"},{"location":"datasets/eternabench-switch/","title":"EternaBench-Switch","text":"

EternaBench-Switch is a synthetic RNA dataset consisting of 7,228 riboswitch constructs, designed to explore the structural behavior of RNA molecules that change conformation upon binding to ligands such as FMN, theophylline, or tryptophan. These riboswitches exhibit different structural states in the presence or absence of their ligands, and the dataset includes detailed measurements of binding affinities (dissociation constants), activation ratios, and RNA folding properties.

"},{"location":"datasets/eternabench-switch/#disclaimer","title":"Disclaimer","text":"

This is an UNOFFICIAL release of the EternaBench-Switch by Hannah K. Wayment-Steele, et al.

The team releasing EternaBench-Switch did not write this dataset card for this dataset so this dataset card has been written by the MultiMolecule team.

"},{"location":"datasets/eternabench-switch/#dataset-description","title":"Dataset Description","text":"

The dataset includes synthetic RNA sequences designed to act as riboswitches. These molecules can adopt different structural states in response to ligand binding, and the dataset provides detailed information on the binding affinities for various ligands, along with metrics on the RNA\u2019s ability to switch between conformations. With over 7,000 entries, this dataset is highly useful for studying RNA folding, ligand interaction, and RNA structural dynamics.

"},{"location":"datasets/eternabench-switch/#example-entry","title":"Example Entry","text":"id design sequence activation_ratio ligand switch kd_off kd_on kd_fmn kd_no_fmn min_kd_val ms2_aptamer lig_aptamer ms2_lig_aptamer log_kd_nolig log_kd_lig log_kd_nolig_scaled log_kd_lig_scaled log_AR folding_subscore num_clusters 286 null AGGAAACAUGAGGAU\u2026 0.8824621522 FMN OFF 13.3115 15.084 null null 3.0082 .....(((((x((xxxx)))))))..... .................. .....(((((x((xx\u2026 2.7137 2.5886 1.6123 1.4873 -0.125 null null"},{"location":"datasets/eternabench-switch/#column-description","title":"Column Description","text":""},{"location":"datasets/eternabench-switch/#related-datasets","title":"Related Datasets","text":""},{"location":"datasets/eternabench-switch/#license","title":"License","text":"

This dataset is licensed under the AGPL-3.0 License.

Text Only
SPDX-License-Identifier: AGPL-3.0-or-later\n
"},{"location":"datasets/eternabench-switch/#citation","title":"Citation","text":"BibTeX
@article{waymentsteele2022rna,\n  author    = {Wayment-Steele, Hannah K and Kladwang, Wipapat and Strom, Alexandra I and Lee, Jeehyung and Treuille, Adrien and Becka, Alex and {Eterna Participants} and Das, Rhiju},\n  journal   = {Nature Methods},\n  month     = oct,\n  number    = 10,\n  pages     = {1234--1242},\n  publisher = {Springer Science and Business Media LLC},\n  title     = {{RNA} secondary structure packages evaluated and improved by high-throughput experiments},\n  volume    = 19,\n  year      = 2022\n}\n
"},{"location":"datasets/gencode/","title":"GENCODE","text":"

GENCODE is a comprehensive annotation project that aims to provide high-quality annotations of the human and mouse genomes. The project is part of the ENCODE (ENCyclopedia Of DNA Elements) scale-up project, which seeks to identify all functional elements in the human genome.

"},{"location":"datasets/gencode/#disclaimer","title":"Disclaimer","text":"

This is an UNOFFICIAL release of the GENCODE by Paul Flicek, Roderic Guigo, Manolis Kellis, Mark Gerstein, Benedict Paten, Michael Tress, Jyoti Choudhary, et al.

The team releasing GENCODE did not write this dataset card for this dataset so this dataset card has been written by the MultiMolecule team.

"},{"location":"datasets/gencode/#dataset-description","title":"Dataset Description","text":""},{"location":"datasets/gencode/#license","title":"License","text":"

This dataset is licensed under the AGPL-3.0 License.

Text Only
SPDX-License-Identifier: AGPL-3.0-or-later\n
"},{"location":"datasets/gencode/#datasets","title":"Datasets","text":"

The GENCODE dataset is available in Human and Mouse:

"},{"location":"datasets/gencode/#citation","title":"Citation","text":"BibTeX
@article{frankish2023gencode,\n  author    = {Frankish, Adam and Carbonell-Sala, S{\\'\\i}lvia and Diekhans, Mark and Jungreis, Irwin and Loveland, Jane E and Mudge, Jonathan M and Sisu, Cristina and Wright, James C and Arnan, Carme and Barnes, If and Banerjee, Abhimanyu and Bennett, Ruth and Berry, Andrew and Bignell, Alexandra and Boix, Carles and Calvet, Ferriol and Cerd{\\'a}n-V{\\'e}lez, Daniel and Cunningham, Fiona and Davidson, Claire and Donaldson, Sarah and Dursun, Cagatay and Fatima, Reham and Giorgetti, Stefano and Giron, Carlos Garc{\\i}a and Gonzalez, Jose Manuel and Hardy, Matthew and Harrison, Peter W and Hourlier, Thibaut and Hollis, Zoe and Hunt, Toby and James, Benjamin and Jiang, Yunzhe and Johnson, Rory and Kay, Mike and Lagarde, Julien and Martin, Fergal J and G{\\'o}mez, Laura Mart{\\'\\i}nez and Nair, Surag and Ni, Pengyu and Pozo, Fernando and Ramalingam, Vivek and Ruffier, Magali and Schmitt, Bianca M and Schreiber, Jacob M and Steed, Emily and Suner, Marie-Marthe and Sumathipala, Dulika and Sycheva, Irina and Uszczynska-Ratajczak, Barbara and Wass, Elizabeth and Yang, Yucheng T and Yates, Andrew and Zafrulla, Zahoor and Choudhary, Jyoti S and Gerstein, Mark and Guigo, Roderic and Hubbard, Tim J P and Kellis, Manolis and Kundaje, Anshul and Paten, Benedict and Tress, Michael L and Flicek, Paul},\n  journal   = {Nucleic Acids Research},\n  month     = jan,\n  number    = {D1},\n  pages     = {D942--D949},\n  publisher = {Oxford University Press (OUP)},\n  title     = {{GENCODE}: reference annotation for the human and mouse genomes in 2023},\n  volume    = 51,\n  year      = 2023\n}\n\n@article{frankish2021gencode,\n  author    = {Frankish, Adam and Diekhans, Mark and Jungreis, Irwin and Lagarde, Julien and Loveland, Jane E and Mudge, Jonathan M and Sisu, Cristina and Wright, James C and Armstrong, Joel and Barnes, If and Berry, Andrew and Bignell, Alexandra and Boix, Carles and Carbonell Sala, Silvia and Cunningham, Fiona and Di Domenico, Tom{\\'a}s and Donaldson, Sarah and Fiddes, Ian T and Garc{\\'\\i}a Gir{\\'o}n, Carlos and Gonzalez, Jose Manuel and Grego, Tiago and Hardy, Matthew and Hourlier, Thibaut and Howe, Kevin L and Hunt, Toby and Izuogu, Osagie G and Johnson, Rory and Martin, Fergal J and Mart{\\'\\i}nez, Laura and Mohanan, Shamika and Muir, Paul and Navarro, Fabio C P and Parker, Anne and Pei, Baikang and Pozo, Fernando and Riera, Ferriol Calvet and Ruffier, Magali and Schmitt, Bianca M and Stapleton, Eloise and Suner, Marie-Marthe and Sycheva, Irina and Uszczynska-Ratajczak, Barbara and Wolf, Maxim Y and Xu, Jinuri and Yang, Yucheng T and Yates, Andrew and Zerbino, Daniel and Zhang, Yan and Choudhary, Jyoti S and Gerstein, Mark and Guig{\\'o}, Roderic and Hubbard, Tim J P and Kellis, Manolis and Paten, Benedict and Tress, Michael L and Flicek, Paul},\n  journal   = {Nucleic Acids Research},\n  month     = jan,\n  number    = {D1},\n  pages     = {D916--D923},\n  publisher = {Oxford University Press (OUP)},\n  title     = {{GENCODE} 2021},\n  volume    = 49,\n  year      = 2021\n}\n\n@article{frankish2019gencode,\n  author    = {Frankish, Adam and Diekhans, Mark and Ferreira, Anne-Maud and Johnson, Rory and Jungreis, Irwin and Loveland, Jane and Mudge, Jonathan M and Sisu, Cristina and Wright, James and Armstrong, Joel and Barnes, If and Berry, Andrew and Bignell, Alexandra and Carbonell Sala, Silvia and Chrast, Jacqueline and Cunningham, Fiona and Di Domenico, Tom{\\'a}s and Donaldson, Sarah and Fiddes, Ian T and Garc{\\'\\i}a Gir{\\'o}n, Carlos and Gonzalez, Jose Manuel and Grego, Tiago and Hardy, Matthew and Hourlier, Thibaut and Hunt, Toby and Izuogu, Osagie G and Lagarde, Julien and Martin, Fergal J and Mart{\\'\\i}nez, Laura and Mohanan, Shamika and Muir, Paul and Navarro, Fabio C P and Parker, Anne and Pei, Baikang and Pozo, Fernando and Ruffier, Magali and Schmitt, Bianca M and Stapleton, Eloise and Suner, Marie-Marthe and Sycheva, Irina and Uszczynska-Ratajczak, Barbara and Xu, Jinuri and Yates, Andrew and Zerbino, Daniel and Zhang, Yan and Aken, Bronwen and Choudhary, Jyoti S and Gerstein, Mark and Guig{\\'o}, Roderic and Hubbard, Tim J P and Kellis, Manolis and Paten, Benedict and Reymond, Alexandre and Tress, Michael L and Flicek, Paul},\n  journal   = {Nucleic Acids Research},\n  month     = jan,\n  number    = {D1},\n  pages     = {D766--D773},\n  publisher = {Oxford University Press (OUP)},\n  title     = {{GENCODE} reference annotation for the human and mouse genomes},\n  volume    = 47,\n  year      = 2019\n}\n\n@article{mudge2015creating,\n  author    = {Mudge, Jonathan M and Harrow, Jennifer},\n  copyright = {https://creativecommons.org/licenses/by/4.0},\n  journal   = {Mamm. Genome},\n  language  = {en},\n  month     = oct,\n  number    = {9-10},\n  pages     = {366--378},\n  publisher = {Springer Science and Business Media LLC},\n  title     = {Creating reference gene annotation for the mouse {C57BL6/J} genome assembly},\n  volume    = 26,\n  year      = 2015\n}\n\n@article{harrow2012gencode,\n  author   = {Harrow, Jennifer and Frankish, Adam and Gonzalez, Jose M and Tapanari, Electra and Diekhans, Mark and Kokocinski, Felix and Aken, Bronwen L and Barrell, Daniel and Zadissa, Amonida and Searle, Stephen and Barnes, If and Bignell, Alexandra and Boychenko, Veronika and Hunt, Toby and Kay, Mike and Mukherjee, Gaurab and Rajan, Jeena and Despacio-Reyes, Gloria and Saunders, Gary and Steward, Charles and Harte, Rachel and Lin, Michael and Howald, C{\\'e}dric and Tanzer, Andrea and Derrien, Thomas and Chrast, Jacqueline and Walters, Nathalie and Balasubramanian, Suganthi and Pei, Baikang and Tress, Michael and Rodriguez, Jose Manuel and Ezkurdia, Iakes and van Baren, Jeltje and Brent, Michael and Haussler, David and Kellis, Manolis and Valencia, Alfonso and Reymond, Alexandre and Gerstein, Mark and Guig{\\'o}, Roderic and Hubbard, Tim J},\n  journal  = {Genome Research},\n  month    = sep,\n  number   = 9,\n  pages    = {1760--1774},\n  title    = {{GENCODE}: the reference human genome annotation for The {ENCODE} Project},\n  volume   = 22,\n  year     = 2012\n}\n\n@article{harrow2006gencode,\n  author    = {Harrow, Jennifer and Denoeud, France and Frankish, Adam and  Reymond, Alexandre and Chen, Chao-Kung and Chrast, Jacqueline  and Lagarde, Julien and Gilbert, James G R and Storey, Roy and  Swarbreck, David and Rossier, Colette and Ucla, Catherine and  Hubbard, Tim and Antonarakis, Stylianos E and Guigo, Roderic},\n  journal   = {Genome Biology},\n  month     = aug,\n  number    = {Suppl 1},\n  pages     = {S4.1--9},\n  publisher = {Springer Nature},\n  title     = {{GENCODE}: producing a reference annotation for {ENCODE}},\n  volume    = {7 Suppl 1},\n  year      = 2006\n}\n
"},{"location":"datasets/rfam/","title":"Rfam","text":"

Rfam is a database of structure-annotated multiple sequence alignments, covariance models and family annotation for a number of non-coding RNA, cis-regulatory and self-splicing intron families.

The seed alignments are hand curated and aligned using available sequence and structure data, and covariance models are built from these alignments using the INFERNAL v1.1.4 software suite.

The full regions list is created by searching the RFAMSEQ database using the covariance model, and then listing all hits above a family specific threshold to the model.

Rfam is maintained by a consortium of researchers at the European Bioinformatics Institute, Sean Eddy\u2019s laboratory and Eric Nawrocki.

"},{"location":"datasets/rfam/#disclaimer","title":"Disclaimer","text":"

This is an UNOFFICIAL release of the Rfam by Ioanna Kalvari, Eric P. Nawrocki, Sarah W. Burge, Paul P Gardner, Sam Griffiths-Jones, et al.

The team releasing Rfam did not write this dataset card for this dataset so this dataset card has been written by the MultiMolecule team.

"},{"location":"datasets/rfam/#dataset-description","title":"Dataset Description","text":""},{"location":"datasets/rfam/#license","title":"License","text":"

This dataset is licensed under the AGPL-3.0 License.

Text Only
SPDX-License-Identifier: AGPL-3.0-or-later\n

Tip

The original Rfam dataset is licensed under the CC0 1.0 Universal license and is available at Rfam.

"},{"location":"datasets/rfam/#citation","title":"Citation","text":"BibTeX
@article{kalvari2021rfam,\n  author    = {Kalvari, Ioanna and Nawrocki, Eric P and Ontiveros-Palacios, Nancy and Argasinska, Joanna and Lamkiewicz, Kevin and Marz, Manja and Griffiths-Jones, Sam and Toffano-Nioche, Claire and Gautheret, Daniel and Weinberg, Zasha and Rivas, Elena and Eddy, Sean R and Finn, Robert D and Bateman, Alex and Petrov, Anton I},\n  copyright = {http://creativecommons.org/licenses/by/4.0/},\n  journal   = {Nucleic Acids Research},\n  language  = {en},\n  month     = jan,\n  number    = {D1},\n  pages     = {D192--D200},\n  publisher = {Oxford University Press (OUP)},\n  title     = {Rfam 14: expanded coverage of metagenomic, viral and {microRNA} families},\n  volume    = 49,\n  year      = 2021\n}\n\n@article{hufsky2021computational,\n  author    = {Hufsky, Franziska and Lamkiewicz, Kevin and Almeida, Alexandre and Aouacheria, Abdel and Arighi, Cecilia and Bateman, Alex and Baumbach, Jan and Beerenwinkel, Niko and Brandt, Christian and Cacciabue, Marco and Chuguransky, Sara and Drechsel, Oliver and Finn, Robert D and Fritz, Adrian and Fuchs, Stephan and Hattab, Georges and Hauschild, Anne-Christin and Heider, Dominik and Hoffmann, Marie and H{\\\"o}lzer, Martin and Hoops, Stefan and Kaderali, Lars and Kalvari, Ioanna and von Kleist, Max and Kmiecinski, Ren{\\'o} and K{\\\"u}hnert, Denise and Lasso, Gorka and Libin, Pieter and List, Markus and L{\\\"o}chel, Hannah F and Martin, Maria J and Martin, Roman and Matschinske, Julian and McHardy, Alice C and Mendes, Pedro and Mistry, Jaina and Navratil, Vincent and Nawrocki, Eric P and O'Toole, {\\'A}ine Niamh and Ontiveros-Palacios, Nancy and Petrov, Anton I and Rangel-Pineros, Guillermo and Redaschi, Nicole and Reimering, Susanne and Reinert, Knut and Reyes, Alejandro and Richardson, Lorna and Robertson, David L and Sadegh, Sepideh and Singer, Joshua B and Theys, Kristof and Upton, Chris and Welzel, Marius and Williams, Lowri and Marz, Manja},\n  copyright = {http://creativecommons.org/licenses/by/4.0/},\n  journal   = {Briefings in Bioinformatics},\n  month     = mar,\n  number    = 2,\n  pages     = {642--663},\n  publisher = {Oxford University Press (OUP)},\n  title     = {Computational strategies to combat {COVID-19}: useful tools to accelerate {SARS-CoV-2} and coronavirus research},\n  volume    = 22,\n  year      = 2021\n}\n\n@article{kalvari2018noncoding,\n  author  = {Kalvari, Ioanna and Nawrocki, Eric P and Argasinska, Joanna and Quinones-Olvera, Natalia and Finn, Robert D and Bateman, Alex and Petrov, Anton I},\n  journal = {Current Protocols in Bioinformatics},\n  month   = jun,\n  number  = 1,\n  pages   = {e51},\n  title   = {Non-coding {RNA} analysis using the rfam database},\n  volume  = 62,\n  year    = 2018\n}\n\n@article{kalvari2018rfam,\n  author  = {Kalvari, Ioanna and Argasinska, Joanna and Quinones-Olvera,\n             Natalia and Nawrocki, Eric P and Rivas, Elena and Eddy, Sean R\n             and Bateman, Alex and Finn, Robert D and Petrov, Anton I},\n  journal = {Nucleic Acids Research},\n  month   = jan,\n  number  = {D1},\n  pages   = {D335--D342},\n  title   = {Rfam 13.0: shifting to a genome-centric resource for non-coding {RNA} families},\n  volume  = 46,\n  year    = 2018\n}\n\n@article{nawrocki2015rfam,\n  author    = {Nawrocki, Eric P and Burge, Sarah W and Bateman, Alex and Daub, Jennifer and Eberhardt, Ruth Y and Eddy, Sean R and Floden, Evan W and Gardner, Paul P and Jones, Thomas A and Tate, John and Finn, Robert D},\n  copyright = {http://creativecommons.org/licenses/by/4.0/},\n  journal   = {Nucleic Acids Research},\n  month     = jan,\n  number    = {Database issue},\n  pages     = {D130--7},\n  publisher = {Oxford University Press (OUP)},\n  title     = {Rfam 12.0: updates to the {RNA} families database},\n  volume    = 43,\n  year      = 2015\n}\n\n@article{burge2013rfam,\n  author    = {Burge, Sarah W and Daub, Jennifer and Eberhardt, Ruth and Tate, John and Barquist, Lars and Nawrocki, Eric P and Eddy, Sean R and Gardner, Paul P and Bateman, Alex},\n  copyright = {http://creativecommons.org/licenses/by-nc/3.0/},\n  journal   = {Nucleic Acids Research},\n  month     = jan,\n  number    = {Database issue},\n  pages     = {D226--32},\n  publisher = {Oxford University Press (OUP)},\n  title     = {Rfam 11.0: 10 years of {RNA} families},\n  volume    = 41,\n  year      = 2013\n}\n\n@article{gardner2011rfam,\n  author  = {Gardner, Paul P and Daub, Jennifer and Tate, John and Moore, Benjamin L and Osuch, Isabelle H and Griffiths-Jones, Sam and Finn, Robert D and Nawrocki, Eric P and Kolbe, Diana L and Eddy, Sean R and Bateman, Alex},\n  journal = {Nucleic Acids Research},\n  month   = jan,\n  number  = {Database issue},\n  pages   = {D141--5},\n  title   = {Rfam: Wikipedia, clans and the ``decimal'' release},\n  volume  = 39,\n  year    = 2011\n}\n\n@article{gardner2009rfam,\n  author   = {Gardner, Paul P and Daub, Jennifer and Tate, John G and Nawrocki, Eric P and Kolbe, Diana L and Lindgreen, Stinus and Wilkinson, Adam C and Finn, Robert D and Griffiths-Jones, Sam and Eddy, Sean R and Bateman, Alex},\n  journal  = {Nucleic Acids Research},\n  month    = jan,\n  number   = {Database issue},\n  pages    = {D136--40},\n  title    = {Rfam: updates to the {RNA} families database},\n  volume   = 37,\n  year     = 2009\n}\n\n@article{daub2008rna,\n  author   = {Daub, Jennifer and Gardner, Paul P and Tate, John and Ramsk{\\\"o}ld, Daniel and Manske, Magnus and Scott, William G and Weinberg, Zasha and Griffiths-Jones, Sam and Bateman, Alex},\n  journal  = {RNA},\n  month    = dec,\n  number   = 12,\n  pages    = {2462--2464},\n  title    = {The {RNA} {WikiProject}: community annotation of {RNA} families},\n  volume   = 14,\n  year     = 2008\n}\n\n@article{griffiths2005rfam,\n  author   = {Griffiths-Jones, Sam and Moxon, Simon and Marshall, Mhairi and Khanna, Ajay and Eddy, Sean R. and Bateman, Alex},\n  doi      = {10.1093/nar/gki081},\n  eprint   = {https://academic.oup.com/nar/article-pdf/33/suppl\\_1/D121/7622063/gki081.pdf},\n  issn     = {0305-1048},\n  journal  = {Nucleic Acids Research},\n  month    = jan,\n  number   = {suppl_1},\n  pages    = {D121-D124},\n  title    = {{Rfam: annotating non-coding RNAs in complete genomes}},\n  url      = {https://doi.org/10.1093/nar/gki081},\n  volume   = {33},\n  year     = {2005}\n}\n\n@article{griffiths2003rfam,\n  author   = {Griffiths-Jones, Sam and Bateman, Alex and Marshall, Mhairi and Khanna, Ajay and Eddy, Sean R.},\n  doi      = {10.1093/nar/gkg006},\n  eprint   = {https://academic.oup.com/nar/article-pdf/31/1/439/7125749/gkg006.pdf},\n  issn     = {0305-1048},\n  journal  = {Nucleic Acids Research},\n  month    = jan,\n  number   = {1},\n  pages    = {439-441},\n  title    = {{Rfam: an RNA family database}},\n  url      = {https://doi.org/10.1093/nar/gkg006},\n  volume   = {31},\n  year     = {2003}\n}\n
"},{"location":"datasets/rivas/","title":"RIVAS","text":"

The RIVAS dataset is a curated collection of RNA sequences and their secondary structures, designed for training and evaluating RNA secondary structure prediction methods. The dataset combines sequences from published studies and databases like Rfam, covering diverse RNA families such as tRNA, SRP RNA, and ribozymes. The secondary structure data is obtained from experimentally verified structures and consensus structures from Rfam alignments, ensuring high-quality annotations for model training and evaluation.

"},{"location":"datasets/rivas/#disclaimer","title":"Disclaimer","text":"

This is an UNOFFICIAL release of the RIVAS dataset by Elena Rivas, et al.

The team releasing RIVAS did not write this dataset card for this dataset so this dataset card has been written by the MultiMolecule team.

"},{"location":"datasets/rivas/#dataset-description","title":"Dataset Description","text":""},{"location":"datasets/rivas/#example-entry","title":"Example Entry","text":"id sequence secondary_structure AACY020454584.1_604-676 ACUGGUUGCGGCCAGUAUAAAUAGUCUUUAAG\u2026 ((((........)))).........((........"},{"location":"datasets/rivas/#column-description","title":"Column Description","text":"

The converted dataset consists of the following columns, each providing specific information about the RNA secondary structures, consistent with the bpRNA standard:

"},{"location":"datasets/rivas/#variations","title":"Variations","text":"

This dataset is available in three variants:

"},{"location":"datasets/rivas/#related-datasets","title":"Related Datasets","text":""},{"location":"datasets/rivas/#license","title":"License","text":"

This dataset is licensed under the AGPL-3.0 License.

Text Only
SPDX-License-Identifier: AGPL-3.0-or-later\n
"},{"location":"datasets/rivas/#citation","title":"Citation","text":"BibTeX
@article{rivas2012a,\n  author    = {Rivas, Elena and Lang, Raymond and Eddy, Sean R},\n  journal   = {RNA},\n  month     = feb,\n  number    = 2,\n  pages     = {193--212},\n  publisher = {Cold Spring Harbor Laboratory},\n  title     = {A range of complex probabilistic models for {RNA} secondary structure prediction that includes the nearest-neighbor model and more},\n  volume    = 18,\n  year      = 2012\n}\n
"},{"location":"datasets/rnacentral/","title":"RNAcentral","text":"

RNAcentral is a free, public resource that offers integrated access to a comprehensive and up-to-date set of non-coding RNA sequences provided by a collaborating group of Expert Databases representing a broad range of organisms and RNA types.

The development of RNAcentral is coordinated by European Bioinformatics Institute and is supported by Wellcome. Initial funding was provided by BBSRC.

"},{"location":"datasets/rnacentral/#disclaimer","title":"Disclaimer","text":"

This is an UNOFFICIAL release of the RNAcentral by the RNAcentral Consortium.

The team releasing RNAcentral did not write this dataset card for this dataset so this dataset card has been written by the MultiMolecule team.

"},{"location":"datasets/rnacentral/#dataset-description","title":"Dataset Description","text":""},{"location":"datasets/rnacentral/#variations","title":"Variations","text":"

This dataset is available in five additional variants:

"},{"location":"datasets/rnacentral/#derived-datasets","title":"Derived Datasets","text":"

In addition to the main RNAcentral dataset, we also provide the following derived datasets:

"},{"location":"datasets/rnacentral/#license","title":"License","text":"

This dataset is licensed under the AGPL-3.0 License.

Text Only
SPDX-License-Identifier: AGPL-3.0-or-later\n

Tip

The original RNAcentral dataset is licensed under the CC0 1.0 Universal license and is available at RNAcentral.

"},{"location":"datasets/rnacentral/#citation","title":"Citation","text":"BibTeX
@article{rnacentral2021,\n  author    = {{RNAcentral Consortium}},\n  doi       = {https://doi.org/10.1093/nar/gkaa921},\n  journal   = {Nucleic Acids Research},\n  month     = jan,\n  number    = {D1},\n  pages     = {D212--D220},\n  publisher = {Oxford University Press (OUP)},\n  title     = {{RNAcentral} 2021: secondary structure integration, improved sequence search and new member databases},\n  url       = {https://academic.oup.com/nar/article/49/D1/D212/5940500},\n  volume    = 49,\n  year      = 2021\n}\n\n@article{sweeney2020exploring,\n  author   = {Sweeney, Blake A. and Tagmazian, Arina A. and Ribas, Carlos E. and Finn, Robert D. and Bateman, Alex and Petrov, Anton I.},\n  doi      = {https://doi.org/10.1002/cpbi.104},\n  eprint   = {https://currentprotocols.onlinelibrary.wiley.com/doi/pdf/10.1002/cpbi.104},\n  journal  = {Current Protocols in Bioinformatics},\n  keywords = {Galaxy, ncRNA, non-coding RNA, RNAcentral, RNA-seq},\n  number   = {1},\n  pages    = {e104},\n  title    = {Exploring Non-Coding RNAs in RNAcentral},\n  url      = {https://currentprotocols.onlinelibrary.wiley.com/doi/abs/10.1002/cpbi.104},\n  volume   = 71,\n  year     = 2020\n}\n\n@article{rnacentral2019,\n  author    = {{The RNAcentral Consortium}},\n  doi       = {https://doi.org/10.1093/nar/gky1034},\n  journal   = {Nucleic Acids Research},\n  month     = jan,\n  number    = {D1},\n  pages     = {D221--D229},\n  publisher = {Oxford University Press (OUP)},\n  title     = {{RNAcentral}: a hub of information for non-coding {RNA} sequences},\n  url       = {https://academic.oup.com/nar/article/47/D1/D221/5160993},\n  volume    = 47,\n  year      = 2019\n}\n\n@article{rnacentral2017,\n  author    = {{The RNAcentral Consortium} and Petrov, Anton I and Kay, Simon J E and Kalvari, Ioanna and Howe, Kevin L and Gray, Kristian A and Bruford, Elspeth A and Kersey, Paul J and Cochrane, Guy and Finn, Robert D and Bateman, Alex and Kozomara, Ana and Griffiths-Jones, Sam and Frankish, Adam and Zwieb, Christian W and Lau, Britney Y and Williams, Kelly P and Chan, Patricia Pand Lowe, Todd M and Cannone, Jamie J and Gutell, Robin and Machnicka, Magdalena A and Bujnicki, Janusz M and Yoshihama, Maki and Kenmochi, Naoya and Chai, Benli and Cole, James R and Szymanski, Maciej and Karlowski, Wojciech M and Wood, Valerie and Huala, Eva and Berardini, Tanya Z and Zhao, Yi and Chen, Runsheng and Zhu, Weimin and Paraskevopoulou, Maria D and Vlachos, Ioannis S and Hatzigeorgiou, Artemis G and Ma, Lina and Zhang, Zhang and Puetz, Joern and Stadler, Peter F and McDonald, Daniel and Basu, Siddhartha and Fey, Petra and Engel, Stacia R and Cherry, J Michael and Volders, Pieter-Jan and Mestdagh, Pieter and Wower, Jacek and Clark, Michael B and Quek, Xiu Cheng and Dinger, Marcel E},\n  doi       = {https://doi.org/10.1093/nar/gkw1008},\n  journal   = {Nucleic Acids Research},\n  month     = jan,\n  number    = {D1},\n  pages     = {D128--D134},\n  publisher = {Oxford University Press (OUP)},\n  title     = {{RNAcentral}: a comprehensive database of non-coding {RNA} sequences},\n  url       = {https://academic.oup.com/nar/article/45/D1/D128/2333921},\n  volume    = 45,\n  year      = 2017\n}\n\n@article{rnacentral2015,\n  author  = {{RNAcentral Consortium} and Petrov, Anton I and Kay, Simon J E and Gibson, Richard and Kulesha, Eugene and Staines, Dan and Bruford, Elspeth A and Wright, Mathew W and Burge, Sarah and Finn, Robert D and Kersey, Paul J and Cochrane, Guy and Bateman, Alex and Griffiths-Jones, Sam and Harrow, Jennifer and Chan, Patricia P and Lowe, Todd M and Zwieb, Christian W and Wower, Jacek and Williams, Kelly P and Hudson, Corey M and Gutell, Robin and Clark, Michael B and Dinger, Marcel and Quek, Xiu Cheng and Bujnicki, Janusz M and Chua, Nam-Hai and Liu, Jun and Wang, Huan and Skogerb{\\o}, Geir and Zhao, Yi and Chen, Runsheng and Zhu, Weimin and Cole, James R and Chai, Benli and Huang, Hsien-Da and Huang, His-Yuan and Cherry, J Michael and Hatzigeorgiou, Artemis and Pruitt, Kim D},\n  doi     = {https://doi.org/10.1093/nar/gku991},\n  journal = {Nucleic Acids Research},\n  month   = jan,\n  number  = {Database issue},\n  pages   = {D123--D129},\n  title   = {{RNAcentral}: an international database of {ncRNA} sequences},\n  url     = {https://academic.oup.com/nar/article/43/D1/D123/2439941},\n  volume  = 43,\n  year    = 2015\n}\n\n@article{bateman2011rnacentral,\n  author    = {Bateman, Alex and Agrawal, Shipra and Birney, Ewan and Bruford, Elspeth A and Bujnicki, Janusz M and Cochrane, Guy and Cole, James R and Dinger, Marcel E and Enright, Anton J and Gardner, Paul P and Gautheret, Daniel and Griffiths-Jones, Sam and Harrow, Jen and Herrero, Javier and Holmes, Ian H and Huang, Hsien-Da and Kelly, Krystyna A and Kersey, Paul and Kozomara, Ana and Lowe, Todd M and Marz, Manja and Moxon, Simon andPruitt, Kim D and Samuelsson, Tore and Stadler, Peter F and Vilella, Albert J and Vogel, Jan-Hinnerk and Williams, Kelly P and Wright, Mathew W and Zwieb, Christian},\n  doi       = {https://doi.org/10.1261/rna.2750811},\n  journal   = {RNA},\n  month     = nov,\n  number    = 11,\n  pages     = {1941--1946},\n  publisher = {Cold Spring Harbor Laboratory},\n  title     = {{RNAcentral}: A vision for an international database of {RNA} sequences},\n  url       = {https://rnajournal.cshlp.org/content/17/11/1941.long},\n  volume    = 17,\n  year      = 2011\n}\n
"},{"location":"datasets/rnastralign/","title":"RNAStrAlign","text":"

RNAStrAlign is a comprehensive dataset of RNA sequences and their secondary structures.

RNAStrAlign aggregates data from multiple established RNA structure repositories, covering diverse RNA families such as 5S ribosomal RNA, tRNA, and group I introns.

It is considered complementary to the ArchiveII dataset.

"},{"location":"datasets/rnastralign/#disclaimer","title":"Disclaimer","text":"

This is an UNOFFICIAL release of the RNAStrAlign by Zhen Tan, et al.

The team releasing RNAStrAlign did not write this dataset card for this dataset so this dataset card has been written by the MultiMolecule team.

"},{"location":"datasets/rnastralign/#dataset-description","title":"Dataset Description","text":""},{"location":"datasets/rnastralign/#example-entry","title":"Example Entry","text":"id sequence secondary_structure family subfamily 16S_rRNA-Actinobacteria-AB002635 ACACAUGCAAGCGAACGUGAUCUCCAGCUUGC\u2026 .(((.(((..((..((((.(((((.((....)\u2026 16S_rRNA Actinobacteria"},{"location":"datasets/rnastralign/#column-description","title":"Column Description","text":""},{"location":"datasets/rnastralign/#variations","title":"Variations","text":"

This dataset is available in two additional variants:

"},{"location":"datasets/rnastralign/#related-datasets","title":"Related Datasets","text":""},{"location":"datasets/rnastralign/#license","title":"License","text":"

This dataset is licensed under the AGPL-3.0 License.

Text Only
SPDX-License-Identifier: AGPL-3.0-or-later\n
"},{"location":"datasets/rnastralign/#citation","title":"Citation","text":"BibTeX
@article{ran2017turbofold,\n  author   = {Tan, Zhen and Fu, Yinghan and Sharma, Gaurav and Mathews, David H},\n  journal  = {Nucleic Acids Research},\n  month    = nov,\n  number   = 20,\n  pages    = {11570--11581},\n  title    = {{TurboFold} {II}: {RNA} structural alignment and secondary structure prediction informed by multiple homologs},\n  volume   = 45,\n  year     = 2017\n}\n
"},{"location":"datasets/ryos/","title":"RYOS","text":"

RYOS is a database of RNA backbone stability in aqueous solution.

RYOS focuses on exploring the stability of mRNA molecules for vaccine applications. This dataset is part of a broader effort to address one of the key challenges of mRNA vaccines: degradation during shipping and storage.

"},{"location":"datasets/ryos/#statement","title":"Statement","text":"

Deep learning models for predicting RNA degradation via dual crowdsourcing is published in Nature Machine Intelligence, which is a Closed Access / Author-Fee journal.

Machine learning has been at the forefront of the movement for free and open access to research.

We see no role for closed access or author-fee publication in the future of machine learning research and believe the adoption of these journals as an outlet of record for the machine learning community would be a retrograde step.

The MultiMolecule team is committed to the principles of open access and open science.

We do NOT endorse the publication of manuscripts in Closed Access / Author-Fee journals and encourage the community to support Open Access journals and conferences.

Please consider signing the Statement on Nature Machine Intelligence.

"},{"location":"datasets/ryos/#disclaimer","title":"Disclaimer","text":"

This is an UNOFFICIAL release of the RYOS by Hannah K. Wayment-Steele, et al.

The team releasing RYOS did not write this dataset card for this dataset so this dataset card has been written by the MultiMolecule team.

"},{"location":"datasets/ryos/#dataset-description","title":"Dataset Description","text":""},{"location":"datasets/ryos/#example-entry","title":"Example Entry","text":"id design sequence secondary_structure reactivity errors_reactivity signal_to_noise_reactivity deg_pH10 errors_deg_pH10 signal_to_noise_deg_pH10 deg_50C errors_deg_50C signal_to_noise_deg_50C deg_Mg_pH10 errors_deg_Mg_pH10 signal_to_noise_deg_Mg_pH10 deg_Mg_50C errors_deg_Mg_50C signal_to_noise_deg_Mg_50C SN_filter 9830366 testing GGAAAUUUGC\u2026 .......(((\u2026 [0.4167, 1.5941, 1.2359, \u2026] [0.1689, 0.2323, 0.193, \u2026] 5.326 [1.5966, 2.6482, 1.3761, \u2026] [0.3058, 0.3294, 0.233, \u2026] 4.198 [0.7885, 1.93, 2.0423, \u2026] 3.746 [0.2773, 0.328, 0.3048, \u2026] [1.5966, 2.6482, 1.3761, \u2026] [0.3058, 0.3294, 0.233, \u2026] 4.198 [0.7885, 1.93, 2.0423, \u2026] [0.2773, 0.328, 0.3048, \u2026] 3.746 True"},{"location":"datasets/ryos/#column-description","title":"Column Description","text":"

Note that due to technical limitations, the ground truth measurements are not available for the final bases of each RNA sequence, resulting in a shorter length for the provided labels compared to the full sequence.

"},{"location":"datasets/ryos/#variations","title":"Variations","text":"

This dataset is available in two subsets:

"},{"location":"datasets/ryos/#license","title":"License","text":"

This dataset is licensed under the AGPL-3.0 License.

Text Only
SPDX-License-Identifier: AGPL-3.0-or-later\n
"},{"location":"datasets/ryos/#citation","title":"Citation","text":"BibTeX
@article{waymentsteele2021deep,\n  author  = {Wayment-Steele, Hannah K and Kladwang, Wipapat and Watkins, Andrew M and Kim, Do Soon and Tunguz, Bojan and Reade, Walter and Demkin, Maggie and Romano, Jonathan and Wellington-Oguri, Roger and Nicol, John J and Gao, Jiayang and Onodera, Kazuki and Fujikawa, Kazuki and Mao, Hanfei and Vandewiele, Gilles and Tinti, Michele and Steenwinckel, Bram and Ito, Takuya and Noumi, Taiga and He, Shujun and Ishi, Keiichiro and Lee, Youhan and {\\\"O}zt{\\\"u}rk, Fatih and Chiu, Anthony and {\\\"O}zt{\\\"u}rk, Emin and Amer, Karim and Fares, Mohamed and Participants, Eterna and Das, Rhiju},\n  journal = {ArXiv},\n  month   = oct,\n  title   = {Deep learning models for predicting {RNA} degradation via dual crowdsourcing},\n  year    = 2021\n}\n
"},{"location":"models/","title":"models","text":"

models provide a collection of pre-trained models.

"},{"location":"models/#model-class","title":"Model Class","text":"

In the transformers library, the names of model classes can sometimes be misleading. While these classes support both regression and classification tasks, their names often include xxxForSequenceClassification, which may imply they are only for classification.

To avoid this ambiguity, MultiMolecule provides a set of model classes with clear, intuitive names that reflect their intended use:

Each of these models supports both regression and classification tasks, offering flexibility and precision for a wide range of applications.

"},{"location":"models/#contact-prediction","title":"Contact Prediction","text":"

Contact prediction assign a label to each pair of token in a sentence. One of the most common contact prediction tasks is protein distance map prediction. Protein distance map prediction attempts to find the distance between all possible amino acid residue pairs of a three-dimensional protein structure

"},{"location":"models/#nucleotide-prediction","title":"Nucleotide Prediction","text":"

Similar to Token Classification, but removes the <bos> token and the <eos> token if they are defined in the model config.

<bos> and <eos> tokens

In tokenizers provided by MultiMolecule, <bos> token is pointed to <cls> token, and <sep> token is pointed to <eos> token.

"},{"location":"models/#usage","title":"Usage","text":""},{"location":"models/#build-with-multimoleculeautomodels","title":"Build with multimolecule.AutoModels","text":"Python
from transformers import AutoTokenizer\n\nfrom multimolecule import AutoModelForSequencePrediction\n\nmodel = AutoModelForSequencePrediction.from_pretrained(\"multimolecule/rnafm\")\ntokenizer = AutoTokenizer.from_pretrained(\"multimolecule/rnafm\")\n\nsequence = \"UAGCGUAUCAGACUGAUGUUG\"\noutput = model(**tokenizer(sequence, return_tensors=\"pt\"))\n
"},{"location":"models/#direct-access","title":"Direct Access","text":"

All models can be directly loaded with the from_pretrained method.

Python
from multimolecule.models import RnaFmForTokenPrediction, RnaTokenizer\n\nmodel = RnaFmForTokenPrediction.from_pretrained(\"multimolecule/rnafm\")\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rnafm\")\n\nsequence = \"UAGCGUAUCAGACUGAUGUUG\"\noutput = model(**tokenizer(sequence, return_tensors=\"pt\"))\n
"},{"location":"models/#build-with-transformersautomodels","title":"Build with transformers.AutoModels","text":"

While we use a different naming convention for model classes, the models are still registered to corresponding transformers.AutoModels.

Python
from transformers import AutoModelForSequenceClassification, AutoTokenizer\n\nimport multimolecule  # noqa: F401\n\nmodel = AutoModelForSequenceClassification.from_pretrained(\"multimolecule/mrnafm\")\ntokenizer = AutoTokenizer.from_pretrained(\"multimolecule/mrnafm\")\n\nsequence = \"UAGCGUAUCAGACUGAUGUUG\"\noutput = model(**tokenizer(sequence, return_tensors=\"pt\"))\n

import multimolecule before use

Note that you must import multimolecule before building the model using transformers.AutoModel. The registration of models is done in the multimolecule package, and the models are not available in the transformers package.

The following error will be raised if you do not import multimolecule before using transformers.AutoModel:

Python
ValueError: The checkpoint you are trying to load has model type `rnafm` but Transformers does not recognize this architecture. This could be because of an issue with the checkpoint, or because your version of Transformers is out of date.\n
"},{"location":"models/#initialize-a-vanilla-model","title":"Initialize a vanilla model","text":"

You can also initialize a vanilla model using the model class.

Python
from multimolecule.models import RnaFmConfig, RnaFmForTokenPrediction, RnaTokenizer\n\nconfig = RnaFmConfig()\nmodel = RnaFmForTokenPrediction(config)\ntokenizer = RnaTokenizer()\n\nsequence = \"UAGCGUAUCAGACUGAUGUUG\"\noutput = model(**tokenizer(sequence, return_tensors=\"pt\"))\n
"},{"location":"models/#available-models","title":"Available Models","text":""},{"location":"models/#deoxyribonucleic-acid-dna","title":"DeoxyriboNucleic Acid (DNA)","text":""},{"location":"models/#ribonucleic-acid-rna","title":"RiboNucleic Acid (RNA)","text":""},{"location":"models/calm/","title":"CaLM","text":"

Pre-trained model on protein-coding DNA (cDNA) using a masked language modeling (MLM) objective.

"},{"location":"models/calm/#statement","title":"Statement","text":"

Codon language embeddings provide strong signals for use in protein engineering is published in Nature Machine Intelligence, which is a Closed Access / Author-Fee journal.

Machine learning has been at the forefront of the movement for free and open access to research.

We see no role for closed access or author-fee publication in the future of machine learning research and believe the adoption of these journals as an outlet of record for the machine learning community would be a retrograde step.

The MultiMolecule team is committed to the principles of open access and open science.

We do NOT endorse the publication of manuscripts in Closed Access / Author-Fee journals and encourage the community to support Open Access journals and conferences.

Please consider signing the Statement on Nature Machine Intelligence.

"},{"location":"models/calm/#disclaimer","title":"Disclaimer","text":"

This is an UNOFFICIAL implementation of the Codon language embeddings provide strong signals for use in protein engineering by Carlos Outeiral and Charlotte M. Deane.

The OFFICIAL repository of CaLM is at oxpig/CaLM.

Warning

The MultiMolecule team is unable to confirm that the provided model and checkpoints are producing the same intermediate representations as the original implementation. This is because

The proposed method is published in a Closed Access / Author-Fee journal.

The team releasing CaLM did not write this model card for this model so this model card has been written by the MultiMolecule team.

"},{"location":"models/calm/#model-details","title":"Model Details","text":"

CaLM is a bert-style model pre-trained on a large corpus of protein-coding DNA sequences in a self-supervised fashion. This means that the model was trained on the raw nucleotides of DNA sequences only, with an automatic process to generate inputs and labels from those texts. Please refer to the Training Details section for more information on the training process.

"},{"location":"models/calm/#model-specification","title":"Model Specification","text":"Num Layers Hidden Size Num Heads Intermediate Size Num Parameters (M) FLOPs (G) MACs (G) Max Num Tokens 12 768 12 3072 85.75 22.36 11.17 1024"},{"location":"models/calm/#links","title":"Links","text":""},{"location":"models/calm/#usage","title":"Usage","text":"

The model file depends on the multimolecule library. You can install it using pip:

Bash
pip install multimolecule\n
"},{"location":"models/calm/#direct-use","title":"Direct Use","text":"

You can use this model directly with a pipeline for masked language modeling:

Python
>>> import multimolecule  # you must import multimolecule to register models\n>>> from transformers import pipeline\n>>> unmasker = pipeline(\"fill-mask\", model=\"multimolecule/calm\")\n>>> unmasker(\"agc<mask>cattatggcgaaccttggctgctg\")\n\n[{'score': 0.011160684749484062,\n  'token': 100,\n  'token_str': 'UUN',\n  'sequence': 'AGC UUN CAU UAU GGC GAA CCU UGG CUG CUG'},\n {'score': 0.01067513320595026,\n  'token': 117,\n  'token_str': 'NGC',\n  'sequence': 'AGC NGC CAU UAU GGC GAA CCU UGG CUG CUG'},\n {'score': 0.010549729689955711,\n  'token': 127,\n  'token_str': 'NNC',\n  'sequence': 'AGC NNC CAU UAU GGC GAA CCU UGG CUG CUG'},\n {'score': 0.0103579331189394,\n  'token': 51,\n  'token_str': 'CNA',\n  'sequence': 'AGC CNA CAU UAU GGC GAA CCU UGG CUG CUG'},\n {'score': 0.010322545655071735,\n  'token': 77,\n  'token_str': 'GNC',\n  'sequence': 'AGC GNC CAU UAU GGC GAA CCU UGG CUG CUG'}]\n
"},{"location":"models/calm/#downstream-use","title":"Downstream Use","text":""},{"location":"models/calm/#extract-features","title":"Extract Features","text":"

Here is how to use this model to get the features of a given sequence in PyTorch:

Python
from multimolecule import RnaTokenizer, CaLmModel\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/calm\")\nmodel = CaLmModel.from_pretrained(\"multimolecule/calm\")\n\ntext = \"GCCAGTCGCTGACAGCCGCGG\"\ninput = tokenizer(text, return_tensors=\"pt\")\n\noutput = model(**input)\n
"},{"location":"models/calm/#sequence-classification-regression","title":"Sequence Classification / Regression","text":"

Note: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for sequence classification or regression.

Here is how to use this model as backbone to fine-tune for a sequence-level task in PyTorch:

Python
import torch\nfrom multimolecule import RnaTokenizer, CaLmForSequencePrediction\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/calm\")\nmodel = CaLmForSequencePrediction.from_pretrained(\"multimolecule/calm\")\n\ntext = \"GCCAGTCGCTGACAGCCGCGG\"\ninput = tokenizer(text, return_tensors=\"pt\")\nlabel = torch.tensor([1])\n\noutput = model(**input, labels=label)\n
"},{"location":"models/calm/#token-classification-regression","title":"Token Classification / Regression","text":"

Note: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for nucleotide classification or regression.

Here is how to use this model as backbone to fine-tune for a nucleotide-level task in PyTorch:

Python
import torch\nfrom multimolecule import RnaTokenizer, CaLmForTokenPrediction\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/calm\")\nmodel = CaLmForTokenPrediction.from_pretrained(\"multimolecule/calm\")\n\ntext = \"GCCAGTCGCTGACAGCCGCGG\"\ninput = tokenizer(text, return_tensors=\"pt\")\nlabel = torch.randint(2, (len(text), ))\n\noutput = model(**input, labels=label)\n
"},{"location":"models/calm/#contact-classification-regression","title":"Contact Classification / Regression","text":"

Note: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for contact classification or regression.

Here is how to use this model as backbone to fine-tune for a contact-level task in PyTorch:

Python
import torch\nfrom multimolecule import RnaTokenizer, CaLmForContactPrediction\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/calm\")\nmodel = CaLmForContactPrediction.from_pretrained(\"multimolecule/calm\")\n\ntext = \"GCCAGTCGCTGACAGCCGCGG\"\ninput = tokenizer(text, return_tensors=\"pt\")\nlabel = torch.randint(2, (len(text), len(text)))\n\noutput = model(**input, labels=label)\n
"},{"location":"models/calm/#training-details","title":"Training Details","text":"

CaLM used Masked Language Modeling (MLM) as the pre-training objective: taking a sequence, the model randomly masks 25% of the tokens in the input then runs the entire masked sentence through the model and has to predict the masked tokens. This is comparable to the Cloze task in language modeling.

"},{"location":"models/calm/#training-data","title":"Training Data","text":"

The CaLM model was pre-trained coding sequences of all organisms available on the European Nucleotide Archive (ENA). European Nucleotide Archive provides a comprehensive record of the world\u2019s nucleotide sequencing information, covering raw sequencing data, sequence assembly information and functional annotation.

CaLM collected coding sequences of all organisms from ENA on April 2022, including 114,214,475 sequences. Only high level assembly information (dataclass CON) were used. Sequences matching the following criteria were filtered out:

To reduce redundancy, CaLM grouped the entries by organism, and apply CD-HIT (CD-HIT-EST) with a cut-off at 40% sequence identity to the translated protein sequences.

The final dataset contains 9,858,385 cDNA sequences.

Note that the alphabet in the original implementation is RNA instead of DNA, therefore, we use RnaTokenizer to tokenize the sequences. RnaTokenizer of multimolecule will convert \u201cU\u201ds to \u201cT\u201ds for you, you may disable this behaviour by passing replace_T_with_U=False.

"},{"location":"models/calm/#training-procedure","title":"Training Procedure","text":""},{"location":"models/calm/#preprocessing","title":"Preprocessing","text":"

CaLM used masked language modeling (MLM) as the pre-training objective. The masking procedure is similar to the one used in BERT:

"},{"location":"models/calm/#pretraining","title":"PreTraining","text":"

The model was trained on 4 NVIDIA Quadro RTX4000 GPUs with 8GiB memories.

"},{"location":"models/calm/#citation","title":"Citation","text":"

BibTeX:

BibTeX
@article {outeiral2022coodn,\n    author = {Outeiral, Carlos and Deane, Charlotte M.},\n    title = {Codon language embeddings provide strong signals for protein engineering},\n    elocation-id = {2022.12.15.519894},\n    year = {2022},\n    doi = {10.1101/2022.12.15.519894},\n    publisher = {Cold Spring Harbor Laboratory},\n    abstract = {Protein representations from deep language models have yielded state-of-the-art performance across many tasks in computational protein engineering. In recent years, progress has primarily focused on parameter count, with recent models{\\textquoteright} capacities surpassing the size of the very datasets they were trained on. Here, we propose an alternative direction. We show that large language models trained on codons, instead of amino acid sequences, provide high-quality representations that outperform comparable state-of-the-art models across a variety of tasks. In some tasks, like species recognition, prediction of protein and transcript abundance, or melting point estimation, we show that a language model trained on codons outperforms every other published protein language model, including some that contain over 50 times more parameters. These results suggest that, in addition to commonly studied scale and model complexity, the information content of biological data provides an orthogonal direction to improve the power of machine learning in biology.Competing Interest StatementThe authors have declared no competing interest.},\n    URL = {https://www.biorxiv.org/content/early/2022/12/19/2022.12.15.519894},\n    eprint = {https://www.biorxiv.org/content/early/2022/12/19/2022.12.15.519894.full.pdf},\n    journal = {bioRxiv}\n}\n
"},{"location":"models/calm/#contact","title":"Contact","text":"

Please use GitHub issues of MultiMolecule for any questions or comments on the model card.

Please contact the authors of the CaLM paper for questions or comments on the paper/model.

"},{"location":"models/calm/#license","title":"License","text":"

This model is licensed under the AGPL-3.0 License.

Text Only
SPDX-License-Identifier: AGPL-3.0-or-later\n
"},{"location":"models/calm/#multimolecule.models.calm","title":"multimolecule.models.calm","text":""},{"location":"models/calm/#multimolecule.models.calm.RnaTokenizer","title":"RnaTokenizer","text":"

Bases: Tokenizer

Tokenizer for RNA sequences.

Parameters:

Name Type Description Default Alphabet | str | List[str] | None

alphabet to use for tokenization.

None int

Size of kmer to tokenize.

1 bool

Whether to tokenize into codons.

False bool

Whether to replace T with U.

True bool

Whether to convert input to uppercase.

True

Examples:

Python Console Session
>>> from multimolecule import RnaTokenizer\n>>> tokenizer = RnaTokenizer()\n>>> tokenizer('<pad><cls><eos><unk><mask><null>ACGUNRYSWKMBDHV.X*-I')[\"input_ids\"]\n[1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 2]\n>>> tokenizer('acgu')[\"input_ids\"]\n[1, 6, 7, 8, 9, 2]\n>>> tokenizer('acgt')[\"input_ids\"]\n[1, 6, 7, 8, 9, 2]\n>>> tokenizer = RnaTokenizer(replace_T_with_U=False)\n>>> tokenizer('acgt')[\"input_ids\"]\n[1, 6, 7, 8, 3, 2]\n>>> tokenizer = RnaTokenizer(nmers=3)\n>>> tokenizer('uagcuuauc')[\"input_ids\"]\n[1, 83, 17, 64, 49, 96, 84, 22, 2]\n>>> tokenizer = RnaTokenizer(codon=True)\n>>> tokenizer('uagcuuauc')[\"input_ids\"]\n[1, 83, 49, 22, 2]\n>>> tokenizer('uagcuuauca')[\"input_ids\"]\nTraceback (most recent call last):\nValueError: length of input sequence must be a multiple of 3 for codon tokenization, but got 10\n
Source code in multimolecule/tokenisers/rna/tokenization_rna.py Python
class RnaTokenizer(Tokenizer):\n    \"\"\"\n    Tokenizer for RNA sequences.\n\n    Args:\n        alphabet: alphabet to use for tokenization.\n\n            - If is `None`, the standard RNA alphabet will be used.\n            - If is a `string`, it should correspond to the name of a predefined alphabet. The options include\n                + `standard`\n                + `extended`\n                + `streamline`\n                + `nucleobase`\n            - If is an alphabet or a list of characters, that specific alphabet will be used.\n        nmers: Size of kmer to tokenize.\n        codon: Whether to tokenize into codons.\n        replace_T_with_U: Whether to replace T with U.\n        do_upper_case: Whether to convert input to uppercase.\n\n    Examples:\n        >>> from multimolecule import RnaTokenizer\n        >>> tokenizer = RnaTokenizer()\n        >>> tokenizer('<pad><cls><eos><unk><mask><null>ACGUNRYSWKMBDHV.X*-I')[\"input_ids\"]\n        [1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 2]\n        >>> tokenizer('acgu')[\"input_ids\"]\n        [1, 6, 7, 8, 9, 2]\n        >>> tokenizer('acgt')[\"input_ids\"]\n        [1, 6, 7, 8, 9, 2]\n        >>> tokenizer = RnaTokenizer(replace_T_with_U=False)\n        >>> tokenizer('acgt')[\"input_ids\"]\n        [1, 6, 7, 8, 3, 2]\n        >>> tokenizer = RnaTokenizer(nmers=3)\n        >>> tokenizer('uagcuuauc')[\"input_ids\"]\n        [1, 83, 17, 64, 49, 96, 84, 22, 2]\n        >>> tokenizer = RnaTokenizer(codon=True)\n        >>> tokenizer('uagcuuauc')[\"input_ids\"]\n        [1, 83, 49, 22, 2]\n        >>> tokenizer('uagcuuauca')[\"input_ids\"]\n        Traceback (most recent call last):\n        ValueError: length of input sequence must be a multiple of 3 for codon tokenization, but got 10\n    \"\"\"\n\n    model_input_names = [\"input_ids\", \"attention_mask\"]\n\n    def __init__(\n        self,\n        alphabet: Alphabet | str | List[str] | None = None,\n        nmers: int = 1,\n        codon: bool = False,\n        replace_T_with_U: bool = True,\n        do_upper_case: bool = True,\n        additional_special_tokens: List | Tuple | None = None,\n        **kwargs,\n    ):\n        if codon and (nmers > 1 and nmers != 3):\n            raise ValueError(\"Codon and nmers cannot be used together.\")\n        if codon:\n            nmers = 3  # set to 3 to get correct vocab\n        if not isinstance(alphabet, Alphabet):\n            alphabet = get_alphabet(alphabet, nmers=nmers)\n        super().__init__(\n            alphabet=alphabet,\n            nmers=nmers,\n            codon=codon,\n            replace_T_with_U=replace_T_with_U,\n            do_upper_case=do_upper_case,\n            additional_special_tokens=additional_special_tokens,\n            **kwargs,\n        )\n        self.replace_T_with_U = replace_T_with_U\n        self.nmers = nmers\n        self.codon = codon\n\n    def _tokenize(self, text: str, **kwargs):\n        if self.do_upper_case:\n            text = text.upper()\n        if self.replace_T_with_U:\n            text = text.replace(\"T\", \"U\")\n        if self.codon:\n            if len(text) % 3 != 0:\n                raise ValueError(\n                    f\"length of input sequence must be a multiple of 3 for codon tokenization, but got {len(text)}\"\n                )\n            return [text[i : i + 3] for i in range(0, len(text), 3)]\n        if self.nmers > 1:\n            return [text[i : i + self.nmers] for i in range(len(text) - self.nmers + 1)]  # noqa: E203\n        return list(text)\n
"},{"location":"models/calm/#multimolecule.models.calm.RnaTokenizer(alphabet)","title":"alphabet","text":""},{"location":"models/calm/#multimolecule.models.calm.RnaTokenizer(nmers)","title":"nmers","text":""},{"location":"models/calm/#multimolecule.models.calm.RnaTokenizer(codon)","title":"codon","text":""},{"location":"models/calm/#multimolecule.models.calm.RnaTokenizer(replace_T_with_U)","title":"replace_T_with_U","text":""},{"location":"models/calm/#multimolecule.models.calm.RnaTokenizer(do_upper_case)","title":"do_upper_case","text":""},{"location":"models/calm/#multimolecule.models.calm.CaLmConfig","title":"CaLmConfig","text":"

Bases: PreTrainedConfig

This is the configuration class to store the configuration of a CaLmModel. It is used to instantiate a CaLM model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the CaLM oxpig/CaLM architecture.

Configuration objects inherit from PreTrainedConfig and can be used to control the model outputs. Read the documentation from PreTrainedConfig for more information.

Parameters:

Name Type Description Default int

Vocabulary size of the CaLM model. Defines the number of different tokens that can be represented by the inputs_ids passed when calling [CaLmModel].

131 int

Dimensionality of the encoder layers and the pooler layer.

768 int

Number of hidden layers in the Transformer encoder.

12 int

Number of attention heads for each attention layer in the Transformer encoder.

12 int

Dimensionality of the \u201cintermediate\u201d (often named feed-forward) layer in the Transformer encoder.

3072 float

The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.

0.1 float

The dropout ratio for the attention probabilities.

0.1 int

The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).

1026 float

The standard deviation of the truncated_normal_initializer for initializing all weight matrices.

0.02 float

The epsilon used by the layer normalization layers.

1e-12 str

Type of position embedding. Choose one of \"absolute\", \"relative_key\", \"relative_key_query\", \"rotary\". For positional embeddings use \"absolute\". For more information on \"relative_key\", please refer to Self-Attention with Relative Position Representations (Shaw et al.). For more information on \"relative_key_query\", please refer to Method 4 in Improve Transformer Models with Better Relative Position Embeddings (Huang et al.).

'rotary' bool

Whether the model is used as a decoder or not. If False, the model is used as an encoder.

False bool

Whether or not the model should return the last key/values attentions (not used by all models). Only relevant if config.is_decoder=True.

True bool

Whether to apply layer normalization after embeddings but before the main stem of the network.

False bool

When this is enabled, masked tokens are treated as if they had been dropped out by input dropout.

False

Examples:

Python Console Session
>>> from multimolecule import CaLmModel, CaLmConfig\n>>> # Initializing a CaLM multimolecule/calm style configuration\n>>> configuration = CaLmConfig()\n>>> # Initializing a model (with random weights) from the multimolecule/calm style configuration\n>>> model = CaLmModel(configuration)\n>>> # Accessing the model configuration\n>>> configuration = model.config\n
Source code in multimolecule/models/calm/configuration_calm.py Python
class CaLmConfig(PreTrainedConfig):\n    r\"\"\"\n    This is the configuration class to store the configuration of a [`CaLmModel`][multimolecule.models.CaLmModel]. It\n    is used to instantiate a CaLM model according to the specified arguments, defining the model architecture.\n    Instantiating a configuration with the defaults will yield a similar configuration to that of the CaLM\n    [oxpig/CaLM](https://github.com/oxpig/CaLM) architecture.\n\n    Configuration objects inherit from [`PreTrainedConfig`][multimolecule.models.PreTrainedConfig] and can be used to\n    control the model outputs. Read the documentation from [`PreTrainedConfig`][multimolecule.models.PreTrainedConfig]\n    for more information.\n\n    Args:\n        vocab_size:\n            Vocabulary size of the CaLM model. Defines the number of different tokens that can be represented by the\n            `inputs_ids` passed when calling [`CaLmModel`].\n        hidden_size:\n            Dimensionality of the encoder layers and the pooler layer.\n        num_hidden_layers:\n            Number of hidden layers in the Transformer encoder.\n        num_attention_heads:\n            Number of attention heads for each attention layer in the Transformer encoder.\n        intermediate_size:\n            Dimensionality of the \"intermediate\" (often named feed-forward) layer in the Transformer encoder.\n        hidden_dropout:\n            The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.\n        attention_dropout:\n            The dropout ratio for the attention probabilities.\n        max_position_embeddings:\n            The maximum sequence length that this model might ever be used with. Typically set this to something large\n            just in case (e.g., 512 or 1024 or 2048).\n        initializer_range:\n            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.\n        layer_norm_eps:\n            The epsilon used by the layer normalization layers.\n        position_embedding_type:\n            Type of position embedding. Choose one of `\"absolute\"`, `\"relative_key\"`, `\"relative_key_query\", \"rotary\"`.\n            For positional embeddings use `\"absolute\"`. For more information on `\"relative_key\"`, please refer to\n            [Self-Attention with Relative Position Representations (Shaw et al.)](https://arxiv.org/abs/1803.02155).\n            For more information on `\"relative_key_query\"`, please refer to *Method 4* in [Improve Transformer Models\n            with Better Relative Position Embeddings (Huang et al.)](https://arxiv.org/abs/2009.13658).\n        is_decoder:\n            Whether the model is used as a decoder or not. If `False`, the model is used as an encoder.\n        use_cache:\n            Whether or not the model should return the last key/values attentions (not used by all models). Only\n            relevant if `config.is_decoder=True`.\n        emb_layer_norm_before:\n            Whether to apply layer normalization after embeddings but before the main stem of the network.\n        token_dropout:\n            When this is enabled, masked tokens are treated as if they had been dropped out by input dropout.\n\n    Examples:\n        >>> from multimolecule import CaLmModel, CaLmConfig\n        >>> # Initializing a CaLM multimolecule/calm style configuration\n        >>> configuration = CaLmConfig()\n        >>> # Initializing a model (with random weights) from the multimolecule/calm style configuration\n        >>> model = CaLmModel(configuration)\n        >>> # Accessing the model configuration\n        >>> configuration = model.config\n    \"\"\"\n\n    model_type = \"calm\"\n\n    def __init__(\n        self,\n        vocab_size: int = 131,\n        codon: bool = True,\n        hidden_size: int = 768,\n        num_hidden_layers: int = 12,\n        num_attention_heads: int = 12,\n        intermediate_size: int = 3072,\n        hidden_act: str = \"gelu\",\n        hidden_dropout: float = 0.1,\n        attention_dropout: float = 0.1,\n        max_position_embeddings: int = 1026,\n        initializer_range: float = 0.02,\n        layer_norm_eps: float = 1e-12,\n        position_embedding_type: str = \"rotary\",\n        is_decoder: bool = False,\n        use_cache: bool = True,\n        emb_layer_norm_before: bool = False,\n        token_dropout: bool = False,\n        head: HeadConfig | None = None,\n        lm_head: MaskedLMHeadConfig | None = None,\n        **kwargs,\n    ):\n        super().__init__(**kwargs)\n\n        self.vocab_size = vocab_size\n        self.codon = codon\n        self.hidden_size = hidden_size\n        self.num_hidden_layers = num_hidden_layers\n        self.num_attention_heads = num_attention_heads\n        self.intermediate_size = intermediate_size\n        self.hidden_act = hidden_act\n        self.hidden_dropout = hidden_dropout\n        self.attention_dropout = attention_dropout\n        self.max_position_embeddings = max_position_embeddings\n        self.initializer_range = initializer_range\n        self.layer_norm_eps = layer_norm_eps\n        self.position_embedding_type = position_embedding_type\n        self.is_decoder = is_decoder\n        self.use_cache = use_cache\n        self.emb_layer_norm_before = emb_layer_norm_before\n        self.token_dropout = token_dropout\n        self.head = HeadConfig(**head) if head is not None else None\n        self.lm_head = MaskedLMHeadConfig(**lm_head) if lm_head is not None else None\n
"},{"location":"models/calm/#multimolecule.models.calm.CaLmConfig(vocab_size)","title":"vocab_size","text":""},{"location":"models/calm/#multimolecule.models.calm.CaLmConfig(hidden_size)","title":"hidden_size","text":""},{"location":"models/calm/#multimolecule.models.calm.CaLmConfig(num_hidden_layers)","title":"num_hidden_layers","text":""},{"location":"models/calm/#multimolecule.models.calm.CaLmConfig(num_attention_heads)","title":"num_attention_heads","text":""},{"location":"models/calm/#multimolecule.models.calm.CaLmConfig(intermediate_size)","title":"intermediate_size","text":""},{"location":"models/calm/#multimolecule.models.calm.CaLmConfig(hidden_dropout)","title":"hidden_dropout","text":""},{"location":"models/calm/#multimolecule.models.calm.CaLmConfig(attention_dropout)","title":"attention_dropout","text":""},{"location":"models/calm/#multimolecule.models.calm.CaLmConfig(max_position_embeddings)","title":"max_position_embeddings","text":""},{"location":"models/calm/#multimolecule.models.calm.CaLmConfig(initializer_range)","title":"initializer_range","text":""},{"location":"models/calm/#multimolecule.models.calm.CaLmConfig(layer_norm_eps)","title":"layer_norm_eps","text":""},{"location":"models/calm/#multimolecule.models.calm.CaLmConfig(position_embedding_type)","title":"position_embedding_type","text":""},{"location":"models/calm/#multimolecule.models.calm.CaLmConfig(is_decoder)","title":"is_decoder","text":""},{"location":"models/calm/#multimolecule.models.calm.CaLmConfig(use_cache)","title":"use_cache","text":""},{"location":"models/calm/#multimolecule.models.calm.CaLmConfig(emb_layer_norm_before)","title":"emb_layer_norm_before","text":""},{"location":"models/calm/#multimolecule.models.calm.CaLmConfig(token_dropout)","title":"token_dropout","text":""},{"location":"models/calm/#multimolecule.models.calm.CaLmForContactPrediction","title":"CaLmForContactPrediction","text":"

Bases: CaLmPreTrainedModel

Examples:

Python Console Session
>>> from multimolecule import CaLmConfig, CaLmForContactPrediction, RnaTokenizer\n>>> config = CaLmConfig()\n>>> model = CaLmForContactPrediction(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=torch.randint(2, (1, 5, 5)))\n>>> output[\"logits\"].shape\ntorch.Size([1, 5, 5, 1])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n
Source code in multimolecule/models/calm/modeling_calm.py Python
class CaLmForContactPrediction(CaLmPreTrainedModel):\n    \"\"\"\n    Examples:\n        >>> from multimolecule import CaLmConfig, CaLmForContactPrediction, RnaTokenizer\n        >>> config = CaLmConfig()\n        >>> model = CaLmForContactPrediction(config)\n        >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n        >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n        >>> output = model(**input, labels=torch.randint(2, (1, 5, 5)))\n        >>> output[\"logits\"].shape\n        torch.Size([1, 5, 5, 1])\n        >>> output[\"loss\"]  # doctest:+ELLIPSIS\n        tensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n    \"\"\"\n\n    def __init__(self, config: CaLmConfig):\n        super().__init__(config)\n        self.calm = CaLmModel(config, add_pooling_layer=True)\n        self.contact_head = ContactPredictionHead(config)\n        self.head_config = self.contact_head.config\n\n        # Initialize weights and apply final processing\n        self.post_init()\n\n    def forward(\n        self,\n        input_ids: Tensor | NestedTensor,\n        attention_mask: Tensor | None = None,\n        position_ids: Tensor | None = None,\n        head_mask: Tensor | None = None,\n        inputs_embeds: Tensor | NestedTensor | None = None,\n        labels: Tensor | None = None,\n        output_attentions: bool | None = None,\n        output_hidden_states: bool | None = None,\n        return_dict: bool | None = None,\n        **kwargs,\n    ) -> Tuple[Tensor, ...] | ContactPredictorOutput:\n        if output_attentions is False:\n            warn(\"output_attentions must be True for contact classification and will be ignored.\")\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n        outputs = self.calm(\n            input_ids,\n            attention_mask=attention_mask,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            output_attentions=True,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n            **kwargs,\n        )\n        output = self.contact_head(outputs, attention_mask, input_ids, labels)\n        logits, loss = output.logits, output.loss\n\n        if not return_dict:\n            output = (logits,) + outputs[2:]\n            return ((loss,) + output) if loss is not None else output\n\n        return ContactPredictorOutput(\n            loss=loss,\n            logits=logits,\n            hidden_states=outputs.hidden_states,\n            attentions=outputs.attentions,\n        )\n
"},{"location":"models/calm/#multimolecule.models.calm.CaLmForMaskedLM","title":"CaLmForMaskedLM","text":"

Bases: CaLmPreTrainedModel

Examples:

Python Console Session
>>> from multimolecule import CaLmConfig, CaLmForMaskedLM, RnaTokenizer\n>>> config = CaLmConfig()\n>>> model = CaLmForMaskedLM(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=input[\"input_ids\"])\n>>> output[\"logits\"].shape\ntorch.Size([1, 7, 131])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<NllLossBackward0>)\n
Source code in multimolecule/models/calm/modeling_calm.py Python
class CaLmForMaskedLM(CaLmPreTrainedModel):\n    \"\"\"\n    Examples:\n        >>> from multimolecule import CaLmConfig, CaLmForMaskedLM, RnaTokenizer\n        >>> config = CaLmConfig()\n        >>> model = CaLmForMaskedLM(config)\n        >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n        >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n        >>> output = model(**input, labels=input[\"input_ids\"])\n        >>> output[\"logits\"].shape\n        torch.Size([1, 7, 131])\n        >>> output[\"loss\"]  # doctest:+ELLIPSIS\n        tensor(..., grad_fn=<NllLossBackward0>)\n    \"\"\"\n\n    _tied_weights_keys = [\"lm_head.decoder.weight\", \"lm_head.decoder.bias\"]\n\n    def __init__(self, config: CaLmConfig):\n        super().__init__(config)\n        if config.is_decoder:\n            logger.warning(\n                \"If you want to use `CaLmForMaskedLM` make sure `config.is_decoder=False` for \"\n                \"bi-directional self-attention.\"\n            )\n        self.calm = CaLmModel(config, add_pooling_layer=False)\n        self.lm_head = MaskedLMHead(config, self.calm.embeddings.word_embeddings.weight)\n\n        # Initialize weights and apply final processing\n        self.post_init()\n\n    def forward(\n        self,\n        input_ids: Tensor | NestedTensor,\n        attention_mask: Tensor | None = None,\n        position_ids: Tensor | None = None,\n        head_mask: Tensor | None = None,\n        inputs_embeds: Tensor | NestedTensor | None = None,\n        encoder_hidden_states: Tensor | None = None,\n        encoder_attention_mask: Tensor | None = None,\n        labels: Tensor | None = None,\n        output_attentions: bool | None = None,\n        output_hidden_states: bool | None = None,\n        return_dict: bool | None = None,\n        **kwargs,\n    ) -> Tuple[Tensor, ...] | MaskedLMOutput:\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n        outputs = self.calm(\n            input_ids,\n            attention_mask=attention_mask,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            encoder_hidden_states=encoder_hidden_states,\n            encoder_attention_mask=encoder_attention_mask,\n            output_attentions=output_attentions,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n            **kwargs,\n        )\n        output = self.lm_head(outputs, labels)\n        logits, loss = output.logits, output.loss\n\n        if not return_dict:\n            output = (logits,) + outputs[2:]\n            return ((loss,) + output) if loss is not None else output\n\n        return MaskedLMOutput(\n            loss=loss,\n            logits=logits,\n            hidden_states=outputs.hidden_states,\n            attentions=outputs.attentions,\n        )\n
"},{"location":"models/calm/#multimolecule.models.calm.CaLmForSequencePrediction","title":"CaLmForSequencePrediction","text":"

Bases: CaLmPreTrainedModel

Examples:

Python Console Session
>>> from multimolecule import CaLmConfig, CaLmForSequencePrediction, RnaTokenizer\n>>> config = CaLmConfig()\n>>> model = CaLmForSequencePrediction(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=torch.tensor([[1]]))\n>>> output[\"logits\"].shape\ntorch.Size([1, 1])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n
Source code in multimolecule/models/calm/modeling_calm.py Python
class CaLmForSequencePrediction(CaLmPreTrainedModel):\n    \"\"\"\n    Examples:\n        >>> from multimolecule import CaLmConfig, CaLmForSequencePrediction, RnaTokenizer\n        >>> config = CaLmConfig()\n        >>> model = CaLmForSequencePrediction(config)\n        >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n        >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n        >>> output = model(**input, labels=torch.tensor([[1]]))\n        >>> output[\"logits\"].shape\n        torch.Size([1, 1])\n        >>> output[\"loss\"]  # doctest:+ELLIPSIS\n        tensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n    \"\"\"\n\n    def __init__(self, config: CaLmConfig):\n        super().__init__(config)\n        self.calm = CaLmModel(config, add_pooling_layer=True)\n        self.sequence_head = SequencePredictionHead(config)\n        self.head_config = self.sequence_head.config\n\n        # Initialize weights and apply final processing\n        self.post_init()\n\n    def forward(\n        self,\n        input_ids: Tensor | NestedTensor,\n        attention_mask: Tensor | None = None,\n        position_ids: Tensor | None = None,\n        head_mask: Tensor | None = None,\n        inputs_embeds: Tensor | NestedTensor | None = None,\n        labels: Tensor | None = None,\n        output_attentions: bool | None = None,\n        output_hidden_states: bool | None = None,\n        return_dict: bool | None = None,\n        **kwargs,\n    ) -> Tuple[Tensor, ...] | SequencePredictorOutput:\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n        outputs = self.calm(\n            input_ids,\n            attention_mask=attention_mask,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            output_attentions=output_attentions,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n            **kwargs,\n        )\n        output = self.sequence_head(outputs, labels)\n        logits, loss = output.logits, output.loss\n\n        if not return_dict:\n            output = (logits,) + outputs[2:]\n            return ((loss,) + output) if loss is not None else output\n\n        return SequencePredictorOutput(\n            loss=loss,\n            logits=logits,\n            hidden_states=outputs.hidden_states,\n            attentions=outputs.attentions,\n        )\n
"},{"location":"models/calm/#multimolecule.models.calm.CaLmForTokenPrediction","title":"CaLmForTokenPrediction","text":"

Bases: CaLmPreTrainedModel

Examples:

Python Console Session
>>> from multimolecule import CaLmConfig, CaLmForTokenPrediction, RnaTokenizer\n>>> config = CaLmConfig()\n>>> model = CaLmForTokenPrediction(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=torch.randint(2, (1, 5)))\n>>> output[\"logits\"].shape\ntorch.Size([1, 5, 1])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n
Source code in multimolecule/models/calm/modeling_calm.py Python
class CaLmForTokenPrediction(CaLmPreTrainedModel):\n    \"\"\"\n    Examples:\n        >>> from multimolecule import CaLmConfig, CaLmForTokenPrediction, RnaTokenizer\n        >>> config = CaLmConfig()\n        >>> model = CaLmForTokenPrediction(config)\n        >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n        >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n        >>> output = model(**input, labels=torch.randint(2, (1, 5)))\n        >>> output[\"logits\"].shape\n        torch.Size([1, 5, 1])\n        >>> output[\"loss\"]  # doctest:+ELLIPSIS\n        tensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n    \"\"\"\n\n    def __init__(self, config: CaLmConfig):\n        super().__init__(config)\n        self.calm = CaLmModel(config, add_pooling_layer=True)\n        self.token_head = TokenPredictionHead(config)\n        self.head_config = self.token_head.config\n\n        # Initialize weights and apply final processing\n        self.post_init()\n\n    def forward(\n        self,\n        input_ids: Tensor | NestedTensor,\n        attention_mask: Tensor | None = None,\n        position_ids: Tensor | None = None,\n        head_mask: Tensor | None = None,\n        inputs_embeds: Tensor | NestedTensor | None = None,\n        labels: Tensor | None = None,\n        output_attentions: bool | None = None,\n        output_hidden_states: bool | None = None,\n        return_dict: bool | None = None,\n        **kwargs,\n    ) -> Tuple[Tensor, ...] | TokenPredictorOutput:\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n        outputs = self.calm(\n            input_ids,\n            attention_mask=attention_mask,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            output_attentions=output_attentions,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n            **kwargs,\n        )\n        output = self.token_head(outputs, attention_mask, input_ids, labels)\n        logits, loss = output.logits, output.loss\n\n        if not return_dict:\n            output = (logits,) + outputs[2:]\n            return ((loss,) + output) if loss is not None else output\n\n        return TokenPredictorOutput(\n            loss=loss,\n            logits=logits,\n            hidden_states=outputs.hidden_states,\n            attentions=outputs.attentions,\n        )\n
"},{"location":"models/calm/#multimolecule.models.calm.CaLmModel","title":"CaLmModel","text":"

Bases: CaLmPreTrainedModel

Examples:

Python Console Session
>>> from multimolecule import CaLmConfig, CaLmModel, RnaTokenizer\n>>> config = CaLmConfig()\n>>> model = CaLmModel(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input)\n>>> output[\"last_hidden_state\"].shape\ntorch.Size([1, 7, 768])\n>>> output[\"pooler_output\"].shape\ntorch.Size([1, 768])\n
Source code in multimolecule/models/calm/modeling_calm.py Python
class CaLmModel(CaLmPreTrainedModel):\n    \"\"\"\n    Examples:\n        >>> from multimolecule import CaLmConfig, CaLmModel, RnaTokenizer\n        >>> config = CaLmConfig()\n        >>> model = CaLmModel(config)\n        >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n        >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n        >>> output = model(**input)\n        >>> output[\"last_hidden_state\"].shape\n        torch.Size([1, 7, 768])\n        >>> output[\"pooler_output\"].shape\n        torch.Size([1, 768])\n    \"\"\"\n\n    def __init__(self, config: CaLmConfig, add_pooling_layer: bool = True):\n        super().__init__(config)\n        self.pad_token_id = config.pad_token_id\n        self.embeddings = CaLmEmbeddings(config)\n        self.encoder = CaLmEncoder(config)\n        self.pooler = CaLmPooler(config) if add_pooling_layer else None\n\n        # Initialize weights and apply final processing\n        self.post_init()\n\n    def get_input_embeddings(self):\n        return self.embeddings.word_embeddings\n\n    def set_input_embeddings(self, value):\n        self.embeddings.word_embeddings = value\n\n    def _prune_heads(self, heads_to_prune):\n        \"\"\"\n        Prunes heads of the model. heads_to_prune: dict of {layer_num: list of heads to prune in this layer} See base\n        class PreTrainedModel\n        \"\"\"\n        for layer, heads in heads_to_prune.items():\n            self.encoder.layer[layer].attention.prune_heads(heads)\n\n    def forward(\n        self,\n        input_ids: Tensor | NestedTensor,\n        attention_mask: Tensor | None = None,\n        position_ids: Tensor | None = None,\n        head_mask: Tensor | None = None,\n        inputs_embeds: Tensor | NestedTensor | None = None,\n        encoder_hidden_states: Tensor | None = None,\n        encoder_attention_mask: Tensor | None = None,\n        past_key_values: Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None = None,\n        use_cache: bool | None = None,\n        output_attentions: bool | None = None,\n        output_hidden_states: bool | None = None,\n        return_dict: bool | None = None,\n        **kwargs,\n    ) -> Tuple[Tensor, ...] | BaseModelOutputWithPoolingAndCrossAttentions:\n        r\"\"\"\n        Args:\n            encoder_hidden_states:\n                Shape: `(batch_size, sequence_length, hidden_size)`\n\n                Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if\n                the model is configured as a decoder.\n            encoder_attention_mask:\n                Shape: `(batch_size, sequence_length)`\n\n                Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used\n                in the cross-attention if the model is configured as a decoder. Mask values selected in `[0, 1]`:\n\n                - 1 for tokens that are **not masked**,\n                - 0 for tokens that are **masked**.\n            past_key_values:\n                Tuple of length `config.n_layers` with each tuple having 4 tensors of shape\n                `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)\n\n                Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up\n                decoding.\n\n                If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those\n                that don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of\n                all `decoder_input_ids` of shape `(batch_size, sequence_length)`.\n            use_cache:\n                If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding\n                (see `past_key_values`).\n        \"\"\"\n        if kwargs:\n            warn(\n                f\"Additional keyword arguments `{', '.join(kwargs)}` are detected in \"\n                f\"`{self.__class__.__name__}.forward`, they will be ignored.\\n\"\n                \"This is provided for backward compatibility and may lead to unexpected behavior.\"\n            )\n        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions\n        output_hidden_states = (\n            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states\n        )\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n\n        if self.config.is_decoder:\n            use_cache = use_cache if use_cache is not None else self.config.use_cache\n        else:\n            use_cache = False\n\n        if isinstance(input_ids, NestedTensor):\n            input_ids, attention_mask = input_ids.tensor, input_ids.mask\n        if input_ids is not None and inputs_embeds is not None:\n            raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n        if input_ids is not None:\n            self.warn_if_padding_and_no_attention_mask(input_ids, attention_mask)\n            input_shape = input_ids.size()\n        elif inputs_embeds is not None:\n            input_shape = inputs_embeds.size()[:-1]\n        else:\n            raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n        batch_size, seq_length = input_shape\n        device = input_ids.device if input_ids is not None else inputs_embeds.device  # type: ignore[union-attr]\n\n        # past_key_values_length\n        past_key_values_length = past_key_values[0][0].shape[2] if past_key_values is not None else 0\n\n        if attention_mask is None:\n            attention_mask = (\n                input_ids.ne(self.pad_token_id)\n                if self.pad_token_id is not None\n                else torch.ones(((batch_size, seq_length + past_key_values_length)), device=device)\n            )\n\n        # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]\n        # ourselves in which case we just need to make it broadcastable to all heads.\n        extended_attention_mask: Tensor = self.get_extended_attention_mask(attention_mask, input_shape)\n\n        # If a 2D or 3D attention mask is provided for the cross-attention\n        # we need to make broadcastable to [batch_size, num_heads, seq_length, seq_length]\n        if self.config.is_decoder and encoder_hidden_states is not None:\n            encoder_batch_size, encoder_sequence_length, _ = encoder_hidden_states.size()\n            encoder_hidden_shape = (encoder_batch_size, encoder_sequence_length)\n            if encoder_attention_mask is None:\n                encoder_attention_mask = torch.ones(encoder_hidden_shape, device=device)\n            encoder_extended_attention_mask = self.invert_attention_mask(encoder_attention_mask)\n        else:\n            encoder_extended_attention_mask = None\n\n        # Prepare head mask if needed\n        # 1.0 in head_mask indicate we keep the head\n        # attention_probs has shape bsz x n_heads x N x N\n        # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]\n        # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]\n        head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)\n\n        embedding_output = self.embeddings(\n            input_ids=input_ids,\n            position_ids=position_ids,\n            attention_mask=attention_mask,\n            inputs_embeds=inputs_embeds,\n            past_key_values_length=past_key_values_length,\n        )\n        encoder_outputs = self.encoder(\n            embedding_output,\n            attention_mask=extended_attention_mask,\n            head_mask=head_mask,\n            encoder_hidden_states=encoder_hidden_states,\n            encoder_attention_mask=encoder_extended_attention_mask,\n            past_key_values=past_key_values,\n            use_cache=use_cache,\n            output_attentions=output_attentions,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n        )\n        sequence_output = encoder_outputs[0]\n        pooled_output = self.pooler(sequence_output) if self.pooler is not None else None\n\n        if not return_dict:\n            return (sequence_output, pooled_output) + encoder_outputs[1:]\n\n        return BaseModelOutputWithPoolingAndCrossAttentions(\n            last_hidden_state=sequence_output,\n            pooler_output=pooled_output,\n            past_key_values=encoder_outputs.past_key_values,\n            hidden_states=encoder_outputs.hidden_states,\n            attentions=encoder_outputs.attentions,\n            cross_attentions=encoder_outputs.cross_attentions,\n        )\n
"},{"location":"models/calm/#multimolecule.models.calm.CaLmModel.forward","title":"forward","text":"Python
forward(input_ids: Tensor | NestedTensor, attention_mask: Tensor | None = None, position_ids: Tensor | None = None, head_mask: Tensor | None = None, inputs_embeds: Tensor | NestedTensor | None = None, encoder_hidden_states: Tensor | None = None, encoder_attention_mask: Tensor | None = None, past_key_values: Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None = None, use_cache: bool | None = None, output_attentions: bool | None = None, output_hidden_states: bool | None = None, return_dict: bool | None = None, **kwargs) -> Tuple[Tensor, ...] | BaseModelOutputWithPoolingAndCrossAttentions\n

Parameters:

Name Type Description Default Tensor | None

Shape: (batch_size, sequence_length, hidden_size)

Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if the model is configured as a decoder.

None Tensor | None

Shape: (batch_size, sequence_length)

Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used in the cross-attention if the model is configured as a decoder. Mask values selected in [0, 1]:

None Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None

Tuple of length config.n_layers with each tuple having 4 tensors of shape `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)

Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up decoding.

If past_key_values are used, the user can optionally input only the last decoder_input_ids (those that don\u2019t have their past key value states given to this model) of shape (batch_size, 1) instead of all decoder_input_ids of shape (batch_size, sequence_length).

None bool | None

If set to True, past_key_values key value states are returned and can be used to speed up decoding (see past_key_values).

None Source code in multimolecule/models/calm/modeling_calm.py Python
def forward(\n    self,\n    input_ids: Tensor | NestedTensor,\n    attention_mask: Tensor | None = None,\n    position_ids: Tensor | None = None,\n    head_mask: Tensor | None = None,\n    inputs_embeds: Tensor | NestedTensor | None = None,\n    encoder_hidden_states: Tensor | None = None,\n    encoder_attention_mask: Tensor | None = None,\n    past_key_values: Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None = None,\n    use_cache: bool | None = None,\n    output_attentions: bool | None = None,\n    output_hidden_states: bool | None = None,\n    return_dict: bool | None = None,\n    **kwargs,\n) -> Tuple[Tensor, ...] | BaseModelOutputWithPoolingAndCrossAttentions:\n    r\"\"\"\n    Args:\n        encoder_hidden_states:\n            Shape: `(batch_size, sequence_length, hidden_size)`\n\n            Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if\n            the model is configured as a decoder.\n        encoder_attention_mask:\n            Shape: `(batch_size, sequence_length)`\n\n            Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used\n            in the cross-attention if the model is configured as a decoder. Mask values selected in `[0, 1]`:\n\n            - 1 for tokens that are **not masked**,\n            - 0 for tokens that are **masked**.\n        past_key_values:\n            Tuple of length `config.n_layers` with each tuple having 4 tensors of shape\n            `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)\n\n            Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up\n            decoding.\n\n            If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those\n            that don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of\n            all `decoder_input_ids` of shape `(batch_size, sequence_length)`.\n        use_cache:\n            If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding\n            (see `past_key_values`).\n    \"\"\"\n    if kwargs:\n        warn(\n            f\"Additional keyword arguments `{', '.join(kwargs)}` are detected in \"\n            f\"`{self.__class__.__name__}.forward`, they will be ignored.\\n\"\n            \"This is provided for backward compatibility and may lead to unexpected behavior.\"\n        )\n    output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions\n    output_hidden_states = (\n        output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states\n    )\n    return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n\n    if self.config.is_decoder:\n        use_cache = use_cache if use_cache is not None else self.config.use_cache\n    else:\n        use_cache = False\n\n    if isinstance(input_ids, NestedTensor):\n        input_ids, attention_mask = input_ids.tensor, input_ids.mask\n    if input_ids is not None and inputs_embeds is not None:\n        raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n    if input_ids is not None:\n        self.warn_if_padding_and_no_attention_mask(input_ids, attention_mask)\n        input_shape = input_ids.size()\n    elif inputs_embeds is not None:\n        input_shape = inputs_embeds.size()[:-1]\n    else:\n        raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n    batch_size, seq_length = input_shape\n    device = input_ids.device if input_ids is not None else inputs_embeds.device  # type: ignore[union-attr]\n\n    # past_key_values_length\n    past_key_values_length = past_key_values[0][0].shape[2] if past_key_values is not None else 0\n\n    if attention_mask is None:\n        attention_mask = (\n            input_ids.ne(self.pad_token_id)\n            if self.pad_token_id is not None\n            else torch.ones(((batch_size, seq_length + past_key_values_length)), device=device)\n        )\n\n    # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]\n    # ourselves in which case we just need to make it broadcastable to all heads.\n    extended_attention_mask: Tensor = self.get_extended_attention_mask(attention_mask, input_shape)\n\n    # If a 2D or 3D attention mask is provided for the cross-attention\n    # we need to make broadcastable to [batch_size, num_heads, seq_length, seq_length]\n    if self.config.is_decoder and encoder_hidden_states is not None:\n        encoder_batch_size, encoder_sequence_length, _ = encoder_hidden_states.size()\n        encoder_hidden_shape = (encoder_batch_size, encoder_sequence_length)\n        if encoder_attention_mask is None:\n            encoder_attention_mask = torch.ones(encoder_hidden_shape, device=device)\n        encoder_extended_attention_mask = self.invert_attention_mask(encoder_attention_mask)\n    else:\n        encoder_extended_attention_mask = None\n\n    # Prepare head mask if needed\n    # 1.0 in head_mask indicate we keep the head\n    # attention_probs has shape bsz x n_heads x N x N\n    # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]\n    # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]\n    head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)\n\n    embedding_output = self.embeddings(\n        input_ids=input_ids,\n        position_ids=position_ids,\n        attention_mask=attention_mask,\n        inputs_embeds=inputs_embeds,\n        past_key_values_length=past_key_values_length,\n    )\n    encoder_outputs = self.encoder(\n        embedding_output,\n        attention_mask=extended_attention_mask,\n        head_mask=head_mask,\n        encoder_hidden_states=encoder_hidden_states,\n        encoder_attention_mask=encoder_extended_attention_mask,\n        past_key_values=past_key_values,\n        use_cache=use_cache,\n        output_attentions=output_attentions,\n        output_hidden_states=output_hidden_states,\n        return_dict=return_dict,\n    )\n    sequence_output = encoder_outputs[0]\n    pooled_output = self.pooler(sequence_output) if self.pooler is not None else None\n\n    if not return_dict:\n        return (sequence_output, pooled_output) + encoder_outputs[1:]\n\n    return BaseModelOutputWithPoolingAndCrossAttentions(\n        last_hidden_state=sequence_output,\n        pooler_output=pooled_output,\n        past_key_values=encoder_outputs.past_key_values,\n        hidden_states=encoder_outputs.hidden_states,\n        attentions=encoder_outputs.attentions,\n        cross_attentions=encoder_outputs.cross_attentions,\n    )\n
"},{"location":"models/calm/#multimolecule.models.calm.CaLmModel.forward(encoder_hidden_states)","title":"encoder_hidden_states","text":""},{"location":"models/calm/#multimolecule.models.calm.CaLmModel.forward(encoder_attention_mask)","title":"encoder_attention_mask","text":""},{"location":"models/calm/#multimolecule.models.calm.CaLmModel.forward(past_key_values)","title":"past_key_values","text":""},{"location":"models/calm/#multimolecule.models.calm.CaLmModel.forward(use_cache)","title":"use_cache","text":""},{"location":"models/calm/#multimolecule.models.calm.CaLmPreTrainedModel","title":"CaLmPreTrainedModel","text":"

Bases: PreTrainedModel

An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained models.

Source code in multimolecule/models/calm/modeling_calm.py Python
class CaLmPreTrainedModel(PreTrainedModel):\n    \"\"\"\n    An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained\n    models.\n    \"\"\"\n\n    config_class = CaLmConfig\n    base_model_prefix = \"calm\"\n    supports_gradient_checkpointing = True\n    _no_split_modules = [\"CaLmLayer\", \"CaLmEmbeddings\"]\n\n    # Copied from transformers.models.bert.modeling_bert.BertPreTrainedModel._init_weights\n    def _init_weights(self, module: nn.Module):\n        \"\"\"Initialize the weights\"\"\"\n        if isinstance(module, nn.Linear):\n            # Slightly different from the TF version which uses truncated_normal for initialization\n            # cf https://github.com/pytorch/pytorch/pull/5617\n            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)\n            if module.bias is not None:\n                module.bias.data.zero_()\n        elif isinstance(module, nn.Embedding):\n            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)\n            if module.padding_idx is not None:\n                module.weight.data[module.padding_idx].zero_()\n        elif isinstance(module, nn.LayerNorm):\n            module.bias.data.zero_()\n            module.weight.data.fill_(1.0)\n
"},{"location":"models/configuration_utils/","title":"configuration_utils","text":""},{"location":"models/configuration_utils/#multimolecule.models.configuration_utils","title":"multimolecule.models.configuration_utils","text":""},{"location":"models/configuration_utils/#multimolecule.models.configuration_utils.HeadConfig","title":"HeadConfig","text":"

Bases: BaseHeadConfig

Configuration class for a prediction head.

Parameters:

Name Type Description Default

Number of labels to use in the last layer added to the model, typically for a classification task.

Head should look for Config.num_labels if is None.

required

Problem type for XxxForYyyPrediction models. Can be one of \"binary\", \"regression\", \"multiclass\" or \"multilabel\".

Head should look for Config.problem_type if is None.

required

Dimensionality of the encoder layers and the pooler layer.

Head should look for Config.hidden_size if is None.

required

The dropout ratio for the hidden states.

required

The transform operation applied to hidden states.

required

The activation function of transform applied to hidden states.

required

Whether to apply bias to the final prediction layer.

required

The activation function of the final prediction output.

required

The epsilon used by the layer normalization layers.

required

The name of the tensor required in model outputs.

If is None, will use the default output name of the corresponding head.

required

The type of the head in the model.

This is used by MultiMoleculeModel to construct heads.

required Source code in multimolecule/module/heads/config.py Python
class HeadConfig(BaseHeadConfig):\n    r\"\"\"\n    Configuration class for a prediction head.\n\n    Args:\n        num_labels:\n            Number of labels to use in the last layer added to the model, typically for a classification task.\n\n            Head should look for [`Config.num_labels`][multimolecule.PreTrainedConfig] if is `None`.\n        problem_type:\n            Problem type for `XxxForYyyPrediction` models. Can be one of `\"binary\"`, `\"regression\"`,\n            `\"multiclass\"` or `\"multilabel\"`.\n\n            Head should look for [`Config.problem_type`][multimolecule.PreTrainedConfig] if is `None`.\n        hidden_size:\n            Dimensionality of the encoder layers and the pooler layer.\n\n            Head should look for [`Config.hidden_size`][multimolecule.PreTrainedConfig] if is `None`.\n        dropout:\n            The dropout ratio for the hidden states.\n        transform:\n            The transform operation applied to hidden states.\n        transform_act:\n            The activation function of transform applied to hidden states.\n        bias:\n            Whether to apply bias to the final prediction layer.\n        act:\n            The activation function of the final prediction output.\n        layer_norm_eps:\n            The epsilon used by the layer normalization layers.\n        output_name:\n            The name of the tensor required in model outputs.\n\n            If is `None`, will use the default output name of the corresponding head.\n        type:\n            The type of the head in the model.\n\n            This is used by [`MultiMoleculeModel`][multimolecule.MultiMoleculeModel] to construct heads.\n    \"\"\"\n\n    num_labels: Optional[int] = None\n    problem_type: Optional[str] = None\n    hidden_size: Optional[int] = None\n    dropout: float = 0.0\n    transform: Optional[str] = None\n    transform_act: Optional[str] = \"gelu\"\n    bias: bool = True\n    act: Optional[str] = None\n    layer_norm_eps: float = 1e-12\n    output_name: Optional[str] = None\n    type: Optional[str] = None\n
"},{"location":"models/configuration_utils/#multimolecule.models.configuration_utils.HeadConfig(num_labels)","title":"num_labels","text":""},{"location":"models/configuration_utils/#multimolecule.models.configuration_utils.HeadConfig(problem_type)","title":"problem_type","text":""},{"location":"models/configuration_utils/#multimolecule.models.configuration_utils.HeadConfig(hidden_size)","title":"hidden_size","text":""},{"location":"models/configuration_utils/#multimolecule.models.configuration_utils.HeadConfig(dropout)","title":"dropout","text":""},{"location":"models/configuration_utils/#multimolecule.models.configuration_utils.HeadConfig(transform)","title":"transform","text":""},{"location":"models/configuration_utils/#multimolecule.models.configuration_utils.HeadConfig(transform_act)","title":"transform_act","text":""},{"location":"models/configuration_utils/#multimolecule.models.configuration_utils.HeadConfig(bias)","title":"bias","text":""},{"location":"models/configuration_utils/#multimolecule.models.configuration_utils.HeadConfig(act)","title":"act","text":""},{"location":"models/configuration_utils/#multimolecule.models.configuration_utils.HeadConfig(layer_norm_eps)","title":"layer_norm_eps","text":""},{"location":"models/configuration_utils/#multimolecule.models.configuration_utils.HeadConfig(output_name)","title":"output_name","text":""},{"location":"models/configuration_utils/#multimolecule.models.configuration_utils.HeadConfig(type)","title":"type","text":""},{"location":"models/configuration_utils/#multimolecule.models.configuration_utils.MaskedLMHeadConfig","title":"MaskedLMHeadConfig","text":"

Bases: BaseHeadConfig

Configuration class for a Masked Language Modeling head.

Parameters:

Name Type Description Default

Dimensionality of the encoder layers and the pooler layer.

Head should look for Config.hidden_size if is None.

required

The dropout ratio for the hidden states.

required

The transform operation applied to hidden states.

required

The activation function of transform applied to hidden states.

required

Whether to apply bias to the final prediction layer.

required

The activation function of the final prediction output.

required

The epsilon used by the layer normalization layers.

required

The name of the tensor required in model outputs.

If is None, will use the default output name of the corresponding head.

required Source code in multimolecule/module/heads/config.py Python
class MaskedLMHeadConfig(BaseHeadConfig):\n    r\"\"\"\n    Configuration class for a Masked Language Modeling head.\n\n    Args:\n        hidden_size:\n            Dimensionality of the encoder layers and the pooler layer.\n\n            Head should look for [`Config.hidden_size`][multimolecule.PreTrainedConfig] if is `None`.\n        dropout:\n            The dropout ratio for the hidden states.\n        transform:\n            The transform operation applied to hidden states.\n        transform_act:\n            The activation function of transform applied to hidden states.\n        bias:\n            Whether to apply bias to the final prediction layer.\n        act:\n            The activation function of the final prediction output.\n        layer_norm_eps:\n            The epsilon used by the layer normalization layers.\n        output_name:\n            The name of the tensor required in model outputs.\n\n            If is `None`, will use the default output name of the corresponding head.\n    \"\"\"\n\n    hidden_size: Optional[int] = None\n    dropout: float = 0.0\n    transform: Optional[str] = \"nonlinear\"\n    transform_act: Optional[str] = \"gelu\"\n    bias: bool = True\n    act: Optional[str] = None\n    layer_norm_eps: float = 1e-12\n    output_name: Optional[str] = None\n
"},{"location":"models/configuration_utils/#multimolecule.models.configuration_utils.MaskedLMHeadConfig(hidden_size)","title":"hidden_size","text":""},{"location":"models/configuration_utils/#multimolecule.models.configuration_utils.MaskedLMHeadConfig(dropout)","title":"dropout","text":""},{"location":"models/configuration_utils/#multimolecule.models.configuration_utils.MaskedLMHeadConfig(transform)","title":"transform","text":""},{"location":"models/configuration_utils/#multimolecule.models.configuration_utils.MaskedLMHeadConfig(transform_act)","title":"transform_act","text":""},{"location":"models/configuration_utils/#multimolecule.models.configuration_utils.MaskedLMHeadConfig(bias)","title":"bias","text":""},{"location":"models/configuration_utils/#multimolecule.models.configuration_utils.MaskedLMHeadConfig(act)","title":"act","text":""},{"location":"models/configuration_utils/#multimolecule.models.configuration_utils.MaskedLMHeadConfig(layer_norm_eps)","title":"layer_norm_eps","text":""},{"location":"models/configuration_utils/#multimolecule.models.configuration_utils.MaskedLMHeadConfig(output_name)","title":"output_name","text":""},{"location":"models/configuration_utils/#multimolecule.models.configuration_utils.PreTrainedConfig","title":"PreTrainedConfig","text":"

Bases: PretrainedConfig

Base class for all model configuration classes.

Source code in multimolecule/models/configuration_utils.py Python
class PreTrainedConfig(PretrainedConfig):\n    r\"\"\"\n    Base class for all model configuration classes.\n    \"\"\"\n\n    head: HeadConfig | None\n    num_labels: int = 1\n\n    hidden_size: int\n\n    pad_token_id: int = 0\n    bos_token_id: int = 1\n    eos_token_id: int = 2\n    unk_token_id: int = 3\n    mask_token_id: int = 4\n    null_token_id: int = 5\n\n    def __init__(\n        self,\n        pad_token_id: int = 0,\n        bos_token_id: int = 1,\n        eos_token_id: int = 2,\n        unk_token_id: int = 3,\n        mask_token_id: int = 4,\n        null_token_id: int = 5,\n        num_labels: int = 1,\n        **kwargs,\n    ):\n        super().__init__(\n            pad_token_id=pad_token_id,\n            bos_token_id=bos_token_id,\n            eos_token_id=eos_token_id,\n            unk_token_id=unk_token_id,\n            mask_token_id=mask_token_id,\n            null_token_id=null_token_id,\n            num_labels=num_labels,\n            **kwargs,\n        )\n\n    def to_dict(self):\n        output = super().to_dict()\n        for k, v in output.items():\n            if hasattr(v, \"to_dict\"):\n                output[k] = v.to_dict()\n            if is_dataclass(v):\n                output[k] = asdict(v)\n        return output\n
"},{"location":"models/ernierna/","title":"ERNIE-RNA","text":"

Pre-trained model on non-coding RNA (ncRNA) using a masked language modeling (MLM) objective.

"},{"location":"models/ernierna/#disclaimer","title":"Disclaimer","text":"

This is an UNOFFICIAL implementation of the ERNIE-RNA: An RNA Language Model with Structure-enhanced Representations by Weijie Yin, Zhaoyu Zhang, Liang He, et al.

The OFFICIAL repository of ERNIE-RNA is at Bruce-ywj/ERNIE-RNA.

Tip

The MultiMolecule team has confirmed that the provided model and checkpoints are producing the same intermediate representations as the original implementation.

The team releasing ERNIE-RNA did not write this model card for this model so this model card has been written by the MultiMolecule team.

"},{"location":"models/ernierna/#model-details","title":"Model Details","text":"

ERNIE-RNA is a bert-style model pre-trained on a large corpus of non-coding RNA sequences in a self-supervised fashion. This means that the model was trained on the raw nucleotides of RNA sequences only, with an automatic process to generate inputs and labels from those texts. Please refer to the Training Details section for more information on the training process.

"},{"location":"models/ernierna/#variations","title":"Variations","text":""},{"location":"models/ernierna/#model-specification","title":"Model Specification","text":"Num Layers Hidden Size Num Heads Intermediate Size Num Parameters (M) FLOPs (G) MACs (G) Max Num Tokens 12 768 12 3072 85.67 22.36 11.17 1024"},{"location":"models/ernierna/#links","title":"Links","text":""},{"location":"models/ernierna/#usage","title":"Usage","text":"

The model file depends on the multimolecule library. You can install it using pip:

Bash
pip install multimolecule\n
"},{"location":"models/ernierna/#direct-use","title":"Direct Use","text":"

You can use this model directly with a pipeline for masked language modeling:

Python
>>> import multimolecule  # you must import multimolecule to register models\n>>> from transformers import pipeline\n>>> unmasker = pipeline(\"fill-mask\", model=\"multimolecule/ernierna\")\n>>> unmasker(\"gguc<mask>cucugguuagaccagaucugagccu\")\n\n[{'score': 0.32839149236679077,\n  'token': 6,\n  'token_str': 'A',\n  'sequence': 'G G U C A C U C U G G U U A G A C C A G A U C U G A G C C U'},\n {'score': 0.3044775426387787,\n  'token': 9,\n  'token_str': 'U',\n  'sequence': 'G G U C U C U C U G G U U A G A C C A G A U C U G A G C C U'},\n {'score': 0.09914574027061462,\n  'token': 7,\n  'token_str': 'C',\n  'sequence': 'G G U C C C U C U G G U U A G A C C A G A U C U G A G C C U'},\n {'score': 0.09502048045396805,\n  'token': 24,\n  'token_str': '-',\n  'sequence': 'G G U C - C U C U G G U U A G A C C A G A U C U G A G C C U'},\n {'score': 0.06993662565946579,\n  'token': 21,\n  'token_str': '.',\n  'sequence': 'G G U C. C U C U G G U U A G A C C A G A U C U G A G C C U'}]\n
"},{"location":"models/ernierna/#downstream-use","title":"Downstream Use","text":""},{"location":"models/ernierna/#extract-features","title":"Extract Features","text":"

Here is how to use this model to get the features of a given sequence in PyTorch:

Python
from multimolecule import RnaTokenizer, ErnieRnaModel\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/ernierna\")\nmodel = ErnieRnaModel.from_pretrained(\"multimolecule/ernierna\")\n\ntext = \"UAGCUUAUCAGACUGAUGUUGA\"\ninput = tokenizer(text, return_tensors=\"pt\")\n\noutput = model(**input)\n
"},{"location":"models/ernierna/#sequence-classification-regression","title":"Sequence Classification / Regression","text":"

Note: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for sequence classification or regression.

Here is how to use this model as backbone to fine-tune for a sequence-level task in PyTorch:

Python
import torch\nfrom multimolecule import RnaTokenizer, ErnieRnaForSequencePrediction\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/ernierna\")\nmodel = ErnieRnaForSequencePrediction.from_pretrained(\"multimolecule/ernierna\")\n\ntext = \"UAGCUUAUCAGACUGAUGUUGA\"\ninput = tokenizer(text, return_tensors=\"pt\")\nlabel = torch.tensor([1])\n\noutput = model(**input, labels=label)\n
"},{"location":"models/ernierna/#token-classification-regression","title":"Token Classification / Regression","text":"

Note: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for nucleotide classification or regression.

Here is how to use this model as backbone to fine-tune for a nucleotide-level task in PyTorch:

Python
import torch\nfrom multimolecule import RnaTokenizer, ErnieRnaForTokenPrediction\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/ernierna\")\nmodel = ErnieRnaForTokenPrediction.from_pretrained(\"multimolecule/ernierna\")\n\ntext = \"UAGCUUAUCAGACUGAUGUUGA\"\ninput = tokenizer(text, return_tensors=\"pt\")\nlabel = torch.randint(2, (len(text), ))\n\noutput = model(**input, labels=label)\n
"},{"location":"models/ernierna/#contact-classification-regression","title":"Contact Classification / Regression","text":"

Note: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for contact classification or regression.

Here is how to use this model as backbone to fine-tune for a contact-level task in PyTorch:

Python
import torch\nfrom multimolecule import RnaTokenizer, ErnieRnaForContactPrediction\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/ernierna\")\nmodel = ErnieRnaForContactPrediction.from_pretrained(\"multimolecule/ernierna\")\n\ntext = \"UAGCUUAUCAGACUGAUGUUGA\"\ninput = tokenizer(text, return_tensors=\"pt\")\nlabel = torch.randint(2, (len(text), len(text)))\n\noutput = model(**input, labels=label)\n
"},{"location":"models/ernierna/#training-details","title":"Training Details","text":"

ERNIE-RNA used Masked Language Modeling (MLM) as the pre-training objective: taking a sequence, the model randomly masks 15% of the tokens in the input then runs the entire masked sentence through the model and has to predict the masked tokens. This is comparable to the Cloze task in language modeling.

"},{"location":"models/ernierna/#training-data","title":"Training Data","text":"

The ERNIE-RNA model was pre-trained on RNAcentral. RNAcentral is a free, public resource that offers integrated access to a comprehensive and up-to-date set of non-coding RNA sequences provided by a collaborating group of Expert Databases representing a broad range of organisms and RNA types.

ERNIE-RNA applied CD-HIT (CD-HIT-EST) with a cut-off at 100% sequence identity to remove redundancy from the RNAcentral, resulting 25 million unique sequences. Sequences longer than 1024 nucleotides were subsequently excluded. The final dataset contains 20.4 million non-redundant RNA sequences. ERNIE-RNA preprocessed all tokens by replacing \u201cT\u201ds with \u201cS\u201ds.

Note that RnaTokenizer will convert \u201cT\u201ds to \u201cU\u201ds for you, you may disable this behaviour by passing replace_T_with_U=False.

"},{"location":"models/ernierna/#training-procedure","title":"Training Procedure","text":""},{"location":"models/ernierna/#preprocessing","title":"Preprocessing","text":"

ERNIE-RNA used masked language modeling (MLM) as the pre-training objective. The masking procedure is similar to the one used in BERT:

"},{"location":"models/ernierna/#pretraining","title":"PreTraining","text":"

The model was trained on 24 NVIDIA V100 GPUs with 32GiB memories.

"},{"location":"models/ernierna/#citation","title":"Citation","text":"

BibTeX:

BibTeX
@article {Yin2024.03.17.585376,\n    author = {Yin, Weijie and Zhang, Zhaoyu and He, Liang and Jiang, Rui and Zhang, Shuo and Liu, Gan and Zhang, Xuegong and Qin, Tao and Xie, Zhen},\n    title = {ERNIE-RNA: An RNA Language Model with Structure-enhanced Representations},\n    elocation-id = {2024.03.17.585376},\n    year = {2024},\n    doi = {10.1101/2024.03.17.585376},\n    publisher = {Cold Spring Harbor Laboratory},\n    abstract = {With large amounts of unlabeled RNA sequences data produced by high-throughput sequencing technologies, pre-trained RNA language models have been developed to estimate semantic space of RNA molecules, which facilities the understanding of grammar of RNA language. However, existing RNA language models overlook the impact of structure when modeling the RNA semantic space, resulting in incomplete feature extraction and suboptimal performance across various downstream tasks. In this study, we developed a RNA pre-trained language model named ERNIE-RNA (Enhanced Representations with base-pairing restriction for RNA modeling) based on a modified BERT (Bidirectional Encoder Representations from Transformers) by incorporating base-pairing restriction with no MSA (Multiple Sequence Alignment) information. We found that the attention maps from ERNIE-RNA with no fine-tuning are able to capture RNA structure in the zero-shot experiment more precisely than conventional methods such as fine-tuned RNAfold and RNAstructure, suggesting that the ERNIE-RNA can provide comprehensive RNA structural representations. Furthermore, ERNIE-RNA achieved SOTA (state-of-the-art) performance after fine-tuning for various downstream tasks, including RNA structural and functional predictions. In summary, our ERNIE-RNA model provides general features which can be widely and effectively applied in various subsequent research tasks. Our results indicate that introducing key knowledge-based prior information in the BERT framework may be a useful strategy to enhance the performance of other language models.Competing Interest StatementOne patent based on the study was submitted by Z.X. and W.Y., which is entitled as \"A Pre-training Approach for RNA Sequences and Its Applications\"(application number, no 202410262527.5). The remaining authors declare no competing interests.},\n    URL = {https://www.biorxiv.org/content/early/2024/03/17/2024.03.17.585376},\n    eprint = {https://www.biorxiv.org/content/early/2024/03/17/2024.03.17.585376.full.pdf},\n    journal = {bioRxiv}\n}\n
"},{"location":"models/ernierna/#contact","title":"Contact","text":"

Please use GitHub issues of MultiMolecule for any questions or comments on the model card.

Please contact the authors of the ERNIE-RNA paper for questions or comments on the paper/model.

"},{"location":"models/ernierna/#license","title":"License","text":"

This model is licensed under the AGPL-3.0 License.

Text Only
SPDX-License-Identifier: AGPL-3.0-or-later\n
"},{"location":"models/ernierna/#multimolecule.models.ernierna","title":"multimolecule.models.ernierna","text":""},{"location":"models/ernierna/#multimolecule.models.ernierna.RnaTokenizer","title":"RnaTokenizer","text":"

Bases: Tokenizer

Tokenizer for RNA sequences.

Parameters:

Name Type Description Default Alphabet | str | List[str] | None

alphabet to use for tokenization.

None int

Size of kmer to tokenize.

1 bool

Whether to tokenize into codons.

False bool

Whether to replace T with U.

True bool

Whether to convert input to uppercase.

True

Examples:

Python Console Session
>>> from multimolecule import RnaTokenizer\n>>> tokenizer = RnaTokenizer()\n>>> tokenizer('<pad><cls><eos><unk><mask><null>ACGUNRYSWKMBDHV.X*-I')[\"input_ids\"]\n[1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 2]\n>>> tokenizer('acgu')[\"input_ids\"]\n[1, 6, 7, 8, 9, 2]\n>>> tokenizer('acgt')[\"input_ids\"]\n[1, 6, 7, 8, 9, 2]\n>>> tokenizer = RnaTokenizer(replace_T_with_U=False)\n>>> tokenizer('acgt')[\"input_ids\"]\n[1, 6, 7, 8, 3, 2]\n>>> tokenizer = RnaTokenizer(nmers=3)\n>>> tokenizer('uagcuuauc')[\"input_ids\"]\n[1, 83, 17, 64, 49, 96, 84, 22, 2]\n>>> tokenizer = RnaTokenizer(codon=True)\n>>> tokenizer('uagcuuauc')[\"input_ids\"]\n[1, 83, 49, 22, 2]\n>>> tokenizer('uagcuuauca')[\"input_ids\"]\nTraceback (most recent call last):\nValueError: length of input sequence must be a multiple of 3 for codon tokenization, but got 10\n
Source code in multimolecule/tokenisers/rna/tokenization_rna.py Python
class RnaTokenizer(Tokenizer):\n    \"\"\"\n    Tokenizer for RNA sequences.\n\n    Args:\n        alphabet: alphabet to use for tokenization.\n\n            - If is `None`, the standard RNA alphabet will be used.\n            - If is a `string`, it should correspond to the name of a predefined alphabet. The options include\n                + `standard`\n                + `extended`\n                + `streamline`\n                + `nucleobase`\n            - If is an alphabet or a list of characters, that specific alphabet will be used.\n        nmers: Size of kmer to tokenize.\n        codon: Whether to tokenize into codons.\n        replace_T_with_U: Whether to replace T with U.\n        do_upper_case: Whether to convert input to uppercase.\n\n    Examples:\n        >>> from multimolecule import RnaTokenizer\n        >>> tokenizer = RnaTokenizer()\n        >>> tokenizer('<pad><cls><eos><unk><mask><null>ACGUNRYSWKMBDHV.X*-I')[\"input_ids\"]\n        [1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 2]\n        >>> tokenizer('acgu')[\"input_ids\"]\n        [1, 6, 7, 8, 9, 2]\n        >>> tokenizer('acgt')[\"input_ids\"]\n        [1, 6, 7, 8, 9, 2]\n        >>> tokenizer = RnaTokenizer(replace_T_with_U=False)\n        >>> tokenizer('acgt')[\"input_ids\"]\n        [1, 6, 7, 8, 3, 2]\n        >>> tokenizer = RnaTokenizer(nmers=3)\n        >>> tokenizer('uagcuuauc')[\"input_ids\"]\n        [1, 83, 17, 64, 49, 96, 84, 22, 2]\n        >>> tokenizer = RnaTokenizer(codon=True)\n        >>> tokenizer('uagcuuauc')[\"input_ids\"]\n        [1, 83, 49, 22, 2]\n        >>> tokenizer('uagcuuauca')[\"input_ids\"]\n        Traceback (most recent call last):\n        ValueError: length of input sequence must be a multiple of 3 for codon tokenization, but got 10\n    \"\"\"\n\n    model_input_names = [\"input_ids\", \"attention_mask\"]\n\n    def __init__(\n        self,\n        alphabet: Alphabet | str | List[str] | None = None,\n        nmers: int = 1,\n        codon: bool = False,\n        replace_T_with_U: bool = True,\n        do_upper_case: bool = True,\n        additional_special_tokens: List | Tuple | None = None,\n        **kwargs,\n    ):\n        if codon and (nmers > 1 and nmers != 3):\n            raise ValueError(\"Codon and nmers cannot be used together.\")\n        if codon:\n            nmers = 3  # set to 3 to get correct vocab\n        if not isinstance(alphabet, Alphabet):\n            alphabet = get_alphabet(alphabet, nmers=nmers)\n        super().__init__(\n            alphabet=alphabet,\n            nmers=nmers,\n            codon=codon,\n            replace_T_with_U=replace_T_with_U,\n            do_upper_case=do_upper_case,\n            additional_special_tokens=additional_special_tokens,\n            **kwargs,\n        )\n        self.replace_T_with_U = replace_T_with_U\n        self.nmers = nmers\n        self.codon = codon\n\n    def _tokenize(self, text: str, **kwargs):\n        if self.do_upper_case:\n            text = text.upper()\n        if self.replace_T_with_U:\n            text = text.replace(\"T\", \"U\")\n        if self.codon:\n            if len(text) % 3 != 0:\n                raise ValueError(\n                    f\"length of input sequence must be a multiple of 3 for codon tokenization, but got {len(text)}\"\n                )\n            return [text[i : i + 3] for i in range(0, len(text), 3)]\n        if self.nmers > 1:\n            return [text[i : i + self.nmers] for i in range(len(text) - self.nmers + 1)]  # noqa: E203\n        return list(text)\n
"},{"location":"models/ernierna/#multimolecule.models.ernierna.RnaTokenizer(alphabet)","title":"alphabet","text":""},{"location":"models/ernierna/#multimolecule.models.ernierna.RnaTokenizer(nmers)","title":"nmers","text":""},{"location":"models/ernierna/#multimolecule.models.ernierna.RnaTokenizer(codon)","title":"codon","text":""},{"location":"models/ernierna/#multimolecule.models.ernierna.RnaTokenizer(replace_T_with_U)","title":"replace_T_with_U","text":""},{"location":"models/ernierna/#multimolecule.models.ernierna.RnaTokenizer(do_upper_case)","title":"do_upper_case","text":""},{"location":"models/ernierna/#multimolecule.models.ernierna.ErnieRnaConfig","title":"ErnieRnaConfig","text":"

Bases: PreTrainedConfig

This is the configuration class to store the configuration of a ErnieRnaModel. It is used to instantiate a ErnieRna model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the ErnieRna Bruce-ywj/ERNIE-RNA architecture.

Configuration objects inherit from PreTrainedConfig and can be used to control the model outputs. Read the documentation from PreTrainedConfig for more information.

Parameters:

Name Type Description Default int

Vocabulary size of the ErnieRna model. Defines the number of different tokens that can be represented by the inputs_ids passed when calling [ErnieRnaModel].

26 int

Dimensionality of the encoder layers and the pooler layer.

768 int

Number of hidden layers in the Transformer encoder.

12 int

Number of attention heads for each attention layer in the Transformer encoder.

12 int

Dimensionality of the \u201cintermediate\u201d (often named feed-forward) layer in the Transformer encoder.

3072 float

The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.

0.1 float

The dropout ratio for the attention probabilities.

0.1 int

The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).

1026 float

The standard deviation of the truncated_normal_initializer for initializing all weight matrices.

0.02 float

The epsilon used by the layer normalization layers.

1e-12

Examples:

Python Console Session
>>> from multimolecule import ErnieRnaModel, ErnieRnaConfig\n>>> # Initializing a ERNIE-RNA multimolecule/ernierna style configuration\n>>> configuration = ErnieRnaConfig()\n>>> # Initializing a model (with random weights) from the multimolecule/ernierna style configuration\n>>> model = ErnieRnaModel(configuration)\n>>> # Accessing the model configuration\n>>> configuration = model.config\n
Source code in multimolecule/models/ernierna/configuration_ernierna.py Python
class ErnieRnaConfig(PreTrainedConfig):\n    r\"\"\"\n    This is the configuration class to store the configuration of a\n    [`ErnieRnaModel`][multimolecule.models.ErnieRnaModel]. It is used to instantiate a ErnieRna model according to the\n    specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a\n    similar configuration to that of the ErnieRna [Bruce-ywj/ERNIE-RNA](https://github.com/Bruce-ywj/ERNIE-RNA)\n    architecture.\n\n    Configuration objects inherit from [`PreTrainedConfig`][multimolecule.models.PreTrainedConfig] and can be used to\n    control the model outputs. Read the documentation from [`PreTrainedConfig`][multimolecule.models.PreTrainedConfig]\n    for more information.\n\n    Args:\n        vocab_size:\n            Vocabulary size of the ErnieRna model. Defines the number of different tokens that can be represented by\n            the `inputs_ids` passed when calling [`ErnieRnaModel`].\n        hidden_size:\n            Dimensionality of the encoder layers and the pooler layer.\n        num_hidden_layers:\n            Number of hidden layers in the Transformer encoder.\n        num_attention_heads:\n            Number of attention heads for each attention layer in the Transformer encoder.\n        intermediate_size:\n            Dimensionality of the \"intermediate\" (often named feed-forward) layer in the Transformer encoder.\n        hidden_dropout:\n            The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.\n        attention_dropout:\n            The dropout ratio for the attention probabilities.\n        max_position_embeddings:\n            The maximum sequence length that this model might ever be used with. Typically set this to something large\n            just in case (e.g., 512 or 1024 or 2048).\n        initializer_range:\n            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.\n        layer_norm_eps:\n            The epsilon used by the layer normalization layers.\n\n    Examples:\n        >>> from multimolecule import ErnieRnaModel, ErnieRnaConfig\n        >>> # Initializing a ERNIE-RNA multimolecule/ernierna style configuration\n        >>> configuration = ErnieRnaConfig()\n        >>> # Initializing a model (with random weights) from the multimolecule/ernierna style configuration\n        >>> model = ErnieRnaModel(configuration)\n        >>> # Accessing the model configuration\n        >>> configuration = model.config\n    \"\"\"\n\n    model_type = \"ernierna\"\n\n    def __init__(\n        self,\n        vocab_size: int = 26,\n        hidden_size: int = 768,\n        num_hidden_layers: int = 12,\n        num_attention_heads: int = 12,\n        intermediate_size: int = 3072,\n        hidden_act: str = \"gelu\",\n        hidden_dropout: float = 0.1,\n        attention_dropout: float = 0.1,\n        max_position_embeddings: int = 1026,\n        initializer_range: float = 0.02,\n        layer_norm_eps: float = 1e-12,\n        position_embedding_type: str = \"sinusoidal\",\n        pairwise_alpha: float = 0.8,\n        is_decoder: bool = False,\n        use_cache: bool = True,\n        head: HeadConfig | None = None,\n        lm_head: MaskedLMHeadConfig | None = None,\n        **kwargs,\n    ):\n        super().__init__(**kwargs)\n\n        self.vocab_size = vocab_size\n        self.type_vocab_size = 2\n        self.hidden_size = hidden_size\n        self.num_hidden_layers = num_hidden_layers\n        self.num_attention_heads = num_attention_heads\n        self.intermediate_size = intermediate_size\n        self.hidden_act = hidden_act\n        self.hidden_dropout = hidden_dropout\n        self.attention_dropout = attention_dropout\n        self.max_position_embeddings = max_position_embeddings\n        self.initializer_range = initializer_range\n        self.layer_norm_eps = layer_norm_eps\n        self.position_embedding_type = position_embedding_type\n        self.pairwise_alpha = pairwise_alpha\n        self.is_decoder = is_decoder\n        self.use_cache = use_cache\n        self.head = HeadConfig(**head) if head is not None else None\n        self.lm_head = MaskedLMHeadConfig(**lm_head) if lm_head is not None else None\n
"},{"location":"models/ernierna/#multimolecule.models.ernierna.ErnieRnaConfig(vocab_size)","title":"vocab_size","text":""},{"location":"models/ernierna/#multimolecule.models.ernierna.ErnieRnaConfig(hidden_size)","title":"hidden_size","text":""},{"location":"models/ernierna/#multimolecule.models.ernierna.ErnieRnaConfig(num_hidden_layers)","title":"num_hidden_layers","text":""},{"location":"models/ernierna/#multimolecule.models.ernierna.ErnieRnaConfig(num_attention_heads)","title":"num_attention_heads","text":""},{"location":"models/ernierna/#multimolecule.models.ernierna.ErnieRnaConfig(intermediate_size)","title":"intermediate_size","text":""},{"location":"models/ernierna/#multimolecule.models.ernierna.ErnieRnaConfig(hidden_dropout)","title":"hidden_dropout","text":""},{"location":"models/ernierna/#multimolecule.models.ernierna.ErnieRnaConfig(attention_dropout)","title":"attention_dropout","text":""},{"location":"models/ernierna/#multimolecule.models.ernierna.ErnieRnaConfig(max_position_embeddings)","title":"max_position_embeddings","text":""},{"location":"models/ernierna/#multimolecule.models.ernierna.ErnieRnaConfig(initializer_range)","title":"initializer_range","text":""},{"location":"models/ernierna/#multimolecule.models.ernierna.ErnieRnaConfig(layer_norm_eps)","title":"layer_norm_eps","text":""},{"location":"models/ernierna/#multimolecule.models.ernierna.ErnieRnaForContactClassification","title":"ErnieRnaForContactClassification","text":"

Bases: ErnieRnaForPreTraining

Examples:

Python Console Session
>>> from multimolecule.models import ErnieRnaConfig, ErnieRnaForContactClassification, RnaTokenizer\n>>> config = ErnieRnaConfig()\n>>> model = ErnieRnaForContactClassification(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input)\n
Source code in multimolecule/models/ernierna/modeling_ernierna.py Python
class ErnieRnaForContactClassification(ErnieRnaForPreTraining):\n    \"\"\"\n    Examples:\n        >>> from multimolecule.models import ErnieRnaConfig, ErnieRnaForContactClassification, RnaTokenizer\n        >>> config = ErnieRnaConfig()\n        >>> model = ErnieRnaForContactClassification(config)\n        >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n        >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n        >>> output = model(**input)\n    \"\"\"\n\n    def __init__(self, config: ErnieRnaConfig):\n        super().__init__(config)\n        self.ss_head = ErnieRnaContactClassificationHead(config)\n\n        # Initialize weights and apply final processing\n        self.post_init()\n\n    def forward(  # type: ignore[override]  # pylint: disable=W0221\n        self,\n        input_ids: Tensor | NestedTensor,\n        attention_mask: Tensor | None = None,\n        position_ids: Tensor | None = None,\n        head_mask: Tensor | None = None,\n        inputs_embeds: Tensor | NestedTensor | None = None,\n        labels_lm: Tensor | None = None,\n        labels_ss: Tensor | None = None,\n        output_attentions: bool | None = None,\n        output_attention_biases: bool | None = None,\n        output_hidden_states: bool | None = None,\n        return_dict: bool | None = None,\n        **kwargs,\n    ) -> Tuple[Tensor, ...] | ErnieRnaForContactClassificationOutput:\n        if output_attentions is False:\n            warn(\"output_attentions must be True for contact classification and will be ignored.\")\n        outputs = self.ernierna(\n            input_ids,\n            attention_mask=attention_mask,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            output_attentions=True,\n            output_attention_biases=output_attention_biases,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n            **kwargs,\n        )\n        output_lm = self.lm_head(outputs, labels_lm)\n        output_ss = self.ss_head(outputs[-1][-1], attention_mask, input_ids, labels_ss)\n        logits_lm, loss_lm = output_lm.logits, output_lm.loss\n        logits_ss, loss_ss = output_ss.logits, output_ss.loss\n\n        loss = None\n        if loss_lm is not None and loss_ss is not None:\n            loss = loss_lm + loss_ss\n        elif loss_lm is not None:\n            loss = loss_lm\n        elif loss_ss is not None:\n            loss = loss_ss\n\n        if not return_dict:\n            output = outputs[2:]\n            output = ((logits_ss, loss_ss) + output) if loss_ss is not None else ((logits_ss,) + output)\n            output = ((logits_lm, loss_lm) + output) if loss_lm is not None else ((logits_lm,) + output)\n            return ((loss,) + output) if loss is not None else output\n\n        return ErnieRnaForContactClassificationOutput(\n            loss=loss,\n            logits_lm=logits_lm,\n            loss_lm=loss_lm,\n            logits_ss=logits_ss,\n            loss_ss=loss_ss,\n            hidden_states=outputs.hidden_states,\n            attentions=outputs.attentions,\n            attention_biases=outputs.attention_biases,\n        )\n
"},{"location":"models/ernierna/#multimolecule.models.ernierna.ErnieRnaForContactPrediction","title":"ErnieRnaForContactPrediction","text":"

Bases: ErnieRnaPreTrainedModel

Examples:

Python Console Session
>>> from multimolecule import ErnieRnaConfig, ErnieRnaForContactPrediction, RnaTokenizer\n>>> config = ErnieRnaConfig()\n>>> model = ErnieRnaForContactPrediction(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=torch.randint(2, (1, 5, 5)))\n>>> output[\"logits\"].shape\ntorch.Size([1, 5, 5, 1])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n
Source code in multimolecule/models/ernierna/modeling_ernierna.py Python
class ErnieRnaForContactPrediction(ErnieRnaPreTrainedModel):\n    \"\"\"\n    Examples:\n        >>> from multimolecule import ErnieRnaConfig, ErnieRnaForContactPrediction, RnaTokenizer\n        >>> config = ErnieRnaConfig()\n        >>> model = ErnieRnaForContactPrediction(config)\n        >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n        >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n        >>> output = model(**input, labels=torch.randint(2, (1, 5, 5)))\n        >>> output[\"logits\"].shape\n        torch.Size([1, 5, 5, 1])\n        >>> output[\"loss\"]  # doctest:+ELLIPSIS\n        tensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n    \"\"\"\n\n    def __init__(self, config: ErnieRnaConfig):\n        super().__init__(config)\n        self.ernierna = ErnieRnaModel(config, add_pooling_layer=True)\n        self.contact_head = ContactPredictionHead(config)\n        self.head_config = self.contact_head.config\n\n        # Initialize weights and apply final processing\n        self.post_init()\n\n    def forward(\n        self,\n        input_ids: Tensor | NestedTensor,\n        attention_mask: Tensor | None = None,\n        position_ids: Tensor | None = None,\n        head_mask: Tensor | None = None,\n        inputs_embeds: Tensor | NestedTensor | None = None,\n        labels: Tensor | None = None,\n        output_attentions: bool | None = None,\n        output_hidden_states: bool | None = None,\n        return_dict: bool | None = None,\n        **kwargs,\n    ) -> Tuple[Tensor, ...] | ErnieRnaContactPredictorOutput:\n        if output_attentions is False:\n            warn(\"output_attentions must be True for contact classification and will be ignored.\")\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n        outputs = self.ernierna(\n            input_ids,\n            attention_mask=attention_mask,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            output_attentions=True,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n            **kwargs,\n        )\n        output = self.contact_head(outputs, attention_mask, input_ids, labels)\n        logits, loss = output.logits, output.loss\n\n        if not return_dict:\n            output = (logits,) + outputs[2:]\n            return ((loss,) + output) if loss is not None else output\n\n        return ErnieRnaContactPredictorOutput(\n            loss=loss,\n            logits=logits,\n            hidden_states=outputs.hidden_states,\n            attentions=outputs.attentions,\n        )\n
"},{"location":"models/ernierna/#multimolecule.models.ernierna.ErnieRnaForMaskedLM","title":"ErnieRnaForMaskedLM","text":"

Bases: ErnieRnaPreTrainedModel

Examples:

Python Console Session
>>> from multimolecule import ErnieRnaConfig, ErnieRnaForMaskedLM, RnaTokenizer\n>>> config = ErnieRnaConfig()\n>>> model = ErnieRnaForMaskedLM(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=input[\"input_ids\"])\n>>> output[\"logits\"].shape\ntorch.Size([1, 7, 26])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<NllLossBackward0>)\n
Source code in multimolecule/models/ernierna/modeling_ernierna.py Python
class ErnieRnaForMaskedLM(ErnieRnaPreTrainedModel):\n    \"\"\"\n    Examples:\n        >>> from multimolecule import ErnieRnaConfig, ErnieRnaForMaskedLM, RnaTokenizer\n        >>> config = ErnieRnaConfig()\n        >>> model = ErnieRnaForMaskedLM(config)\n        >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n        >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n        >>> output = model(**input, labels=input[\"input_ids\"])\n        >>> output[\"logits\"].shape\n        torch.Size([1, 7, 26])\n        >>> output[\"loss\"]  # doctest:+ELLIPSIS\n        tensor(..., grad_fn=<NllLossBackward0>)\n    \"\"\"\n\n    _tied_weights_keys = [\"lm_head.decoder.weight\", \"lm_head.decoder.bias\"]\n\n    def __init__(self, config: ErnieRnaConfig):\n        super().__init__(config)\n        if config.is_decoder:\n            logger.warning(\n                \"If you want to use `BertForMaskedLM` make sure `config.is_decoder=False` for \"\n                \"bi-directional self-attention.\"\n            )\n        self.ernierna = ErnieRnaModel(config, add_pooling_layer=False)\n        self.lm_head = MaskedLMHead(config)\n\n        # Initialize weights and apply final processing\n        self.post_init()\n\n    def get_output_embeddings(self):\n        return self.lm_head.decoder\n\n    def set_output_embeddings(self, new_embeddings):\n        self.lm_head.decoder = new_embeddings\n\n    def forward(\n        self,\n        input_ids: Tensor | NestedTensor,\n        attention_mask: Tensor | None = None,\n        position_ids: Tensor | None = None,\n        head_mask: Tensor | None = None,\n        inputs_embeds: Tensor | NestedTensor | None = None,\n        encoder_hidden_states: Tensor | None = None,\n        encoder_attention_mask: Tensor | None = None,\n        labels: Tensor | None = None,\n        output_attentions: bool | None = None,\n        output_attention_biases: bool | None = None,\n        output_hidden_states: bool | None = None,\n        return_dict: bool | None = None,\n        **kwargs,\n    ) -> Tuple[Tensor, ...] | ErnieRnaForMaskedLMOutput:\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n        outputs = self.ernierna(\n            input_ids,\n            attention_mask=attention_mask,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            encoder_hidden_states=encoder_hidden_states,\n            encoder_attention_mask=encoder_attention_mask,\n            output_attentions=output_attentions,\n            output_attention_biases=output_attention_biases,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n            **kwargs,\n        )\n        output = self.lm_head(outputs, labels)\n        logits, loss = output.logits, output.loss\n\n        if not return_dict:\n            output = (logits,) + outputs[2:]\n            return ((loss,) + output) if loss is not None else output\n\n        return ErnieRnaForMaskedLMOutput(\n            loss=loss,\n            logits=logits,\n            hidden_states=outputs.hidden_states,\n            attentions=outputs.attentions,\n        )\n
"},{"location":"models/ernierna/#multimolecule.models.ernierna.ErnieRnaForSequencePrediction","title":"ErnieRnaForSequencePrediction","text":"

Bases: ErnieRnaPreTrainedModel

Examples:

Python Console Session
>>> from multimolecule import ErnieRnaConfig, ErnieRnaForSequencePrediction, RnaTokenizer\n>>> config = ErnieRnaConfig()\n>>> model = ErnieRnaForSequencePrediction(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input)\n>>> output[\"logits\"].shape\ntorch.Size([1, 1])\n
Source code in multimolecule/models/ernierna/modeling_ernierna.py Python
class ErnieRnaForSequencePrediction(ErnieRnaPreTrainedModel):\n    \"\"\"\n    Examples:\n        >>> from multimolecule import ErnieRnaConfig, ErnieRnaForSequencePrediction, RnaTokenizer\n        >>> config = ErnieRnaConfig()\n        >>> model = ErnieRnaForSequencePrediction(config)\n        >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n        >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n        >>> output = model(**input)\n        >>> output[\"logits\"].shape\n        torch.Size([1, 1])\n    \"\"\"\n\n    def __init__(self, config: ErnieRnaConfig):\n        super().__init__(config)\n        self.ernierna = ErnieRnaModel(config)\n        self.sequence_head = SequencePredictionHead(config)\n        self.head_config = self.sequence_head.config\n\n        # Initialize weights and apply final processing\n        self.post_init()\n\n    def forward(\n        self,\n        input_ids: Tensor | NestedTensor,\n        attention_mask: Tensor | None = None,\n        position_ids: Tensor | None = None,\n        head_mask: Tensor | None = None,\n        inputs_embeds: Tensor | NestedTensor | None = None,\n        labels: Tensor | None = None,\n        output_attentions: bool | None = None,\n        output_attention_biases: bool | None = None,\n        output_hidden_states: bool | None = None,\n        return_dict: bool | None = None,\n        **kwargs,\n    ) -> Tuple[Tensor, ...] | ErnieRnaSequencePredictorOutput:\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n        outputs = self.ernierna(\n            input_ids,\n            attention_mask=attention_mask,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            output_attentions=output_attentions,\n            output_attention_biases=output_attention_biases,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n            **kwargs,\n        )\n        output = self.sequence_head(outputs, labels)\n        logits, loss = output.logits, output.loss\n\n        if not return_dict:\n            output = (logits,) + outputs[2:]\n            return ((loss,) + output) if loss is not None else output\n\n        return ErnieRnaSequencePredictorOutput(\n            loss=loss,\n            logits=logits,\n            hidden_states=outputs.hidden_states,\n            attentions=outputs.attentions,\n        )\n
"},{"location":"models/ernierna/#multimolecule.models.ernierna.ErnieRnaForTokenPrediction","title":"ErnieRnaForTokenPrediction","text":"

Bases: ErnieRnaPreTrainedModel

Examples:

Python Console Session
>>> from multimolecule import ErnieRnaConfig, ErnieRnaForTokenPrediction, RnaTokenizer\n>>> config = ErnieRnaConfig()\n>>> model = ErnieRnaForTokenPrediction(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=torch.randint(2, (1, 5)))\n>>> output[\"logits\"].shape\ntorch.Size([1, 5, 1])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n
Source code in multimolecule/models/ernierna/modeling_ernierna.py Python
class ErnieRnaForTokenPrediction(ErnieRnaPreTrainedModel):\n    \"\"\"\n    Examples:\n        >>> from multimolecule import ErnieRnaConfig, ErnieRnaForTokenPrediction, RnaTokenizer\n        >>> config = ErnieRnaConfig()\n        >>> model = ErnieRnaForTokenPrediction(config)\n        >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n        >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n        >>> output = model(**input, labels=torch.randint(2, (1, 5)))\n        >>> output[\"logits\"].shape\n        torch.Size([1, 5, 1])\n        >>> output[\"loss\"]  # doctest:+ELLIPSIS\n        tensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n    \"\"\"\n\n    def __init__(self, config: ErnieRnaConfig):\n        super().__init__(config)\n        self.num_labels = config.num_labels\n        self.ernierna = ErnieRnaModel(config, add_pooling_layer=True)\n        self.token_head = TokenPredictionHead(config)\n        self.head_config = self.token_head.config\n\n        # Initialize weights and apply final processing\n        self.post_init()\n\n    def forward(\n        self,\n        input_ids: Tensor | NestedTensor,\n        attention_mask: Tensor | None = None,\n        position_ids: Tensor | None = None,\n        head_mask: Tensor | None = None,\n        inputs_embeds: Tensor | NestedTensor | None = None,\n        labels: Tensor | None = None,\n        output_attentions: bool | None = None,\n        output_attention_biases: bool | None = None,\n        output_hidden_states: bool | None = None,\n        return_dict: bool | None = None,\n        **kwargs,\n    ) -> Tuple[Tensor, ...] | ErnieRnaTokenPredictorOutput:\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n        outputs = self.ernierna(\n            input_ids,\n            attention_mask=attention_mask,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            output_attentions=output_attentions,\n            output_attention_biases=output_attention_biases,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n            **kwargs,\n        )\n        output = self.token_head(outputs, attention_mask, input_ids, labels)\n        logits, loss = output.logits, output.loss\n\n        if not return_dict:\n            output = (logits,) + outputs[2:]\n            return ((loss,) + output) if loss is not None else output\n\n        return ErnieRnaTokenPredictorOutput(\n            loss=loss,\n            logits=logits,\n            hidden_states=outputs.hidden_states,\n            attentions=outputs.attentions,\n        )\n
"},{"location":"models/ernierna/#multimolecule.models.ernierna.ErnieRnaModel","title":"ErnieRnaModel","text":"

Bases: ErnieRnaPreTrainedModel

Examples:

Python Console Session
>>> from multimolecule import ErnieRnaConfig, ErnieRnaModel, RnaTokenizer\n>>> config = ErnieRnaConfig()\n>>> model = ErnieRnaModel(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input)\n>>> output[\"last_hidden_state\"].shape\ntorch.Size([1, 7, 768])\n>>> output[\"pooler_output\"].shape\ntorch.Size([1, 768])\n
Source code in multimolecule/models/ernierna/modeling_ernierna.py Python
class ErnieRnaModel(ErnieRnaPreTrainedModel):\n    \"\"\"\n    Examples:\n        >>> from multimolecule import ErnieRnaConfig, ErnieRnaModel, RnaTokenizer\n        >>> config = ErnieRnaConfig()\n        >>> model = ErnieRnaModel(config)\n        >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n        >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n        >>> output = model(**input)\n        >>> output[\"last_hidden_state\"].shape\n        torch.Size([1, 7, 768])\n        >>> output[\"pooler_output\"].shape\n        torch.Size([1, 768])\n    \"\"\"\n\n    pairwise_bias_map: Tensor\n\n    def __init__(\n        self, config: ErnieRnaConfig, add_pooling_layer: bool = True, tokenizer: PreTrainedTokenizer | None = None\n    ):\n        super().__init__(config)\n        if tokenizer is None:\n            tokenizer = AutoTokenizer.from_pretrained(\"multimolecule/rna\")\n        self.tokenizer = tokenizer\n        self.pad_token_id = tokenizer.pad_token_id\n        self.vocab_size = len(self.tokenizer)\n        if self.vocab_size != config.vocab_size:\n            raise ValueError(\n                f\"Vocab size in tokenizer ({self.vocab_size}) does not match the one in config ({config.vocab_size})\"\n            )\n        token_to_ids = self.tokenizer._token_to_id\n        tokens = sorted(token_to_ids, key=token_to_ids.get)\n        pairwise_bias_dict = get_pairwise_bias_dict(config.pairwise_alpha)\n        self.register_buffer(\n            \"pairwise_bias_map\",\n            torch.tensor([[pairwise_bias_dict.get(f\"{i}{j}\", 0) for i in tokens] for j in tokens]),\n            persistent=False,\n        )\n        self.pairwise_bias_proj = nn.Sequential(\n            nn.Linear(1, config.num_attention_heads // 2),\n            nn.GELU(),\n            nn.Linear(config.num_attention_heads // 2, config.num_attention_heads),\n        )\n        self.embeddings = ErnieRnaEmbeddings(config)\n        self.encoder = ErnieRnaEncoder(config)\n        self.pooler = ErnieRnaPooler(config) if add_pooling_layer else None\n\n        # Initialize weights and apply final processing\n        self.post_init()\n\n    def get_input_embeddings(self):\n        return self.embeddings.word_embeddings\n\n    def set_input_embeddings(self, value):\n        self.embeddings.word_embeddings = value\n\n    def _prune_heads(self, heads_to_prune):\n        \"\"\"\n        Prunes heads of the model. heads_to_prune: dict of {layer_num: list of heads to prune in this layer} See base\n        class PreTrainedModel\n        \"\"\"\n        for layer, heads in heads_to_prune.items():\n            self.encoder.layer[layer].attention.prune_heads(heads)\n\n    def get_pairwise_bias(\n        self, input_ids: Tensor | NestedTensor, attention_mask: Tensor | NestedTensor | None = None\n    ) -> Tensor | NestedTensor:\n        batch_size, seq_len = input_ids.shape\n\n        # Broadcasting data indices to compute indices\n        data_index_x = input_ids.unsqueeze(2).expand(batch_size, seq_len, seq_len)\n        data_index_y = input_ids.unsqueeze(1).expand(batch_size, seq_len, seq_len)\n\n        # Get bias from pairwise_bias_map\n        return self.pairwise_bias_map[data_index_x, data_index_y]\n\n        # Zhiyuan: Is it really necessary to mask the bias?\n        # The mask position should have been nan, and the implementation is incorrect anyway\n        # if attention_mask is not None:\n        #     attention_mask = attention_mask.unsqueeze(1).expand(batch_size, seq_len, seq_len)\n        #     bias = bias * attention_mask\n\n    def forward(\n        self,\n        input_ids: Tensor | NestedTensor,\n        attention_mask: Tensor | None = None,\n        position_ids: Tensor | None = None,\n        head_mask: Tensor | None = None,\n        inputs_embeds: Tensor | NestedTensor | None = None,\n        encoder_hidden_states: Tensor | None = None,\n        encoder_attention_mask: Tensor | None = None,\n        past_key_values: Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None = None,\n        use_cache: bool | None = None,\n        output_attentions: bool | None = None,\n        output_attention_biases: bool | None = None,\n        output_hidden_states: bool | None = None,\n        return_dict: bool | None = None,\n        **kwargs,\n    ) -> Tuple[Tensor, ...] | ErnieRnaModelOutputWithPoolingAndCrossAttentions:\n        r\"\"\"\n        Args:\n            encoder_hidden_states:\n                Shape: `(batch_size, sequence_length, hidden_size)`\n\n                Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if\n                the model is configured as a decoder.\n            encoder_attention_mask:\n                Shape: `(batch_size, sequence_length)`\n\n                Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used\n                in the cross-attention if the model is configured as a decoder. Mask values selected in `[0, 1]`:\n\n                - 1 for tokens that are **not masked**,\n                - 0 for tokens that are **masked**.\n            past_key_values:\n                Tuple of length `config.n_layers` with each tuple having 4 tensors of shape\n                `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)\n\n                Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up\n                decoding.\n\n                If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those\n                that don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of\n                all `decoder_input_ids` of shape `(batch_size, sequence_length)`.\n            use_cache:\n                If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding\n                (see `past_key_values`).\n        \"\"\"\n        if kwargs:\n            warn(\n                f\"Additional keyword arguments `{', '.join(kwargs)}` are detected in \"\n                f\"`{self.__class__.__name__}.forward`, they will be ignored.\\n\"\n                \"This is provided for backward compatibility and may lead to unexpected behavior.\"\n            )\n        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions\n        output_hidden_states = (\n            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states\n        )\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n\n        if self.config.is_decoder:\n            use_cache = use_cache if use_cache is not None else self.config.use_cache\n        else:\n            use_cache = False\n\n        pairwise_bias = self.get_pairwise_bias(input_ids, attention_mask)\n        attention_bias = self.pairwise_bias_proj(pairwise_bias.unsqueeze(-1)).transpose(1, 3)\n\n        if isinstance(input_ids, NestedTensor):\n            input_ids, attention_mask = input_ids.tensor, input_ids.mask\n        if input_ids is not None and inputs_embeds is not None:\n            raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n        if input_ids is not None:\n            self.warn_if_padding_and_no_attention_mask(input_ids, attention_mask)\n            input_shape = input_ids.size()\n        elif inputs_embeds is not None:\n            input_shape = inputs_embeds.size()[:-1]\n        else:\n            raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n        batch_size, seq_length = input_shape\n        device = input_ids.device if input_ids is not None else inputs_embeds.device  # type: ignore[union-attr]\n\n        # past_key_values_length\n        past_key_values_length = past_key_values[0][0].shape[2] if past_key_values is not None else 0\n\n        if attention_mask is None:\n            attention_mask = (\n                input_ids.ne(self.pad_token_id)\n                if self.pad_token_id is not None\n                else torch.ones(((batch_size, seq_length + past_key_values_length)), device=device)\n            )\n\n        # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]\n        # ourselves in which case we just need to make it broadcastable to all heads.\n        extended_attention_mask: Tensor = self.get_extended_attention_mask(attention_mask, input_shape)\n\n        # If a 2D or 3D attention mask is provided for the cross-attention\n        # we need to make broadcastable to [batch_size, num_heads, seq_length, seq_length]\n        if self.config.is_decoder and encoder_hidden_states is not None:\n            encoder_batch_size, encoder_sequence_length, _ = encoder_hidden_states.size()\n            encoder_hidden_shape = (encoder_batch_size, encoder_sequence_length)\n            if encoder_attention_mask is None:\n                encoder_attention_mask = torch.ones(encoder_hidden_shape, device=device)\n            encoder_extended_attention_mask = self.invert_attention_mask(encoder_attention_mask)\n        else:\n            encoder_extended_attention_mask = None\n\n        # Prepare head mask if needed\n        # 1.0 in head_mask indicate we keep the head\n        # attention_probs has shape bsz x n_heads x N x N\n        # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]\n        # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]\n        head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)\n\n        embedding_output = self.embeddings(\n            input_ids=input_ids,\n            position_ids=position_ids,\n            inputs_embeds=inputs_embeds,\n            past_key_values_length=past_key_values_length,\n        )\n        encoder_outputs = self.encoder(\n            embedding_output,\n            attention_mask=extended_attention_mask,\n            attention_bias=attention_bias,\n            head_mask=head_mask,\n            encoder_hidden_states=encoder_hidden_states,\n            encoder_attention_mask=encoder_extended_attention_mask,\n            past_key_values=past_key_values,\n            use_cache=use_cache,\n            output_attentions=output_attentions,\n            output_attention_biases=output_attention_biases,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n        )\n        sequence_output = encoder_outputs[0]\n        pooled_output = self.pooler(sequence_output) if self.pooler is not None else None\n\n        if not return_dict:\n            return (sequence_output, pooled_output) + encoder_outputs[1:]\n\n        return ErnieRnaModelOutputWithPoolingAndCrossAttentions(\n            last_hidden_state=sequence_output,\n            pooler_output=pooled_output,\n            past_key_values=encoder_outputs.past_key_values,\n            hidden_states=encoder_outputs.hidden_states,\n            attention_biases=encoder_outputs.attention_biases,\n            attentions=encoder_outputs.attentions,\n            cross_attentions=encoder_outputs.cross_attentions,\n        )\n
"},{"location":"models/ernierna/#multimolecule.models.ernierna.ErnieRnaModel.forward","title":"forward","text":"Python
forward(input_ids: Tensor | NestedTensor, attention_mask: Tensor | None = None, position_ids: Tensor | None = None, head_mask: Tensor | None = None, inputs_embeds: Tensor | NestedTensor | None = None, encoder_hidden_states: Tensor | None = None, encoder_attention_mask: Tensor | None = None, past_key_values: Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None = None, use_cache: bool | None = None, output_attentions: bool | None = None, output_attention_biases: bool | None = None, output_hidden_states: bool | None = None, return_dict: bool | None = None, **kwargs) -> Tuple[Tensor, ...] | ErnieRnaModelOutputWithPoolingAndCrossAttentions\n

Parameters:

Name Type Description Default Tensor | None

Shape: (batch_size, sequence_length, hidden_size)

Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if the model is configured as a decoder.

None Tensor | None

Shape: (batch_size, sequence_length)

Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used in the cross-attention if the model is configured as a decoder. Mask values selected in [0, 1]:

None Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None

Tuple of length config.n_layers with each tuple having 4 tensors of shape `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)

Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up decoding.

If past_key_values are used, the user can optionally input only the last decoder_input_ids (those that don\u2019t have their past key value states given to this model) of shape (batch_size, 1) instead of all decoder_input_ids of shape (batch_size, sequence_length).

None bool | None

If set to True, past_key_values key value states are returned and can be used to speed up decoding (see past_key_values).

None Source code in multimolecule/models/ernierna/modeling_ernierna.py Python
def forward(\n    self,\n    input_ids: Tensor | NestedTensor,\n    attention_mask: Tensor | None = None,\n    position_ids: Tensor | None = None,\n    head_mask: Tensor | None = None,\n    inputs_embeds: Tensor | NestedTensor | None = None,\n    encoder_hidden_states: Tensor | None = None,\n    encoder_attention_mask: Tensor | None = None,\n    past_key_values: Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None = None,\n    use_cache: bool | None = None,\n    output_attentions: bool | None = None,\n    output_attention_biases: bool | None = None,\n    output_hidden_states: bool | None = None,\n    return_dict: bool | None = None,\n    **kwargs,\n) -> Tuple[Tensor, ...] | ErnieRnaModelOutputWithPoolingAndCrossAttentions:\n    r\"\"\"\n    Args:\n        encoder_hidden_states:\n            Shape: `(batch_size, sequence_length, hidden_size)`\n\n            Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if\n            the model is configured as a decoder.\n        encoder_attention_mask:\n            Shape: `(batch_size, sequence_length)`\n\n            Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used\n            in the cross-attention if the model is configured as a decoder. Mask values selected in `[0, 1]`:\n\n            - 1 for tokens that are **not masked**,\n            - 0 for tokens that are **masked**.\n        past_key_values:\n            Tuple of length `config.n_layers` with each tuple having 4 tensors of shape\n            `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)\n\n            Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up\n            decoding.\n\n            If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those\n            that don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of\n            all `decoder_input_ids` of shape `(batch_size, sequence_length)`.\n        use_cache:\n            If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding\n            (see `past_key_values`).\n    \"\"\"\n    if kwargs:\n        warn(\n            f\"Additional keyword arguments `{', '.join(kwargs)}` are detected in \"\n            f\"`{self.__class__.__name__}.forward`, they will be ignored.\\n\"\n            \"This is provided for backward compatibility and may lead to unexpected behavior.\"\n        )\n    output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions\n    output_hidden_states = (\n        output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states\n    )\n    return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n\n    if self.config.is_decoder:\n        use_cache = use_cache if use_cache is not None else self.config.use_cache\n    else:\n        use_cache = False\n\n    pairwise_bias = self.get_pairwise_bias(input_ids, attention_mask)\n    attention_bias = self.pairwise_bias_proj(pairwise_bias.unsqueeze(-1)).transpose(1, 3)\n\n    if isinstance(input_ids, NestedTensor):\n        input_ids, attention_mask = input_ids.tensor, input_ids.mask\n    if input_ids is not None and inputs_embeds is not None:\n        raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n    if input_ids is not None:\n        self.warn_if_padding_and_no_attention_mask(input_ids, attention_mask)\n        input_shape = input_ids.size()\n    elif inputs_embeds is not None:\n        input_shape = inputs_embeds.size()[:-1]\n    else:\n        raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n    batch_size, seq_length = input_shape\n    device = input_ids.device if input_ids is not None else inputs_embeds.device  # type: ignore[union-attr]\n\n    # past_key_values_length\n    past_key_values_length = past_key_values[0][0].shape[2] if past_key_values is not None else 0\n\n    if attention_mask is None:\n        attention_mask = (\n            input_ids.ne(self.pad_token_id)\n            if self.pad_token_id is not None\n            else torch.ones(((batch_size, seq_length + past_key_values_length)), device=device)\n        )\n\n    # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]\n    # ourselves in which case we just need to make it broadcastable to all heads.\n    extended_attention_mask: Tensor = self.get_extended_attention_mask(attention_mask, input_shape)\n\n    # If a 2D or 3D attention mask is provided for the cross-attention\n    # we need to make broadcastable to [batch_size, num_heads, seq_length, seq_length]\n    if self.config.is_decoder and encoder_hidden_states is not None:\n        encoder_batch_size, encoder_sequence_length, _ = encoder_hidden_states.size()\n        encoder_hidden_shape = (encoder_batch_size, encoder_sequence_length)\n        if encoder_attention_mask is None:\n            encoder_attention_mask = torch.ones(encoder_hidden_shape, device=device)\n        encoder_extended_attention_mask = self.invert_attention_mask(encoder_attention_mask)\n    else:\n        encoder_extended_attention_mask = None\n\n    # Prepare head mask if needed\n    # 1.0 in head_mask indicate we keep the head\n    # attention_probs has shape bsz x n_heads x N x N\n    # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]\n    # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]\n    head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)\n\n    embedding_output = self.embeddings(\n        input_ids=input_ids,\n        position_ids=position_ids,\n        inputs_embeds=inputs_embeds,\n        past_key_values_length=past_key_values_length,\n    )\n    encoder_outputs = self.encoder(\n        embedding_output,\n        attention_mask=extended_attention_mask,\n        attention_bias=attention_bias,\n        head_mask=head_mask,\n        encoder_hidden_states=encoder_hidden_states,\n        encoder_attention_mask=encoder_extended_attention_mask,\n        past_key_values=past_key_values,\n        use_cache=use_cache,\n        output_attentions=output_attentions,\n        output_attention_biases=output_attention_biases,\n        output_hidden_states=output_hidden_states,\n        return_dict=return_dict,\n    )\n    sequence_output = encoder_outputs[0]\n    pooled_output = self.pooler(sequence_output) if self.pooler is not None else None\n\n    if not return_dict:\n        return (sequence_output, pooled_output) + encoder_outputs[1:]\n\n    return ErnieRnaModelOutputWithPoolingAndCrossAttentions(\n        last_hidden_state=sequence_output,\n        pooler_output=pooled_output,\n        past_key_values=encoder_outputs.past_key_values,\n        hidden_states=encoder_outputs.hidden_states,\n        attention_biases=encoder_outputs.attention_biases,\n        attentions=encoder_outputs.attentions,\n        cross_attentions=encoder_outputs.cross_attentions,\n    )\n
"},{"location":"models/ernierna/#multimolecule.models.ernierna.ErnieRnaModel.forward(encoder_hidden_states)","title":"encoder_hidden_states","text":""},{"location":"models/ernierna/#multimolecule.models.ernierna.ErnieRnaModel.forward(encoder_attention_mask)","title":"encoder_attention_mask","text":""},{"location":"models/ernierna/#multimolecule.models.ernierna.ErnieRnaModel.forward(past_key_values)","title":"past_key_values","text":""},{"location":"models/ernierna/#multimolecule.models.ernierna.ErnieRnaModel.forward(use_cache)","title":"use_cache","text":""},{"location":"models/ernierna/#multimolecule.models.ernierna.ErnieRnaPreTrainedModel","title":"ErnieRnaPreTrainedModel","text":"

Bases: PreTrainedModel

An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained models.

Source code in multimolecule/models/ernierna/modeling_ernierna.py Python
class ErnieRnaPreTrainedModel(PreTrainedModel):\n    \"\"\"\n    An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained\n    models.\n    \"\"\"\n\n    config_class = ErnieRnaConfig\n    base_model_prefix = \"ernierna\"\n    supports_gradient_checkpointing = True\n    _no_split_modules = [\"ErnieRnaLayer\", \"ErnieRnaEmbeddings\"]\n\n    # Copied from transformers.models.bert.modeling_bert.BertPreTrainedModel._init_weights\n    def _init_weights(self, module: nn.Module):\n        \"\"\"Initialize the weights\"\"\"\n        if isinstance(module, nn.Linear):\n            # Slightly different from the TF version which uses truncated_normal for initialization\n            # cf https://github.com/pytorch/pytorch/pull/5617\n            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)\n            if module.bias is not None:\n                module.bias.data.zero_()\n        elif isinstance(module, nn.Embedding):\n            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)\n            if module.padding_idx is not None:\n                module.weight.data[module.padding_idx].zero_()\n        elif isinstance(module, nn.LayerNorm):\n            module.bias.data.zero_()\n            module.weight.data.fill_(1.0)\n
"},{"location":"models/modeling_outputs/","title":"modeling_outputs","text":""},{"location":"models/modeling_outputs/#multimolecule.models.modeling_outputs","title":"multimolecule.models.modeling_outputs","text":""},{"location":"models/modeling_outputs/#multimolecule.models.modeling_outputs.SequencePredictorOutput","title":"SequencePredictorOutput dataclass","text":"

Bases: ModelOutput

Base class for outputs of sentence classification & regression models.

Parameters:

Name Type Description Default FloatTensor | None

torch.FloatTensor of shape (1,).

Optional, returned when labels is provided

None FloatTensor

torch.FloatTensor of shape (batch_size, sequence_length, config.num_labels)

Prediction outputs.

None Tuple[FloatTensor, ...] | None

Tuple of torch.FloatTensor (one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).

Optional, returned when output_hidden_states=True is passed or when `config.output_hidden_states=True

Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.

None Tuple[FloatTensor, ...] | None

Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

Optional, eturned when output_attentions=True is passed or when config.output_attentions=True

Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

None Source code in multimolecule/models/modeling_outputs.py Python
@dataclass\nclass SequencePredictorOutput(ModelOutput):\n    \"\"\"\n    Base class for outputs of sentence classification & regression models.\n\n    Args:\n        loss:\n            `torch.FloatTensor` of shape `(1,)`.\n\n            Optional, returned when `labels` is provided\n        logits:\n            `torch.FloatTensor` of shape `(batch_size, sequence_length, config.num_labels)`\n\n            Prediction outputs.\n        hidden_states:\n            Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +\n            one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.\n\n            Optional, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True\n\n            Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.\n        attentions:\n            Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,\n            sequence_length)`.\n\n            Optional, eturned when `output_attentions=True` is passed or when `config.output_attentions=True`\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n    \"\"\"\n\n    loss: torch.FloatTensor | None = None\n    logits: torch.FloatTensor = None\n    hidden_states: Tuple[torch.FloatTensor, ...] | None = None\n    attentions: Tuple[torch.FloatTensor, ...] | None = None\n
"},{"location":"models/modeling_outputs/#multimolecule.models.modeling_outputs.SequencePredictorOutput(loss)","title":"loss","text":""},{"location":"models/modeling_outputs/#multimolecule.models.modeling_outputs.SequencePredictorOutput(logits)","title":"logits","text":""},{"location":"models/modeling_outputs/#multimolecule.models.modeling_outputs.SequencePredictorOutput(hidden_states)","title":"hidden_states","text":""},{"location":"models/modeling_outputs/#multimolecule.models.modeling_outputs.SequencePredictorOutput(attentions)","title":"attentions","text":""},{"location":"models/modeling_outputs/#multimolecule.models.modeling_outputs.TokenPredictorOutput","title":"TokenPredictorOutput dataclass","text":"

Bases: ModelOutput

Base class for outputs of token classification & regression models.

Parameters:

Name Type Description Default FloatTensor | None

torch.FloatTensor of shape (1,).

Optional, returned when labels is provided

None FloatTensor

torch.FloatTensor of shape (batch_size, sequence_length, config.num_labels)

Prediction outputs.

None Tuple[FloatTensor, ...] | None

Tuple of torch.FloatTensor (one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).

Optional, returned when output_hidden_states=True is passed or when `config.output_hidden_states=True

Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.

None Tuple[FloatTensor, ...] | None

Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

Optional, eturned when output_attentions=True is passed or when config.output_attentions=True

Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

None Source code in multimolecule/models/modeling_outputs.py Python
@dataclass\nclass TokenPredictorOutput(ModelOutput):\n    \"\"\"\n    Base class for outputs of token classification & regression models.\n\n    Args:\n        loss:\n            `torch.FloatTensor` of shape `(1,)`.\n\n            Optional, returned when `labels` is provided\n        logits:\n            `torch.FloatTensor` of shape `(batch_size, sequence_length, config.num_labels)`\n\n            Prediction outputs.\n        hidden_states:\n            Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +\n            one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.\n\n            Optional, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True\n\n            Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.\n        attentions:\n            Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,\n            sequence_length)`.\n\n            Optional, eturned when `output_attentions=True` is passed or when `config.output_attentions=True`\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n    \"\"\"\n\n    loss: torch.FloatTensor | None = None\n    logits: torch.FloatTensor = None\n    hidden_states: Tuple[torch.FloatTensor, ...] | None = None\n    attentions: Tuple[torch.FloatTensor, ...] | None = None\n
"},{"location":"models/modeling_outputs/#multimolecule.models.modeling_outputs.TokenPredictorOutput(loss)","title":"loss","text":""},{"location":"models/modeling_outputs/#multimolecule.models.modeling_outputs.TokenPredictorOutput(logits)","title":"logits","text":""},{"location":"models/modeling_outputs/#multimolecule.models.modeling_outputs.TokenPredictorOutput(hidden_states)","title":"hidden_states","text":""},{"location":"models/modeling_outputs/#multimolecule.models.modeling_outputs.TokenPredictorOutput(attentions)","title":"attentions","text":""},{"location":"models/modeling_outputs/#multimolecule.models.modeling_outputs.ContactPredictorOutput","title":"ContactPredictorOutput dataclass","text":"

Bases: ModelOutput

Base class for outputs of contact classification & regression models.

Parameters:

Name Type Description Default FloatTensor | None

torch.FloatTensor of shape (1,).

Optional, returned when labels is provided

None FloatTensor

torch.FloatTensor of shape (batch_size, sequence_length, config.num_labels)

Prediction outputs.

None Tuple[FloatTensor, ...] | None

Tuple of torch.FloatTensor (one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).

Optional, returned when output_hidden_states=True is passed or when `config.output_hidden_states=True

Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.

None Tuple[FloatTensor, ...] | None

Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

Optional, eturned when output_attentions=True is passed or when config.output_attentions=True

Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

None Source code in multimolecule/models/modeling_outputs.py Python
@dataclass\nclass ContactPredictorOutput(ModelOutput):\n    \"\"\"\n    Base class for outputs of contact classification & regression models.\n\n    Args:\n        loss:\n            `torch.FloatTensor` of shape `(1,)`.\n\n            Optional, returned when `labels` is provided\n        logits:\n            `torch.FloatTensor` of shape `(batch_size, sequence_length, config.num_labels)`\n\n            Prediction outputs.\n        hidden_states:\n            Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +\n            one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.\n\n            Optional, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True\n\n            Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.\n        attentions:\n            Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,\n            sequence_length)`.\n\n            Optional, eturned when `output_attentions=True` is passed or when `config.output_attentions=True`\n\n            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n            heads.\n    \"\"\"\n\n    loss: torch.FloatTensor | None = None\n    logits: torch.FloatTensor = None\n    hidden_states: Tuple[torch.FloatTensor, ...] | None = None\n    attentions: Tuple[torch.FloatTensor, ...] | None = None\n
"},{"location":"models/modeling_outputs/#multimolecule.models.modeling_outputs.ContactPredictorOutput(loss)","title":"loss","text":""},{"location":"models/modeling_outputs/#multimolecule.models.modeling_outputs.ContactPredictorOutput(logits)","title":"logits","text":""},{"location":"models/modeling_outputs/#multimolecule.models.modeling_outputs.ContactPredictorOutput(hidden_states)","title":"hidden_states","text":""},{"location":"models/modeling_outputs/#multimolecule.models.modeling_outputs.ContactPredictorOutput(attentions)","title":"attentions","text":""},{"location":"models/rinalmo/","title":"RiNALMo","text":"

Pre-trained model on non-coding RNA (ncRNA) using a masked language modeling (MLM) objective.

"},{"location":"models/rinalmo/#disclaimer","title":"Disclaimer","text":"

This is an UNOFFICIAL implementation of the RiNALMo: General-Purpose RNA Language Models Can Generalize Well on Structure Prediction Tasks by Rafael Josip Peni\u0107, et al.

The OFFICIAL repository of RiNALMo is at lbcb-sci/RiNALMo.

Tip

The MultiMolecule team has confirmed that the provided model and checkpoints are producing the same intermediate representations as the original implementation.

The team releasing RiNALMo did not write this model card for this model so this model card has been written by the MultiMolecule team.

"},{"location":"models/rinalmo/#model-details","title":"Model Details","text":"

RiNALMo is a bert-style model pre-trained on a large corpus of non-coding RNA sequences in a self-supervised fashion. This means that the model was trained on the raw nucleotides of RNA sequences only, with an automatic process to generate inputs and labels from those texts. Please refer to the Training Details section for more information on the training process.

"},{"location":"models/rinalmo/#model-specification","title":"Model Specification","text":"Num Layers Hidden Size Num Heads Intermediate Size Num Parameters (M) FLOPs (G) MACs (G) Max Num Tokens 33 1280 20 5120 650.88 168.92 84.43 1022"},{"location":"models/rinalmo/#links","title":"Links","text":""},{"location":"models/rinalmo/#usage","title":"Usage","text":"

The model file depends on the multimolecule library. You can install it using pip:

Bash
pip install multimolecule\n
"},{"location":"models/rinalmo/#direct-use","title":"Direct Use","text":"

You can use this model directly with a pipeline for masked language modeling:

Python
>>> import multimolecule  # you must import multimolecule to register models\n>>> from transformers import pipeline\n>>> unmasker = pipeline(\"fill-mask\", model=\"multimolecule/rinalmo\")\n>>> unmasker(\"gguc<mask>cucugguuagaccagaucugagccu\")\n\n[{'score': 0.3932918310165405,\n  'token': 6,\n  'token_str': 'A',\n  'sequence': 'G G U C A C U C U G G U U A G A C C A G A U C U G A G C C U'},\n {'score': 0.2897723913192749,\n  'token': 9,\n  'token_str': 'U',\n  'sequence': 'G G U C U C U C U G G U U A G A C C A G A U C U G A G C C U'},\n {'score': 0.15423105657100677,\n  'token': 22,\n  'token_str': 'X',\n  'sequence': 'G G U C X C U C U G G U U A G A C C A G A U C U G A G C C U'},\n {'score': 0.12160095572471619,\n  'token': 7,\n  'token_str': 'C',\n  'sequence': 'G G U C C C U C U G G U U A G A C C A G A U C U G A G C C U'},\n {'score': 0.0408296100795269,\n  'token': 8,\n  'token_str': 'G',\n  'sequence': 'G G U C G C U C U G G U U A G A C C A G A U C U G A G C C U'}]\n
"},{"location":"models/rinalmo/#downstream-use","title":"Downstream Use","text":""},{"location":"models/rinalmo/#extract-features","title":"Extract Features","text":"

Here is how to use this model to get the features of a given sequence in PyTorch:

Python
from multimolecule import RnaTokenizer, RiNALMoModel\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rinalmo\")\nmodel = RiNALMoModel.from_pretrained(\"multimolecule/rinalmo\")\n\ntext = \"UAGCUUAUCAGACUGAUGUUGA\"\ninput = tokenizer(text, return_tensors=\"pt\")\n\noutput = model(**input)\n
"},{"location":"models/rinalmo/#sequence-classification-regression","title":"Sequence Classification / Regression","text":"

Note: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for sequence classification or regression.

Here is how to use this model as backbone to fine-tune for a sequence-level task in PyTorch:

Python
import torch\nfrom multimolecule import RnaTokenizer, RiNALMoForSequencePrediction\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rinalmo\")\nmodel = RiNALMoForSequencePrediction.from_pretrained(\"multimolecule/rinalmo\")\n\ntext = \"UAGCUUAUCAGACUGAUGUUGA\"\ninput = tokenizer(text, return_tensors=\"pt\")\nlabel = torch.tensor([1])\n\noutput = model(**input, labels=label)\n
"},{"location":"models/rinalmo/#token-classification-regression","title":"Token Classification / Regression","text":"

Note: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for nucleotide classification or regression.

Here is how to use this model as backbone to fine-tune for a nucleotide-level task in PyTorch:

Python
import torch\nfrom multimolecule import RnaTokenizer, RiNALMoForTokenPrediction\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rinalmo\")\nmodel = RiNALMoForTokenPrediction.from_pretrained(\"multimolecule/rinalmo\")\n\ntext = \"UAGCUUAUCAGACUGAUGUUGA\"\ninput = tokenizer(text, return_tensors=\"pt\")\nlabel = torch.randint(2, (len(text), ))\n\noutput = model(**input, labels=label)\n
"},{"location":"models/rinalmo/#contact-classification-regression","title":"Contact Classification / Regression","text":"

Note: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for contact classification or regression.

Here is how to use this model as backbone to fine-tune for a contact-level task in PyTorch:

Python
import torch\nfrom multimolecule import RnaTokenizer, RiNALMoForContactPrediction\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rinalmo\")\nmodel = RiNALMoForContactPrediction.from_pretrained(\"multimolecule/rinalmo\")\n\ntext = \"UAGCUUAUCAGACUGAUGUUGA\"\ninput = tokenizer(text, return_tensors=\"pt\")\nlabel = torch.randint(2, (len(text), len(text)))\n\noutput = model(**input, labels=label)\n
"},{"location":"models/rinalmo/#training-details","title":"Training Details","text":"

RiNALMo used Masked Language Modeling (MLM) as the pre-training objective: taking a sequence, the model randomly masks 15% of the tokens in the input then runs the entire masked sentence through the model and has to predict the masked tokens. This is comparable to the Cloze task in language modeling.

"},{"location":"models/rinalmo/#training-data","title":"Training Data","text":"

The RiNALMo model was pre-trained on a cocktail of databases including RNAcentral, Rfam, Ensembl Genome Browser, and Nucleotide. The training data contains 36 million unique ncRNA sequences.

To ensure sequence diversity in each training batch, RiNALMo clustered the sequences with MMSeqs2 into 17 million clusters and then sampled each sequence in the batch from a different cluster.

RiNALMo preprocessed all tokens by replacing \u201cU\u201ds with \u201cT\u201ds.

Note that during model conversions, \u201cT\u201d is replaced with \u201cU\u201d. RnaTokenizer will convert \u201cT\u201ds to \u201cU\u201ds for you, you may disable this behaviour by passing replace_T_with_U=False.

"},{"location":"models/rinalmo/#training-procedure","title":"Training Procedure","text":""},{"location":"models/rinalmo/#preprocessing","title":"Preprocessing","text":"

RiNALMo used masked language modeling (MLM) as the pre-training objective. The masking procedure is similar to the one used in BERT:

"},{"location":"models/rinalmo/#pretraining","title":"PreTraining","text":"

The model was trained on 7 NVIDIA A100 GPUs with 80GiB memories.

"},{"location":"models/rinalmo/#citation","title":"Citation","text":"

BibTeX:

BibTeX
@article{penic2024rinalmo,\n  title={RiNALMo: General-Purpose RNA Language Models Can Generalize Well on Structure Prediction Tasks},\n  author={Peni\u0107, Rafael Josip and Vla\u0161i\u0107, Tin and Huber, Roland G. and Wan, Yue and \u0160iki\u0107, Mile},\n  journal={arXiv preprint arXiv:2403.00043},\n  year={2024}\n}\n
"},{"location":"models/rinalmo/#contact","title":"Contact","text":"

Please use GitHub issues of MultiMolecule for any questions or comments on the model card.

Please contact the authors of the RiNALMo paper for questions or comments on the paper/model.

"},{"location":"models/rinalmo/#license","title":"License","text":"

This model is licensed under the AGPL-3.0 License.

Text Only
SPDX-License-Identifier: AGPL-3.0-or-later\n
"},{"location":"models/rinalmo/#multimolecule.models.rinalmo","title":"multimolecule.models.rinalmo","text":""},{"location":"models/rinalmo/#multimolecule.models.rinalmo.RnaTokenizer","title":"RnaTokenizer","text":"

Bases: Tokenizer

Tokenizer for RNA sequences.

Parameters:

Name Type Description Default Alphabet | str | List[str] | None

alphabet to use for tokenization.

None int

Size of kmer to tokenize.

1 bool

Whether to tokenize into codons.

False bool

Whether to replace T with U.

True bool

Whether to convert input to uppercase.

True

Examples:

Python Console Session
>>> from multimolecule import RnaTokenizer\n>>> tokenizer = RnaTokenizer()\n>>> tokenizer('<pad><cls><eos><unk><mask><null>ACGUNRYSWKMBDHV.X*-I')[\"input_ids\"]\n[1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 2]\n>>> tokenizer('acgu')[\"input_ids\"]\n[1, 6, 7, 8, 9, 2]\n>>> tokenizer('acgt')[\"input_ids\"]\n[1, 6, 7, 8, 9, 2]\n>>> tokenizer = RnaTokenizer(replace_T_with_U=False)\n>>> tokenizer('acgt')[\"input_ids\"]\n[1, 6, 7, 8, 3, 2]\n>>> tokenizer = RnaTokenizer(nmers=3)\n>>> tokenizer('uagcuuauc')[\"input_ids\"]\n[1, 83, 17, 64, 49, 96, 84, 22, 2]\n>>> tokenizer = RnaTokenizer(codon=True)\n>>> tokenizer('uagcuuauc')[\"input_ids\"]\n[1, 83, 49, 22, 2]\n>>> tokenizer('uagcuuauca')[\"input_ids\"]\nTraceback (most recent call last):\nValueError: length of input sequence must be a multiple of 3 for codon tokenization, but got 10\n
Source code in multimolecule/tokenisers/rna/tokenization_rna.py Python
class RnaTokenizer(Tokenizer):\n    \"\"\"\n    Tokenizer for RNA sequences.\n\n    Args:\n        alphabet: alphabet to use for tokenization.\n\n            - If is `None`, the standard RNA alphabet will be used.\n            - If is a `string`, it should correspond to the name of a predefined alphabet. The options include\n                + `standard`\n                + `extended`\n                + `streamline`\n                + `nucleobase`\n            - If is an alphabet or a list of characters, that specific alphabet will be used.\n        nmers: Size of kmer to tokenize.\n        codon: Whether to tokenize into codons.\n        replace_T_with_U: Whether to replace T with U.\n        do_upper_case: Whether to convert input to uppercase.\n\n    Examples:\n        >>> from multimolecule import RnaTokenizer\n        >>> tokenizer = RnaTokenizer()\n        >>> tokenizer('<pad><cls><eos><unk><mask><null>ACGUNRYSWKMBDHV.X*-I')[\"input_ids\"]\n        [1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 2]\n        >>> tokenizer('acgu')[\"input_ids\"]\n        [1, 6, 7, 8, 9, 2]\n        >>> tokenizer('acgt')[\"input_ids\"]\n        [1, 6, 7, 8, 9, 2]\n        >>> tokenizer = RnaTokenizer(replace_T_with_U=False)\n        >>> tokenizer('acgt')[\"input_ids\"]\n        [1, 6, 7, 8, 3, 2]\n        >>> tokenizer = RnaTokenizer(nmers=3)\n        >>> tokenizer('uagcuuauc')[\"input_ids\"]\n        [1, 83, 17, 64, 49, 96, 84, 22, 2]\n        >>> tokenizer = RnaTokenizer(codon=True)\n        >>> tokenizer('uagcuuauc')[\"input_ids\"]\n        [1, 83, 49, 22, 2]\n        >>> tokenizer('uagcuuauca')[\"input_ids\"]\n        Traceback (most recent call last):\n        ValueError: length of input sequence must be a multiple of 3 for codon tokenization, but got 10\n    \"\"\"\n\n    model_input_names = [\"input_ids\", \"attention_mask\"]\n\n    def __init__(\n        self,\n        alphabet: Alphabet | str | List[str] | None = None,\n        nmers: int = 1,\n        codon: bool = False,\n        replace_T_with_U: bool = True,\n        do_upper_case: bool = True,\n        additional_special_tokens: List | Tuple | None = None,\n        **kwargs,\n    ):\n        if codon and (nmers > 1 and nmers != 3):\n            raise ValueError(\"Codon and nmers cannot be used together.\")\n        if codon:\n            nmers = 3  # set to 3 to get correct vocab\n        if not isinstance(alphabet, Alphabet):\n            alphabet = get_alphabet(alphabet, nmers=nmers)\n        super().__init__(\n            alphabet=alphabet,\n            nmers=nmers,\n            codon=codon,\n            replace_T_with_U=replace_T_with_U,\n            do_upper_case=do_upper_case,\n            additional_special_tokens=additional_special_tokens,\n            **kwargs,\n        )\n        self.replace_T_with_U = replace_T_with_U\n        self.nmers = nmers\n        self.codon = codon\n\n    def _tokenize(self, text: str, **kwargs):\n        if self.do_upper_case:\n            text = text.upper()\n        if self.replace_T_with_U:\n            text = text.replace(\"T\", \"U\")\n        if self.codon:\n            if len(text) % 3 != 0:\n                raise ValueError(\n                    f\"length of input sequence must be a multiple of 3 for codon tokenization, but got {len(text)}\"\n                )\n            return [text[i : i + 3] for i in range(0, len(text), 3)]\n        if self.nmers > 1:\n            return [text[i : i + self.nmers] for i in range(len(text) - self.nmers + 1)]  # noqa: E203\n        return list(text)\n
"},{"location":"models/rinalmo/#multimolecule.models.rinalmo.RnaTokenizer(alphabet)","title":"alphabet","text":""},{"location":"models/rinalmo/#multimolecule.models.rinalmo.RnaTokenizer(nmers)","title":"nmers","text":""},{"location":"models/rinalmo/#multimolecule.models.rinalmo.RnaTokenizer(codon)","title":"codon","text":""},{"location":"models/rinalmo/#multimolecule.models.rinalmo.RnaTokenizer(replace_T_with_U)","title":"replace_T_with_U","text":""},{"location":"models/rinalmo/#multimolecule.models.rinalmo.RnaTokenizer(do_upper_case)","title":"do_upper_case","text":""},{"location":"models/rinalmo/#multimolecule.models.rinalmo.RiNALMoConfig","title":"RiNALMoConfig","text":"

Bases: PreTrainedConfig

This is the configuration class to store the configuration of a RiNALMoModel. It is used to instantiate a RiNALMo model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the RiNALMo lbcb-sci/RiNALMo architecture.

Configuration objects inherit from PreTrainedConfig and can be used to control the model outputs. Read the documentation from PreTrainedConfig for more information.

Parameters:

Name Type Description Default int

Vocabulary size of the RiNALMo model. Defines the number of different tokens that can be represented by the inputs_ids passed when calling [RiNALMoModel].

26 int

Dimensionality of the encoder layers and the pooler layer.

1280 int

Number of hidden layers in the Transformer encoder.

33 int

Number of attention heads for each attention layer in the Transformer encoder.

20 int

Dimensionality of the \u201cintermediate\u201d (often named feed-forward) layer in the Transformer encoder.

5120 float

The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.

0.1 float

The dropout ratio for the attention probabilities.

0.1 int

The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).

1024 float

The standard deviation of the truncated_normal_initializer for initializing all weight matrices.

0.02 float

The epsilon used by the layer normalization layers.

1e-12 str

Type of position embedding. Choose one of \"absolute\", \"relative_key\", \"relative_key_query\", \"rotary\". For positional embeddings use \"absolute\". For more information on \"relative_key\", please refer to Self-Attention with Relative Position Representations (Shaw et al.). For more information on \"relative_key_query\", please refer to Method 4 in Improve Transformer Models with Better Relative Position Embeddings (Huang et al.).

'rotary' bool

Whether the model is used as a decoder or not. If False, the model is used as an encoder.

False bool

Whether or not the model should return the last key/values attentions (not used by all models). Only relevant if config.is_decoder=True.

True bool

Whether to apply layer normalization after embeddings but before the main stem of the network.

True bool

When this is enabled, masked tokens are treated as if they had been dropped out by input dropout.

True

Examples:

Python Console Session
>>> from multimolecule import RiNALMoModel, RiNALMoConfig\n>>> # Initializing a RiNALMo multimolecule/rinalmo style configuration\n>>> configuration = RiNALMoConfig()\n>>> # Initializing a model (with random weights) from the multimolecule/rinalmo style configuration\n>>> model = RiNALMoModel(configuration)\n>>> # Accessing the model configuration\n>>> configuration = model.config\n
Source code in multimolecule/models/rinalmo/configuration_rinalmo.py Python
class RiNALMoConfig(PreTrainedConfig):\n    r\"\"\"\n    This is the configuration class to store the configuration of a [`RiNALMoModel`][multimolecule.models.RiNALMoModel].\n    It is used to instantiate a RiNALMo model according to the specified arguments, defining the model architecture.\n    Instantiating a configuration with the defaults will yield a similar configuration to that of the RiNALMo\n    [lbcb-sci/RiNALMo](https://github.com/lbcb-sci/RiNALMo) architecture.\n\n    Configuration objects inherit from [`PreTrainedConfig`][multimolecule.models.PreTrainedConfig] and can be used to\n    control the model outputs. Read the documentation from [`PreTrainedConfig`][multimolecule.models.PreTrainedConfig]\n    for more information.\n\n    Args:\n        vocab_size:\n            Vocabulary size of the RiNALMo model. Defines the number of different tokens that can be represented by the\n            `inputs_ids` passed when calling [`RiNALMoModel`].\n        hidden_size:\n            Dimensionality of the encoder layers and the pooler layer.\n        num_hidden_layers:\n            Number of hidden layers in the Transformer encoder.\n        num_attention_heads:\n            Number of attention heads for each attention layer in the Transformer encoder.\n        intermediate_size:\n            Dimensionality of the \"intermediate\" (often named feed-forward) layer in the Transformer encoder.\n        hidden_dropout:\n            The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.\n        attention_dropout:\n            The dropout ratio for the attention probabilities.\n        max_position_embeddings:\n            The maximum sequence length that this model might ever be used with. Typically set this to something large\n            just in case (e.g., 512 or 1024 or 2048).\n        initializer_range:\n            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.\n        layer_norm_eps:\n            The epsilon used by the layer normalization layers.\n        position_embedding_type:\n            Type of position embedding. Choose one of `\"absolute\"`, `\"relative_key\"`, `\"relative_key_query\", \"rotary\"`.\n            For positional embeddings use `\"absolute\"`. For more information on `\"relative_key\"`, please refer to\n            [Self-Attention with Relative Position Representations (Shaw et al.)](https://arxiv.org/abs/1803.02155).\n            For more information on `\"relative_key_query\"`, please refer to *Method 4* in [Improve Transformer Models\n            with Better Relative Position Embeddings (Huang et al.)](https://arxiv.org/abs/2009.13658).\n        is_decoder:\n            Whether the model is used as a decoder or not. If `False`, the model is used as an encoder.\n        use_cache:\n            Whether or not the model should return the last key/values attentions (not used by all models). Only\n            relevant if `config.is_decoder=True`.\n        emb_layer_norm_before:\n            Whether to apply layer normalization after embeddings but before the main stem of the network.\n        token_dropout:\n            When this is enabled, masked tokens are treated as if they had been dropped out by input dropout.\n\n    Examples:\n        >>> from multimolecule import RiNALMoModel, RiNALMoConfig\n        >>> # Initializing a RiNALMo multimolecule/rinalmo style configuration\n        >>> configuration = RiNALMoConfig()\n        >>> # Initializing a model (with random weights) from the multimolecule/rinalmo style configuration\n        >>> model = RiNALMoModel(configuration)\n        >>> # Accessing the model configuration\n        >>> configuration = model.config\n    \"\"\"\n\n    model_type = \"rinalmo\"\n\n    def __init__(\n        self,\n        vocab_size: int = 26,\n        hidden_size: int = 1280,\n        num_hidden_layers: int = 33,\n        num_attention_heads: int = 20,\n        intermediate_size: int = 5120,\n        hidden_act: str = \"gelu\",\n        hidden_dropout: float = 0.1,\n        attention_dropout: float = 0.1,\n        max_position_embeddings: int = 1024,\n        initializer_range: float = 0.02,\n        layer_norm_eps: float = 1e-12,\n        position_embedding_type: str = \"rotary\",\n        is_decoder: bool = False,\n        use_cache: bool = True,\n        emb_layer_norm_before: bool = True,\n        learnable_beta: bool = True,\n        token_dropout: bool = True,\n        head: HeadConfig | None = None,\n        lm_head: MaskedLMHeadConfig | None = None,\n        **kwargs,\n    ):\n        super().__init__(**kwargs)\n        self.vocab_size = vocab_size\n        self.hidden_size = hidden_size\n        self.num_hidden_layers = num_hidden_layers\n        self.num_attention_heads = num_attention_heads\n        self.intermediate_size = intermediate_size\n        self.hidden_act = hidden_act\n        self.hidden_dropout = hidden_dropout\n        self.attention_dropout = attention_dropout\n        self.max_position_embeddings = max_position_embeddings\n        self.initializer_range = initializer_range\n        self.layer_norm_eps = layer_norm_eps\n        self.position_embedding_type = position_embedding_type\n        self.is_decoder = is_decoder\n        self.use_cache = use_cache\n        self.learnable_beta = learnable_beta\n        self.token_dropout = token_dropout\n        self.head = HeadConfig(**head) if head is not None else None\n        self.lm_head = MaskedLMHeadConfig(**lm_head) if lm_head is not None else None\n        self.emb_layer_norm_before = emb_layer_norm_before\n
"},{"location":"models/rinalmo/#multimolecule.models.rinalmo.RiNALMoConfig(vocab_size)","title":"vocab_size","text":""},{"location":"models/rinalmo/#multimolecule.models.rinalmo.RiNALMoConfig(hidden_size)","title":"hidden_size","text":""},{"location":"models/rinalmo/#multimolecule.models.rinalmo.RiNALMoConfig(num_hidden_layers)","title":"num_hidden_layers","text":""},{"location":"models/rinalmo/#multimolecule.models.rinalmo.RiNALMoConfig(num_attention_heads)","title":"num_attention_heads","text":""},{"location":"models/rinalmo/#multimolecule.models.rinalmo.RiNALMoConfig(intermediate_size)","title":"intermediate_size","text":""},{"location":"models/rinalmo/#multimolecule.models.rinalmo.RiNALMoConfig(hidden_dropout)","title":"hidden_dropout","text":""},{"location":"models/rinalmo/#multimolecule.models.rinalmo.RiNALMoConfig(attention_dropout)","title":"attention_dropout","text":""},{"location":"models/rinalmo/#multimolecule.models.rinalmo.RiNALMoConfig(max_position_embeddings)","title":"max_position_embeddings","text":""},{"location":"models/rinalmo/#multimolecule.models.rinalmo.RiNALMoConfig(initializer_range)","title":"initializer_range","text":""},{"location":"models/rinalmo/#multimolecule.models.rinalmo.RiNALMoConfig(layer_norm_eps)","title":"layer_norm_eps","text":""},{"location":"models/rinalmo/#multimolecule.models.rinalmo.RiNALMoConfig(position_embedding_type)","title":"position_embedding_type","text":""},{"location":"models/rinalmo/#multimolecule.models.rinalmo.RiNALMoConfig(is_decoder)","title":"is_decoder","text":""},{"location":"models/rinalmo/#multimolecule.models.rinalmo.RiNALMoConfig(use_cache)","title":"use_cache","text":""},{"location":"models/rinalmo/#multimolecule.models.rinalmo.RiNALMoConfig(emb_layer_norm_before)","title":"emb_layer_norm_before","text":""},{"location":"models/rinalmo/#multimolecule.models.rinalmo.RiNALMoConfig(token_dropout)","title":"token_dropout","text":""},{"location":"models/rinalmo/#multimolecule.models.rinalmo.RiNALMoForContactPrediction","title":"RiNALMoForContactPrediction","text":"

Bases: RiNALMoPreTrainedModel

Examples:

Python Console Session
>>> from multimolecule import RiNALMoConfig, RiNALMoForContactPrediction, RnaTokenizer\n>>> config = RiNALMoConfig()\n>>> model = RiNALMoForContactPrediction(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=torch.randint(2, (1, 5, 5)))\n>>> output[\"logits\"].shape\ntorch.Size([1, 5, 5, 1])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n
Source code in multimolecule/models/rinalmo/modeling_rinalmo.py Python
class RiNALMoForContactPrediction(RiNALMoPreTrainedModel):\n    \"\"\"\n    Examples:\n        >>> from multimolecule import RiNALMoConfig, RiNALMoForContactPrediction, RnaTokenizer\n        >>> config = RiNALMoConfig()\n        >>> model = RiNALMoForContactPrediction(config)\n        >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n        >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n        >>> output = model(**input, labels=torch.randint(2, (1, 5, 5)))\n        >>> output[\"logits\"].shape\n        torch.Size([1, 5, 5, 1])\n        >>> output[\"loss\"]  # doctest:+ELLIPSIS\n        tensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n    \"\"\"\n\n    def __init__(self, config: RiNALMoConfig):\n        super().__init__(config)\n        self.rinalmo = RiNALMoModel(config, add_pooling_layer=True)\n        self.contact_head = ContactPredictionHead(config)\n        self.head_config = self.contact_head.config\n\n        # Initialize weights and apply final processing\n        self.post_init()\n\n    def forward(\n        self,\n        input_ids: Tensor | NestedTensor,\n        attention_mask: Tensor | None = None,\n        position_ids: Tensor | None = None,\n        head_mask: Tensor | None = None,\n        inputs_embeds: Tensor | NestedTensor | None = None,\n        labels: Tensor | None = None,\n        output_attentions: bool | None = None,\n        output_hidden_states: bool | None = None,\n        return_dict: bool | None = None,\n        **kwargs,\n    ) -> Tuple[Tensor, ...] | ContactPredictorOutput:\n        if output_attentions is False:\n            warn(\"output_attentions must be True for contact classification and will be ignored.\")\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n        outputs = self.rinalmo(\n            input_ids,\n            attention_mask=attention_mask,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            output_attentions=True,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n            **kwargs,\n        )\n        output = self.contact_head(outputs, attention_mask, input_ids, labels)\n        logits, loss = output.logits, output.loss\n\n        if not return_dict:\n            output = (logits,) + outputs[2:]\n            return ((loss,) + output) if loss is not None else output\n\n        return ContactPredictorOutput(\n            loss=loss,\n            logits=logits,\n            hidden_states=outputs.hidden_states,\n            attentions=outputs.attentions,\n        )\n
"},{"location":"models/rinalmo/#multimolecule.models.rinalmo.RiNALMoForMaskedLM","title":"RiNALMoForMaskedLM","text":"

Bases: RiNALMoPreTrainedModel

Examples:

Python Console Session
>>> from multimolecule import RiNALMoConfig, RiNALMoForMaskedLM, RnaTokenizer\n>>> config = RiNALMoConfig()\n>>> model = RiNALMoForMaskedLM(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=input[\"input_ids\"])\n>>> output[\"logits\"].shape\ntorch.Size([1, 7, 26])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<NllLossBackward0>)\n
Source code in multimolecule/models/rinalmo/modeling_rinalmo.py Python
class RiNALMoForMaskedLM(RiNALMoPreTrainedModel):\n    \"\"\"\n    Examples:\n        >>> from multimolecule import RiNALMoConfig, RiNALMoForMaskedLM, RnaTokenizer\n        >>> config = RiNALMoConfig()\n        >>> model = RiNALMoForMaskedLM(config)\n        >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n        >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n        >>> output = model(**input, labels=input[\"input_ids\"])\n        >>> output[\"logits\"].shape\n        torch.Size([1, 7, 26])\n        >>> output[\"loss\"]  # doctest:+ELLIPSIS\n        tensor(..., grad_fn=<NllLossBackward0>)\n    \"\"\"\n\n    _tied_weights_keys = [\"lm_head.decoder.weight\", \"lm_head.decoder.bias\"]\n\n    def __init__(self, config: RiNALMoConfig):\n        super().__init__(config)\n        if config.is_decoder:\n            logger.warning(\n                \"If you want to use `RiNALMoForMaskedLM` make sure `config.is_decoder=False` for \"\n                \"bi-directional self-attention.\"\n            )\n        self.rinalmo = RiNALMoModel(config, add_pooling_layer=False)\n        self.lm_head = MaskedLMHead(config)\n\n        # Initialize weights and apply final processing\n        self.post_init()\n\n    def forward(\n        self,\n        input_ids: Tensor | NestedTensor,\n        attention_mask: Tensor | None = None,\n        position_ids: Tensor | None = None,\n        head_mask: Tensor | None = None,\n        inputs_embeds: Tensor | NestedTensor | None = None,\n        encoder_hidden_states: Tensor | None = None,\n        encoder_attention_mask: Tensor | None = None,\n        labels: Tensor | None = None,\n        output_attentions: bool | None = None,\n        output_hidden_states: bool | None = None,\n        return_dict: bool | None = None,\n        **kwargs,\n    ) -> Tuple[Tensor, ...] | MaskedLMOutput:\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n        outputs = self.rinalmo(\n            input_ids,\n            attention_mask=attention_mask,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            encoder_hidden_states=encoder_hidden_states,\n            encoder_attention_mask=encoder_attention_mask,\n            output_attentions=output_attentions,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n            **kwargs,\n        )\n        output = self.lm_head(outputs, labels)\n        logits, loss = output.logits, output.loss\n\n        if not return_dict:\n            output = (logits,) + outputs[2:]\n            return ((loss,) + output) if loss is not None else output\n\n        return MaskedLMOutput(\n            loss=loss,\n            logits=logits,\n            hidden_states=outputs.hidden_states,\n            attentions=outputs.attentions,\n        )\n
"},{"location":"models/rinalmo/#multimolecule.models.rinalmo.RiNALMoForSequencePrediction","title":"RiNALMoForSequencePrediction","text":"

Bases: RiNALMoPreTrainedModel

Examples:

Python Console Session
>>> from multimolecule import RiNALMoConfig, RiNALMoForSequencePrediction, RnaTokenizer\n>>> config = RiNALMoConfig()\n>>> model = RiNALMoForSequencePrediction(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=torch.tensor([[1]]))\n>>> output[\"logits\"].shape\ntorch.Size([1, 1])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n
Source code in multimolecule/models/rinalmo/modeling_rinalmo.py Python
class RiNALMoForSequencePrediction(RiNALMoPreTrainedModel):\n    \"\"\"\n    Examples:\n        >>> from multimolecule import RiNALMoConfig, RiNALMoForSequencePrediction, RnaTokenizer\n        >>> config = RiNALMoConfig()\n        >>> model = RiNALMoForSequencePrediction(config)\n        >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n        >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n        >>> output = model(**input, labels=torch.tensor([[1]]))\n        >>> output[\"logits\"].shape\n        torch.Size([1, 1])\n        >>> output[\"loss\"]  # doctest:+ELLIPSIS\n        tensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n    \"\"\"\n\n    def __init__(self, config: RiNALMoConfig):\n        super().__init__(config)\n        self.rinalmo = RiNALMoModel(config, add_pooling_layer=True)\n        self.sequence_head = SequencePredictionHead(config)\n        self.head_config = self.sequence_head.config\n\n        # Initialize weights and apply final processing\n        self.post_init()\n\n    def forward(\n        self,\n        input_ids: Tensor | NestedTensor,\n        attention_mask: Tensor | None = None,\n        position_ids: Tensor | None = None,\n        head_mask: Tensor | None = None,\n        inputs_embeds: Tensor | NestedTensor | None = None,\n        labels: Tensor | None = None,\n        output_attentions: bool | None = None,\n        output_hidden_states: bool | None = None,\n        return_dict: bool | None = None,\n        **kwargs,\n    ) -> Tuple[Tensor, ...] | SequencePredictorOutput:\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n        outputs = self.rinalmo(\n            input_ids,\n            attention_mask=attention_mask,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            output_attentions=output_attentions,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n            **kwargs,\n        )\n        output = self.sequence_head(outputs, labels)\n        logits, loss = output.logits, output.loss\n\n        if not return_dict:\n            output = (logits,) + outputs[2:]\n            return ((loss,) + output) if loss is not None else output\n\n        return SequencePredictorOutput(\n            loss=loss,\n            logits=logits,\n            hidden_states=outputs.hidden_states,\n            attentions=outputs.attentions,\n        )\n
"},{"location":"models/rinalmo/#multimolecule.models.rinalmo.RiNALMoForTokenPrediction","title":"RiNALMoForTokenPrediction","text":"

Bases: RiNALMoPreTrainedModel

Examples:

Python Console Session
>>> from multimolecule import RiNALMoConfig, RiNALMoForTokenPrediction, RnaTokenizer\n>>> config = RiNALMoConfig()\n>>> model = RiNALMoForTokenPrediction(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=torch.randint(2, (1, 5)))\n>>> output[\"logits\"].shape\ntorch.Size([1, 5, 1])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n
Source code in multimolecule/models/rinalmo/modeling_rinalmo.py Python
class RiNALMoForTokenPrediction(RiNALMoPreTrainedModel):\n    \"\"\"\n    Examples:\n        >>> from multimolecule import RiNALMoConfig, RiNALMoForTokenPrediction, RnaTokenizer\n        >>> config = RiNALMoConfig()\n        >>> model = RiNALMoForTokenPrediction(config)\n        >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n        >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n        >>> output = model(**input, labels=torch.randint(2, (1, 5)))\n        >>> output[\"logits\"].shape\n        torch.Size([1, 5, 1])\n        >>> output[\"loss\"]  # doctest:+ELLIPSIS\n        tensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n    \"\"\"\n\n    def __init__(self, config: RiNALMoConfig):\n        super().__init__(config)\n        self.rinalmo = RiNALMoModel(config, add_pooling_layer=True)\n        self.token_head = TokenPredictionHead(config)\n        self.head_config = self.token_head.config\n\n        # Initialize weights and apply final processing\n        self.post_init()\n\n    def forward(\n        self,\n        input_ids: Tensor | NestedTensor,\n        attention_mask: Tensor | None = None,\n        position_ids: Tensor | None = None,\n        head_mask: Tensor | None = None,\n        inputs_embeds: Tensor | NestedTensor | None = None,\n        labels: Tensor | None = None,\n        output_attentions: bool | None = None,\n        output_hidden_states: bool | None = None,\n        return_dict: bool | None = None,\n        **kwargs,\n    ) -> Tuple[Tensor, ...] | TokenPredictorOutput:\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n        outputs = self.rinalmo(\n            input_ids,\n            attention_mask=attention_mask,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            output_attentions=output_attentions,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n            **kwargs,\n        )\n        output = self.token_head(outputs, attention_mask, input_ids, labels)\n        logits, loss = output.logits, output.loss\n\n        if not return_dict:\n            output = (logits,) + outputs[2:]\n            return ((loss,) + output) if loss is not None else output\n\n        return TokenPredictorOutput(\n            loss=loss,\n            logits=logits,\n            hidden_states=outputs.hidden_states,\n            attentions=outputs.attentions,\n        )\n
"},{"location":"models/rinalmo/#multimolecule.models.rinalmo.RiNALMoModel","title":"RiNALMoModel","text":"

Bases: RiNALMoPreTrainedModel

Examples:

Python Console Session
>>> from multimolecule import RiNALMoConfig, RiNALMoModel, RnaTokenizer\n>>> config = RiNALMoConfig()\n>>> model = RiNALMoModel(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input)\n>>> output[\"last_hidden_state\"].shape\ntorch.Size([1, 7, 1280])\n>>> output[\"pooler_output\"].shape\ntorch.Size([1, 1280])\n
Source code in multimolecule/models/rinalmo/modeling_rinalmo.py Python
class RiNALMoModel(RiNALMoPreTrainedModel):\n    \"\"\"\n    Examples:\n        >>> from multimolecule import RiNALMoConfig, RiNALMoModel, RnaTokenizer\n        >>> config = RiNALMoConfig()\n        >>> model = RiNALMoModel(config)\n        >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n        >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n        >>> output = model(**input)\n        >>> output[\"last_hidden_state\"].shape\n        torch.Size([1, 7, 1280])\n        >>> output[\"pooler_output\"].shape\n        torch.Size([1, 1280])\n    \"\"\"\n\n    def __init__(self, config: RiNALMoConfig, add_pooling_layer: bool = True):\n        super().__init__(config)\n        self.pad_token_id = config.pad_token_id\n        self.embeddings = RiNALMoEmbeddings(config)\n        self.encoder = RiNALMoEncoder(config)\n        self.pooler = RiNALMoPooler(config) if add_pooling_layer else None\n\n        # Initialize weights and apply final processing\n        self.post_init()\n\n    def get_input_embeddings(self):\n        return self.embeddings.word_embeddings\n\n    def set_input_embeddings(self, value):\n        self.embeddings.word_embeddings = value\n\n    def _prune_heads(self, heads_to_prune):\n        \"\"\"\n        Prunes heads of the model. heads_to_prune: dict of {layer_num: list of heads to prune in this layer} See base\n        class PreTrainedModel\n        \"\"\"\n        for layer, heads in heads_to_prune.items():\n            self.encoder.layer[layer].attention.prune_heads(heads)\n\n    def forward(\n        self,\n        input_ids: Tensor | NestedTensor,\n        attention_mask: Tensor | None = None,\n        position_ids: Tensor | None = None,\n        head_mask: Tensor | None = None,\n        inputs_embeds: Tensor | NestedTensor | None = None,\n        encoder_hidden_states: Tensor | None = None,\n        encoder_attention_mask: Tensor | None = None,\n        past_key_values: Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None = None,\n        use_cache: bool | None = None,\n        output_attentions: bool | None = None,\n        output_hidden_states: bool | None = None,\n        return_dict: bool | None = None,\n        **kwargs,\n    ) -> Tuple[Tensor, ...] | BaseModelOutputWithPoolingAndCrossAttentions:\n        r\"\"\"\n        Args:\n            encoder_hidden_states:\n                Shape: `(batch_size, sequence_length, hidden_size)`\n\n                Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if\n                the model is configured as a decoder.\n            encoder_attention_mask:\n                Shape: `(batch_size, sequence_length)`\n\n                Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used\n                in the cross-attention if the model is configured as a decoder. Mask values selected in `[0, 1]`:\n\n                - 1 for tokens that are **not masked**,\n                - 0 for tokens that are **masked**.\n            past_key_values:\n                Tuple of length `config.n_layers` with each tuple having 4 tensors of shape\n                `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)\n\n                Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up\n                decoding.\n\n                If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those\n                that don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of\n                all `decoder_input_ids` of shape `(batch_size, sequence_length)`.\n            use_cache:\n                If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding\n                (see `past_key_values`).\n        \"\"\"\n        if kwargs:\n            warn(\n                f\"Additional keyword arguments `{', '.join(kwargs)}` are detected in \"\n                f\"`{self.__class__.__name__}.forward`, they will be ignored.\\n\"\n                \"This is provided for backward compatibility and may lead to unexpected behavior.\"\n            )\n        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions\n        output_hidden_states = (\n            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states\n        )\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n\n        if self.config.is_decoder:\n            use_cache = use_cache if use_cache is not None else self.config.use_cache\n        else:\n            use_cache = False\n\n        if isinstance(input_ids, NestedTensor):\n            input_ids, attention_mask = input_ids.tensor, input_ids.mask\n        if input_ids is not None and inputs_embeds is not None:\n            raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n        if input_ids is not None:\n            self.warn_if_padding_and_no_attention_mask(input_ids, attention_mask)\n            input_shape = input_ids.size()\n        elif inputs_embeds is not None:\n            input_shape = inputs_embeds.size()[:-1]\n        else:\n            raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n        batch_size, seq_length = input_shape\n        device = input_ids.device if input_ids is not None else inputs_embeds.device  # type: ignore[union-attr]\n\n        # past_key_values_length\n        past_key_values_length = past_key_values[0][0].shape[2] if past_key_values is not None else 0\n\n        if attention_mask is None:\n            attention_mask = (\n                input_ids.ne(self.pad_token_id)\n                if self.pad_token_id is not None\n                else torch.ones(((batch_size, seq_length + past_key_values_length)), device=device)\n            )\n\n        # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]\n        # ourselves in which case we just need to make it broadcastable to all heads.\n        extended_attention_mask: Tensor = self.get_extended_attention_mask(attention_mask, input_shape)\n\n        # If a 2D or 3D attention mask is provided for the cross-attention\n        # we need to make broadcastable to [batch_size, num_heads, seq_length, seq_length]\n        if self.config.is_decoder and encoder_hidden_states is not None:\n            encoder_batch_size, encoder_sequence_length, _ = encoder_hidden_states.size()\n            encoder_hidden_shape = (encoder_batch_size, encoder_sequence_length)\n            if encoder_attention_mask is None:\n                encoder_attention_mask = torch.ones(encoder_hidden_shape, device=device)\n            encoder_extended_attention_mask = self.invert_attention_mask(encoder_attention_mask)\n        else:\n            encoder_extended_attention_mask = None\n\n        # Prepare head mask if needed\n        # 1.0 in head_mask indicate we keep the head\n        # attention_probs has shape bsz x n_heads x N x N\n        # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]\n        # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]\n        head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)\n\n        embedding_output = self.embeddings(\n            input_ids=input_ids,\n            position_ids=position_ids,\n            attention_mask=attention_mask,\n            inputs_embeds=inputs_embeds,\n            past_key_values_length=past_key_values_length,\n        )\n        encoder_outputs = self.encoder(\n            embedding_output,\n            attention_mask=extended_attention_mask,\n            head_mask=head_mask,\n            encoder_hidden_states=encoder_hidden_states,\n            encoder_attention_mask=encoder_extended_attention_mask,\n            past_key_values=past_key_values,\n            use_cache=use_cache,\n            output_attentions=output_attentions,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n        )\n        sequence_output = encoder_outputs[0]\n        pooled_output = self.pooler(sequence_output) if self.pooler is not None else None\n\n        if not return_dict:\n            return (sequence_output, pooled_output) + encoder_outputs[1:]\n\n        return BaseModelOutputWithPoolingAndCrossAttentions(\n            last_hidden_state=sequence_output,\n            pooler_output=pooled_output,\n            past_key_values=encoder_outputs.past_key_values,\n            hidden_states=encoder_outputs.hidden_states,\n            attentions=encoder_outputs.attentions,\n            cross_attentions=encoder_outputs.cross_attentions,\n        )\n
"},{"location":"models/rinalmo/#multimolecule.models.rinalmo.RiNALMoModel.forward","title":"forward","text":"Python
forward(input_ids: Tensor | NestedTensor, attention_mask: Tensor | None = None, position_ids: Tensor | None = None, head_mask: Tensor | None = None, inputs_embeds: Tensor | NestedTensor | None = None, encoder_hidden_states: Tensor | None = None, encoder_attention_mask: Tensor | None = None, past_key_values: Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None = None, use_cache: bool | None = None, output_attentions: bool | None = None, output_hidden_states: bool | None = None, return_dict: bool | None = None, **kwargs) -> Tuple[Tensor, ...] | BaseModelOutputWithPoolingAndCrossAttentions\n

Parameters:

Name Type Description Default Tensor | None

Shape: (batch_size, sequence_length, hidden_size)

Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if the model is configured as a decoder.

None Tensor | None

Shape: (batch_size, sequence_length)

Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used in the cross-attention if the model is configured as a decoder. Mask values selected in [0, 1]:

None Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None

Tuple of length config.n_layers with each tuple having 4 tensors of shape `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)

Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up decoding.

If past_key_values are used, the user can optionally input only the last decoder_input_ids (those that don\u2019t have their past key value states given to this model) of shape (batch_size, 1) instead of all decoder_input_ids of shape (batch_size, sequence_length).

None bool | None

If set to True, past_key_values key value states are returned and can be used to speed up decoding (see past_key_values).

None Source code in multimolecule/models/rinalmo/modeling_rinalmo.py Python
def forward(\n    self,\n    input_ids: Tensor | NestedTensor,\n    attention_mask: Tensor | None = None,\n    position_ids: Tensor | None = None,\n    head_mask: Tensor | None = None,\n    inputs_embeds: Tensor | NestedTensor | None = None,\n    encoder_hidden_states: Tensor | None = None,\n    encoder_attention_mask: Tensor | None = None,\n    past_key_values: Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None = None,\n    use_cache: bool | None = None,\n    output_attentions: bool | None = None,\n    output_hidden_states: bool | None = None,\n    return_dict: bool | None = None,\n    **kwargs,\n) -> Tuple[Tensor, ...] | BaseModelOutputWithPoolingAndCrossAttentions:\n    r\"\"\"\n    Args:\n        encoder_hidden_states:\n            Shape: `(batch_size, sequence_length, hidden_size)`\n\n            Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if\n            the model is configured as a decoder.\n        encoder_attention_mask:\n            Shape: `(batch_size, sequence_length)`\n\n            Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used\n            in the cross-attention if the model is configured as a decoder. Mask values selected in `[0, 1]`:\n\n            - 1 for tokens that are **not masked**,\n            - 0 for tokens that are **masked**.\n        past_key_values:\n            Tuple of length `config.n_layers` with each tuple having 4 tensors of shape\n            `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)\n\n            Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up\n            decoding.\n\n            If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those\n            that don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of\n            all `decoder_input_ids` of shape `(batch_size, sequence_length)`.\n        use_cache:\n            If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding\n            (see `past_key_values`).\n    \"\"\"\n    if kwargs:\n        warn(\n            f\"Additional keyword arguments `{', '.join(kwargs)}` are detected in \"\n            f\"`{self.__class__.__name__}.forward`, they will be ignored.\\n\"\n            \"This is provided for backward compatibility and may lead to unexpected behavior.\"\n        )\n    output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions\n    output_hidden_states = (\n        output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states\n    )\n    return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n\n    if self.config.is_decoder:\n        use_cache = use_cache if use_cache is not None else self.config.use_cache\n    else:\n        use_cache = False\n\n    if isinstance(input_ids, NestedTensor):\n        input_ids, attention_mask = input_ids.tensor, input_ids.mask\n    if input_ids is not None and inputs_embeds is not None:\n        raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n    if input_ids is not None:\n        self.warn_if_padding_and_no_attention_mask(input_ids, attention_mask)\n        input_shape = input_ids.size()\n    elif inputs_embeds is not None:\n        input_shape = inputs_embeds.size()[:-1]\n    else:\n        raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n    batch_size, seq_length = input_shape\n    device = input_ids.device if input_ids is not None else inputs_embeds.device  # type: ignore[union-attr]\n\n    # past_key_values_length\n    past_key_values_length = past_key_values[0][0].shape[2] if past_key_values is not None else 0\n\n    if attention_mask is None:\n        attention_mask = (\n            input_ids.ne(self.pad_token_id)\n            if self.pad_token_id is not None\n            else torch.ones(((batch_size, seq_length + past_key_values_length)), device=device)\n        )\n\n    # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]\n    # ourselves in which case we just need to make it broadcastable to all heads.\n    extended_attention_mask: Tensor = self.get_extended_attention_mask(attention_mask, input_shape)\n\n    # If a 2D or 3D attention mask is provided for the cross-attention\n    # we need to make broadcastable to [batch_size, num_heads, seq_length, seq_length]\n    if self.config.is_decoder and encoder_hidden_states is not None:\n        encoder_batch_size, encoder_sequence_length, _ = encoder_hidden_states.size()\n        encoder_hidden_shape = (encoder_batch_size, encoder_sequence_length)\n        if encoder_attention_mask is None:\n            encoder_attention_mask = torch.ones(encoder_hidden_shape, device=device)\n        encoder_extended_attention_mask = self.invert_attention_mask(encoder_attention_mask)\n    else:\n        encoder_extended_attention_mask = None\n\n    # Prepare head mask if needed\n    # 1.0 in head_mask indicate we keep the head\n    # attention_probs has shape bsz x n_heads x N x N\n    # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]\n    # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]\n    head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)\n\n    embedding_output = self.embeddings(\n        input_ids=input_ids,\n        position_ids=position_ids,\n        attention_mask=attention_mask,\n        inputs_embeds=inputs_embeds,\n        past_key_values_length=past_key_values_length,\n    )\n    encoder_outputs = self.encoder(\n        embedding_output,\n        attention_mask=extended_attention_mask,\n        head_mask=head_mask,\n        encoder_hidden_states=encoder_hidden_states,\n        encoder_attention_mask=encoder_extended_attention_mask,\n        past_key_values=past_key_values,\n        use_cache=use_cache,\n        output_attentions=output_attentions,\n        output_hidden_states=output_hidden_states,\n        return_dict=return_dict,\n    )\n    sequence_output = encoder_outputs[0]\n    pooled_output = self.pooler(sequence_output) if self.pooler is not None else None\n\n    if not return_dict:\n        return (sequence_output, pooled_output) + encoder_outputs[1:]\n\n    return BaseModelOutputWithPoolingAndCrossAttentions(\n        last_hidden_state=sequence_output,\n        pooler_output=pooled_output,\n        past_key_values=encoder_outputs.past_key_values,\n        hidden_states=encoder_outputs.hidden_states,\n        attentions=encoder_outputs.attentions,\n        cross_attentions=encoder_outputs.cross_attentions,\n    )\n
"},{"location":"models/rinalmo/#multimolecule.models.rinalmo.RiNALMoModel.forward(encoder_hidden_states)","title":"encoder_hidden_states","text":""},{"location":"models/rinalmo/#multimolecule.models.rinalmo.RiNALMoModel.forward(encoder_attention_mask)","title":"encoder_attention_mask","text":""},{"location":"models/rinalmo/#multimolecule.models.rinalmo.RiNALMoModel.forward(past_key_values)","title":"past_key_values","text":""},{"location":"models/rinalmo/#multimolecule.models.rinalmo.RiNALMoModel.forward(use_cache)","title":"use_cache","text":""},{"location":"models/rinalmo/#multimolecule.models.rinalmo.RiNALMoPreTrainedModel","title":"RiNALMoPreTrainedModel","text":"

Bases: PreTrainedModel

An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained models.

Source code in multimolecule/models/rinalmo/modeling_rinalmo.py Python
class RiNALMoPreTrainedModel(PreTrainedModel):\n    \"\"\"\n    An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained\n    models.\n    \"\"\"\n\n    config_class = RiNALMoConfig\n    base_model_prefix = \"rinalmo\"\n    supports_gradient_checkpointing = True\n    _no_split_modules = [\"RiNALMoLayer\", \"RiNALMoEmbeddings\"]\n\n    # Copied from transformers.models.bert.modeling_bert.BertPreTrainedModel._init_weights\n    def _init_weights(self, module: nn.Module):\n        \"\"\"Initialize the weights\"\"\"\n        if isinstance(module, nn.Linear):\n            # Slightly different from the TF version which uses truncated_normal for initialization\n            # cf https://github.com/pytorch/pytorch/pull/5617\n            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)\n            if module.bias is not None:\n                module.bias.data.zero_()\n        elif isinstance(module, nn.Embedding):\n            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)\n            if module.padding_idx is not None:\n                module.weight.data[module.padding_idx].zero_()\n        elif isinstance(module, nn.LayerNorm):\n            module.bias.data.zero_()\n            module.weight.data.fill_(1.0)\n
"},{"location":"models/rnabert/","title":"RNABERT","text":"

Pre-trained model on non-coding RNA (ncRNA) using masked language modeling (MLM) and structural alignment learning (SAL) objectives.

"},{"location":"models/rnabert/#disclaimer","title":"Disclaimer","text":"

This is an UNOFFICIAL implementation of the Informative RNA-base embedding for functional RNA clustering and structural alignment by Manato Akiyama and Yasubumi Sakakibara.

The OFFICIAL repository of RNABERT is at mana438/RNABERT.

Caution

The MultiMolecule team is aware of a potential risk in reproducing the results of RNABERT.

The original implementation of RNABERT does not prepend <cls> and append <eos> tokens to the input sequence. This should not affect the performance of the model in most cases, but it can lead to unexpected behavior in some cases.

Please set cls_token=None and eos_token=None explicitly in the tokenizer if you want the exact behavior of the original implementation.

Tip

The MultiMolecule team has confirmed that the provided model and checkpoints are producing the same intermediate representations as the original implementation.

The team releasing RNABERT did not write this model card for this model so this model card has been written by the MultiMolecule team.

"},{"location":"models/rnabert/#model-details","title":"Model Details","text":"

RNABERT is a bert-style model pre-trained on a large corpus of non-coding RNA sequences in a self-supervised fashion. This means that the model was trained on the raw nucleotides of RNA sequences only, with an automatic process to generate inputs and labels from those texts. Please refer to the Training Details section for more information on the training process.

"},{"location":"models/rnabert/#model-specification","title":"Model Specification","text":"Num Layers Hidden Size Num Heads Intermediate Size Num Parameters (M) FLOPs (G) MACs (G) Max Num Tokens 6 120 12 40 0.48 0.15 0.08 440"},{"location":"models/rnabert/#links","title":"Links","text":""},{"location":"models/rnabert/#usage","title":"Usage","text":"

The model file depends on the multimolecule library. You can install it using pip:

Bash
pip install multimolecule\n
"},{"location":"models/rnabert/#direct-use","title":"Direct Use","text":"

You can use this model directly with a pipeline for masked language modeling:

Python
>>> import multimolecule  # you must import multimolecule to register models\n>>> from transformers import pipeline\n>>> unmasker = pipeline(\"fill-mask\", model=\"multimolecule/rnabert\")\n>>> unmasker(\"gguc<mask>cucugguuagaccagaucugagccu\")\n\n[{'score': 0.03852083534002304,\n  'token': 24,\n  'token_str': '-',\n  'sequence': 'G G U C - C U C U G G U U A G A C C A G A U C U G A G C C U'},\n {'score': 0.03851056098937988,\n  'token': 10,\n  'token_str': 'N',\n  'sequence': 'G G U C N C U C U G G U U A G A C C A G A U C U G A G C C U'},\n {'score': 0.03849703073501587,\n  'token': 25,\n  'token_str': 'I',\n  'sequence': 'G G U C I C U C U G G U U A G A C C A G A U C U G A G C C U'},\n {'score': 0.03848597779870033,\n  'token': 3,\n  'token_str': '<unk>',\n  'sequence': 'G G U C C U C U G G U U A G A C C A G A U C U G A G C C U'},\n {'score': 0.038484156131744385,\n  'token': 5,\n  'token_str': '<null>',\n  'sequence': 'G G U C C U C U G G U U A G A C C A G A U C U G A G C C U'}]\n
"},{"location":"models/rnabert/#downstream-use","title":"Downstream Use","text":""},{"location":"models/rnabert/#extract-features","title":"Extract Features","text":"

Here is how to use this model to get the features of a given sequence in PyTorch:

Python
from multimolecule import RnaTokenizer, RnaBertModel\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rnabert\")\nmodel = RnaBertModel.from_pretrained(\"multimolecule/rnabert\")\n\ntext = \"UAGCUUAUCAGACUGAUGUUGA\"\ninput = tokenizer(text, return_tensors=\"pt\")\n\noutput = model(**input)\n
"},{"location":"models/rnabert/#sequence-classification-regression","title":"Sequence Classification / Regression","text":"

Note: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for sequence classification or regression.

Here is how to use this model as backbone to fine-tune for a sequence-level task in PyTorch:

Python
import torch\nfrom multimolecule import RnaTokenizer, RnaBertForSequencePrediction\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rnabert\")\nmodel = RnaBertForSequencePrediction.from_pretrained(\"multimolecule/rnabert\")\n\ntext = \"UAGCUUAUCAGACUGAUGUUGA\"\ninput = tokenizer(text, return_tensors=\"pt\")\nlabel = torch.tensor([1])\n\noutput = model(**input, labels=label)\n
"},{"location":"models/rnabert/#token-classification-regression","title":"Token Classification / Regression","text":"

Note: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for nucleotide classification or regression.

Here is how to use this model as backbone to fine-tune for a nucleotide-level task in PyTorch:

Python
import torch\nfrom multimolecule import RnaTokenizer, RnaBertForTokenPrediction\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rnabert\")\nmodel = RnaBertForTokenPrediction.from_pretrained(\"multimolecule/rnabert\")\n\ntext = \"UAGCUUAUCAGACUGAUGUUGA\"\ninput = tokenizer(text, return_tensors=\"pt\")\nlabel = torch.randint(2, (len(text), ))\n\noutput = model(**input, labels=label)\n
"},{"location":"models/rnabert/#contact-classification-regression","title":"Contact Classification / Regression","text":"

Note: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for contact classification or regression.

Here is how to use this model as backbone to fine-tune for a contact-level task in PyTorch:

Python
import torch\nfrom multimolecule import RnaTokenizer, RnaBertForContactPrediction\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rnabert\")\nmodel = RnaBertForContactPrediction.from_pretrained(\"multimolecule/rnabert\")\n\ntext = \"UAGCUUAUCAGACUGAUGUUGA\"\ninput = tokenizer(text, return_tensors=\"pt\")\nlabel = torch.randint(2, (len(text), len(text)))\n\noutput = model(**input, labels=label)\n
"},{"location":"models/rnabert/#training-details","title":"Training Details","text":"

RNABERT has two pre-training objectives: masked language modeling (MLM) and structural alignment learning (SAL).

"},{"location":"models/rnabert/#training-data","title":"Training Data","text":"

The RNABERT model was pre-trained on RNAcentral. RNAcentral is a free, public resource that offers integrated access to a comprehensive and up-to-date set of non-coding RNA sequences provided by a collaborating group of Expert Databases representing a broad range of organisms and RNA types.

RNABERT used a subset of 76, 237 human ncRNA sequences from RNAcentral for pre-training. RNABERT preprocessed all tokens by replacing \u201cU\u201ds with \u201cT\u201ds.

Note that during model conversions, \u201cT\u201d is replaced with \u201cU\u201d. RnaTokenizer will convert \u201cT\u201ds to \u201cU\u201ds for you, you may disable this behaviour by passing replace_T_with_U=False.

"},{"location":"models/rnabert/#training-procedure","title":"Training Procedure","text":""},{"location":"models/rnabert/#preprocessing","title":"Preprocessing","text":"

RNABERT preprocess the dataset by applying 10 different mask patterns to the 72, 237 human ncRNA sequences. The final dataset contains 722, 370 sequences. The masking procedure is similar to the one used in BERT:

"},{"location":"models/rnabert/#pretraining","title":"PreTraining","text":"

The model was trained on 1 NVIDIA V100 GPU.

"},{"location":"models/rnabert/#citation","title":"Citation","text":"

BibTeX:

BibTeX
@article{akiyama2022informative,\n    author = {Akiyama, Manato and Sakakibara, Yasubumi},\n    title = \"{Informative RNA base embedding for RNA structural alignment and clustering by deep representation learning}\",\n    journal = {NAR Genomics and Bioinformatics},\n    volume = {4},\n    number = {1},\n    pages = {lqac012},\n    year = {2022},\n    month = {02},\n    abstract = \"{Effective embedding is actively conducted by applying deep learning to biomolecular information. Obtaining better embeddings enhances the quality of downstream analyses, such as DNA sequence motif detection and protein function prediction. In this study, we adopt a pre-training algorithm for the effective embedding of RNA bases to acquire semantically rich representations and apply this algorithm to two fundamental RNA sequence problems: structural alignment and clustering. By using the pre-training algorithm to embed the four bases of RNA in a position-dependent manner using a large number of RNA sequences from various RNA families, a context-sensitive embedding representation is obtained. As a result, not only base information but also secondary structure and context information of RNA sequences are embedded for each base. We call this \u2018informative base embedding\u2019 and use it to achieve accuracies superior to those of existing state-of-the-art methods on RNA structural alignment and RNA family clustering tasks. Furthermore, upon performing RNA sequence alignment by combining this informative base embedding with a simple Needleman\u2013Wunsch alignment algorithm, we succeed in calculating structural alignments with a time complexity of O(n2) instead of the O(n6) time complexity of the naive implementation of Sankoff-style algorithm for input RNA sequence of length n.}\",\n    issn = {2631-9268},\n    doi = {10.1093/nargab/lqac012},\n    url = {https://doi.org/10.1093/nargab/lqac012},\n    eprint = {https://academic.oup.com/nargab/article-pdf/4/1/lqac012/42577168/lqac012.pdf},\n}\n
"},{"location":"models/rnabert/#contact","title":"Contact","text":"

Please use GitHub issues of MultiMolecule for any questions or comments on the model card.

Please contact the authors of the RNABERT paper for questions or comments on the paper/model.

"},{"location":"models/rnabert/#license","title":"License","text":"

This model is licensed under the AGPL-3.0 License.

Text Only
SPDX-License-Identifier: AGPL-3.0-or-later\n
"},{"location":"models/rnabert/#multimolecule.models.rnabert","title":"multimolecule.models.rnabert","text":""},{"location":"models/rnabert/#multimolecule.models.rnabert.RnaTokenizer","title":"RnaTokenizer","text":"

Bases: Tokenizer

Tokenizer for RNA sequences.

Parameters:

Name Type Description Default Alphabet | str | List[str] | None

alphabet to use for tokenization.

None int

Size of kmer to tokenize.

1 bool

Whether to tokenize into codons.

False bool

Whether to replace T with U.

True bool

Whether to convert input to uppercase.

True

Examples:

Python Console Session
>>> from multimolecule import RnaTokenizer\n>>> tokenizer = RnaTokenizer()\n>>> tokenizer('<pad><cls><eos><unk><mask><null>ACGUNRYSWKMBDHV.X*-I')[\"input_ids\"]\n[1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 2]\n>>> tokenizer('acgu')[\"input_ids\"]\n[1, 6, 7, 8, 9, 2]\n>>> tokenizer('acgt')[\"input_ids\"]\n[1, 6, 7, 8, 9, 2]\n>>> tokenizer = RnaTokenizer(replace_T_with_U=False)\n>>> tokenizer('acgt')[\"input_ids\"]\n[1, 6, 7, 8, 3, 2]\n>>> tokenizer = RnaTokenizer(nmers=3)\n>>> tokenizer('uagcuuauc')[\"input_ids\"]\n[1, 83, 17, 64, 49, 96, 84, 22, 2]\n>>> tokenizer = RnaTokenizer(codon=True)\n>>> tokenizer('uagcuuauc')[\"input_ids\"]\n[1, 83, 49, 22, 2]\n>>> tokenizer('uagcuuauca')[\"input_ids\"]\nTraceback (most recent call last):\nValueError: length of input sequence must be a multiple of 3 for codon tokenization, but got 10\n
Source code in multimolecule/tokenisers/rna/tokenization_rna.py Python
class RnaTokenizer(Tokenizer):\n    \"\"\"\n    Tokenizer for RNA sequences.\n\n    Args:\n        alphabet: alphabet to use for tokenization.\n\n            - If is `None`, the standard RNA alphabet will be used.\n            - If is a `string`, it should correspond to the name of a predefined alphabet. The options include\n                + `standard`\n                + `extended`\n                + `streamline`\n                + `nucleobase`\n            - If is an alphabet or a list of characters, that specific alphabet will be used.\n        nmers: Size of kmer to tokenize.\n        codon: Whether to tokenize into codons.\n        replace_T_with_U: Whether to replace T with U.\n        do_upper_case: Whether to convert input to uppercase.\n\n    Examples:\n        >>> from multimolecule import RnaTokenizer\n        >>> tokenizer = RnaTokenizer()\n        >>> tokenizer('<pad><cls><eos><unk><mask><null>ACGUNRYSWKMBDHV.X*-I')[\"input_ids\"]\n        [1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 2]\n        >>> tokenizer('acgu')[\"input_ids\"]\n        [1, 6, 7, 8, 9, 2]\n        >>> tokenizer('acgt')[\"input_ids\"]\n        [1, 6, 7, 8, 9, 2]\n        >>> tokenizer = RnaTokenizer(replace_T_with_U=False)\n        >>> tokenizer('acgt')[\"input_ids\"]\n        [1, 6, 7, 8, 3, 2]\n        >>> tokenizer = RnaTokenizer(nmers=3)\n        >>> tokenizer('uagcuuauc')[\"input_ids\"]\n        [1, 83, 17, 64, 49, 96, 84, 22, 2]\n        >>> tokenizer = RnaTokenizer(codon=True)\n        >>> tokenizer('uagcuuauc')[\"input_ids\"]\n        [1, 83, 49, 22, 2]\n        >>> tokenizer('uagcuuauca')[\"input_ids\"]\n        Traceback (most recent call last):\n        ValueError: length of input sequence must be a multiple of 3 for codon tokenization, but got 10\n    \"\"\"\n\n    model_input_names = [\"input_ids\", \"attention_mask\"]\n\n    def __init__(\n        self,\n        alphabet: Alphabet | str | List[str] | None = None,\n        nmers: int = 1,\n        codon: bool = False,\n        replace_T_with_U: bool = True,\n        do_upper_case: bool = True,\n        additional_special_tokens: List | Tuple | None = None,\n        **kwargs,\n    ):\n        if codon and (nmers > 1 and nmers != 3):\n            raise ValueError(\"Codon and nmers cannot be used together.\")\n        if codon:\n            nmers = 3  # set to 3 to get correct vocab\n        if not isinstance(alphabet, Alphabet):\n            alphabet = get_alphabet(alphabet, nmers=nmers)\n        super().__init__(\n            alphabet=alphabet,\n            nmers=nmers,\n            codon=codon,\n            replace_T_with_U=replace_T_with_U,\n            do_upper_case=do_upper_case,\n            additional_special_tokens=additional_special_tokens,\n            **kwargs,\n        )\n        self.replace_T_with_U = replace_T_with_U\n        self.nmers = nmers\n        self.codon = codon\n\n    def _tokenize(self, text: str, **kwargs):\n        if self.do_upper_case:\n            text = text.upper()\n        if self.replace_T_with_U:\n            text = text.replace(\"T\", \"U\")\n        if self.codon:\n            if len(text) % 3 != 0:\n                raise ValueError(\n                    f\"length of input sequence must be a multiple of 3 for codon tokenization, but got {len(text)}\"\n                )\n            return [text[i : i + 3] for i in range(0, len(text), 3)]\n        if self.nmers > 1:\n            return [text[i : i + self.nmers] for i in range(len(text) - self.nmers + 1)]  # noqa: E203\n        return list(text)\n
"},{"location":"models/rnabert/#multimolecule.models.rnabert.RnaTokenizer(alphabet)","title":"alphabet","text":""},{"location":"models/rnabert/#multimolecule.models.rnabert.RnaTokenizer(nmers)","title":"nmers","text":""},{"location":"models/rnabert/#multimolecule.models.rnabert.RnaTokenizer(codon)","title":"codon","text":""},{"location":"models/rnabert/#multimolecule.models.rnabert.RnaTokenizer(replace_T_with_U)","title":"replace_T_with_U","text":""},{"location":"models/rnabert/#multimolecule.models.rnabert.RnaTokenizer(do_upper_case)","title":"do_upper_case","text":""},{"location":"models/rnabert/#multimolecule.models.rnabert.RnaBertConfig","title":"RnaBertConfig","text":"

Bases: PreTrainedConfig

This is the configuration class to store the configuration of a RnaBertModel. It is used to instantiate a RnaBert model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the RnaBert mana438/RNABERT architecture.

Configuration objects inherit from PreTrainedConfig and can be used to control the model outputs. Read the documentation from PreTrainedConfig for more information.

Parameters:

Name Type Description Default int

Vocabulary size of the RnaBert model. Defines the number of different tokens that can be represented by the inputs_ids passed when calling [RnaBertModel].

26 int | None

Dimensionality of the encoder layers and the pooler layer.

None int

Number of hidden layers in the Transformer encoder.

6 int

Number of attention heads for each attention layer in the Transformer encoder.

12 int

Dimensionality of the \u201cintermediate\u201d (often named feed-forward) layer in the Transformer encoder.

40 float

The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.

0.0 float

The dropout ratio for the attention probabilities.

0.0 int

The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).

440 float

The standard deviation of the truncated_normal_initializer for initializing all weight matrices.

0.02 float

The epsilon used by the layer normalization layers.

1e-12

Examples:

Python Console Session
>>> from multimolecule import RnaBertModel, RnaBertConfig\n>>> # Initializing a RNABERT multimolecule/rnabert style configuration\n>>> configuration = RnaBertConfig()\n>>> # Initializing a model (with random weights) from the multimolecule/rnabert style configuration\n>>> model = RnaBertModel(configuration)\n>>> # Accessing the model configuration\n>>> configuration = model.config\n
Source code in multimolecule/models/rnabert/configuration_rnabert.py Python
class RnaBertConfig(PreTrainedConfig):\n    r\"\"\"\n    This is the configuration class to store the configuration of a [`RnaBertModel`][multimolecule.models.RnaBertModel].\n    It is used to instantiate a RnaBert model according to the specified arguments, defining the model architecture.\n    Instantiating a configuration with the defaults will yield a similar configuration to that of the RnaBert\n    [mana438/RNABERT](https://github.com/mana438/RNABERT) architecture.\n\n    Configuration objects inherit from [`PreTrainedConfig`][multimolecule.models.PreTrainedConfig] and can be used to\n    control the model outputs. Read the documentation from [`PreTrainedConfig`][multimolecule.models.PreTrainedConfig]\n    for more information.\n\n    Args:\n        vocab_size:\n            Vocabulary size of the RnaBert model. Defines the number of different tokens that can be represented by the\n            `inputs_ids` passed when calling [`RnaBertModel`].\n        hidden_size:\n            Dimensionality of the encoder layers and the pooler layer.\n        num_hidden_layers:\n            Number of hidden layers in the Transformer encoder.\n        num_attention_heads:\n            Number of attention heads for each attention layer in the Transformer encoder.\n        intermediate_size:\n            Dimensionality of the \"intermediate\" (often named feed-forward) layer in the Transformer encoder.\n        hidden_dropout:\n            The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.\n        attention_dropout:\n            The dropout ratio for the attention probabilities.\n        max_position_embeddings:\n            The maximum sequence length that this model might ever be used with. Typically set this to something large\n            just in case (e.g., 512 or 1024 or 2048).\n        initializer_range:\n            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.\n        layer_norm_eps:\n            The epsilon used by the layer normalization layers.\n\n    Examples:\n        >>> from multimolecule import RnaBertModel, RnaBertConfig\n        >>> # Initializing a RNABERT multimolecule/rnabert style configuration\n        >>> configuration = RnaBertConfig()\n        >>> # Initializing a model (with random weights) from the multimolecule/rnabert style configuration\n        >>> model = RnaBertModel(configuration)\n        >>> # Accessing the model configuration\n        >>> configuration = model.config\n    \"\"\"\n\n    model_type = \"rnabert\"\n\n    def __init__(\n        self,\n        vocab_size: int = 26,\n        ss_vocab_size: int = 8,\n        hidden_size: int | None = None,\n        multiple: int | None = None,\n        num_hidden_layers: int = 6,\n        num_attention_heads: int = 12,\n        intermediate_size: int = 40,\n        hidden_act: str = \"gelu\",\n        hidden_dropout: float = 0.0,\n        attention_dropout: float = 0.0,\n        max_position_embeddings: int = 440,\n        initializer_range: float = 0.02,\n        layer_norm_eps: float = 1e-12,\n        position_embedding_type: str = \"absolute\",\n        is_decoder: bool = False,\n        use_cache: bool = True,\n        head: HeadConfig | None = None,\n        lm_head: MaskedLMHeadConfig | None = None,\n        **kwargs,\n    ):\n        if hidden_size is None:\n            hidden_size = num_attention_heads * multiple if multiple is not None else 120\n        super().__init__(**kwargs)\n\n        self.vocab_size = vocab_size\n        self.ss_vocab_size = ss_vocab_size\n        self.type_vocab_size = 2\n        self.hidden_size = hidden_size\n        self.num_hidden_layers = num_hidden_layers\n        self.num_attention_heads = num_attention_heads\n        self.intermediate_size = intermediate_size\n        self.hidden_act = hidden_act\n        self.hidden_dropout = hidden_dropout\n        self.attention_dropout = attention_dropout\n        self.max_position_embeddings = max_position_embeddings\n        self.initializer_range = initializer_range\n        self.layer_norm_eps = layer_norm_eps\n        self.position_embedding_type = position_embedding_type\n        self.is_decoder = is_decoder\n        self.use_cache = use_cache\n        self.head = HeadConfig(**head) if head is not None else None\n        self.lm_head = MaskedLMHeadConfig(**lm_head) if lm_head is not None else None\n
"},{"location":"models/rnabert/#multimolecule.models.rnabert.RnaBertConfig(vocab_size)","title":"vocab_size","text":""},{"location":"models/rnabert/#multimolecule.models.rnabert.RnaBertConfig(hidden_size)","title":"hidden_size","text":""},{"location":"models/rnabert/#multimolecule.models.rnabert.RnaBertConfig(num_hidden_layers)","title":"num_hidden_layers","text":""},{"location":"models/rnabert/#multimolecule.models.rnabert.RnaBertConfig(num_attention_heads)","title":"num_attention_heads","text":""},{"location":"models/rnabert/#multimolecule.models.rnabert.RnaBertConfig(intermediate_size)","title":"intermediate_size","text":""},{"location":"models/rnabert/#multimolecule.models.rnabert.RnaBertConfig(hidden_dropout)","title":"hidden_dropout","text":""},{"location":"models/rnabert/#multimolecule.models.rnabert.RnaBertConfig(attention_dropout)","title":"attention_dropout","text":""},{"location":"models/rnabert/#multimolecule.models.rnabert.RnaBertConfig(max_position_embeddings)","title":"max_position_embeddings","text":""},{"location":"models/rnabert/#multimolecule.models.rnabert.RnaBertConfig(initializer_range)","title":"initializer_range","text":""},{"location":"models/rnabert/#multimolecule.models.rnabert.RnaBertConfig(layer_norm_eps)","title":"layer_norm_eps","text":""},{"location":"models/rnabert/#multimolecule.models.rnabert.RnaBertForContactPrediction","title":"RnaBertForContactPrediction","text":"

Bases: RnaBertPreTrainedModel

Examples:

Python Console Session
>>> from multimolecule import RnaBertConfig, RnaBertForContactPrediction, RnaTokenizer\n>>> config = RnaBertConfig()\n>>> model = RnaBertForContactPrediction(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=torch.randint(2, (1, 5, 5)))\n>>> output[\"logits\"].shape\ntorch.Size([1, 5, 5, 1])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n
Source code in multimolecule/models/rnabert/modeling_rnabert.py Python
class RnaBertForContactPrediction(RnaBertPreTrainedModel):\n    \"\"\"\n    Examples:\n        >>> from multimolecule import RnaBertConfig, RnaBertForContactPrediction, RnaTokenizer\n        >>> config = RnaBertConfig()\n        >>> model = RnaBertForContactPrediction(config)\n        >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n        >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n        >>> output = model(**input, labels=torch.randint(2, (1, 5, 5)))\n        >>> output[\"logits\"].shape\n        torch.Size([1, 5, 5, 1])\n        >>> output[\"loss\"]  # doctest:+ELLIPSIS\n        tensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n    \"\"\"\n\n    def __init__(self, config: RnaBertConfig):\n        super().__init__(config)\n        self.rnabert = RnaBertModel(config, add_pooling_layer=True)\n        self.contact_head = ContactPredictionHead(config)\n        self.head_config = self.contact_head.config\n\n        # Initialize weights and apply final processing\n        self.post_init()\n\n    def forward(\n        self,\n        input_ids: Tensor | NestedTensor,\n        attention_mask: Tensor | None = None,\n        position_ids: Tensor | None = None,\n        head_mask: Tensor | None = None,\n        inputs_embeds: Tensor | NestedTensor | None = None,\n        labels: Tensor | None = None,\n        output_attentions: bool | None = None,\n        output_hidden_states: bool | None = None,\n        return_dict: bool | None = None,\n        **kwargs,\n    ) -> Tuple[Tensor, ...] | ContactPredictorOutput:\n        if output_attentions is False:\n            warn(\"output_attentions must be True for contact classification and will be ignored.\")\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n        outputs = self.rnabert(\n            input_ids,\n            attention_mask=attention_mask,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            output_attentions=True,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n            **kwargs,\n        )\n        output = self.contact_head(outputs, attention_mask, input_ids, labels)\n        logits, loss = output.logits, output.loss\n\n        if not return_dict:\n            output = (logits,) + outputs[2:]\n            return ((loss,) + output) if loss is not None else output\n\n        return ContactPredictorOutput(\n            loss=loss,\n            logits=logits,\n            hidden_states=outputs.hidden_states,\n            attentions=outputs.attentions,\n        )\n
"},{"location":"models/rnabert/#multimolecule.models.rnabert.RnaBertForMaskedLM","title":"RnaBertForMaskedLM","text":"

Bases: RnaBertPreTrainedModel

Examples:

Python Console Session
>>> from multimolecule import RnaBertConfig, RnaBertForMaskedLM, RnaTokenizer\n>>> config = RnaBertConfig()\n>>> model = RnaBertForMaskedLM(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=input[\"input_ids\"])\n>>> output[\"logits\"].shape\ntorch.Size([1, 7, 26])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<NllLossBackward0>)\n
Source code in multimolecule/models/rnabert/modeling_rnabert.py Python
class RnaBertForMaskedLM(RnaBertPreTrainedModel):\n    \"\"\"\n    Examples:\n        >>> from multimolecule import RnaBertConfig, RnaBertForMaskedLM, RnaTokenizer\n        >>> config = RnaBertConfig()\n        >>> model = RnaBertForMaskedLM(config)\n        >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n        >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n        >>> output = model(**input, labels=input[\"input_ids\"])\n        >>> output[\"logits\"].shape\n        torch.Size([1, 7, 26])\n        >>> output[\"loss\"]  # doctest:+ELLIPSIS\n        tensor(..., grad_fn=<NllLossBackward0>)\n    \"\"\"\n\n    _tied_weights_keys = [\"lm_head.decoder.weight\", \"lm_head.decoder.bias\"]\n\n    def __init__(self, config: RnaBertConfig):\n        super().__init__(config)\n        self.rnabert = RnaBertModel(config, add_pooling_layer=False)\n        self.lm_head = MaskedLMHead(config)\n\n        # Initialize weights and apply final processing\n        self.post_init()\n\n    def forward(\n        self,\n        input_ids: Tensor | NestedTensor,\n        attention_mask: Tensor | None = None,\n        labels: Tensor | None = None,\n        output_attentions: bool = False,\n        output_hidden_states: bool = False,\n        return_dict: bool = True,\n        **kwargs,\n    ) -> Tuple[Tensor, ...] | MaskedLMOutput:\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n        outputs = self.rnabert(\n            input_ids,\n            attention_mask,\n            output_attentions=output_attentions,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n            **kwargs,\n        )\n        output = self.lm_head(outputs, labels)\n        logits, loss = output.logits, output.loss\n\n        if not return_dict:\n            output = (logits,) + outputs[2:]\n            return ((loss,) + output) if loss is not None else output\n\n        return MaskedLMOutput(\n            loss=loss,\n            logits=logits,\n            hidden_states=outputs.hidden_states,\n            attentions=outputs.attentions,\n        )\n
"},{"location":"models/rnabert/#multimolecule.models.rnabert.RnaBertForPreTraining","title":"RnaBertForPreTraining","text":"

Bases: RnaBertPreTrainedModel

Examples:

Python Console Session
>>> from multimolecule import RnaBertConfig, RnaBertForPreTraining, RnaTokenizer\n>>> config = RnaBertConfig()\n>>> model = RnaBertForPreTraining(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels_mlm=input[\"input_ids\"])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<AddBackward0>)\n>>> output[\"logits_mlm\"].shape\ntorch.Size([1, 7, 26])\n>>> output[\"logits_ss\"].shape\ntorch.Size([1, 7, 8])\n>>> output[\"logits_sal\"].shape\ntorch.Size([1, 2])\n
Source code in multimolecule/models/rnabert/modeling_rnabert.py Python
class RnaBertForPreTraining(RnaBertPreTrainedModel):\n    \"\"\"\n    Examples:\n        >>> from multimolecule import RnaBertConfig, RnaBertForPreTraining, RnaTokenizer\n        >>> config = RnaBertConfig()\n        >>> model = RnaBertForPreTraining(config)\n        >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n        >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n        >>> output = model(**input, labels_mlm=input[\"input_ids\"])\n        >>> output[\"loss\"]  # doctest:+ELLIPSIS\n        tensor(..., grad_fn=<AddBackward0>)\n        >>> output[\"logits_mlm\"].shape\n        torch.Size([1, 7, 26])\n        >>> output[\"logits_ss\"].shape\n        torch.Size([1, 7, 8])\n        >>> output[\"logits_sal\"].shape\n        torch.Size([1, 2])\n    \"\"\"\n\n    _tied_weights_keys = [\n        \"lm_head.decoder.weight\",\n        \"lm_head.decoder.bias\",\n        \"pretrain.predictions.decoder.weight\",\n        \"pretrain.predictions.decoder.bias\",\n        \"pretrain.predictions_ss.decoder.weight\",\n        \"pretrain.predictions_ss.decoder.bias\",\n    ]\n\n    def __init__(self, config: RnaBertConfig):\n        super().__init__(config)\n        self.rnabert = RnaBertModel(config, add_pooling_layer=True)\n        self.pretrain = RnaBertPreTrainingHeads(config)\n\n        # Initialize weights and apply final processing\n        self.post_init()\n\n    def forward(\n        self,\n        input_ids: Tensor | NestedTensor,\n        attention_mask: Tensor | None = None,\n        labels_mlm: Tensor | None = None,\n        labels_ss: Tensor | None = None,\n        labels_sal: Tensor | None = None,\n        output_attentions: bool = False,\n        output_hidden_states: bool = False,\n        return_dict: bool = True,\n        **kwargs,\n    ) -> Tuple[Tensor, ...] | RnaBertForPreTrainingOutput:\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n        outputs = self.rnabert(\n            input_ids,\n            attention_mask,\n            output_attentions=output_attentions,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n            **kwargs,\n        )\n        total_loss, logits_mlm, logits_ss, logits_sal = self.pretrain(\n            outputs, labels_mlm=labels_mlm, labels_ss=labels_ss, labels_sal=labels_sal\n        )\n\n        if not return_dict:\n            output = (logits_mlm, logits_ss, logits_sal) + outputs[2:]\n            return ((total_loss,) + output) if total_loss is not None else output\n\n        return RnaBertForPreTrainingOutput(\n            loss=total_loss,\n            logits_mlm=logits_mlm,\n            logits_ss=logits_ss,\n            logits_sal=logits_sal,\n            hidden_states=outputs.hidden_states,\n            attentions=outputs.attentions,\n        )\n
"},{"location":"models/rnabert/#multimolecule.models.rnabert.RnaBertForSequencePrediction","title":"RnaBertForSequencePrediction","text":"

Bases: RnaBertPreTrainedModel

Examples:

Python Console Session
>>> from multimolecule import RnaBertConfig, RnaBertForSequencePrediction, RnaTokenizer\n>>> config = RnaBertConfig()\n>>> model = RnaBertForSequencePrediction(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=torch.tensor([[1]]))\n>>> output[\"logits\"].shape\ntorch.Size([1, 1])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n
Source code in multimolecule/models/rnabert/modeling_rnabert.py Python
class RnaBertForSequencePrediction(RnaBertPreTrainedModel):\n    \"\"\"\n    Examples:\n        >>> from multimolecule import RnaBertConfig, RnaBertForSequencePrediction, RnaTokenizer\n        >>> config = RnaBertConfig()\n        >>> model = RnaBertForSequencePrediction(config)\n        >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n        >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n        >>> output = model(**input, labels=torch.tensor([[1]]))\n        >>> output[\"logits\"].shape\n        torch.Size([1, 1])\n        >>> output[\"loss\"]  # doctest:+ELLIPSIS\n        tensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n    \"\"\"\n\n    def __init__(self, config: RnaBertConfig):\n        super().__init__(config)\n        self.rnabert = RnaBertModel(config, add_pooling_layer=True)\n        self.sequence_head = SequencePredictionHead(config)\n        self.head_config = self.sequence_head.config\n\n        # Initialize weights and apply final processing\n        self.post_init()\n\n    def forward(\n        self,\n        input_ids: Tensor | NestedTensor,\n        attention_mask: Tensor | None = None,\n        position_ids: Tensor | None = None,\n        head_mask: Tensor | None = None,\n        inputs_embeds: Tensor | NestedTensor | None = None,\n        labels: Tensor | None = None,\n        output_attentions: bool | None = None,\n        output_hidden_states: bool | None = None,\n        return_dict: bool | None = None,\n        **kwargs,\n    ) -> Tuple[Tensor, ...] | SequencePredictorOutput:\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n        outputs = self.rnabert(\n            input_ids,\n            attention_mask=attention_mask,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            output_attentions=output_attentions,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n            **kwargs,\n        )\n        output = self.sequence_head(outputs, labels)\n        logits, loss = output.logits, output.loss\n\n        if not return_dict:\n            output = (logits,) + outputs[2:]\n            return ((loss,) + output) if loss is not None else output\n\n        return SequencePredictorOutput(\n            loss=loss,\n            logits=logits,\n            hidden_states=outputs.hidden_states,\n            attentions=outputs.attentions,\n        )\n
"},{"location":"models/rnabert/#multimolecule.models.rnabert.RnaBertForTokenPrediction","title":"RnaBertForTokenPrediction","text":"

Bases: RnaBertPreTrainedModel

Examples:

Python Console Session
>>> from multimolecule import RnaBertConfig, RnaBertForTokenPrediction, RnaTokenizer\n>>> config = RnaBertConfig()\n>>> model = RnaBertForTokenPrediction(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=torch.randint(2, (1, 5)))\n>>> output[\"logits\"].shape\ntorch.Size([1, 5, 1])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n
Source code in multimolecule/models/rnabert/modeling_rnabert.py Python
class RnaBertForTokenPrediction(RnaBertPreTrainedModel):\n    \"\"\"\n    Examples:\n        >>> from multimolecule import RnaBertConfig, RnaBertForTokenPrediction, RnaTokenizer\n        >>> config = RnaBertConfig()\n        >>> model = RnaBertForTokenPrediction(config)\n        >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n        >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n        >>> output = model(**input, labels=torch.randint(2, (1, 5)))\n        >>> output[\"logits\"].shape\n        torch.Size([1, 5, 1])\n        >>> output[\"loss\"]  # doctest:+ELLIPSIS\n        tensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n    \"\"\"\n\n    def __init__(self, config: RnaBertConfig):\n        super().__init__(config)\n        self.rnabert = RnaBertModel(config, add_pooling_layer=True)\n        self.token_head = TokenPredictionHead(config)\n        self.head_config = self.token_head.config\n\n        # Initialize weights and apply final processing\n        self.post_init()\n\n    def forward(\n        self,\n        input_ids: Tensor | NestedTensor,\n        attention_mask: Tensor | None = None,\n        position_ids: Tensor | None = None,\n        head_mask: Tensor | None = None,\n        inputs_embeds: Tensor | NestedTensor | None = None,\n        labels: Tensor | None = None,\n        output_attentions: bool | None = None,\n        output_hidden_states: bool | None = None,\n        return_dict: bool | None = None,\n        **kwargs,\n    ) -> Tuple[Tensor, ...] | TokenPredictorOutput:\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n        outputs = self.rnabert(\n            input_ids,\n            attention_mask=attention_mask,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            output_attentions=output_attentions,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n            **kwargs,\n        )\n        output = self.token_head(outputs, attention_mask, input_ids, labels)\n        logits, loss = output.logits, output.loss\n\n        if not return_dict:\n            output = (logits,) + outputs[2:]\n            return ((loss,) + output) if loss is not None else output\n\n        return TokenPredictorOutput(\n            loss=loss,\n            logits=logits,\n            hidden_states=outputs.hidden_states,\n            attentions=outputs.attentions,\n        )\n
"},{"location":"models/rnabert/#multimolecule.models.rnabert.RnaBertModel","title":"RnaBertModel","text":"

Bases: RnaBertPreTrainedModel

Examples:

Python Console Session
>>> from multimolecule import RnaBertConfig, RnaBertModel, RnaTokenizer\n>>> config = RnaBertConfig()\n>>> model = RnaBertModel(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input)\n>>> output[\"last_hidden_state\"].shape\ntorch.Size([1, 7, 120])\n>>> output[\"pooler_output\"].shape\ntorch.Size([1, 120])\n
Source code in multimolecule/models/rnabert/modeling_rnabert.py Python
class RnaBertModel(RnaBertPreTrainedModel):\n    \"\"\"\n    Examples:\n        >>> from multimolecule import RnaBertConfig, RnaBertModel, RnaTokenizer\n        >>> config = RnaBertConfig()\n        >>> model = RnaBertModel(config)\n        >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n        >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n        >>> output = model(**input)\n        >>> output[\"last_hidden_state\"].shape\n        torch.Size([1, 7, 120])\n        >>> output[\"pooler_output\"].shape\n        torch.Size([1, 120])\n    \"\"\"\n\n    def __init__(self, config: RnaBertConfig, add_pooling_layer: bool = True):\n        super().__init__(config)\n        self.pad_token_id = config.pad_token_id\n        self.embeddings = RnaBertEmbeddings(config)\n        self.encoder = RnaBertEncoder(config)\n        self.pooler = RnaBertPooler(config) if add_pooling_layer else None\n\n        # Initialize weights and apply final processing\n        self.post_init()\n\n    def get_input_embeddings(self):\n        return self.embeddings.word_embeddings\n\n    def set_input_embeddings(self, value):\n        self.embeddings.word_embeddings = value\n\n    def _prune_heads(self, heads_to_prune):\n        \"\"\"\n        Prunes heads of the model. heads_to_prune: dict of {layer_num: list of heads to prune in this layer} See base\n        class PreTrainedModel\n        \"\"\"\n        for layer, heads in heads_to_prune.items():\n            self.encoder.layer[layer].attention.prune_heads(heads)\n\n    def forward(\n        self,\n        input_ids: Tensor | NestedTensor,\n        attention_mask: Tensor | None = None,\n        position_ids: Tensor | None = None,\n        head_mask: Tensor | None = None,\n        inputs_embeds: Tensor | NestedTensor | None = None,\n        encoder_hidden_states: Tensor | None = None,\n        encoder_attention_mask: Tensor | None = None,\n        past_key_values: Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None = None,\n        use_cache: bool | None = None,\n        output_attentions: bool | None = None,\n        output_hidden_states: bool | None = None,\n        return_dict: bool | None = None,\n        **kwargs,\n    ) -> Tuple[Tensor, ...] | BaseModelOutputWithPoolingAndCrossAttentions:\n        r\"\"\"\n        Args:\n            encoder_hidden_states:\n                Shape: `(batch_size, sequence_length, hidden_size)`\n\n                Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if\n                the model is configured as a decoder.\n            encoder_attention_mask:\n                Shape: `(batch_size, sequence_length)`\n\n                Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used\n                in the cross-attention if the model is configured as a decoder. Mask values selected in `[0, 1]`:\n\n                - 1 for tokens that are **not masked**,\n                - 0 for tokens that are **masked**.\n            past_key_values:\n                Tuple of length `config.n_layers` with each tuple having 4 tensors of shape\n                `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)\n\n                Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up\n                decoding.\n\n                If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those\n                that don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of\n                all `decoder_input_ids` of shape `(batch_size, sequence_length)`.\n            use_cache:\n                If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding\n                (see `past_key_values`).\n        \"\"\"\n        if kwargs:\n            warn(\n                f\"Additional keyword arguments `{', '.join(kwargs)}` are detected in \"\n                f\"`{self.__class__.__name__}.forward`, they will be ignored.\\n\"\n                \"This is provided for backward compatibility and may lead to unexpected behavior.\"\n            )\n        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions\n        output_hidden_states = (\n            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states\n        )\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n\n        if self.config.is_decoder:\n            use_cache = use_cache if use_cache is not None else self.config.use_cache\n        else:\n            use_cache = False\n\n        if isinstance(input_ids, NestedTensor):\n            input_ids, attention_mask = input_ids.tensor, input_ids.mask\n        if input_ids is not None and inputs_embeds is not None:\n            raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n        if input_ids is not None:\n            self.warn_if_padding_and_no_attention_mask(input_ids, attention_mask)\n            input_shape = input_ids.size()\n        elif inputs_embeds is not None:\n            input_shape = inputs_embeds.size()[:-1]\n        else:\n            raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n        batch_size, seq_length = input_shape\n        device = input_ids.device if input_ids is not None else inputs_embeds.device  # type: ignore[union-attr]\n\n        # past_key_values_length\n        past_key_values_length = past_key_values[0][0].shape[2] if past_key_values is not None else 0\n\n        if attention_mask is None:\n            attention_mask = (\n                input_ids.ne(self.pad_token_id)\n                if self.pad_token_id is not None\n                else torch.ones(((batch_size, seq_length + past_key_values_length)), device=device)\n            )\n\n        # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]\n        # ourselves in which case we just need to make it broadcastable to all heads.\n        extended_attention_mask: Tensor = self.get_extended_attention_mask(attention_mask, input_shape)\n\n        # If a 2D or 3D attention mask is provided for the cross-attention\n        # we need to make broadcastable to [batch_size, num_heads, seq_length, seq_length]\n        if self.config.is_decoder and encoder_hidden_states is not None:\n            encoder_batch_size, encoder_sequence_length, _ = encoder_hidden_states.size()\n            encoder_hidden_shape = (encoder_batch_size, encoder_sequence_length)\n            if encoder_attention_mask is None:\n                encoder_attention_mask = torch.ones(encoder_hidden_shape, device=device)\n            encoder_extended_attention_mask = self.invert_attention_mask(encoder_attention_mask)\n        else:\n            encoder_extended_attention_mask = None\n\n        # Prepare head mask if needed\n        # 1.0 in head_mask indicate we keep the head\n        # attention_probs has shape bsz x n_heads x N x N\n        # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]\n        # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]\n        head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)\n\n        embedding_output = self.embeddings(\n            input_ids=input_ids,\n            position_ids=position_ids,\n            inputs_embeds=inputs_embeds,\n            past_key_values_length=past_key_values_length,\n        )\n        encoder_outputs = self.encoder(\n            embedding_output,\n            attention_mask=extended_attention_mask,\n            head_mask=head_mask,\n            encoder_hidden_states=encoder_hidden_states,\n            encoder_attention_mask=encoder_extended_attention_mask,\n            past_key_values=past_key_values,\n            use_cache=use_cache,\n            output_attentions=output_attentions,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n        )\n        sequence_output = encoder_outputs[0]\n        pooled_output = self.pooler(sequence_output) if self.pooler is not None else None\n\n        if not return_dict:\n            return (sequence_output, pooled_output) + encoder_outputs[1:]\n\n        return BaseModelOutputWithPoolingAndCrossAttentions(\n            last_hidden_state=sequence_output,\n            pooler_output=pooled_output,\n            past_key_values=encoder_outputs.past_key_values,\n            hidden_states=encoder_outputs.hidden_states,\n            attentions=encoder_outputs.attentions,\n            cross_attentions=encoder_outputs.cross_attentions,\n        )\n
"},{"location":"models/rnabert/#multimolecule.models.rnabert.RnaBertModel.forward","title":"forward","text":"Python
forward(input_ids: Tensor | NestedTensor, attention_mask: Tensor | None = None, position_ids: Tensor | None = None, head_mask: Tensor | None = None, inputs_embeds: Tensor | NestedTensor | None = None, encoder_hidden_states: Tensor | None = None, encoder_attention_mask: Tensor | None = None, past_key_values: Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None = None, use_cache: bool | None = None, output_attentions: bool | None = None, output_hidden_states: bool | None = None, return_dict: bool | None = None, **kwargs) -> Tuple[Tensor, ...] | BaseModelOutputWithPoolingAndCrossAttentions\n

Parameters:

Name Type Description Default Tensor | None

Shape: (batch_size, sequence_length, hidden_size)

Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if the model is configured as a decoder.

None Tensor | None

Shape: (batch_size, sequence_length)

Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used in the cross-attention if the model is configured as a decoder. Mask values selected in [0, 1]:

None Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None

Tuple of length config.n_layers with each tuple having 4 tensors of shape `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)

Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up decoding.

If past_key_values are used, the user can optionally input only the last decoder_input_ids (those that don\u2019t have their past key value states given to this model) of shape (batch_size, 1) instead of all decoder_input_ids of shape (batch_size, sequence_length).

None bool | None

If set to True, past_key_values key value states are returned and can be used to speed up decoding (see past_key_values).

None Source code in multimolecule/models/rnabert/modeling_rnabert.py Python
def forward(\n    self,\n    input_ids: Tensor | NestedTensor,\n    attention_mask: Tensor | None = None,\n    position_ids: Tensor | None = None,\n    head_mask: Tensor | None = None,\n    inputs_embeds: Tensor | NestedTensor | None = None,\n    encoder_hidden_states: Tensor | None = None,\n    encoder_attention_mask: Tensor | None = None,\n    past_key_values: Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None = None,\n    use_cache: bool | None = None,\n    output_attentions: bool | None = None,\n    output_hidden_states: bool | None = None,\n    return_dict: bool | None = None,\n    **kwargs,\n) -> Tuple[Tensor, ...] | BaseModelOutputWithPoolingAndCrossAttentions:\n    r\"\"\"\n    Args:\n        encoder_hidden_states:\n            Shape: `(batch_size, sequence_length, hidden_size)`\n\n            Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if\n            the model is configured as a decoder.\n        encoder_attention_mask:\n            Shape: `(batch_size, sequence_length)`\n\n            Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used\n            in the cross-attention if the model is configured as a decoder. Mask values selected in `[0, 1]`:\n\n            - 1 for tokens that are **not masked**,\n            - 0 for tokens that are **masked**.\n        past_key_values:\n            Tuple of length `config.n_layers` with each tuple having 4 tensors of shape\n            `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)\n\n            Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up\n            decoding.\n\n            If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those\n            that don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of\n            all `decoder_input_ids` of shape `(batch_size, sequence_length)`.\n        use_cache:\n            If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding\n            (see `past_key_values`).\n    \"\"\"\n    if kwargs:\n        warn(\n            f\"Additional keyword arguments `{', '.join(kwargs)}` are detected in \"\n            f\"`{self.__class__.__name__}.forward`, they will be ignored.\\n\"\n            \"This is provided for backward compatibility and may lead to unexpected behavior.\"\n        )\n    output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions\n    output_hidden_states = (\n        output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states\n    )\n    return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n\n    if self.config.is_decoder:\n        use_cache = use_cache if use_cache is not None else self.config.use_cache\n    else:\n        use_cache = False\n\n    if isinstance(input_ids, NestedTensor):\n        input_ids, attention_mask = input_ids.tensor, input_ids.mask\n    if input_ids is not None and inputs_embeds is not None:\n        raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n    if input_ids is not None:\n        self.warn_if_padding_and_no_attention_mask(input_ids, attention_mask)\n        input_shape = input_ids.size()\n    elif inputs_embeds is not None:\n        input_shape = inputs_embeds.size()[:-1]\n    else:\n        raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n    batch_size, seq_length = input_shape\n    device = input_ids.device if input_ids is not None else inputs_embeds.device  # type: ignore[union-attr]\n\n    # past_key_values_length\n    past_key_values_length = past_key_values[0][0].shape[2] if past_key_values is not None else 0\n\n    if attention_mask is None:\n        attention_mask = (\n            input_ids.ne(self.pad_token_id)\n            if self.pad_token_id is not None\n            else torch.ones(((batch_size, seq_length + past_key_values_length)), device=device)\n        )\n\n    # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]\n    # ourselves in which case we just need to make it broadcastable to all heads.\n    extended_attention_mask: Tensor = self.get_extended_attention_mask(attention_mask, input_shape)\n\n    # If a 2D or 3D attention mask is provided for the cross-attention\n    # we need to make broadcastable to [batch_size, num_heads, seq_length, seq_length]\n    if self.config.is_decoder and encoder_hidden_states is not None:\n        encoder_batch_size, encoder_sequence_length, _ = encoder_hidden_states.size()\n        encoder_hidden_shape = (encoder_batch_size, encoder_sequence_length)\n        if encoder_attention_mask is None:\n            encoder_attention_mask = torch.ones(encoder_hidden_shape, device=device)\n        encoder_extended_attention_mask = self.invert_attention_mask(encoder_attention_mask)\n    else:\n        encoder_extended_attention_mask = None\n\n    # Prepare head mask if needed\n    # 1.0 in head_mask indicate we keep the head\n    # attention_probs has shape bsz x n_heads x N x N\n    # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]\n    # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]\n    head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)\n\n    embedding_output = self.embeddings(\n        input_ids=input_ids,\n        position_ids=position_ids,\n        inputs_embeds=inputs_embeds,\n        past_key_values_length=past_key_values_length,\n    )\n    encoder_outputs = self.encoder(\n        embedding_output,\n        attention_mask=extended_attention_mask,\n        head_mask=head_mask,\n        encoder_hidden_states=encoder_hidden_states,\n        encoder_attention_mask=encoder_extended_attention_mask,\n        past_key_values=past_key_values,\n        use_cache=use_cache,\n        output_attentions=output_attentions,\n        output_hidden_states=output_hidden_states,\n        return_dict=return_dict,\n    )\n    sequence_output = encoder_outputs[0]\n    pooled_output = self.pooler(sequence_output) if self.pooler is not None else None\n\n    if not return_dict:\n        return (sequence_output, pooled_output) + encoder_outputs[1:]\n\n    return BaseModelOutputWithPoolingAndCrossAttentions(\n        last_hidden_state=sequence_output,\n        pooler_output=pooled_output,\n        past_key_values=encoder_outputs.past_key_values,\n        hidden_states=encoder_outputs.hidden_states,\n        attentions=encoder_outputs.attentions,\n        cross_attentions=encoder_outputs.cross_attentions,\n    )\n
"},{"location":"models/rnabert/#multimolecule.models.rnabert.RnaBertModel.forward(encoder_hidden_states)","title":"encoder_hidden_states","text":""},{"location":"models/rnabert/#multimolecule.models.rnabert.RnaBertModel.forward(encoder_attention_mask)","title":"encoder_attention_mask","text":""},{"location":"models/rnabert/#multimolecule.models.rnabert.RnaBertModel.forward(past_key_values)","title":"past_key_values","text":""},{"location":"models/rnabert/#multimolecule.models.rnabert.RnaBertModel.forward(use_cache)","title":"use_cache","text":""},{"location":"models/rnabert/#multimolecule.models.rnabert.RnaBertPreTrainedModel","title":"RnaBertPreTrainedModel","text":"

Bases: PreTrainedModel

An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained models.

Source code in multimolecule/models/rnabert/modeling_rnabert.py Python
class RnaBertPreTrainedModel(PreTrainedModel):\n    \"\"\"\n    An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained\n    models.\n    \"\"\"\n\n    config_class = RnaBertConfig\n    base_model_prefix = \"rnabert\"\n    supports_gradient_checkpointing = True\n    _no_split_modules = [\"RnaBertLayer\", \"RnaBertEmbeddings\"]\n\n    # Copied from transformers.models.bert.modeling_bert.BertPreTrainedModel._init_weights\n    def _init_weights(self, module: nn.Module):\n        \"\"\"Initialize the weights\"\"\"\n        if isinstance(module, nn.Linear):\n            # Slightly different from the TF version which uses truncated_normal for initialization\n            # cf https://github.com/pytorch/pytorch/pull/5617\n            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)\n            if module.bias is not None:\n                module.bias.data.zero_()\n        elif isinstance(module, nn.Embedding):\n            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)\n            if module.padding_idx is not None:\n                module.weight.data[module.padding_idx].zero_()\n        elif isinstance(module, nn.LayerNorm):\n            module.bias.data.zero_()\n            module.weight.data.fill_(1.0)\n
"},{"location":"models/rnaernie/","title":"RNAErnie","text":"

Pre-trained model on non-coding RNA (ncRNA) using a multi-stage masked language modeling (MLM) objective.

"},{"location":"models/rnaernie/#statement","title":"Statement","text":"

Multi-purpose RNA language modelling with motif-aware pretraining and type-guided fine-tuning is published in Nature Machine Intelligence, which is a Closed Access / Author-Fee journal.

Machine learning has been at the forefront of the movement for free and open access to research.

We see no role for closed access or author-fee publication in the future of machine learning research and believe the adoption of these journals as an outlet of record for the machine learning community would be a retrograde step.

The MultiMolecule team is committed to the principles of open access and open science.

We do NOT endorse the publication of manuscripts in Closed Access / Author-Fee journals and encourage the community to support Open Access journals and conferences.

Please consider signing the Statement on Nature Machine Intelligence.

"},{"location":"models/rnaernie/#disclaimer","title":"Disclaimer","text":"

This is an UNOFFICIAL implementation of the RNAErnie: An RNA Language Model with Structure-enhanced Representations by Ning Wang, Jiang Bian, Haoyi Xiong, et al.

The OFFICIAL repository of RNAErnie is at CatIIIIIIII/RNAErnie.

Warning

The MultiMolecule team is unable to confirm that the provided model and checkpoints are producing the same intermediate representations as the original implementation. This is because

The proposed method is published in a Closed Access / Author-Fee journal.

The team releasing RNAErnie did not write this model card for this model so this model card has been written by the MultiMolecule team.

"},{"location":"models/rnaernie/#model-details","title":"Model Details","text":"

RNAErnie is a bert-style model pre-trained on a large corpus of non-coding RNA sequences in a self-supervised fashion. This means that the model was trained on the raw nucleotides of RNA sequences only, with an automatic process to generate inputs and labels from those texts. Please refer to the Training Details section for more information on the training process.

Note that during the conversion process, additional tokens such as [IND] and ncRNA class symbols are removed.

"},{"location":"models/rnaernie/#model-specification","title":"Model Specification","text":"Num Layers Hidden Size Num Heads Intermediate Size Num Parameters (M) FLOPs (G) MACs (G) Max Num Tokens 12 768 12 3072 86.06 22.36 11.17 512"},{"location":"models/rnaernie/#links","title":"Links","text":""},{"location":"models/rnaernie/#usage","title":"Usage","text":"

The model file depends on the multimolecule library. You can install it using pip:

Bash
pip install multimolecule\n
"},{"location":"models/rnaernie/#direct-use","title":"Direct Use","text":"

You can use this model directly with a pipeline for masked language modeling:

Python
>>> import multimolecule  # you must import multimolecule to register models\n>>> from transformers import pipeline\n>>> unmasker = pipeline(\"fill-mask\", model=\"multimolecule/rnaernie\")\n>>> unmasker(\"gguc<mask>cucugguuagaccagaucugagccu\")\n\n[{'score': 0.09252794831991196,\n  'token': 8,\n  'token_str': 'G',\n  'sequence': 'G G U C G C U C U G G U U A G A C C A G A U C U G A G C C U'},\n {'score': 0.09062391519546509,\n  'token': 11,\n  'token_str': 'R',\n  'sequence': 'G G U C R C U C U G G U U A G A C C A G A U C U G A G C C U'},\n {'score': 0.08875908702611923,\n  'token': 6,\n  'token_str': 'A',\n  'sequence': 'G G U C A C U C U G G U U A G A C C A G A U C U G A G C C U'},\n {'score': 0.07809742540121078,\n  'token': 20,\n  'token_str': 'V',\n  'sequence': 'G G U C V C U C U G G U U A G A C C A G A U C U G A G C C U'},\n {'score': 0.07325706630945206,\n  'token': 13,\n  'token_str': 'S',\n  'sequence': 'G G U C S C U C U G G U U A G A C C A G A U C U G A G C C U'}]\n
"},{"location":"models/rnaernie/#downstream-use","title":"Downstream Use","text":""},{"location":"models/rnaernie/#extract-features","title":"Extract Features","text":"

Here is how to use this model to get the features of a given sequence in PyTorch:

Python
from multimolecule import RnaTokenizer, RnaErnieModel\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rnaernie\")\nmodel = RnaErnieModel.from_pretrained(\"multimolecule/rnaernie\")\n\ntext = \"UAGCUUAUCAGACUGAUGUUGA\"\ninput = tokenizer(text, return_tensors=\"pt\")\n\noutput = model(**input)\n
"},{"location":"models/rnaernie/#sequence-classification-regression","title":"Sequence Classification / Regression","text":"

Note: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for sequence classification or regression.

Here is how to use this model as backbone to fine-tune for a sequence-level task in PyTorch:

Python
import torch\nfrom multimolecule import RnaTokenizer, RnaErnieForSequencePrediction\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rnaernie\")\nmodel = RnaErnieForSequencePrediction.from_pretrained(\"multimolecule/rnaernie\")\n\ntext = \"UAGCUUAUCAGACUGAUGUUGA\"\ninput = tokenizer(text, return_tensors=\"pt\")\nlabel = torch.tensor([1])\n\noutput = model(**input, labels=label)\n
"},{"location":"models/rnaernie/#token-classification-regression","title":"Token Classification / Regression","text":"

Note: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for nucleotide classification or regression.

Here is how to use this model as backbone to fine-tune for a nucleotide-level task in PyTorch:

Python
import torch\nfrom multimolecule import RnaTokenizer, RnaErnieForTokenPrediction\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rnaernie\")\nmodel = RnaErnieForTokenPrediction.from_pretrained(\"multimolecule/rnaernie\")\n\ntext = \"UAGCUUAUCAGACUGAUGUUGA\"\ninput = tokenizer(text, return_tensors=\"pt\")\nlabel = torch.randint(2, (len(text), ))\n\noutput = model(**input, labels=label)\n
"},{"location":"models/rnaernie/#contact-classification-regression","title":"Contact Classification / Regression","text":"

Note: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for contact classification or regression.

Here is how to use this model as backbone to fine-tune for a contact-level task in PyTorch:

Python
import torch\nfrom multimolecule import RnaTokenizer, RnaErnieForContactPrediction\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rnaernie\")\nmodel = RnaErnieForContactPrediction.from_pretrained(\"multimolecule/rnaernie\")\n\ntext = \"UAGCUUAUCAGACUGAUGUUGA\"\ninput = tokenizer(text, return_tensors=\"pt\")\nlabel = torch.randint(2, (len(text), len(text)))\n\noutput = model(**input, labels=label)\n
"},{"location":"models/rnaernie/#training-details","title":"Training Details","text":"

RNAErnie used Masked Language Modeling (MLM) as the pre-training objective: taking a sequence, the model randomly masks 15% of the tokens in the input then runs the entire masked sentence through the model and has to predict the masked tokens. This is comparable to the Cloze task in language modeling.

"},{"location":"models/rnaernie/#training-data","title":"Training Data","text":"

The RNAErnie model was pre-trained on RNAcentral. RNAcentral is a free, public resource that offers integrated access to a comprehensive and up-to-date set of non-coding RNA sequences provided by a collaborating group of Expert Databases representing a broad range of organisms and RNA types.

RNAErnie used a subset of RNAcentral for pre-training. The subset contains 23 million sequences. RNAErnie preprocessed all tokens by replacing \u201cT\u201ds with \u201cS\u201ds.

Note that RnaTokenizer will convert \u201cT\u201ds to \u201cU\u201ds for you, you may disable this behaviour by passing replace_T_with_U=False.

"},{"location":"models/rnaernie/#training-procedure","title":"Training Procedure","text":""},{"location":"models/rnaernie/#preprocessing","title":"Preprocessing","text":"

RNAErnie used masked language modeling (MLM) as the pre-training objective. The masking procedure is similar to the one used in BERT:

"},{"location":"models/rnaernie/#pretraining","title":"PreTraining","text":"

RNAErnie uses a special 3-stage training pipeline to pre-train the model, each with a different masking strategy:

Base-level Masking: The masking applies to each nucleotide in the sequence. Subsequence-level Masking: The masking applies to subsequences of 4-8bp in the sequence. Motif-level Masking: The model is trained on motif datasets.

The model was trained on 4 NVIDIA V100 GPUs with 32GiB memories.

"},{"location":"models/rnaernie/#citation","title":"Citation","text":"

Citation information is not available for papers published in Closed Access / Author-Fee journals.

"},{"location":"models/rnaernie/#contact","title":"Contact","text":"

Please use GitHub issues of MultiMolecule for any questions or comments on the model card.

Please contact the authors of the RNAErnie paper for questions or comments on the paper/model.

"},{"location":"models/rnaernie/#license","title":"License","text":"

This model is licensed under the AGPL-3.0 License.

Text Only
SPDX-License-Identifier: AGPL-3.0-or-later\n
"},{"location":"models/rnaernie/#multimolecule.models.rnaernie","title":"multimolecule.models.rnaernie","text":""},{"location":"models/rnaernie/#multimolecule.models.rnaernie.RnaTokenizer","title":"RnaTokenizer","text":"

Bases: Tokenizer

Tokenizer for RNA sequences.

Parameters:

Name Type Description Default Alphabet | str | List[str] | None

alphabet to use for tokenization.

None int

Size of kmer to tokenize.

1 bool

Whether to tokenize into codons.

False bool

Whether to replace T with U.

True bool

Whether to convert input to uppercase.

True

Examples:

Python Console Session
>>> from multimolecule import RnaTokenizer\n>>> tokenizer = RnaTokenizer()\n>>> tokenizer('<pad><cls><eos><unk><mask><null>ACGUNRYSWKMBDHV.X*-I')[\"input_ids\"]\n[1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 2]\n>>> tokenizer('acgu')[\"input_ids\"]\n[1, 6, 7, 8, 9, 2]\n>>> tokenizer('acgt')[\"input_ids\"]\n[1, 6, 7, 8, 9, 2]\n>>> tokenizer = RnaTokenizer(replace_T_with_U=False)\n>>> tokenizer('acgt')[\"input_ids\"]\n[1, 6, 7, 8, 3, 2]\n>>> tokenizer = RnaTokenizer(nmers=3)\n>>> tokenizer('uagcuuauc')[\"input_ids\"]\n[1, 83, 17, 64, 49, 96, 84, 22, 2]\n>>> tokenizer = RnaTokenizer(codon=True)\n>>> tokenizer('uagcuuauc')[\"input_ids\"]\n[1, 83, 49, 22, 2]\n>>> tokenizer('uagcuuauca')[\"input_ids\"]\nTraceback (most recent call last):\nValueError: length of input sequence must be a multiple of 3 for codon tokenization, but got 10\n
Source code in multimolecule/tokenisers/rna/tokenization_rna.py Python
class RnaTokenizer(Tokenizer):\n    \"\"\"\n    Tokenizer for RNA sequences.\n\n    Args:\n        alphabet: alphabet to use for tokenization.\n\n            - If is `None`, the standard RNA alphabet will be used.\n            - If is a `string`, it should correspond to the name of a predefined alphabet. The options include\n                + `standard`\n                + `extended`\n                + `streamline`\n                + `nucleobase`\n            - If is an alphabet or a list of characters, that specific alphabet will be used.\n        nmers: Size of kmer to tokenize.\n        codon: Whether to tokenize into codons.\n        replace_T_with_U: Whether to replace T with U.\n        do_upper_case: Whether to convert input to uppercase.\n\n    Examples:\n        >>> from multimolecule import RnaTokenizer\n        >>> tokenizer = RnaTokenizer()\n        >>> tokenizer('<pad><cls><eos><unk><mask><null>ACGUNRYSWKMBDHV.X*-I')[\"input_ids\"]\n        [1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 2]\n        >>> tokenizer('acgu')[\"input_ids\"]\n        [1, 6, 7, 8, 9, 2]\n        >>> tokenizer('acgt')[\"input_ids\"]\n        [1, 6, 7, 8, 9, 2]\n        >>> tokenizer = RnaTokenizer(replace_T_with_U=False)\n        >>> tokenizer('acgt')[\"input_ids\"]\n        [1, 6, 7, 8, 3, 2]\n        >>> tokenizer = RnaTokenizer(nmers=3)\n        >>> tokenizer('uagcuuauc')[\"input_ids\"]\n        [1, 83, 17, 64, 49, 96, 84, 22, 2]\n        >>> tokenizer = RnaTokenizer(codon=True)\n        >>> tokenizer('uagcuuauc')[\"input_ids\"]\n        [1, 83, 49, 22, 2]\n        >>> tokenizer('uagcuuauca')[\"input_ids\"]\n        Traceback (most recent call last):\n        ValueError: length of input sequence must be a multiple of 3 for codon tokenization, but got 10\n    \"\"\"\n\n    model_input_names = [\"input_ids\", \"attention_mask\"]\n\n    def __init__(\n        self,\n        alphabet: Alphabet | str | List[str] | None = None,\n        nmers: int = 1,\n        codon: bool = False,\n        replace_T_with_U: bool = True,\n        do_upper_case: bool = True,\n        additional_special_tokens: List | Tuple | None = None,\n        **kwargs,\n    ):\n        if codon and (nmers > 1 and nmers != 3):\n            raise ValueError(\"Codon and nmers cannot be used together.\")\n        if codon:\n            nmers = 3  # set to 3 to get correct vocab\n        if not isinstance(alphabet, Alphabet):\n            alphabet = get_alphabet(alphabet, nmers=nmers)\n        super().__init__(\n            alphabet=alphabet,\n            nmers=nmers,\n            codon=codon,\n            replace_T_with_U=replace_T_with_U,\n            do_upper_case=do_upper_case,\n            additional_special_tokens=additional_special_tokens,\n            **kwargs,\n        )\n        self.replace_T_with_U = replace_T_with_U\n        self.nmers = nmers\n        self.codon = codon\n\n    def _tokenize(self, text: str, **kwargs):\n        if self.do_upper_case:\n            text = text.upper()\n        if self.replace_T_with_U:\n            text = text.replace(\"T\", \"U\")\n        if self.codon:\n            if len(text) % 3 != 0:\n                raise ValueError(\n                    f\"length of input sequence must be a multiple of 3 for codon tokenization, but got {len(text)}\"\n                )\n            return [text[i : i + 3] for i in range(0, len(text), 3)]\n        if self.nmers > 1:\n            return [text[i : i + self.nmers] for i in range(len(text) - self.nmers + 1)]  # noqa: E203\n        return list(text)\n
"},{"location":"models/rnaernie/#multimolecule.models.rnaernie.RnaTokenizer(alphabet)","title":"alphabet","text":""},{"location":"models/rnaernie/#multimolecule.models.rnaernie.RnaTokenizer(nmers)","title":"nmers","text":""},{"location":"models/rnaernie/#multimolecule.models.rnaernie.RnaTokenizer(codon)","title":"codon","text":""},{"location":"models/rnaernie/#multimolecule.models.rnaernie.RnaTokenizer(replace_T_with_U)","title":"replace_T_with_U","text":""},{"location":"models/rnaernie/#multimolecule.models.rnaernie.RnaTokenizer(do_upper_case)","title":"do_upper_case","text":""},{"location":"models/rnaernie/#multimolecule.models.rnaernie.RnaErnieConfig","title":"RnaErnieConfig","text":"

Bases: PreTrainedConfig

This is the configuration class to store the configuration of a RnaErnieModel. It is used to instantiate a RnaErnie model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the RnaErnie Bruce-ywj/rnaernie architecture.

Configuration objects inherit from PreTrainedConfig and can be used to control the model outputs. Read the documentation from PreTrainedConfig for more information.

Parameters:

Name Type Description Default int

Vocabulary size of the RnaErnie model. Defines the number of different tokens that can be represented by the inputs_ids passed when calling [RnaErnieModel].

26 int

Dimensionality of the encoder layers and the pooler layer.

768 int

Number of hidden layers in the Transformer encoder.

12 int

Number of attention heads for each attention layer in the Transformer encoder.

12 int

Dimensionality of the \u201cintermediate\u201d (often named feed-forward) layer in the Transformer encoder.

3072 float

The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.

0.1 float

The dropout ratio for the attention probabilities.

0.1 int

The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).

513 float

The standard deviation of the truncated_normal_initializer for initializing all weight matrices.

0.02 float

The epsilon used by the layer normalization layers.

1e-12

Examples:

Python Console Session
>>> from multimolecule import RnaErnieModel, RnaErnieConfig\n>>> # Initializing a rnaernie multimolecule/rnaernie style configuration\n>>> configuration = RnaErnieConfig()\n>>> # Initializing a model (with random weights) from the multimolecule/rnaernie style configuration\n>>> model = RnaErnieModel(configuration)\n>>> # Accessing the model configuration\n>>> configuration = model.config\n
Source code in multimolecule/models/rnaernie/configuration_rnaernie.py Python
class RnaErnieConfig(PreTrainedConfig):\n    r\"\"\"\n    This is the configuration class to store the configuration of a\n    [`RnaErnieModel`][multimolecule.models.RnaErnieModel]. It is used to instantiate a RnaErnie model according to the\n    specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a\n    similar configuration to that of the RnaErnie [Bruce-ywj/rnaernie](https://github.com/Bruce-ywj/rnaernie)\n    architecture.\n\n    Configuration objects inherit from [`PreTrainedConfig`][multimolecule.models.PreTrainedConfig] and can be used to\n    control the model outputs. Read the documentation from [`PreTrainedConfig`][multimolecule.models.PreTrainedConfig]\n    for more information.\n\n    Args:\n        vocab_size:\n            Vocabulary size of the RnaErnie model. Defines the number of different tokens that can be represented by\n            the `inputs_ids` passed when calling [`RnaErnieModel`].\n        hidden_size:\n            Dimensionality of the encoder layers and the pooler layer.\n        num_hidden_layers:\n            Number of hidden layers in the Transformer encoder.\n        num_attention_heads:\n            Number of attention heads for each attention layer in the Transformer encoder.\n        intermediate_size:\n            Dimensionality of the \"intermediate\" (often named feed-forward) layer in the Transformer encoder.\n        hidden_dropout:\n            The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.\n        attention_dropout:\n            The dropout ratio for the attention probabilities.\n        max_position_embeddings:\n            The maximum sequence length that this model might ever be used with. Typically set this to something large\n            just in case (e.g., 512 or 1024 or 2048).\n        initializer_range:\n            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.\n        layer_norm_eps:\n            The epsilon used by the layer normalization layers.\n\n    Examples:\n        >>> from multimolecule import RnaErnieModel, RnaErnieConfig\n        >>> # Initializing a rnaernie multimolecule/rnaernie style configuration\n        >>> configuration = RnaErnieConfig()\n        >>> # Initializing a model (with random weights) from the multimolecule/rnaernie style configuration\n        >>> model = RnaErnieModel(configuration)\n        >>> # Accessing the model configuration\n        >>> configuration = model.config\n    \"\"\"\n\n    model_type = \"rnaernie\"\n\n    def __init__(\n        self,\n        vocab_size: int = 26,\n        hidden_size: int = 768,\n        num_hidden_layers: int = 12,\n        num_attention_heads: int = 12,\n        intermediate_size: int = 3072,\n        hidden_act: str = \"relu\",\n        hidden_dropout: float = 0.1,\n        attention_dropout: float = 0.1,\n        max_position_embeddings: int = 513,\n        initializer_range: float = 0.02,\n        layer_norm_eps: float = 1e-12,\n        position_embedding_type: str = \"absolute\",\n        is_decoder: bool = False,\n        use_cache: bool = True,\n        head: HeadConfig | None = None,\n        lm_head: MaskedLMHeadConfig | None = None,\n        **kwargs,\n    ):\n        super().__init__(**kwargs)\n\n        self.vocab_size = vocab_size\n        self.type_vocab_size = 2\n        self.hidden_size = hidden_size\n        self.num_hidden_layers = num_hidden_layers\n        self.num_attention_heads = num_attention_heads\n        self.intermediate_size = intermediate_size\n        self.hidden_act = hidden_act\n        self.hidden_dropout = hidden_dropout\n        self.attention_dropout = attention_dropout\n        self.max_position_embeddings = max_position_embeddings\n        self.initializer_range = initializer_range\n        self.layer_norm_eps = layer_norm_eps\n        self.position_embedding_type = position_embedding_type\n        self.is_decoder = is_decoder\n        self.use_cache = use_cache\n        self.head = HeadConfig(**head) if head is not None else None\n        self.lm_head = MaskedLMHeadConfig(**lm_head) if lm_head is not None else None\n
"},{"location":"models/rnaernie/#multimolecule.models.rnaernie.RnaErnieConfig(vocab_size)","title":"vocab_size","text":""},{"location":"models/rnaernie/#multimolecule.models.rnaernie.RnaErnieConfig(hidden_size)","title":"hidden_size","text":""},{"location":"models/rnaernie/#multimolecule.models.rnaernie.RnaErnieConfig(num_hidden_layers)","title":"num_hidden_layers","text":""},{"location":"models/rnaernie/#multimolecule.models.rnaernie.RnaErnieConfig(num_attention_heads)","title":"num_attention_heads","text":""},{"location":"models/rnaernie/#multimolecule.models.rnaernie.RnaErnieConfig(intermediate_size)","title":"intermediate_size","text":""},{"location":"models/rnaernie/#multimolecule.models.rnaernie.RnaErnieConfig(hidden_dropout)","title":"hidden_dropout","text":""},{"location":"models/rnaernie/#multimolecule.models.rnaernie.RnaErnieConfig(attention_dropout)","title":"attention_dropout","text":""},{"location":"models/rnaernie/#multimolecule.models.rnaernie.RnaErnieConfig(max_position_embeddings)","title":"max_position_embeddings","text":""},{"location":"models/rnaernie/#multimolecule.models.rnaernie.RnaErnieConfig(initializer_range)","title":"initializer_range","text":""},{"location":"models/rnaernie/#multimolecule.models.rnaernie.RnaErnieConfig(layer_norm_eps)","title":"layer_norm_eps","text":""},{"location":"models/rnaernie/#multimolecule.models.rnaernie.RnaErnieForContactPrediction","title":"RnaErnieForContactPrediction","text":"

Bases: RnaErniePreTrainedModel

Examples:

Python Console Session
>>> from multimolecule import RnaErnieConfig, RnaErnieForContactPrediction, RnaTokenizer\n>>> config = RnaErnieConfig()\n>>> model = RnaErnieForContactPrediction(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=torch.randint(2, (1, 5, 5)))\n>>> output[\"logits\"].shape\ntorch.Size([1, 5, 5, 1])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n
Source code in multimolecule/models/rnaernie/modeling_rnaernie.py Python
class RnaErnieForContactPrediction(RnaErniePreTrainedModel):\n    \"\"\"\n    Examples:\n        >>> from multimolecule import RnaErnieConfig, RnaErnieForContactPrediction, RnaTokenizer\n        >>> config = RnaErnieConfig()\n        >>> model = RnaErnieForContactPrediction(config)\n        >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n        >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n        >>> output = model(**input, labels=torch.randint(2, (1, 5, 5)))\n        >>> output[\"logits\"].shape\n        torch.Size([1, 5, 5, 1])\n        >>> output[\"loss\"]  # doctest:+ELLIPSIS\n        tensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n    \"\"\"\n\n    def __init__(self, config: RnaErnieConfig):\n        super().__init__(config)\n        self.rnaernie = RnaErnieModel(config, add_pooling_layer=True)\n        self.contact_head = ContactPredictionHead(config)\n        self.head_config = self.contact_head.config\n\n        # Initialize weights and apply final processing\n        self.post_init()\n\n    def forward(\n        self,\n        input_ids: Tensor | NestedTensor,\n        attention_mask: Tensor | None = None,\n        position_ids: Tensor | None = None,\n        head_mask: Tensor | None = None,\n        inputs_embeds: Tensor | NestedTensor | None = None,\n        labels: Tensor | None = None,\n        output_attentions: bool | None = None,\n        output_hidden_states: bool | None = None,\n        return_dict: bool | None = None,\n        **kwargs,\n    ) -> Tuple[Tensor, ...] | ContactPredictorOutput:\n        if output_attentions is False:\n            warn(\"output_attentions must be True for contact classification and will be ignored.\")\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n        outputs = self.rnaernie(\n            input_ids,\n            attention_mask=attention_mask,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            output_attentions=True,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n            **kwargs,\n        )\n        output = self.contact_head(outputs, attention_mask, input_ids, labels)\n        logits, loss = output.logits, output.loss\n\n        if not return_dict:\n            output = (logits,) + outputs[2:]\n            return ((loss,) + output) if loss is not None else output\n\n        return ContactPredictorOutput(\n            loss=loss,\n            logits=logits,\n            hidden_states=outputs.hidden_states,\n            attentions=outputs.attentions,\n        )\n
"},{"location":"models/rnaernie/#multimolecule.models.rnaernie.RnaErnieForMaskedLM","title":"RnaErnieForMaskedLM","text":"

Bases: RnaErniePreTrainedModel

Examples:

Python Console Session
>>> from multimolecule import RnaErnieConfig, RnaErnieForMaskedLM, RnaTokenizer\n>>> config = RnaErnieConfig()\n>>> model = RnaErnieForMaskedLM(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=input[\"input_ids\"])\n>>> output[\"logits\"].shape\ntorch.Size([1, 7, 26])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<NllLossBackward0>)\n
Source code in multimolecule/models/rnaernie/modeling_rnaernie.py Python
class RnaErnieForMaskedLM(RnaErniePreTrainedModel):\n    \"\"\"\n    Examples:\n        >>> from multimolecule import RnaErnieConfig, RnaErnieForMaskedLM, RnaTokenizer\n        >>> config = RnaErnieConfig()\n        >>> model = RnaErnieForMaskedLM(config)\n        >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n        >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n        >>> output = model(**input, labels=input[\"input_ids\"])\n        >>> output[\"logits\"].shape\n        torch.Size([1, 7, 26])\n        >>> output[\"loss\"]  # doctest:+ELLIPSIS\n        tensor(..., grad_fn=<NllLossBackward0>)\n    \"\"\"\n\n    _tied_weights_keys = [\"lm_head.decoder.bias\", \"lm_head.decoder.weight\"]\n\n    def __init__(self, config: RnaErnieConfig):\n        super().__init__(config)\n        if config.is_decoder:\n            logger.warning(\n                \"If you want to use `RnaErnieForMaskedLM` make sure `config.is_decoder=False` for \"\n                \"bi-directional self-attention.\"\n            )\n        self.rnaernie = RnaErnieModel(config, add_pooling_layer=False)\n        self.lm_head = MaskedLMHead(config)\n\n        # Initialize weights and apply final processing\n        self.post_init()\n\n    def get_output_embeddings(self):\n        return self.lm_head.decoder\n\n    def set_output_embeddings(self, new_embeddings):\n        self.lm_head.decoder = new_embeddings\n\n    def forward(\n        self,\n        input_ids: Tensor | NestedTensor,\n        attention_mask: Tensor | None = None,\n        position_ids: Tensor | None = None,\n        head_mask: Tensor | None = None,\n        inputs_embeds: Tensor | NestedTensor | None = None,\n        encoder_hidden_states: Tensor | None = None,\n        encoder_attention_mask: Tensor | None = None,\n        labels: Tensor | None = None,\n        output_attentions: bool | None = None,\n        output_hidden_states: bool | None = None,\n        return_dict: bool | None = None,\n        **kwargs,\n    ) -> Tuple[Tensor, ...] | MaskedLMOutput:\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n        outputs = self.rnaernie(\n            input_ids,\n            attention_mask=attention_mask,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            encoder_hidden_states=encoder_hidden_states,\n            encoder_attention_mask=encoder_attention_mask,\n            output_attentions=output_attentions,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n            **kwargs,\n        )\n        output = self.lm_head(outputs, labels)\n        logits, loss = output.logits, output.loss\n\n        if not return_dict:\n            output = (logits,) + outputs[2:]\n            return ((loss,) + output) if loss is not None else output\n\n        return MaskedLMOutput(\n            loss=loss,\n            logits=logits,\n            hidden_states=outputs.hidden_states,\n            attentions=outputs.attentions,\n        )\n
"},{"location":"models/rnaernie/#multimolecule.models.rnaernie.RnaErnieForSequencePrediction","title":"RnaErnieForSequencePrediction","text":"

Bases: RnaErniePreTrainedModel

Examples:

Python Console Session
>>> from multimolecule import RnaErnieConfig, RnaErnieForSequencePrediction, RnaTokenizer\n>>> config = RnaErnieConfig()\n>>> model = RnaErnieForSequencePrediction(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=torch.tensor([[1]]))\n>>> output[\"logits\"].shape\ntorch.Size([1, 1])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n
Source code in multimolecule/models/rnaernie/modeling_rnaernie.py Python
class RnaErnieForSequencePrediction(RnaErniePreTrainedModel):\n    \"\"\"\n    Examples:\n        >>> from multimolecule import RnaErnieConfig, RnaErnieForSequencePrediction, RnaTokenizer\n        >>> config = RnaErnieConfig()\n        >>> model = RnaErnieForSequencePrediction(config)\n        >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n        >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n        >>> output = model(**input, labels=torch.tensor([[1]]))\n        >>> output[\"logits\"].shape\n        torch.Size([1, 1])\n        >>> output[\"loss\"]  # doctest:+ELLIPSIS\n        tensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n    \"\"\"\n\n    def __init__(self, config):\n        super().__init__(config)\n        self.rnaernie = RnaErnieModel(config)\n        self.sequence_head = SequencePredictionHead(config)\n        self.head_config = self.sequence_head.config\n\n        # Initialize weights and apply final processing\n        self.post_init()\n\n    def forward(\n        self,\n        input_ids: Tensor | NestedTensor,\n        attention_mask: Tensor | None = None,\n        position_ids: Tensor | None = None,\n        head_mask: Tensor | None = None,\n        inputs_embeds: Tensor | NestedTensor | None = None,\n        labels: Tensor | None = None,\n        output_attentions: bool | None = None,\n        output_hidden_states: bool | None = None,\n        return_dict: bool | None = None,\n        **kwargs,\n    ) -> Tuple[Tensor, ...] | SequencePredictorOutput:\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n        outputs = self.rnaernie(\n            input_ids,\n            attention_mask=attention_mask,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            output_attentions=output_attentions,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n            **kwargs,\n        )\n        output = self.sequence_head(outputs, labels)\n        logits, loss = output.logits, output.loss\n\n        if not return_dict:\n            output = (logits,) + outputs[2:]\n            return ((loss,) + output) if loss is not None else output\n\n        return SequencePredictorOutput(\n            loss=loss,\n            logits=logits,\n            hidden_states=outputs.hidden_states,\n            attentions=outputs.attentions,\n        )\n
"},{"location":"models/rnaernie/#multimolecule.models.rnaernie.RnaErnieForTokenPrediction","title":"RnaErnieForTokenPrediction","text":"

Bases: RnaErniePreTrainedModel

Examples:

Python Console Session
>>> from multimolecule import RnaErnieConfig, RnaErnieForTokenPrediction, RnaTokenizer\n>>> config = RnaErnieConfig()\n>>> model = RnaErnieForTokenPrediction(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=torch.randint(2, (1, 5)))\n>>> output[\"logits\"].shape\ntorch.Size([1, 5, 1])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n
Source code in multimolecule/models/rnaernie/modeling_rnaernie.py Python
class RnaErnieForTokenPrediction(RnaErniePreTrainedModel):\n    \"\"\"\n    Examples:\n        >>> from multimolecule import RnaErnieConfig, RnaErnieForTokenPrediction, RnaTokenizer\n        >>> config = RnaErnieConfig()\n        >>> model = RnaErnieForTokenPrediction(config)\n        >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n        >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n        >>> output = model(**input, labels=torch.randint(2, (1, 5)))\n        >>> output[\"logits\"].shape\n        torch.Size([1, 5, 1])\n        >>> output[\"loss\"]  # doctest:+ELLIPSIS\n        tensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n    \"\"\"\n\n    def __init__(self, config: RnaErnieConfig):\n        super().__init__(config)\n        self.rnaernie = RnaErnieModel(config, add_pooling_layer=True)\n        self.token_head = TokenPredictionHead(config)\n        self.head_config = self.token_head.config\n\n        # Initialize weights and apply final processing\n        self.post_init()\n\n    def forward(\n        self,\n        input_ids: Tensor | NestedTensor,\n        attention_mask: Tensor | None = None,\n        position_ids: Tensor | None = None,\n        head_mask: Tensor | None = None,\n        inputs_embeds: Tensor | NestedTensor | None = None,\n        labels: Tensor | None = None,\n        output_attentions: bool | None = None,\n        output_hidden_states: bool | None = None,\n        return_dict: bool | None = None,\n        **kwargs,\n    ) -> Tuple[Tensor, ...] | TokenPredictorOutput:\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n        outputs = self.rnaernie(\n            input_ids,\n            attention_mask=attention_mask,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            output_attentions=output_attentions,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n            **kwargs,\n        )\n        output = self.token_head(outputs, attention_mask, input_ids, labels)\n        logits, loss = output.logits, output.loss\n\n        if not return_dict:\n            output = (logits,) + outputs[2:]\n            return ((loss,) + output) if loss is not None else output\n\n        return TokenPredictorOutput(\n            loss=loss,\n            logits=logits,\n            hidden_states=outputs.hidden_states,\n            attentions=outputs.attentions,\n        )\n
"},{"location":"models/rnaernie/#multimolecule.models.rnaernie.RnaErnieModel","title":"RnaErnieModel","text":"

Bases: RnaErniePreTrainedModel

Examples:

Python Console Session
>>> from multimolecule import RnaErnieConfig, RnaErnieModel, RnaTokenizer\n>>> config = RnaErnieConfig()\n>>> model = RnaErnieModel(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input)\n>>> output[\"last_hidden_state\"].shape\ntorch.Size([1, 7, 768])\n>>> output[\"pooler_output\"].shape\ntorch.Size([1, 768])\n
Source code in multimolecule/models/rnaernie/modeling_rnaernie.py Python
class RnaErnieModel(RnaErniePreTrainedModel):\n    \"\"\"\n    Examples:\n        >>> from multimolecule import RnaErnieConfig, RnaErnieModel, RnaTokenizer\n        >>> config = RnaErnieConfig()\n        >>> model = RnaErnieModel(config)\n        >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n        >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n        >>> output = model(**input)\n        >>> output[\"last_hidden_state\"].shape\n        torch.Size([1, 7, 768])\n        >>> output[\"pooler_output\"].shape\n        torch.Size([1, 768])\n    \"\"\"\n\n    def __init__(self, config: RnaErnieConfig, add_pooling_layer: bool = True):\n        super().__init__(config)\n        self.pad_token_id = config.pad_token_id\n\n        self.embeddings = RnaErnieEmbeddings(config)\n        self.encoder = RnaErnieEncoder(config)\n\n        self.pooler = RnaErniePooler(config) if add_pooling_layer else None\n\n        # Initialize weights and apply final processing\n        self.post_init()\n\n    def get_input_embeddings(self):\n        return self.embeddings.word_embeddings\n\n    def set_input_embeddings(self, value):\n        self.embeddings.word_embeddings = value\n\n    def _prune_heads(self, heads_to_prune):\n        \"\"\"\n        Prunes heads of the model. heads_to_prune: dict of {layer_num: list of heads to prune in this layer} See base\n        class PreTrainedModel\n        \"\"\"\n        for layer, heads in heads_to_prune.items():\n            self.encoder.layer[layer].attention.prune_heads(heads)\n\n    def forward(\n        self,\n        input_ids: Tensor | NestedTensor,\n        attention_mask: Tensor | None = None,\n        position_ids: Tensor | None = None,\n        head_mask: Tensor | None = None,\n        inputs_embeds: Tensor | NestedTensor | None = None,\n        encoder_hidden_states: Tensor | None = None,\n        encoder_attention_mask: Tensor | None = None,\n        past_key_values: Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None = None,\n        use_cache: bool | None = None,\n        output_attentions: bool | None = None,\n        output_hidden_states: bool | None = None,\n        return_dict: bool | None = None,\n        **kwargs,\n    ) -> Tuple[Tensor, ...] | BaseModelOutputWithPoolingAndCrossAttentions:\n        r\"\"\"\n        Args:\n            encoder_hidden_states:\n                Shape: `(batch_size, sequence_length, hidden_size)`\n\n                Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if\n                the model is configured as a decoder.\n            encoder_attention_mask:\n                Shape: `(batch_size, sequence_length)`\n\n                Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used\n                in the cross-attention if the model is configured as a decoder. Mask values selected in `[0, 1]`:\n\n                - 1 for tokens that are **not masked**,\n                - 0 for tokens that are **masked**.\n            past_key_values:\n                Tuple of length `config.n_layers` with each tuple having 4 tensors of shape\n                `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)\n\n                Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up\n                decoding.\n\n                If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those\n                that don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of\n                all `decoder_input_ids` of shape `(batch_size, sequence_length)`.\n            use_cache:\n                If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding\n                (see `past_key_values`).\n        \"\"\"\n        if kwargs:\n            warn(\n                f\"Additional keyword arguments `{', '.join(kwargs)}` are detected in \"\n                f\"`{self.__class__.__name__}.forward`, they will be ignored.\\n\"\n                \"This is provided for backward compatibility and may lead to unexpected behavior.\"\n            )\n        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions\n        output_hidden_states = (\n            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states\n        )\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n\n        if self.config.is_decoder:\n            use_cache = use_cache if use_cache is not None else self.config.use_cache\n        else:\n            use_cache = False\n\n        if isinstance(input_ids, NestedTensor):\n            input_ids, attention_mask = input_ids.tensor, input_ids.mask\n        if input_ids is not None and inputs_embeds is not None:\n            raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n        if input_ids is not None:\n            self.warn_if_padding_and_no_attention_mask(input_ids, attention_mask)\n            input_shape = input_ids.size()\n        elif inputs_embeds is not None:\n            input_shape = inputs_embeds.size()[:-1]\n        else:\n            raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n        batch_size, seq_length = input_shape\n        device = input_ids.device if input_ids is not None else inputs_embeds.device  # type: ignore[union-attr]\n\n        # past_key_values_length\n        past_key_values_length = past_key_values[0][0].shape[2] if past_key_values is not None else 0\n\n        if attention_mask is None:\n            attention_mask = (\n                input_ids.ne(self.pad_token_id)\n                if self.pad_token_id is not None\n                else torch.ones(((batch_size, seq_length + past_key_values_length)), device=device)\n            )\n\n        # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]\n        # ourselves in which case we just need to make it broadcastable to all heads.\n        extended_attention_mask: Tensor = self.get_extended_attention_mask(attention_mask, input_shape)\n\n        # If a 2D or 3D attention mask is provided for the cross-attention\n        # we need to make broadcastable to [batch_size, num_heads, seq_length, seq_length]\n        if self.config.is_decoder and encoder_hidden_states is not None:\n            encoder_batch_size, encoder_sequence_length, _ = encoder_hidden_states.size()\n            encoder_hidden_shape = (encoder_batch_size, encoder_sequence_length)\n            if encoder_attention_mask is None:\n                encoder_attention_mask = torch.ones(encoder_hidden_shape, device=device)\n            encoder_extended_attention_mask = self.invert_attention_mask(encoder_attention_mask)\n        else:\n            encoder_extended_attention_mask = None\n\n        # Prepare head mask if needed\n        # 1.0 in head_mask indicate we keep the head\n        # attention_probs has shape bsz x n_heads x N x N\n        # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]\n        # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]\n        head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)\n\n        embedding_output = self.embeddings(\n            input_ids=input_ids,\n            position_ids=position_ids,\n            inputs_embeds=inputs_embeds,\n            past_key_values_length=past_key_values_length,\n        )\n        encoder_outputs = self.encoder(\n            embedding_output,\n            attention_mask=extended_attention_mask,\n            head_mask=head_mask,\n            encoder_hidden_states=encoder_hidden_states,\n            encoder_attention_mask=encoder_extended_attention_mask,\n            past_key_values=past_key_values,\n            use_cache=use_cache,\n            output_attentions=output_attentions,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n        )\n        sequence_output = encoder_outputs[0]\n        pooled_output = self.pooler(sequence_output) if self.pooler is not None else None\n\n        if not return_dict:\n            return (sequence_output, pooled_output) + encoder_outputs[1:]\n\n        return BaseModelOutputWithPoolingAndCrossAttentions(\n            last_hidden_state=sequence_output,\n            pooler_output=pooled_output,\n            past_key_values=encoder_outputs.past_key_values,\n            hidden_states=encoder_outputs.hidden_states,\n            attentions=encoder_outputs.attentions,\n            cross_attentions=encoder_outputs.cross_attentions,\n        )\n
"},{"location":"models/rnaernie/#multimolecule.models.rnaernie.RnaErnieModel.forward","title":"forward","text":"Python
forward(input_ids: Tensor | NestedTensor, attention_mask: Tensor | None = None, position_ids: Tensor | None = None, head_mask: Tensor | None = None, inputs_embeds: Tensor | NestedTensor | None = None, encoder_hidden_states: Tensor | None = None, encoder_attention_mask: Tensor | None = None, past_key_values: Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None = None, use_cache: bool | None = None, output_attentions: bool | None = None, output_hidden_states: bool | None = None, return_dict: bool | None = None, **kwargs) -> Tuple[Tensor, ...] | BaseModelOutputWithPoolingAndCrossAttentions\n

Parameters:

Name Type Description Default Tensor | None

Shape: (batch_size, sequence_length, hidden_size)

Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if the model is configured as a decoder.

None Tensor | None

Shape: (batch_size, sequence_length)

Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used in the cross-attention if the model is configured as a decoder. Mask values selected in [0, 1]:

None Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None

Tuple of length config.n_layers with each tuple having 4 tensors of shape `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)

Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up decoding.

If past_key_values are used, the user can optionally input only the last decoder_input_ids (those that don\u2019t have their past key value states given to this model) of shape (batch_size, 1) instead of all decoder_input_ids of shape (batch_size, sequence_length).

None bool | None

If set to True, past_key_values key value states are returned and can be used to speed up decoding (see past_key_values).

None Source code in multimolecule/models/rnaernie/modeling_rnaernie.py Python
def forward(\n    self,\n    input_ids: Tensor | NestedTensor,\n    attention_mask: Tensor | None = None,\n    position_ids: Tensor | None = None,\n    head_mask: Tensor | None = None,\n    inputs_embeds: Tensor | NestedTensor | None = None,\n    encoder_hidden_states: Tensor | None = None,\n    encoder_attention_mask: Tensor | None = None,\n    past_key_values: Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None = None,\n    use_cache: bool | None = None,\n    output_attentions: bool | None = None,\n    output_hidden_states: bool | None = None,\n    return_dict: bool | None = None,\n    **kwargs,\n) -> Tuple[Tensor, ...] | BaseModelOutputWithPoolingAndCrossAttentions:\n    r\"\"\"\n    Args:\n        encoder_hidden_states:\n            Shape: `(batch_size, sequence_length, hidden_size)`\n\n            Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if\n            the model is configured as a decoder.\n        encoder_attention_mask:\n            Shape: `(batch_size, sequence_length)`\n\n            Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used\n            in the cross-attention if the model is configured as a decoder. Mask values selected in `[0, 1]`:\n\n            - 1 for tokens that are **not masked**,\n            - 0 for tokens that are **masked**.\n        past_key_values:\n            Tuple of length `config.n_layers` with each tuple having 4 tensors of shape\n            `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)\n\n            Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up\n            decoding.\n\n            If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those\n            that don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of\n            all `decoder_input_ids` of shape `(batch_size, sequence_length)`.\n        use_cache:\n            If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding\n            (see `past_key_values`).\n    \"\"\"\n    if kwargs:\n        warn(\n            f\"Additional keyword arguments `{', '.join(kwargs)}` are detected in \"\n            f\"`{self.__class__.__name__}.forward`, they will be ignored.\\n\"\n            \"This is provided for backward compatibility and may lead to unexpected behavior.\"\n        )\n    output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions\n    output_hidden_states = (\n        output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states\n    )\n    return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n\n    if self.config.is_decoder:\n        use_cache = use_cache if use_cache is not None else self.config.use_cache\n    else:\n        use_cache = False\n\n    if isinstance(input_ids, NestedTensor):\n        input_ids, attention_mask = input_ids.tensor, input_ids.mask\n    if input_ids is not None and inputs_embeds is not None:\n        raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n    if input_ids is not None:\n        self.warn_if_padding_and_no_attention_mask(input_ids, attention_mask)\n        input_shape = input_ids.size()\n    elif inputs_embeds is not None:\n        input_shape = inputs_embeds.size()[:-1]\n    else:\n        raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n    batch_size, seq_length = input_shape\n    device = input_ids.device if input_ids is not None else inputs_embeds.device  # type: ignore[union-attr]\n\n    # past_key_values_length\n    past_key_values_length = past_key_values[0][0].shape[2] if past_key_values is not None else 0\n\n    if attention_mask is None:\n        attention_mask = (\n            input_ids.ne(self.pad_token_id)\n            if self.pad_token_id is not None\n            else torch.ones(((batch_size, seq_length + past_key_values_length)), device=device)\n        )\n\n    # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]\n    # ourselves in which case we just need to make it broadcastable to all heads.\n    extended_attention_mask: Tensor = self.get_extended_attention_mask(attention_mask, input_shape)\n\n    # If a 2D or 3D attention mask is provided for the cross-attention\n    # we need to make broadcastable to [batch_size, num_heads, seq_length, seq_length]\n    if self.config.is_decoder and encoder_hidden_states is not None:\n        encoder_batch_size, encoder_sequence_length, _ = encoder_hidden_states.size()\n        encoder_hidden_shape = (encoder_batch_size, encoder_sequence_length)\n        if encoder_attention_mask is None:\n            encoder_attention_mask = torch.ones(encoder_hidden_shape, device=device)\n        encoder_extended_attention_mask = self.invert_attention_mask(encoder_attention_mask)\n    else:\n        encoder_extended_attention_mask = None\n\n    # Prepare head mask if needed\n    # 1.0 in head_mask indicate we keep the head\n    # attention_probs has shape bsz x n_heads x N x N\n    # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]\n    # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]\n    head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)\n\n    embedding_output = self.embeddings(\n        input_ids=input_ids,\n        position_ids=position_ids,\n        inputs_embeds=inputs_embeds,\n        past_key_values_length=past_key_values_length,\n    )\n    encoder_outputs = self.encoder(\n        embedding_output,\n        attention_mask=extended_attention_mask,\n        head_mask=head_mask,\n        encoder_hidden_states=encoder_hidden_states,\n        encoder_attention_mask=encoder_extended_attention_mask,\n        past_key_values=past_key_values,\n        use_cache=use_cache,\n        output_attentions=output_attentions,\n        output_hidden_states=output_hidden_states,\n        return_dict=return_dict,\n    )\n    sequence_output = encoder_outputs[0]\n    pooled_output = self.pooler(sequence_output) if self.pooler is not None else None\n\n    if not return_dict:\n        return (sequence_output, pooled_output) + encoder_outputs[1:]\n\n    return BaseModelOutputWithPoolingAndCrossAttentions(\n        last_hidden_state=sequence_output,\n        pooler_output=pooled_output,\n        past_key_values=encoder_outputs.past_key_values,\n        hidden_states=encoder_outputs.hidden_states,\n        attentions=encoder_outputs.attentions,\n        cross_attentions=encoder_outputs.cross_attentions,\n    )\n
"},{"location":"models/rnaernie/#multimolecule.models.rnaernie.RnaErnieModel.forward(encoder_hidden_states)","title":"encoder_hidden_states","text":""},{"location":"models/rnaernie/#multimolecule.models.rnaernie.RnaErnieModel.forward(encoder_attention_mask)","title":"encoder_attention_mask","text":""},{"location":"models/rnaernie/#multimolecule.models.rnaernie.RnaErnieModel.forward(past_key_values)","title":"past_key_values","text":""},{"location":"models/rnaernie/#multimolecule.models.rnaernie.RnaErnieModel.forward(use_cache)","title":"use_cache","text":""},{"location":"models/rnaernie/#multimolecule.models.rnaernie.RnaErniePreTrainedModel","title":"RnaErniePreTrainedModel","text":"

Bases: PreTrainedModel

An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained models.

Source code in multimolecule/models/rnaernie/modeling_rnaernie.py Python
class RnaErniePreTrainedModel(PreTrainedModel):\n    \"\"\"\n    An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained\n    models.\n    \"\"\"\n\n    config_class = RnaErnieConfig\n    base_model_prefix = \"rnaernie\"\n    supports_gradient_checkpointing = True\n    _no_split_modules = [\"RnaErnieLayer\", \"RnaErnieEmbeddings\"]\n\n    # Copied from transformers.models.bert.modeling_bert.BertPreTrainedModel._init_weights\n    def _init_weights(self, module: nn.Module):\n        \"\"\"Initialize the weights\"\"\"\n        if isinstance(module, nn.Linear):\n            # Slightly different from the TF version which uses truncated_normal for initialization\n            # cf https://github.com/pytorch/pytorch/pull/5617\n            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)\n            if module.bias is not None:\n                module.bias.data.zero_()\n        elif isinstance(module, nn.Embedding):\n            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)\n            if module.padding_idx is not None:\n                module.weight.data[module.padding_idx].zero_()\n        elif isinstance(module, nn.LayerNorm):\n            module.bias.data.zero_()\n            module.weight.data.fill_(1.0)\n\n    def _set_gradient_checkpointing(self, module, value=False):\n        if isinstance(module, RnaErnieEncoder):\n            module.gradient_checkpointing = value\n
"},{"location":"models/rnafm/","title":"RNA-FM","text":"

Pre-trained model on non-coding RNA (ncRNA) using a masked language modeling (MLM) objective.

"},{"location":"models/rnafm/#disclaimer","title":"Disclaimer","text":"

This is an UNOFFICIAL implementation of the Interpretable RNA Foundation Model from Unannotated Data for Highly Accurate RNA Structure and Function Predictions by Jiayang Chen, Zhihang Hue, Siqi Sun, et al.

The OFFICIAL repository of RNA-FM is at ml4bio/RNA-FM.

Tip

The MultiMolecule team has confirmed that the provided model and checkpoints are producing the same intermediate representations as the original implementation.

The team releasing RNA-FM did not write this model card for this model so this model card has been written by the MultiMolecule team.

"},{"location":"models/rnafm/#model-details","title":"Model Details","text":"

RNA-FM is a bert-style model pre-trained on a large corpus of non-coding RNA sequences in a self-supervised fashion. This means that the model was trained on the raw nucleotides of RNA sequences only, with an automatic process to generate inputs and labels from those texts. Please refer to the Training Details section for more information on the training process.

"},{"location":"models/rnafm/#variations","title":"Variations","text":""},{"location":"models/rnafm/#model-specification","title":"Model Specification","text":"Variants Num Layers Hidden Size Num Heads Intermediate Size Num Parameters (M) FLOPs (G) MACs (G) Max Num Tokens RNA-FM 12 640 20 5120 99.52 25.68 12.83 1024 mRNA-FM 1280 239.25 61.43 30.7"},{"location":"models/rnafm/#links","title":"Links","text":""},{"location":"models/rnafm/#usage","title":"Usage","text":"

The model file depends on the multimolecule library. You can install it using pip:

Bash
pip install multimolecule\n
"},{"location":"models/rnafm/#direct-use","title":"Direct Use","text":"

You can use this model directly with a pipeline for masked language modeling:

Python
>>> import multimolecule  # you must import multimolecule to register models\n>>> from transformers import pipeline\n>>> unmasker = pipeline(\"fill-mask\", model=\"multimolecule/rnafm\")\n>>> unmasker(\"gguc<mask>cucugguuagaccagaucugagccu\")\n\n[{'score': 0.2752501964569092,\n  'token': 21,\n  'token_str': '.',\n  'sequence': 'G G U C. C U C U G G U U A G A C C A G A U C U G A G C C U'},\n {'score': 0.22108642756938934,\n  'token': 23,\n  'token_str': '*',\n  'sequence': 'G G U C * C U C U G G U U A G A C C A G A U C U G A G C C U'},\n {'score': 0.18201279640197754,\n  'token': 25,\n  'token_str': 'I',\n  'sequence': 'G G U C I C U C U G G U U A G A C C A G A U C U G A G C C U'},\n {'score': 0.10875876247882843,\n  'token': 9,\n  'token_str': 'U',\n  'sequence': 'G G U C U C U C U G G U U A G A C C A G A U C U G A G C C U'},\n {'score': 0.08898332715034485,\n  'token': 6,\n  'token_str': 'A',\n  'sequence': 'G G U C A C U C U G G U U A G A C C A G A U C U G A G C C U'}]\n
"},{"location":"models/rnafm/#downstream-use","title":"Downstream Use","text":""},{"location":"models/rnafm/#extract-features","title":"Extract Features","text":"

Here is how to use this model to get the features of a given sequence in PyTorch:

Python
from multimolecule import RnaTokenizer, RnaFmModel\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rnafm\")\nmodel = RnaFmModel.from_pretrained(\"multimolecule/rnafm\")\n\ntext = \"UAGCUUAUCAGACUGAUGUUGA\"\ninput = tokenizer(text, return_tensors=\"pt\")\n\noutput = model(**input)\n
"},{"location":"models/rnafm/#sequence-classification-regression","title":"Sequence Classification / Regression","text":"

Note: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for sequence classification or regression.

Here is how to use this model as backbone to fine-tune for a sequence-level task in PyTorch:

Python
import torch\nfrom multimolecule import RnaTokenizer, RnaFmForSequencePrediction\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rnafm\")\nmodel = RnaFmForSequencePrediction.from_pretrained(\"multimolecule/rnafm\")\n\ntext = \"UAGCUUAUCAGACUGAUGUUGA\"\ninput = tokenizer(text, return_tensors=\"pt\")\nlabel = torch.tensor([1])\n\noutput = model(**input, labels=label)\n
"},{"location":"models/rnafm/#token-classification-regression","title":"Token Classification / Regression","text":"

Note: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for nucleotide classification or regression.

Here is how to use this model as backbone to fine-tune for a nucleotide-level task in PyTorch:

Python
import torch\nfrom multimolecule import RnaTokenizer, RnaFmForTokenPrediction\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rnafm\")\nmodel = RnaFmForTokenPrediction.from_pretrained(\"multimolecule/rnafm\")\n\ntext = \"UAGCUUAUCAGACUGAUGUUGA\"\ninput = tokenizer(text, return_tensors=\"pt\")\nlabel = torch.randint(2, (len(text), ))\n\noutput = model(**input, labels=label)\n
"},{"location":"models/rnafm/#contact-classification-regression","title":"Contact Classification / Regression","text":"

Note: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for contact classification or regression.

Here is how to use this model as backbone to fine-tune for a contact-level task in PyTorch:

Python
import torch\nfrom multimolecule import RnaTokenizer, RnaFmForContactPrediction\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rnafm\")\nmodel = RnaFmForContactPrediction.from_pretrained(\"multimolecule/rnafm\")\n\ntext = \"UAGCUUAUCAGACUGAUGUUGA\"\ninput = tokenizer(text, return_tensors=\"pt\")\nlabel = torch.randint(2, (len(text), len(text)))\n\noutput = model(**input, labels=label)\n
"},{"location":"models/rnafm/#training-details","title":"Training Details","text":"

RNA-FM used Masked Language Modeling (MLM) as the pre-training objective: taking a sequence, the model randomly masks 15% of the tokens in the input then runs the entire masked sentence through the model and has to predict the masked tokens. This is comparable to the Cloze task in language modeling.

"},{"location":"models/rnafm/#training-data","title":"Training Data","text":"

The RNA-FM model was pre-trained on RNAcentral. RNAcentral is a free, public resource that offers integrated access to a comprehensive and up-to-date set of non-coding RNA sequences provided by a collaborating group of Expert Databases representing a broad range of organisms and RNA types.

RNA-FM applied CD-HIT (CD-HIT-EST) with a cut-off at 100% sequence identity to remove redundancy from the RNAcentral. The final dataset contains 23.7 million non-redundant RNA sequences.

RNA-FM preprocessed all tokens by replacing \u201cU\u201ds with \u201cT\u201ds.

Note that during model conversions, \u201cT\u201d is replaced with \u201cU\u201d. RnaTokenizer will convert \u201cT\u201ds to \u201cU\u201ds for you, you may disable this behaviour by passing replace_T_with_U=False.

"},{"location":"models/rnafm/#training-procedure","title":"Training Procedure","text":""},{"location":"models/rnafm/#preprocessing","title":"Preprocessing","text":"

RNA-FM used masked language modeling (MLM) as the pre-training objective. The masking procedure is similar to the one used in BERT:

"},{"location":"models/rnafm/#pretraining","title":"PreTraining","text":"

The model was trained on 8 NVIDIA A100 GPUs with 80GiB memories.

"},{"location":"models/rnafm/#citation","title":"Citation","text":"

BibTeX:

BibTeX
@article{chen2022interpretable,\n  title={Interpretable rna foundation model from unannotated data for highly accurate rna structure and function predictions},\n  author={Chen, Jiayang and Hu, Zhihang and Sun, Siqi and Tan, Qingxiong and Wang, Yixuan and Yu, Qinze and Zong, Licheng and Hong, Liang and Xiao, Jin and King, Irwin and others},\n  journal={arXiv preprint arXiv:2204.00300},\n  year={2022}\n}\n
"},{"location":"models/rnafm/#contact","title":"Contact","text":"

Please use GitHub issues of MultiMolecule for any questions or comments on the model card.

Please contact the authors of the RNA-FM paper for questions or comments on the paper/model.

"},{"location":"models/rnafm/#license","title":"License","text":"

This model is licensed under the AGPL-3.0 License.

Text Only
SPDX-License-Identifier: AGPL-3.0-or-later\n
"},{"location":"models/rnafm/#multimolecule.models.rnafm","title":"multimolecule.models.rnafm","text":""},{"location":"models/rnafm/#multimolecule.models.rnafm.RnaTokenizer","title":"RnaTokenizer","text":"

Bases: Tokenizer

Tokenizer for RNA sequences.

Parameters:

Name Type Description Default Alphabet | str | List[str] | None

alphabet to use for tokenization.

None int

Size of kmer to tokenize.

1 bool

Whether to tokenize into codons.

False bool

Whether to replace T with U.

True bool

Whether to convert input to uppercase.

True

Examples:

Python Console Session
>>> from multimolecule import RnaTokenizer\n>>> tokenizer = RnaTokenizer()\n>>> tokenizer('<pad><cls><eos><unk><mask><null>ACGUNRYSWKMBDHV.X*-I')[\"input_ids\"]\n[1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 2]\n>>> tokenizer('acgu')[\"input_ids\"]\n[1, 6, 7, 8, 9, 2]\n>>> tokenizer('acgt')[\"input_ids\"]\n[1, 6, 7, 8, 9, 2]\n>>> tokenizer = RnaTokenizer(replace_T_with_U=False)\n>>> tokenizer('acgt')[\"input_ids\"]\n[1, 6, 7, 8, 3, 2]\n>>> tokenizer = RnaTokenizer(nmers=3)\n>>> tokenizer('uagcuuauc')[\"input_ids\"]\n[1, 83, 17, 64, 49, 96, 84, 22, 2]\n>>> tokenizer = RnaTokenizer(codon=True)\n>>> tokenizer('uagcuuauc')[\"input_ids\"]\n[1, 83, 49, 22, 2]\n>>> tokenizer('uagcuuauca')[\"input_ids\"]\nTraceback (most recent call last):\nValueError: length of input sequence must be a multiple of 3 for codon tokenization, but got 10\n
Source code in multimolecule/tokenisers/rna/tokenization_rna.py Python
class RnaTokenizer(Tokenizer):\n    \"\"\"\n    Tokenizer for RNA sequences.\n\n    Args:\n        alphabet: alphabet to use for tokenization.\n\n            - If is `None`, the standard RNA alphabet will be used.\n            - If is a `string`, it should correspond to the name of a predefined alphabet. The options include\n                + `standard`\n                + `extended`\n                + `streamline`\n                + `nucleobase`\n            - If is an alphabet or a list of characters, that specific alphabet will be used.\n        nmers: Size of kmer to tokenize.\n        codon: Whether to tokenize into codons.\n        replace_T_with_U: Whether to replace T with U.\n        do_upper_case: Whether to convert input to uppercase.\n\n    Examples:\n        >>> from multimolecule import RnaTokenizer\n        >>> tokenizer = RnaTokenizer()\n        >>> tokenizer('<pad><cls><eos><unk><mask><null>ACGUNRYSWKMBDHV.X*-I')[\"input_ids\"]\n        [1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 2]\n        >>> tokenizer('acgu')[\"input_ids\"]\n        [1, 6, 7, 8, 9, 2]\n        >>> tokenizer('acgt')[\"input_ids\"]\n        [1, 6, 7, 8, 9, 2]\n        >>> tokenizer = RnaTokenizer(replace_T_with_U=False)\n        >>> tokenizer('acgt')[\"input_ids\"]\n        [1, 6, 7, 8, 3, 2]\n        >>> tokenizer = RnaTokenizer(nmers=3)\n        >>> tokenizer('uagcuuauc')[\"input_ids\"]\n        [1, 83, 17, 64, 49, 96, 84, 22, 2]\n        >>> tokenizer = RnaTokenizer(codon=True)\n        >>> tokenizer('uagcuuauc')[\"input_ids\"]\n        [1, 83, 49, 22, 2]\n        >>> tokenizer('uagcuuauca')[\"input_ids\"]\n        Traceback (most recent call last):\n        ValueError: length of input sequence must be a multiple of 3 for codon tokenization, but got 10\n    \"\"\"\n\n    model_input_names = [\"input_ids\", \"attention_mask\"]\n\n    def __init__(\n        self,\n        alphabet: Alphabet | str | List[str] | None = None,\n        nmers: int = 1,\n        codon: bool = False,\n        replace_T_with_U: bool = True,\n        do_upper_case: bool = True,\n        additional_special_tokens: List | Tuple | None = None,\n        **kwargs,\n    ):\n        if codon and (nmers > 1 and nmers != 3):\n            raise ValueError(\"Codon and nmers cannot be used together.\")\n        if codon:\n            nmers = 3  # set to 3 to get correct vocab\n        if not isinstance(alphabet, Alphabet):\n            alphabet = get_alphabet(alphabet, nmers=nmers)\n        super().__init__(\n            alphabet=alphabet,\n            nmers=nmers,\n            codon=codon,\n            replace_T_with_U=replace_T_with_U,\n            do_upper_case=do_upper_case,\n            additional_special_tokens=additional_special_tokens,\n            **kwargs,\n        )\n        self.replace_T_with_U = replace_T_with_U\n        self.nmers = nmers\n        self.codon = codon\n\n    def _tokenize(self, text: str, **kwargs):\n        if self.do_upper_case:\n            text = text.upper()\n        if self.replace_T_with_U:\n            text = text.replace(\"T\", \"U\")\n        if self.codon:\n            if len(text) % 3 != 0:\n                raise ValueError(\n                    f\"length of input sequence must be a multiple of 3 for codon tokenization, but got {len(text)}\"\n                )\n            return [text[i : i + 3] for i in range(0, len(text), 3)]\n        if self.nmers > 1:\n            return [text[i : i + self.nmers] for i in range(len(text) - self.nmers + 1)]  # noqa: E203\n        return list(text)\n
"},{"location":"models/rnafm/#multimolecule.models.rnafm.RnaTokenizer(alphabet)","title":"alphabet","text":""},{"location":"models/rnafm/#multimolecule.models.rnafm.RnaTokenizer(nmers)","title":"nmers","text":""},{"location":"models/rnafm/#multimolecule.models.rnafm.RnaTokenizer(codon)","title":"codon","text":""},{"location":"models/rnafm/#multimolecule.models.rnafm.RnaTokenizer(replace_T_with_U)","title":"replace_T_with_U","text":""},{"location":"models/rnafm/#multimolecule.models.rnafm.RnaTokenizer(do_upper_case)","title":"do_upper_case","text":""},{"location":"models/rnafm/#multimolecule.models.rnafm.RnaFmConfig","title":"RnaFmConfig","text":"

Bases: PreTrainedConfig

This is the configuration class to store the configuration of a RnaFmModel. It is used to instantiate a RNA-FM model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the RNA-FM ml4bio/RNA-FM architecture.

Configuration objects inherit from PreTrainedConfig and can be used to control the model outputs. Read the documentation from PreTrainedConfig for more information.

Parameters:

Name Type Description Default int | None

Vocabulary size of the RNA-FM model. Defines the number of different tokens that can be represented by the inputs_ids passed when calling [RnaFmModel]. Defaults to 25 if codon=False else 131.

None bool

Whether to use codon tokenization.

False int

Dimensionality of the encoder layers and the pooler layer.

640 int

Number of hidden layers in the Transformer encoder.

12 int

Number of attention heads for each attention layer in the Transformer encoder.

20 int

Dimensionality of the \u201cintermediate\u201d (often named feed-forward) layer in the Transformer encoder.

5120 float

The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.

0.1 float

The dropout ratio for the attention probabilities.

0.1 int

The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).

1026 float

The standard deviation of the truncated_normal_initializer for initializing all weight matrices.

0.02 float

The epsilon used by the layer normalization layers.

1e-12 str

Type of position embedding. Choose one of \"absolute\", \"relative_key\", \"relative_key_query\", \"rotary\". For positional embeddings use \"absolute\". For more information on \"relative_key\", please refer to Self-Attention with Relative Position Representations (Shaw et al.). For more information on \"relative_key_query\", please refer to Method 4 in Improve Transformer Models with Better Relative Position Embeddings (Huang et al.).

'absolute' bool

Whether the model is used as a decoder or not. If False, the model is used as an encoder.

False bool

Whether or not the model should return the last key/values attentions (not used by all models). Only relevant if config.is_decoder=True.

True bool

Whether to apply layer normalization after embeddings but before the main stem of the network.

True bool

When this is enabled, masked tokens are treated as if they had been dropped out by input dropout.

False

Examples:

Python Console Session
>>> from multimolecule import RnaFmModel, RnaFmConfig\n>>> # Initializing a RNA-FM multimolecule/rnafm style configuration\n>>> configuration = RnaFmConfig()\n>>> # Initializing a model (with random weights) from the multimolecule/rnafm style configuration\n>>> model = RnaFmModel(configuration)\n>>> # Accessing the model configuration\n>>> configuration = model.config\n
Source code in multimolecule/models/rnafm/configuration_rnafm.py Python
class RnaFmConfig(PreTrainedConfig):\n    r\"\"\"\n    This is the configuration class to store the configuration of a [`RnaFmModel`][multimolecule.models.RnaFmModel].\n    It is used to instantiate a RNA-FM model according to the specified arguments, defining the model architecture.\n    Instantiating a configuration with the defaults will yield a similar configuration to that of the RNA-FM\n    [ml4bio/RNA-FM](https://github.com/ml4bio/RNA-FM) architecture.\n\n    Configuration objects inherit from [`PreTrainedConfig`][multimolecule.models.PreTrainedConfig] and can be used to\n    control the model outputs. Read the documentation from [`PreTrainedConfig`][multimolecule.models.PreTrainedConfig]\n    for more information.\n\n    Args:\n        vocab_size:\n            Vocabulary size of the RNA-FM model. Defines the number of different tokens that can be represented by the\n            `inputs_ids` passed when calling [`RnaFmModel`].\n            Defaults to 25 if `codon=False` else 131.\n        codon:\n            Whether to use codon tokenization.\n        hidden_size:\n            Dimensionality of the encoder layers and the pooler layer.\n        num_hidden_layers:\n            Number of hidden layers in the Transformer encoder.\n        num_attention_heads:\n            Number of attention heads for each attention layer in the Transformer encoder.\n        intermediate_size:\n            Dimensionality of the \"intermediate\" (often named feed-forward) layer in the Transformer encoder.\n        hidden_dropout:\n            The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.\n        attention_dropout:\n            The dropout ratio for the attention probabilities.\n        max_position_embeddings:\n            The maximum sequence length that this model might ever be used with. Typically set this to something large\n            just in case (e.g., 512 or 1024 or 2048).\n        initializer_range:\n            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.\n        layer_norm_eps:\n            The epsilon used by the layer normalization layers.\n        position_embedding_type:\n            Type of position embedding. Choose one of `\"absolute\"`, `\"relative_key\"`, `\"relative_key_query\", \"rotary\"`.\n            For positional embeddings use `\"absolute\"`. For more information on `\"relative_key\"`, please refer to\n            [Self-Attention with Relative Position Representations (Shaw et al.)](https://arxiv.org/abs/1803.02155).\n            For more information on `\"relative_key_query\"`, please refer to *Method 4* in [Improve Transformer Models\n            with Better Relative Position Embeddings (Huang et al.)](https://arxiv.org/abs/2009.13658).\n        is_decoder:\n            Whether the model is used as a decoder or not. If `False`, the model is used as an encoder.\n        use_cache:\n            Whether or not the model should return the last key/values attentions (not used by all models). Only\n            relevant if `config.is_decoder=True`.\n        emb_layer_norm_before:\n            Whether to apply layer normalization after embeddings but before the main stem of the network.\n        token_dropout:\n            When this is enabled, masked tokens are treated as if they had been dropped out by input dropout.\n\n    Examples:\n        >>> from multimolecule import RnaFmModel, RnaFmConfig\n        >>> # Initializing a RNA-FM multimolecule/rnafm style configuration\n        >>> configuration = RnaFmConfig()\n        >>> # Initializing a model (with random weights) from the multimolecule/rnafm style configuration\n        >>> model = RnaFmModel(configuration)\n        >>> # Accessing the model configuration\n        >>> configuration = model.config\n    \"\"\"\n\n    model_type = \"rnafm\"\n\n    def __init__(\n        self,\n        vocab_size: int | None = None,\n        codon: bool = False,\n        hidden_size: int = 640,\n        num_hidden_layers: int = 12,\n        num_attention_heads: int = 20,\n        intermediate_size: int = 5120,\n        hidden_act: str = \"gelu\",\n        hidden_dropout: float = 0.1,\n        attention_dropout: float = 0.1,\n        max_position_embeddings: int = 1026,\n        initializer_range: float = 0.02,\n        layer_norm_eps: float = 1e-12,\n        position_embedding_type: str = \"absolute\",\n        is_decoder: bool = False,\n        use_cache: bool = True,\n        emb_layer_norm_before: bool = True,\n        token_dropout: bool = False,\n        head: HeadConfig | None = None,\n        lm_head: MaskedLMHeadConfig | None = None,\n        **kwargs,\n    ):\n        super().__init__(**kwargs)\n        if vocab_size is None:\n            vocab_size = 131 if codon else 26\n        self.vocab_size = vocab_size\n        self.codon = codon\n        self.hidden_size = hidden_size\n        self.num_hidden_layers = num_hidden_layers\n        self.num_attention_heads = num_attention_heads\n        self.intermediate_size = intermediate_size\n        self.hidden_act = hidden_act\n        self.hidden_dropout = hidden_dropout\n        self.attention_dropout = attention_dropout\n        self.max_position_embeddings = max_position_embeddings\n        self.initializer_range = initializer_range\n        self.layer_norm_eps = layer_norm_eps\n        self.position_embedding_type = position_embedding_type\n        self.is_decoder = is_decoder\n        self.use_cache = use_cache\n        self.emb_layer_norm_before = emb_layer_norm_before\n        self.token_dropout = token_dropout\n        self.head = HeadConfig(**head) if head is not None else None\n        self.lm_head = MaskedLMHeadConfig(**lm_head) if lm_head is not None else None\n
"},{"location":"models/rnafm/#multimolecule.models.rnafm.RnaFmConfig(vocab_size)","title":"vocab_size","text":""},{"location":"models/rnafm/#multimolecule.models.rnafm.RnaFmConfig(codon)","title":"codon","text":""},{"location":"models/rnafm/#multimolecule.models.rnafm.RnaFmConfig(hidden_size)","title":"hidden_size","text":""},{"location":"models/rnafm/#multimolecule.models.rnafm.RnaFmConfig(num_hidden_layers)","title":"num_hidden_layers","text":""},{"location":"models/rnafm/#multimolecule.models.rnafm.RnaFmConfig(num_attention_heads)","title":"num_attention_heads","text":""},{"location":"models/rnafm/#multimolecule.models.rnafm.RnaFmConfig(intermediate_size)","title":"intermediate_size","text":""},{"location":"models/rnafm/#multimolecule.models.rnafm.RnaFmConfig(hidden_dropout)","title":"hidden_dropout","text":""},{"location":"models/rnafm/#multimolecule.models.rnafm.RnaFmConfig(attention_dropout)","title":"attention_dropout","text":""},{"location":"models/rnafm/#multimolecule.models.rnafm.RnaFmConfig(max_position_embeddings)","title":"max_position_embeddings","text":""},{"location":"models/rnafm/#multimolecule.models.rnafm.RnaFmConfig(initializer_range)","title":"initializer_range","text":""},{"location":"models/rnafm/#multimolecule.models.rnafm.RnaFmConfig(layer_norm_eps)","title":"layer_norm_eps","text":""},{"location":"models/rnafm/#multimolecule.models.rnafm.RnaFmConfig(position_embedding_type)","title":"position_embedding_type","text":""},{"location":"models/rnafm/#multimolecule.models.rnafm.RnaFmConfig(is_decoder)","title":"is_decoder","text":""},{"location":"models/rnafm/#multimolecule.models.rnafm.RnaFmConfig(use_cache)","title":"use_cache","text":""},{"location":"models/rnafm/#multimolecule.models.rnafm.RnaFmConfig(emb_layer_norm_before)","title":"emb_layer_norm_before","text":""},{"location":"models/rnafm/#multimolecule.models.rnafm.RnaFmConfig(token_dropout)","title":"token_dropout","text":""},{"location":"models/rnafm/#multimolecule.models.rnafm.RnaFmForContactPrediction","title":"RnaFmForContactPrediction","text":"

Bases: RnaFmPreTrainedModel

Examples:

Python Console Session
>>> from multimolecule import RnaFmConfig, RnaFmForContactPrediction, RnaTokenizer\n>>> config = RnaFmConfig()\n>>> model = RnaFmForContactPrediction(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=torch.randint(2, (1, 5, 5)))\n>>> output[\"logits\"].shape\ntorch.Size([1, 5, 5, 1])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n
Source code in multimolecule/models/rnafm/modeling_rnafm.py Python
class RnaFmForContactPrediction(RnaFmPreTrainedModel):\n    \"\"\"\n    Examples:\n        >>> from multimolecule import RnaFmConfig, RnaFmForContactPrediction, RnaTokenizer\n        >>> config = RnaFmConfig()\n        >>> model = RnaFmForContactPrediction(config)\n        >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n        >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n        >>> output = model(**input, labels=torch.randint(2, (1, 5, 5)))\n        >>> output[\"logits\"].shape\n        torch.Size([1, 5, 5, 1])\n        >>> output[\"loss\"]  # doctest:+ELLIPSIS\n        tensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n    \"\"\"\n\n    def __init__(self, config: RnaFmConfig):\n        super().__init__(config)\n        self.rnafm = RnaFmModel(config, add_pooling_layer=True)\n        self.contact_head = ContactPredictionHead(config)\n        self.head_config = self.contact_head.config\n\n        # Initialize weights and apply final processing\n        self.post_init()\n\n    def forward(\n        self,\n        input_ids: Tensor | NestedTensor,\n        attention_mask: Tensor | None = None,\n        position_ids: Tensor | None = None,\n        head_mask: Tensor | None = None,\n        inputs_embeds: Tensor | NestedTensor | None = None,\n        labels: Tensor | None = None,\n        output_attentions: bool | None = None,\n        output_hidden_states: bool | None = None,\n        return_dict: bool | None = None,\n        **kwargs,\n    ) -> Tuple[Tensor, ...] | ContactPredictorOutput:\n        if output_attentions is False:\n            warn(\"output_attentions must be True for contact classification and will be ignored.\")\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n        outputs = self.rnafm(\n            input_ids,\n            attention_mask=attention_mask,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            output_attentions=True,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n            **kwargs,\n        )\n        output = self.contact_head(outputs, attention_mask, input_ids, labels)\n        logits, loss = output.logits, output.loss\n\n        if not return_dict:\n            output = (logits,) + outputs[2:]\n            return ((loss,) + output) if loss is not None else output\n\n        return ContactPredictorOutput(\n            loss=loss,\n            logits=logits,\n            hidden_states=outputs.hidden_states,\n            attentions=outputs.attentions,\n        )\n
"},{"location":"models/rnafm/#multimolecule.models.rnafm.RnaFmForMaskedLM","title":"RnaFmForMaskedLM","text":"

Bases: RnaFmPreTrainedModel

Examples:

Python Console Session
>>> from multimolecule import RnaFmConfig, RnaFmForMaskedLM, RnaTokenizer\n>>> config = RnaFmConfig()\n>>> model = RnaFmForMaskedLM(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=input[\"input_ids\"])\n>>> output[\"logits\"].shape\ntorch.Size([1, 7, 26])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<NllLossBackward0>)\n
Source code in multimolecule/models/rnafm/modeling_rnafm.py Python
class RnaFmForMaskedLM(RnaFmPreTrainedModel):\n    \"\"\"\n    Examples:\n        >>> from multimolecule import RnaFmConfig, RnaFmForMaskedLM, RnaTokenizer\n        >>> config = RnaFmConfig()\n        >>> model = RnaFmForMaskedLM(config)\n        >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n        >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n        >>> output = model(**input, labels=input[\"input_ids\"])\n        >>> output[\"logits\"].shape\n        torch.Size([1, 7, 26])\n        >>> output[\"loss\"]  # doctest:+ELLIPSIS\n        tensor(..., grad_fn=<NllLossBackward0>)\n    \"\"\"\n\n    _tied_weights_keys = [\"lm_head.decoder.weight\", \"lm_head.decoder.bias\"]\n\n    def __init__(self, config: RnaFmConfig):\n        super().__init__(config)\n        if config.is_decoder:\n            logger.warning(\n                \"If you want to use `RnaFmForMaskedLM` make sure `config.is_decoder=False` for \"\n                \"bi-directional self-attention.\"\n            )\n        self.rnafm = RnaFmModel(config, add_pooling_layer=False)\n        self.lm_head = MaskedLMHead(config)\n\n        # Initialize weights and apply final processing\n        self.post_init()\n\n    def forward(\n        self,\n        input_ids: Tensor | NestedTensor,\n        attention_mask: Tensor | None = None,\n        position_ids: Tensor | None = None,\n        head_mask: Tensor | None = None,\n        inputs_embeds: Tensor | NestedTensor | None = None,\n        encoder_hidden_states: Tensor | None = None,\n        encoder_attention_mask: Tensor | None = None,\n        labels: Tensor | None = None,\n        output_attentions: bool | None = None,\n        output_hidden_states: bool | None = None,\n        return_dict: bool | None = None,\n        **kwargs,\n    ) -> Tuple[Tensor, ...] | MaskedLMOutput:\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n        outputs = self.rnafm(\n            input_ids,\n            attention_mask=attention_mask,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            encoder_hidden_states=encoder_hidden_states,\n            encoder_attention_mask=encoder_attention_mask,\n            output_attentions=output_attentions,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n            **kwargs,\n        )\n        output = self.lm_head(outputs, labels)\n        logits, loss = output.logits, output.loss\n\n        if not return_dict:\n            output = (logits,) + outputs[2:]\n            return ((loss,) + output) if loss is not None else output\n\n        return MaskedLMOutput(\n            loss=loss,\n            logits=logits,\n            hidden_states=outputs.hidden_states,\n            attentions=outputs.attentions,\n        )\n
"},{"location":"models/rnafm/#multimolecule.models.rnafm.RnaFmForPreTraining","title":"RnaFmForPreTraining","text":"

Bases: RnaFmPreTrainedModel

Examples:

Python Console Session
>>> from multimolecule import RnaFmConfig, RnaFmForPreTraining, RnaTokenizer\n>>> config = RnaFmConfig()\n>>> model = RnaFmForPreTraining(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels_mlm=input[\"input_ids\"])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<AddBackward0>)\n>>> output[\"logits\"].shape\ntorch.Size([1, 7, 26])\n>>> output[\"contact_map\"].shape\ntorch.Size([1, 5, 5, 1])\n
Source code in multimolecule/models/rnafm/modeling_rnafm.py Python
class RnaFmForPreTraining(RnaFmPreTrainedModel):\n    \"\"\"\n    Examples:\n        >>> from multimolecule import RnaFmConfig, RnaFmForPreTraining, RnaTokenizer\n        >>> config = RnaFmConfig()\n        >>> model = RnaFmForPreTraining(config)\n        >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n        >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n        >>> output = model(**input, labels_mlm=input[\"input_ids\"])\n        >>> output[\"loss\"]  # doctest:+ELLIPSIS\n        tensor(..., grad_fn=<AddBackward0>)\n        >>> output[\"logits\"].shape\n        torch.Size([1, 7, 26])\n        >>> output[\"contact_map\"].shape\n        torch.Size([1, 5, 5, 1])\n    \"\"\"\n\n    _tied_weights_keys = [\n        \"lm_head.decoder.weight\",\n        \"lm_head.decoder.bias\",\n        \"pretrain.predictions.decoder.weight\",\n        \"pretrain.predictions.decoder.bias\",\n        \"pretrain.predictions_ss.decoder.weight\",\n        \"pretrain.predictions_ss.decoder.bias\",\n    ]\n\n    def __init__(self, config: RnaFmConfig):\n        super().__init__(config)\n        if config.is_decoder:\n            logger.warning(\n                \"If you want to use `RnaFmForPreTraining` make sure `config.is_decoder=False` for \"\n                \"bi-directional self-attention.\"\n            )\n        self.rnafm = RnaFmModel(config, add_pooling_layer=False)\n        self.pretrain = RnaFmPreTrainingHeads(config)\n\n        # Initialize weights and apply final processing\n        self.post_init()\n\n    def get_output_embeddings(self):\n        return self.pretrain.predictions.decoder\n\n    def set_output_embeddings(self, embeddings):\n        self.pretrain.predictions.decoder = embeddings\n\n    def forward(\n        self,\n        input_ids: Tensor | NestedTensor,\n        attention_mask: Tensor | None = None,\n        position_ids: Tensor | None = None,\n        head_mask: Tensor | None = None,\n        inputs_embeds: Tensor | NestedTensor | None = None,\n        encoder_hidden_states: Tensor | None = None,\n        encoder_attention_mask: Tensor | None = None,\n        labels_mlm: Tensor | None = None,\n        labels_contact: Tensor | None = None,\n        output_attentions: bool | None = None,\n        output_hidden_states: bool | None = None,\n        return_dict: bool | None = None,\n        **kwargs,\n    ) -> Tuple[Tensor, ...] | RnaFmForPreTrainingOutput:\n        if output_attentions is False:\n            warn(\"output_attentions must be True for contact classification and will be ignored.\")\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n        outputs = self.rnafm(\n            input_ids,\n            attention_mask=attention_mask,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            encoder_hidden_states=encoder_hidden_states,\n            encoder_attention_mask=encoder_attention_mask,\n            output_attentions=True,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n            **kwargs,\n        )\n        total_loss, logits, contact_map = self.pretrain(\n            outputs, attention_mask, input_ids, labels_mlm=labels_mlm, labels_contact=labels_contact\n        )\n\n        if not return_dict:\n            output = (logits,) + outputs[2:]\n            return ((total_loss,) + output) if total_loss is not None else output\n\n        return RnaFmForPreTrainingOutput(\n            loss=total_loss,\n            logits=logits,\n            contact_map=contact_map,\n            hidden_states=outputs.hidden_states,\n            attentions=outputs.attentions,\n        )\n
"},{"location":"models/rnafm/#multimolecule.models.rnafm.RnaFmForSequencePrediction","title":"RnaFmForSequencePrediction","text":"

Bases: RnaFmPreTrainedModel

Examples:

Python Console Session
>>> from multimolecule import RnaFmConfig, RnaFmForSequencePrediction, RnaTokenizer\n>>> config = RnaFmConfig()\n>>> model = RnaFmForSequencePrediction(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=torch.tensor([[1]]))\n>>> output[\"logits\"].shape\ntorch.Size([1, 1])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n
Source code in multimolecule/models/rnafm/modeling_rnafm.py Python
class RnaFmForSequencePrediction(RnaFmPreTrainedModel):\n    \"\"\"\n    Examples:\n        >>> from multimolecule import RnaFmConfig, RnaFmForSequencePrediction, RnaTokenizer\n        >>> config = RnaFmConfig()\n        >>> model = RnaFmForSequencePrediction(config)\n        >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n        >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n        >>> output = model(**input, labels=torch.tensor([[1]]))\n        >>> output[\"logits\"].shape\n        torch.Size([1, 1])\n        >>> output[\"loss\"]  # doctest:+ELLIPSIS\n        tensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n    \"\"\"\n\n    def __init__(self, config: RnaFmConfig):\n        super().__init__(config)\n        self.rnafm = RnaFmModel(config, add_pooling_layer=True)\n        self.sequence_head = SequencePredictionHead(config)\n        self.head_config = self.sequence_head.config\n\n        # Initialize weights and apply final processing\n        self.post_init()\n\n    def forward(\n        self,\n        input_ids: Tensor | NestedTensor,\n        attention_mask: Tensor | None = None,\n        position_ids: Tensor | None = None,\n        head_mask: Tensor | None = None,\n        inputs_embeds: Tensor | NestedTensor | None = None,\n        labels: Tensor | None = None,\n        output_attentions: bool | None = None,\n        output_hidden_states: bool | None = None,\n        return_dict: bool | None = None,\n        **kwargs,\n    ) -> Tuple[Tensor, ...] | SequencePredictorOutput:\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n        outputs = self.rnafm(\n            input_ids,\n            attention_mask=attention_mask,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            output_attentions=output_attentions,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n            **kwargs,\n        )\n        output = self.sequence_head(outputs, labels)\n        logits, loss = output.logits, output.loss\n\n        if not return_dict:\n            output = (logits,) + outputs[2:]\n            return ((loss,) + output) if loss is not None else output\n\n        return SequencePredictorOutput(\n            loss=loss,\n            logits=logits,\n            hidden_states=outputs.hidden_states,\n            attentions=outputs.attentions,\n        )\n
"},{"location":"models/rnafm/#multimolecule.models.rnafm.RnaFmForTokenPrediction","title":"RnaFmForTokenPrediction","text":"

Bases: RnaFmPreTrainedModel

Examples:

Python Console Session
>>> from multimolecule import RnaFmConfig, RnaFmForTokenPrediction, RnaTokenizer\n>>> config = RnaFmConfig()\n>>> model = RnaFmForTokenPrediction(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=torch.randint(2, (1, 5)))\n>>> output[\"logits\"].shape\ntorch.Size([1, 5, 1])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n
Source code in multimolecule/models/rnafm/modeling_rnafm.py Python
class RnaFmForTokenPrediction(RnaFmPreTrainedModel):\n    \"\"\"\n    Examples:\n        >>> from multimolecule import RnaFmConfig, RnaFmForTokenPrediction, RnaTokenizer\n        >>> config = RnaFmConfig()\n        >>> model = RnaFmForTokenPrediction(config)\n        >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n        >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n        >>> output = model(**input, labels=torch.randint(2, (1, 5)))\n        >>> output[\"logits\"].shape\n        torch.Size([1, 5, 1])\n        >>> output[\"loss\"]  # doctest:+ELLIPSIS\n        tensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n    \"\"\"\n\n    def __init__(self, config: RnaFmConfig):\n        super().__init__(config)\n        self.rnafm = RnaFmModel(config, add_pooling_layer=True)\n        self.token_head = TokenPredictionHead(config)\n        self.head_config = self.token_head.config\n\n        # Initialize weights and apply final processing\n        self.post_init()\n\n    def forward(\n        self,\n        input_ids: Tensor | NestedTensor,\n        attention_mask: Tensor | None = None,\n        position_ids: Tensor | None = None,\n        head_mask: Tensor | None = None,\n        inputs_embeds: Tensor | NestedTensor | None = None,\n        labels: Tensor | None = None,\n        output_attentions: bool | None = None,\n        output_hidden_states: bool | None = None,\n        return_dict: bool | None = None,\n        **kwargs,\n    ) -> Tuple[Tensor, ...] | TokenPredictorOutput:\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n        outputs = self.rnafm(\n            input_ids,\n            attention_mask=attention_mask,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            output_attentions=output_attentions,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n            **kwargs,\n        )\n        output = self.token_head(outputs, attention_mask, input_ids, labels)\n        logits, loss = output.logits, output.loss\n\n        if not return_dict:\n            output = (logits,) + outputs[2:]\n            return ((loss,) + output) if loss is not None else output\n\n        return TokenPredictorOutput(\n            loss=loss,\n            logits=logits,\n            hidden_states=outputs.hidden_states,\n            attentions=outputs.attentions,\n        )\n
"},{"location":"models/rnafm/#multimolecule.models.rnafm.RnaFmModel","title":"RnaFmModel","text":"

Bases: RnaFmPreTrainedModel

Examples:

Python Console Session
>>> from multimolecule import RnaFmConfig, RnaFmModel, RnaTokenizer\n>>> config = RnaFmConfig()\n>>> model = RnaFmModel(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input)\n>>> output[\"last_hidden_state\"].shape\ntorch.Size([1, 7, 640])\n>>> output[\"pooler_output\"].shape\ntorch.Size([1, 640])\n
Source code in multimolecule/models/rnafm/modeling_rnafm.py Python
class RnaFmModel(RnaFmPreTrainedModel):\n    \"\"\"\n    Examples:\n        >>> from multimolecule import RnaFmConfig, RnaFmModel, RnaTokenizer\n        >>> config = RnaFmConfig()\n        >>> model = RnaFmModel(config)\n        >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n        >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n        >>> output = model(**input)\n        >>> output[\"last_hidden_state\"].shape\n        torch.Size([1, 7, 640])\n        >>> output[\"pooler_output\"].shape\n        torch.Size([1, 640])\n    \"\"\"\n\n    def __init__(self, config: RnaFmConfig, add_pooling_layer: bool = True):\n        super().__init__(config)\n        self.pad_token_id = config.pad_token_id\n        self.embeddings = RnaFmEmbeddings(config)\n        self.encoder = RnaFmEncoder(config)\n        self.pooler = RnaFmPooler(config) if add_pooling_layer else None\n\n        # Initialize weights and apply final processing\n        self.post_init()\n\n    def get_input_embeddings(self):\n        return self.embeddings.word_embeddings\n\n    def set_input_embeddings(self, value):\n        self.embeddings.word_embeddings = value\n\n    def _prune_heads(self, heads_to_prune):\n        \"\"\"\n        Prunes heads of the model. heads_to_prune: dict of {layer_num: list of heads to prune in this layer} See base\n        class PreTrainedModel\n        \"\"\"\n        for layer, heads in heads_to_prune.items():\n            self.encoder.layer[layer].attention.prune_heads(heads)\n\n    def forward(\n        self,\n        input_ids: Tensor | NestedTensor,\n        attention_mask: Tensor | None = None,\n        position_ids: Tensor | None = None,\n        head_mask: Tensor | None = None,\n        inputs_embeds: Tensor | NestedTensor | None = None,\n        encoder_hidden_states: Tensor | None = None,\n        encoder_attention_mask: Tensor | None = None,\n        past_key_values: Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None = None,\n        use_cache: bool | None = None,\n        output_attentions: bool | None = None,\n        output_hidden_states: bool | None = None,\n        return_dict: bool | None = None,\n        **kwargs,\n    ) -> Tuple[Tensor, ...] | BaseModelOutputWithPoolingAndCrossAttentions:\n        r\"\"\"\n        Args:\n            encoder_hidden_states:\n                Shape: `(batch_size, sequence_length, hidden_size)`\n\n                Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if\n                the model is configured as a decoder.\n            encoder_attention_mask:\n                Shape: `(batch_size, sequence_length)`\n\n                Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used\n                in the cross-attention if the model is configured as a decoder. Mask values selected in `[0, 1]`:\n\n                - 1 for tokens that are **not masked**,\n                - 0 for tokens that are **masked**.\n            past_key_values:\n                Tuple of length `config.n_layers` with each tuple having 4 tensors of shape\n                `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)\n\n                Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up\n                decoding.\n\n                If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those\n                that don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of\n                all `decoder_input_ids` of shape `(batch_size, sequence_length)`.\n            use_cache:\n                If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding\n                (see `past_key_values`).\n        \"\"\"\n        if kwargs:\n            warn(\n                f\"Additional keyword arguments `{', '.join(kwargs)}` are detected in \"\n                f\"`{self.__class__.__name__}.forward`, they will be ignored.\\n\"\n                \"This is provided for backward compatibility and may lead to unexpected behavior.\"\n            )\n        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions\n        output_hidden_states = (\n            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states\n        )\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n\n        if self.config.is_decoder:\n            use_cache = use_cache if use_cache is not None else self.config.use_cache\n        else:\n            use_cache = False\n\n        if isinstance(input_ids, NestedTensor):\n            input_ids, attention_mask = input_ids.tensor, input_ids.mask\n        if input_ids is not None and inputs_embeds is not None:\n            raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n        if input_ids is not None:\n            self.warn_if_padding_and_no_attention_mask(input_ids, attention_mask)\n            input_shape = input_ids.size()\n        elif inputs_embeds is not None:\n            input_shape = inputs_embeds.size()[:-1]\n        else:\n            raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n        batch_size, seq_length = input_shape\n        device = input_ids.device if input_ids is not None else inputs_embeds.device  # type: ignore[union-attr]\n\n        # past_key_values_length\n        past_key_values_length = past_key_values[0][0].shape[2] if past_key_values is not None else 0\n\n        if attention_mask is None:\n            attention_mask = (\n                input_ids.ne(self.pad_token_id)\n                if self.pad_token_id is not None\n                else torch.ones(((batch_size, seq_length + past_key_values_length)), device=device)\n            )\n\n        # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]\n        # ourselves in which case we just need to make it broadcastable to all heads.\n        extended_attention_mask: Tensor = self.get_extended_attention_mask(attention_mask, input_shape)\n\n        # If a 2D or 3D attention mask is provided for the cross-attention\n        # we need to make broadcastable to [batch_size, num_heads, seq_length, seq_length]\n        if self.config.is_decoder and encoder_hidden_states is not None:\n            encoder_batch_size, encoder_sequence_length, _ = encoder_hidden_states.size()\n            encoder_hidden_shape = (encoder_batch_size, encoder_sequence_length)\n            if encoder_attention_mask is None:\n                encoder_attention_mask = torch.ones(encoder_hidden_shape, device=device)\n            encoder_extended_attention_mask = self.invert_attention_mask(encoder_attention_mask)\n        else:\n            encoder_extended_attention_mask = None\n\n        # Prepare head mask if needed\n        # 1.0 in head_mask indicate we keep the head\n        # attention_probs has shape bsz x n_heads x N x N\n        # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]\n        # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]\n        head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)\n\n        embedding_output = self.embeddings(\n            input_ids=input_ids,\n            position_ids=position_ids,\n            attention_mask=attention_mask,\n            inputs_embeds=inputs_embeds,\n            past_key_values_length=past_key_values_length,\n        )\n        encoder_outputs = self.encoder(\n            embedding_output,\n            attention_mask=extended_attention_mask,\n            head_mask=head_mask,\n            encoder_hidden_states=encoder_hidden_states,\n            encoder_attention_mask=encoder_extended_attention_mask,\n            past_key_values=past_key_values,\n            use_cache=use_cache,\n            output_attentions=output_attentions,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n        )\n        sequence_output = encoder_outputs[0]\n        pooled_output = self.pooler(sequence_output) if self.pooler is not None else None\n\n        if not return_dict:\n            return (sequence_output, pooled_output) + encoder_outputs[1:]\n\n        return BaseModelOutputWithPoolingAndCrossAttentions(\n            last_hidden_state=sequence_output,\n            pooler_output=pooled_output,\n            past_key_values=encoder_outputs.past_key_values,\n            hidden_states=encoder_outputs.hidden_states,\n            attentions=encoder_outputs.attentions,\n            cross_attentions=encoder_outputs.cross_attentions,\n        )\n
"},{"location":"models/rnafm/#multimolecule.models.rnafm.RnaFmModel.forward","title":"forward","text":"Python
forward(input_ids: Tensor | NestedTensor, attention_mask: Tensor | None = None, position_ids: Tensor | None = None, head_mask: Tensor | None = None, inputs_embeds: Tensor | NestedTensor | None = None, encoder_hidden_states: Tensor | None = None, encoder_attention_mask: Tensor | None = None, past_key_values: Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None = None, use_cache: bool | None = None, output_attentions: bool | None = None, output_hidden_states: bool | None = None, return_dict: bool | None = None, **kwargs) -> Tuple[Tensor, ...] | BaseModelOutputWithPoolingAndCrossAttentions\n

Parameters:

Name Type Description Default Tensor | None

Shape: (batch_size, sequence_length, hidden_size)

Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if the model is configured as a decoder.

None Tensor | None

Shape: (batch_size, sequence_length)

Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used in the cross-attention if the model is configured as a decoder. Mask values selected in [0, 1]:

None Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None

Tuple of length config.n_layers with each tuple having 4 tensors of shape `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)

Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up decoding.

If past_key_values are used, the user can optionally input only the last decoder_input_ids (those that don\u2019t have their past key value states given to this model) of shape (batch_size, 1) instead of all decoder_input_ids of shape (batch_size, sequence_length).

None bool | None

If set to True, past_key_values key value states are returned and can be used to speed up decoding (see past_key_values).

None Source code in multimolecule/models/rnafm/modeling_rnafm.py Python
def forward(\n    self,\n    input_ids: Tensor | NestedTensor,\n    attention_mask: Tensor | None = None,\n    position_ids: Tensor | None = None,\n    head_mask: Tensor | None = None,\n    inputs_embeds: Tensor | NestedTensor | None = None,\n    encoder_hidden_states: Tensor | None = None,\n    encoder_attention_mask: Tensor | None = None,\n    past_key_values: Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None = None,\n    use_cache: bool | None = None,\n    output_attentions: bool | None = None,\n    output_hidden_states: bool | None = None,\n    return_dict: bool | None = None,\n    **kwargs,\n) -> Tuple[Tensor, ...] | BaseModelOutputWithPoolingAndCrossAttentions:\n    r\"\"\"\n    Args:\n        encoder_hidden_states:\n            Shape: `(batch_size, sequence_length, hidden_size)`\n\n            Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if\n            the model is configured as a decoder.\n        encoder_attention_mask:\n            Shape: `(batch_size, sequence_length)`\n\n            Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used\n            in the cross-attention if the model is configured as a decoder. Mask values selected in `[0, 1]`:\n\n            - 1 for tokens that are **not masked**,\n            - 0 for tokens that are **masked**.\n        past_key_values:\n            Tuple of length `config.n_layers` with each tuple having 4 tensors of shape\n            `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)\n\n            Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up\n            decoding.\n\n            If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those\n            that don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of\n            all `decoder_input_ids` of shape `(batch_size, sequence_length)`.\n        use_cache:\n            If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding\n            (see `past_key_values`).\n    \"\"\"\n    if kwargs:\n        warn(\n            f\"Additional keyword arguments `{', '.join(kwargs)}` are detected in \"\n            f\"`{self.__class__.__name__}.forward`, they will be ignored.\\n\"\n            \"This is provided for backward compatibility and may lead to unexpected behavior.\"\n        )\n    output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions\n    output_hidden_states = (\n        output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states\n    )\n    return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n\n    if self.config.is_decoder:\n        use_cache = use_cache if use_cache is not None else self.config.use_cache\n    else:\n        use_cache = False\n\n    if isinstance(input_ids, NestedTensor):\n        input_ids, attention_mask = input_ids.tensor, input_ids.mask\n    if input_ids is not None and inputs_embeds is not None:\n        raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n    if input_ids is not None:\n        self.warn_if_padding_and_no_attention_mask(input_ids, attention_mask)\n        input_shape = input_ids.size()\n    elif inputs_embeds is not None:\n        input_shape = inputs_embeds.size()[:-1]\n    else:\n        raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n    batch_size, seq_length = input_shape\n    device = input_ids.device if input_ids is not None else inputs_embeds.device  # type: ignore[union-attr]\n\n    # past_key_values_length\n    past_key_values_length = past_key_values[0][0].shape[2] if past_key_values is not None else 0\n\n    if attention_mask is None:\n        attention_mask = (\n            input_ids.ne(self.pad_token_id)\n            if self.pad_token_id is not None\n            else torch.ones(((batch_size, seq_length + past_key_values_length)), device=device)\n        )\n\n    # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]\n    # ourselves in which case we just need to make it broadcastable to all heads.\n    extended_attention_mask: Tensor = self.get_extended_attention_mask(attention_mask, input_shape)\n\n    # If a 2D or 3D attention mask is provided for the cross-attention\n    # we need to make broadcastable to [batch_size, num_heads, seq_length, seq_length]\n    if self.config.is_decoder and encoder_hidden_states is not None:\n        encoder_batch_size, encoder_sequence_length, _ = encoder_hidden_states.size()\n        encoder_hidden_shape = (encoder_batch_size, encoder_sequence_length)\n        if encoder_attention_mask is None:\n            encoder_attention_mask = torch.ones(encoder_hidden_shape, device=device)\n        encoder_extended_attention_mask = self.invert_attention_mask(encoder_attention_mask)\n    else:\n        encoder_extended_attention_mask = None\n\n    # Prepare head mask if needed\n    # 1.0 in head_mask indicate we keep the head\n    # attention_probs has shape bsz x n_heads x N x N\n    # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]\n    # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]\n    head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)\n\n    embedding_output = self.embeddings(\n        input_ids=input_ids,\n        position_ids=position_ids,\n        attention_mask=attention_mask,\n        inputs_embeds=inputs_embeds,\n        past_key_values_length=past_key_values_length,\n    )\n    encoder_outputs = self.encoder(\n        embedding_output,\n        attention_mask=extended_attention_mask,\n        head_mask=head_mask,\n        encoder_hidden_states=encoder_hidden_states,\n        encoder_attention_mask=encoder_extended_attention_mask,\n        past_key_values=past_key_values,\n        use_cache=use_cache,\n        output_attentions=output_attentions,\n        output_hidden_states=output_hidden_states,\n        return_dict=return_dict,\n    )\n    sequence_output = encoder_outputs[0]\n    pooled_output = self.pooler(sequence_output) if self.pooler is not None else None\n\n    if not return_dict:\n        return (sequence_output, pooled_output) + encoder_outputs[1:]\n\n    return BaseModelOutputWithPoolingAndCrossAttentions(\n        last_hidden_state=sequence_output,\n        pooler_output=pooled_output,\n        past_key_values=encoder_outputs.past_key_values,\n        hidden_states=encoder_outputs.hidden_states,\n        attentions=encoder_outputs.attentions,\n        cross_attentions=encoder_outputs.cross_attentions,\n    )\n
"},{"location":"models/rnafm/#multimolecule.models.rnafm.RnaFmModel.forward(encoder_hidden_states)","title":"encoder_hidden_states","text":""},{"location":"models/rnafm/#multimolecule.models.rnafm.RnaFmModel.forward(encoder_attention_mask)","title":"encoder_attention_mask","text":""},{"location":"models/rnafm/#multimolecule.models.rnafm.RnaFmModel.forward(past_key_values)","title":"past_key_values","text":""},{"location":"models/rnafm/#multimolecule.models.rnafm.RnaFmModel.forward(use_cache)","title":"use_cache","text":""},{"location":"models/rnafm/#multimolecule.models.rnafm.RnaFmPreTrainedModel","title":"RnaFmPreTrainedModel","text":"

Bases: PreTrainedModel

An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained models.

Source code in multimolecule/models/rnafm/modeling_rnafm.py Python
class RnaFmPreTrainedModel(PreTrainedModel):\n    \"\"\"\n    An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained\n    models.\n    \"\"\"\n\n    config_class = RnaFmConfig\n    base_model_prefix = \"rnafm\"\n    supports_gradient_checkpointing = True\n    _no_split_modules = [\"RnaFmLayer\", \"RnaFmEmbeddings\"]\n\n    # Copied from transformers.models.bert.modeling_bert.BertPreTrainedModel._init_weights\n    def _init_weights(self, module: nn.Module):\n        \"\"\"Initialize the weights\"\"\"\n        if isinstance(module, nn.Linear):\n            # Slightly different from the TF version which uses truncated_normal for initialization\n            # cf https://github.com/pytorch/pytorch/pull/5617\n            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)\n            if module.bias is not None:\n                module.bias.data.zero_()\n        elif isinstance(module, nn.Embedding):\n            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)\n            if module.padding_idx is not None:\n                module.weight.data[module.padding_idx].zero_()\n        elif isinstance(module, nn.LayerNorm):\n            module.bias.data.zero_()\n            module.weight.data.fill_(1.0)\n
"},{"location":"models/rnamsm/","title":"RNA-MSM","text":"

Pre-trained model on non-coding RNA (ncRNA) with multi (homologous) sequence alignment using a masked language modeling (MLM) objective.

"},{"location":"models/rnamsm/#disclaimer","title":"Disclaimer","text":"

This is an UNOFFICIAL implementation of the Multiple sequence alignment-based RNA language model and its application to structural inference by Yikun Zhang, Mei Lang, Jiuhong Jiang, Zhiqiang Gao, et al.

The OFFICIAL repository of RNA-MSM is at yikunpku/RNA-MSM.

Caution

The MultiMolecule team is aware of a potential risk in reproducing the results of RNA-MSM.

The original implementation of RNA-MSM used a custom tokenizer that does not append <eos> token to the end of the input sequence in consistent to MSA Transformer. This should not affect the performance of the model in most cases, but it can lead to unexpected behavior in some cases.

Please set eos_token=None explicitly in the tokenizer if you want the exact behavior of the original implementation.

See more at issue #10

Tip

The MultiMolecule team has confirmed that the provided model and checkpoints are producing the same intermediate representations as the original implementation.

The team releasing RNA-MSM did not write this model card for this model so this model card has been written by the MultiMolecule team.

"},{"location":"models/rnamsm/#model-details","title":"Model Details","text":"

RNA-MSM is a bert-style model pre-trained on a large corpus of non-coding RNA sequences in a self-supervised fashion. This means that the model was trained on the raw nucleotides of RNA sequences only, with an automatic process to generate inputs and labels from those texts. Please refer to the Training Details section for more information on the training process.

"},{"location":"models/rnamsm/#model-specification","title":"Model Specification","text":"Num Layers Hidden Size Num Heads Intermediate Size Num Parameters (M) FLOPs (G) MACs (G) Max Num Tokens 10 768 12 3072 95.92 21.66 10.57 1024"},{"location":"models/rnamsm/#links","title":"Links","text":""},{"location":"models/rnamsm/#usage","title":"Usage","text":"

The model file depends on the multimolecule library. You can install it using pip:

Bash
pip install multimolecule\n
"},{"location":"models/rnamsm/#direct-use","title":"Direct Use","text":"

You can use this model directly with a pipeline for masked language modeling:

Python
>>> import multimolecule  # you must import multimolecule to register models\n>>> from transformers import pipeline\n>>> unmasker = pipeline(\"fill-mask\", model=\"multimolecule/rnamsm\")\n>>> unmasker(\"gguc<mask>cucugguuagaccagaucugagccu\")\n\n[{'score': 0.25111356377601624,\n  'token': 9,\n  'token_str': 'U',\n  'sequence': 'G G U C U C U C U G G U U A G A C C A G A U C U G A G C C U'},\n {'score': 0.1200353354215622,\n  'token': 14,\n  'token_str': 'W',\n  'sequence': 'G G U C W C U C U G G U U A G A C C A G A U C U G A G C C U'},\n {'score': 0.10132723301649094,\n  'token': 15,\n  'token_str': 'K',\n  'sequence': 'G G U C K C U C U G G U U A G A C C A G A U C U G A G C C U'},\n {'score': 0.08383019268512726,\n  'token': 18,\n  'token_str': 'D',\n  'sequence': 'G G U C D C U C U G G U U A G A C C A G A U C U G A G C C U'},\n {'score': 0.05737845227122307,\n  'token': 6,\n  'token_str': 'A',\n  'sequence': 'G G U C A C U C U G G U U A G A C C A G A U C U G A G C C U'}]\n
"},{"location":"models/rnamsm/#downstream-use","title":"Downstream Use","text":""},{"location":"models/rnamsm/#extract-features","title":"Extract Features","text":"

Here is how to use this model to get the features of a given sequence in PyTorch:

Python
from multimolecule import RnaTokenizer, RnaMsmModel\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rnamsm\")\nmodel = RnaMsmModel.from_pretrained(\"multimolecule/rnamsm\")\n\ntext = \"UAGCUUAUCAGACUGAUGUUGA\"\ninput = tokenizer(text, return_tensors=\"pt\")\n\noutput = model(**input)\n
"},{"location":"models/rnamsm/#sequence-classification-regression","title":"Sequence Classification / Regression","text":"

Note: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for sequence classification or regression.

Here is how to use this model as backbone to fine-tune for a sequence-level task in PyTorch:

Python
import torch\nfrom multimolecule import RnaTokenizer, RnaMsmForSequencePrediction\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rnamsm\")\nmodel = RnaMsmForSequencePrediction.from_pretrained(\"multimolecule/rnamsm\")\n\ntext = \"UAGCUUAUCAGACUGAUGUUGA\"\ninput = tokenizer(text, return_tensors=\"pt\")\nlabel = torch.tensor([1])\n\noutput = model(**input, labels=label)\n
"},{"location":"models/rnamsm/#token-classification-regression","title":"Token Classification / Regression","text":"

Note: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for nucleotide classification or regression.

Here is how to use this model as backbone to fine-tune for a nucleotide-level task in PyTorch:

Python
import torch\nfrom multimolecule import RnaTokenizer, RnaMsmForTokenPrediction\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rnamsm\")\nmodel = RnaMsmForNucleotidPrediction.from_pretrained(\"multimolecule/rnamsm\")\n\ntext = \"UAGCUUAUCAGACUGAUGUUGA\"\ninput = tokenizer(text, return_tensors=\"pt\")\nlabel = torch.randint(2, (len(text), ))\n\noutput = model(**input, labels=label)\n
"},{"location":"models/rnamsm/#contact-classification-regression","title":"Contact Classification / Regression","text":"

Note: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for contact classification or regression.

Here is how to use this model as backbone to fine-tune for a contact-level task in PyTorch:

Python
import torch\nfrom multimolecule import RnaTokenizer, RnaMsmForContactPrediction\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rnamsm\")\nmodel = RnaMsmForContactPrediction.from_pretrained(\"multimolecule/rnamsm\")\n\ntext = \"UAGCUUAUCAGACUGAUGUUGA\"\ninput = tokenizer(text, return_tensors=\"pt\")\nlabel = torch.randint(2, (len(text), len(text)))\n\noutput = model(**input, labels=label)\n
"},{"location":"models/rnamsm/#training-details","title":"Training Details","text":"

RNA-MSM used Masked Language Modeling (MLM) as the pre-training objective: taking a sequence, the model randomly masks 15% of the tokens in the input then runs the entire masked sentence through the model and has to predict the masked tokens. This is comparable to the Cloze task in language modeling.

"},{"location":"models/rnamsm/#training-data","title":"Training Data","text":"

The RNA-MSM model was pre-trained on Rfam. The Rfam database is a collection of RNA sequence families of structural RNAs including non-coding RNA genes as well as cis-regulatory elements. RNA-MSM used Rfam 14.7 which contains 4,069 RNA families.

To avoid potential overfitting in structural inference, RNA-MSM excluded families with experimentally determined structures, such as ribosomal RNAs, transfer RNAs, and small nuclear RNAs. The final dataset contains 3,932 RNA families. The median value for the number of MSA sequences for these families by RNAcmap3 is 2,184.

To increase the number of homologous sequences, RNA-MSM used an automatic pipeline, RNAcmap3, for homolog search and sequence alignment. RNAcmap3 is a pipeline that combines the BLAST-N, INFERNAL, Easel, RNAfold and evolutionary coupling tools to generate homologous sequences.

RNA-MSM preprocessed all tokens by replacing \u201cT\u201ds with \u201cU\u201ds and substituting \u201cR\u201d, \u201cY\u201d, \u201cK\u201d, \u201cM\u201d, \u201cS\u201d, \u201cW\u201d, \u201cB\u201d, \u201cD\u201d, \u201cH\u201d, \u201cV\u201d, \u201cN\u201d with \u201cX\u201d.

Note that RnaTokenizer will convert \u201cT\u201ds to \u201cU\u201ds for you, you may disable this behaviour by passing replace_T_with_U=False. RnaTokenizer does not perform other substitutions.

"},{"location":"models/rnamsm/#training-procedure","title":"Training Procedure","text":""},{"location":"models/rnamsm/#preprocessing","title":"Preprocessing","text":"

RNA-MSM used masked language modeling (MLM) as the pre-training objective. The masking procedure is similar to the one used in BERT:

"},{"location":"models/rnamsm/#pretraining","title":"PreTraining","text":"

The model was trained on 8 NVIDIA V100 GPUs with 32GiB memories.

"},{"location":"models/rnamsm/#citation","title":"Citation","text":"

BibTeX:

BibTeX
@article{zhang2023multiple,\n    author = {Zhang, Yikun and Lang, Mei and Jiang, Jiuhong and Gao, Zhiqiang and Xu, Fan and Litfin, Thomas and Chen, Ke and Singh, Jaswinder and Huang, Xiansong and Song, Guoli and Tian, Yonghong and Zhan, Jian and Chen, Jie and Zhou, Yaoqi},\n    title = \"{Multiple sequence alignment-based RNA language model and its application to structural inference}\",\n    journal = {Nucleic Acids Research},\n    volume = {52},\n    number = {1},\n    pages = {e3-e3},\n    year = {2023},\n    month = {11},\n    abstract = \"{Compared with proteins, DNA and RNA are more difficult languages to interpret because four-letter coded DNA/RNA sequences have less information content than 20-letter coded protein sequences. While BERT (Bidirectional Encoder Representations from Transformers)-like language models have been developed for RNA, they are ineffective at capturing the evolutionary information from homologous sequences because\u00a0unlike proteins, RNA sequences are less conserved. Here, we have developed an unsupervised multiple sequence alignment-based RNA language model (RNA-MSM) by utilizing homologous sequences from an automatic pipeline, RNAcmap, as it can provide significantly more homologous sequences than manually annotated Rfam. We demonstrate that the resulting unsupervised, two-dimensional attention maps and one-dimensional embeddings from RNA-MSM contain structural information. In fact, they can be directly mapped with high accuracy to 2D base pairing probabilities and 1D solvent accessibilities, respectively. Further fine-tuning led to significantly improved performance on these two downstream tasks compared with existing state-of-the-art techniques including SPOT-RNA2 and RNAsnap2. By comparison, RNA-FM, a BERT-based RNA language model, performs worse than one-hot encoding with its embedding in base pair and solvent-accessible surface area prediction. We anticipate that the pre-trained RNA-MSM model can be fine-tuned on many other tasks related to RNA structure and function.}\",\n    issn = {0305-1048},\n    doi = {10.1093/nar/gkad1031},\n    url = {https://doi.org/10.1093/nar/gkad1031},\n    eprint = {https://academic.oup.com/nar/article-pdf/52/1/e3/55443207/gkad1031.pdf},\n}\n
"},{"location":"models/rnamsm/#contact","title":"Contact","text":"

Please use GitHub issues of MultiMolecule for any questions or comments on the model card.

Please contact the authors of the RNA-MSM paper for questions or comments on the paper/model.

"},{"location":"models/rnamsm/#license","title":"License","text":"

This model is licensed under the AGPL-3.0 License.

Text Only
SPDX-License-Identifier: AGPL-3.0-or-later\n
"},{"location":"models/rnamsm/#multimolecule.models.rnamsm","title":"multimolecule.models.rnamsm","text":""},{"location":"models/rnamsm/#multimolecule.models.rnamsm.RnaTokenizer","title":"RnaTokenizer","text":"

Bases: Tokenizer

Tokenizer for RNA sequences.

Parameters:

Name Type Description Default Alphabet | str | List[str] | None

alphabet to use for tokenization.

None int

Size of kmer to tokenize.

1 bool

Whether to tokenize into codons.

False bool

Whether to replace T with U.

True bool

Whether to convert input to uppercase.

True

Examples:

Python Console Session
>>> from multimolecule import RnaTokenizer\n>>> tokenizer = RnaTokenizer()\n>>> tokenizer('<pad><cls><eos><unk><mask><null>ACGUNRYSWKMBDHV.X*-I')[\"input_ids\"]\n[1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 2]\n>>> tokenizer('acgu')[\"input_ids\"]\n[1, 6, 7, 8, 9, 2]\n>>> tokenizer('acgt')[\"input_ids\"]\n[1, 6, 7, 8, 9, 2]\n>>> tokenizer = RnaTokenizer(replace_T_with_U=False)\n>>> tokenizer('acgt')[\"input_ids\"]\n[1, 6, 7, 8, 3, 2]\n>>> tokenizer = RnaTokenizer(nmers=3)\n>>> tokenizer('uagcuuauc')[\"input_ids\"]\n[1, 83, 17, 64, 49, 96, 84, 22, 2]\n>>> tokenizer = RnaTokenizer(codon=True)\n>>> tokenizer('uagcuuauc')[\"input_ids\"]\n[1, 83, 49, 22, 2]\n>>> tokenizer('uagcuuauca')[\"input_ids\"]\nTraceback (most recent call last):\nValueError: length of input sequence must be a multiple of 3 for codon tokenization, but got 10\n
Source code in multimolecule/tokenisers/rna/tokenization_rna.py Python
class RnaTokenizer(Tokenizer):\n    \"\"\"\n    Tokenizer for RNA sequences.\n\n    Args:\n        alphabet: alphabet to use for tokenization.\n\n            - If is `None`, the standard RNA alphabet will be used.\n            - If is a `string`, it should correspond to the name of a predefined alphabet. The options include\n                + `standard`\n                + `extended`\n                + `streamline`\n                + `nucleobase`\n            - If is an alphabet or a list of characters, that specific alphabet will be used.\n        nmers: Size of kmer to tokenize.\n        codon: Whether to tokenize into codons.\n        replace_T_with_U: Whether to replace T with U.\n        do_upper_case: Whether to convert input to uppercase.\n\n    Examples:\n        >>> from multimolecule import RnaTokenizer\n        >>> tokenizer = RnaTokenizer()\n        >>> tokenizer('<pad><cls><eos><unk><mask><null>ACGUNRYSWKMBDHV.X*-I')[\"input_ids\"]\n        [1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 2]\n        >>> tokenizer('acgu')[\"input_ids\"]\n        [1, 6, 7, 8, 9, 2]\n        >>> tokenizer('acgt')[\"input_ids\"]\n        [1, 6, 7, 8, 9, 2]\n        >>> tokenizer = RnaTokenizer(replace_T_with_U=False)\n        >>> tokenizer('acgt')[\"input_ids\"]\n        [1, 6, 7, 8, 3, 2]\n        >>> tokenizer = RnaTokenizer(nmers=3)\n        >>> tokenizer('uagcuuauc')[\"input_ids\"]\n        [1, 83, 17, 64, 49, 96, 84, 22, 2]\n        >>> tokenizer = RnaTokenizer(codon=True)\n        >>> tokenizer('uagcuuauc')[\"input_ids\"]\n        [1, 83, 49, 22, 2]\n        >>> tokenizer('uagcuuauca')[\"input_ids\"]\n        Traceback (most recent call last):\n        ValueError: length of input sequence must be a multiple of 3 for codon tokenization, but got 10\n    \"\"\"\n\n    model_input_names = [\"input_ids\", \"attention_mask\"]\n\n    def __init__(\n        self,\n        alphabet: Alphabet | str | List[str] | None = None,\n        nmers: int = 1,\n        codon: bool = False,\n        replace_T_with_U: bool = True,\n        do_upper_case: bool = True,\n        additional_special_tokens: List | Tuple | None = None,\n        **kwargs,\n    ):\n        if codon and (nmers > 1 and nmers != 3):\n            raise ValueError(\"Codon and nmers cannot be used together.\")\n        if codon:\n            nmers = 3  # set to 3 to get correct vocab\n        if not isinstance(alphabet, Alphabet):\n            alphabet = get_alphabet(alphabet, nmers=nmers)\n        super().__init__(\n            alphabet=alphabet,\n            nmers=nmers,\n            codon=codon,\n            replace_T_with_U=replace_T_with_U,\n            do_upper_case=do_upper_case,\n            additional_special_tokens=additional_special_tokens,\n            **kwargs,\n        )\n        self.replace_T_with_U = replace_T_with_U\n        self.nmers = nmers\n        self.codon = codon\n\n    def _tokenize(self, text: str, **kwargs):\n        if self.do_upper_case:\n            text = text.upper()\n        if self.replace_T_with_U:\n            text = text.replace(\"T\", \"U\")\n        if self.codon:\n            if len(text) % 3 != 0:\n                raise ValueError(\n                    f\"length of input sequence must be a multiple of 3 for codon tokenization, but got {len(text)}\"\n                )\n            return [text[i : i + 3] for i in range(0, len(text), 3)]\n        if self.nmers > 1:\n            return [text[i : i + self.nmers] for i in range(len(text) - self.nmers + 1)]  # noqa: E203\n        return list(text)\n
"},{"location":"models/rnamsm/#multimolecule.models.rnamsm.RnaTokenizer(alphabet)","title":"alphabet","text":""},{"location":"models/rnamsm/#multimolecule.models.rnamsm.RnaTokenizer(nmers)","title":"nmers","text":""},{"location":"models/rnamsm/#multimolecule.models.rnamsm.RnaTokenizer(codon)","title":"codon","text":""},{"location":"models/rnamsm/#multimolecule.models.rnamsm.RnaTokenizer(replace_T_with_U)","title":"replace_T_with_U","text":""},{"location":"models/rnamsm/#multimolecule.models.rnamsm.RnaTokenizer(do_upper_case)","title":"do_upper_case","text":""},{"location":"models/rnamsm/#multimolecule.models.rnamsm.RnaMsmConfig","title":"RnaMsmConfig","text":"

Bases: PreTrainedConfig

This is the configuration class to store the configuration of a RnaMsmModel. It is used to instantiate a RnaMsm model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the RnaMsm yikunpku/RNA-MSM architecture.

Configuration objects inherit from PreTrainedConfig and can be used to control the model outputs. Read the documentation from PreTrainedConfig for more information.

Parameters:

Name Type Description Default int

Vocabulary size of the RnaMsm model. Defines the number of different tokens that can be represented by the inputs_ids passed when calling [RnaMsmModel].

26 int

Dimensionality of the encoder layers and the pooler layer.

768 int

Number of hidden layers in the Transformer encoder.

10 int

Number of attention heads for each attention layer in the Transformer encoder.

12 int

Dimensionality of the \u201cintermediate\u201d (often named feed-forward) layer in the Transformer encoder.

3072 float

The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.

0.1 float

The dropout ratio for the attention probabilities.

0.1 int

The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).

1024 float

The standard deviation of the truncated_normal_initializer for initializing all weight matrices.

0.02 float

The epsilon used by the layer normalization layers.

1e-12

Examples:

Python Console Session
>>> from multimolecule import RnaMsmModel, RnaMsmConfig\n>>> # Initializing a RNA-MSM multimolecule/rnamsm style configuration\n>>> configuration = RnaMsmConfig()\n>>> # Initializing a model (with random weights) from the multimolecule/rnamsm style configuration\n>>> model = RnaMsmModel(configuration)\n>>> # Accessing the model configuration\n>>> configuration = model.config\n
Source code in multimolecule/models/rnamsm/configuration_rnamsm.py Python
class RnaMsmConfig(PreTrainedConfig):\n    r\"\"\"\n    This is the configuration class to store the configuration of a [`RnaMsmModel`][multimolecule.models.RnaMsmModel].\n    It is used to instantiate a RnaMsm model according to the specified arguments, defining the model architecture.\n    Instantiating a configuration with the defaults will yield a similar configuration to that of the RnaMsm\n    [yikunpku/RNA-MSM](https://github.com/yikunpku/RNA-MSM) architecture.\n\n    Configuration objects inherit from [`PreTrainedConfig`][multimolecule.models.PreTrainedConfig] and can be used to\n    control the model outputs. Read the documentation from [`PreTrainedConfig`][multimolecule.models.PreTrainedConfig]\n    for more information.\n\n    Args:\n        vocab_size:\n            Vocabulary size of the RnaMsm model. Defines the number of different tokens that can be represented by the\n            `inputs_ids` passed when calling [`RnaMsmModel`].\n        hidden_size:\n            Dimensionality of the encoder layers and the pooler layer.\n        num_hidden_layers:\n            Number of hidden layers in the Transformer encoder.\n        num_attention_heads:\n            Number of attention heads for each attention layer in the Transformer encoder.\n        intermediate_size:\n            Dimensionality of the \"intermediate\" (often named feed-forward) layer in the Transformer encoder.\n        hidden_dropout:\n            The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.\n        attention_dropout:\n            The dropout ratio for the attention probabilities.\n        max_position_embeddings:\n            The maximum sequence length that this model might ever be used with. Typically set this to something large\n            just in case (e.g., 512 or 1024 or 2048).\n        initializer_range:\n            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.\n        layer_norm_eps:\n            The epsilon used by the layer normalization layers.\n\n    Examples:\n        >>> from multimolecule import RnaMsmModel, RnaMsmConfig\n        >>> # Initializing a RNA-MSM multimolecule/rnamsm style configuration\n        >>> configuration = RnaMsmConfig()\n        >>> # Initializing a model (with random weights) from the multimolecule/rnamsm style configuration\n        >>> model = RnaMsmModel(configuration)\n        >>> # Accessing the model configuration\n        >>> configuration = model.config\n    \"\"\"\n\n    model_type = \"rnamsm\"\n\n    def __init__(\n        self,\n        vocab_size: int = 26,\n        hidden_size: int = 768,\n        num_hidden_layers: int = 10,\n        num_attention_heads: int = 12,\n        intermediate_size: int = 3072,\n        hidden_act: str = \"gelu\",\n        hidden_dropout: float = 0.1,\n        attention_dropout: float = 0.1,\n        max_position_embeddings: int = 1024,\n        initializer_range: float = 0.02,\n        layer_norm_eps: float = 1e-12,\n        position_embedding_type: str = \"absolute\",\n        is_decoder: bool = False,\n        use_cache: bool = True,\n        max_tokens_per_msa: int = 2**14,\n        layer_type: str = \"standard\",\n        attention_type: str = \"standard\",\n        embed_positions_msa: bool = True,\n        attention_bias: bool = True,\n        head: HeadConfig | None = None,\n        lm_head: MaskedLMHeadConfig | None = None,\n        **kwargs,\n    ):\n        super().__init__(**kwargs)\n\n        self.vocab_size = vocab_size\n        self.hidden_size = hidden_size\n        self.num_hidden_layers = num_hidden_layers\n        self.num_attention_heads = num_attention_heads\n        self.intermediate_size = intermediate_size\n        self.hidden_act = hidden_act\n        self.hidden_dropout = hidden_dropout\n        self.attention_dropout = attention_dropout\n        self.max_position_embeddings = max_position_embeddings\n        self.initializer_range = initializer_range\n        self.layer_norm_eps = layer_norm_eps\n        self.position_embedding_type = position_embedding_type\n        self.is_decoder = is_decoder\n        self.use_cache = use_cache\n        self.max_tokens_per_msa = max_tokens_per_msa\n        self.layer_type = layer_type\n        self.attention_type = attention_type\n        self.embed_positions_msa = embed_positions_msa\n        self.attention_bias = attention_bias\n        self.head = HeadConfig(**head) if head is not None else None\n        self.lm_head = MaskedLMHeadConfig(**lm_head) if lm_head is not None else None\n
"},{"location":"models/rnamsm/#multimolecule.models.rnamsm.RnaMsmConfig(vocab_size)","title":"vocab_size","text":""},{"location":"models/rnamsm/#multimolecule.models.rnamsm.RnaMsmConfig(hidden_size)","title":"hidden_size","text":""},{"location":"models/rnamsm/#multimolecule.models.rnamsm.RnaMsmConfig(num_hidden_layers)","title":"num_hidden_layers","text":""},{"location":"models/rnamsm/#multimolecule.models.rnamsm.RnaMsmConfig(num_attention_heads)","title":"num_attention_heads","text":""},{"location":"models/rnamsm/#multimolecule.models.rnamsm.RnaMsmConfig(intermediate_size)","title":"intermediate_size","text":""},{"location":"models/rnamsm/#multimolecule.models.rnamsm.RnaMsmConfig(hidden_dropout)","title":"hidden_dropout","text":""},{"location":"models/rnamsm/#multimolecule.models.rnamsm.RnaMsmConfig(attention_dropout)","title":"attention_dropout","text":""},{"location":"models/rnamsm/#multimolecule.models.rnamsm.RnaMsmConfig(max_position_embeddings)","title":"max_position_embeddings","text":""},{"location":"models/rnamsm/#multimolecule.models.rnamsm.RnaMsmConfig(initializer_range)","title":"initializer_range","text":""},{"location":"models/rnamsm/#multimolecule.models.rnamsm.RnaMsmConfig(layer_norm_eps)","title":"layer_norm_eps","text":""},{"location":"models/rnamsm/#multimolecule.models.rnamsm.RnaMsmForContactPrediction","title":"RnaMsmForContactPrediction","text":"

Bases: RnaMsmPreTrainedModel

Examples:

Python Console Session
>>> from multimolecule import RnaMsmConfig, RnaMsmForContactPrediction, RnaTokenizer\n>>> config = RnaMsmConfig()\n>>> model = RnaMsmForContactPrediction(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=torch.randint(2, (1, 5, 5)))\n>>> output[\"logits\"].shape\ntorch.Size([1, 5, 5, 1])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n
Source code in multimolecule/models/rnamsm/modeling_rnamsm.py Python
class RnaMsmForContactPrediction(RnaMsmPreTrainedModel):\n    \"\"\"\n    Examples:\n        >>> from multimolecule import RnaMsmConfig, RnaMsmForContactPrediction, RnaTokenizer\n        >>> config = RnaMsmConfig()\n        >>> model = RnaMsmForContactPrediction(config)\n        >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n        >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n        >>> output = model(**input, labels=torch.randint(2, (1, 5, 5)))\n        >>> output[\"logits\"].shape\n        torch.Size([1, 5, 5, 1])\n        >>> output[\"loss\"]  # doctest:+ELLIPSIS\n        tensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n    \"\"\"\n\n    def __init__(self, config: RnaMsmConfig):\n        super().__init__(config)\n        self.rnamsm = RnaMsmModel(config, add_pooling_layer=True)\n        head_config = HeadConfig(output_name=\"row_attentions\")\n        self.contact_head = ContactPredictionHead(config, head_config)\n        self.head_config = self.contact_head.config\n\n        # Initialize weights and apply final processing\n        self.post_init()\n\n    def forward(\n        self,\n        input_ids: Tensor | NestedTensor,\n        attention_mask: Tensor | None = None,\n        position_ids: Tensor | None = None,\n        head_mask: Tensor | None = None,\n        inputs_embeds: Tensor | NestedTensor | None = None,\n        labels: Tensor | None = None,\n        output_attentions: bool | None = None,\n        output_hidden_states: bool | None = None,\n        return_dict: bool | None = None,\n        **kwargs,\n    ) -> Tuple[Tensor, ...] | RnaMsmContactPredictorOutput:\n        if output_attentions is False:\n            warn(\"output_attentions must be True for contact classification and will be ignored.\")\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n        outputs = self.rnamsm(\n            input_ids,\n            attention_mask=attention_mask,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            output_attentions=True,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n            **kwargs,\n        )\n        output = self.contact_head(outputs, attention_mask, input_ids, labels)\n        logits, loss = output.logits, output.loss\n\n        if not return_dict:\n            output = (logits,) + outputs[2:]\n            return ((loss,) + output) if loss is not None else output\n\n        return RnaMsmContactPredictorOutput(\n            loss=loss,\n            logits=logits,\n            hidden_states=outputs.hidden_states,\n            col_attentions=outputs.col_attentions,\n            row_attentions=outputs.row_attentions,\n        )\n
"},{"location":"models/rnamsm/#multimolecule.models.rnamsm.RnaMsmForMaskedLM","title":"RnaMsmForMaskedLM","text":"

Bases: RnaMsmPreTrainedModel

Examples:

Python Console Session
>>> from multimolecule import RnaMsmConfig, RnaMsmForMaskedLM, RnaTokenizer\n>>> config = RnaMsmConfig()\n>>> model = RnaMsmForMaskedLM(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=input[\"input_ids\"])\n>>> output[\"logits\"].shape\ntorch.Size([1, 7, 26])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<NllLossBackward0>)\n
Source code in multimolecule/models/rnamsm/modeling_rnamsm.py Python
class RnaMsmForMaskedLM(RnaMsmPreTrainedModel):\n    \"\"\"\n    Examples:\n        >>> from multimolecule import RnaMsmConfig, RnaMsmForMaskedLM, RnaTokenizer\n        >>> config = RnaMsmConfig()\n        >>> model = RnaMsmForMaskedLM(config)\n        >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n        >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n        >>> output = model(**input, labels=input[\"input_ids\"])\n        >>> output[\"logits\"].shape\n        torch.Size([1, 7, 26])\n        >>> output[\"loss\"]  # doctest:+ELLIPSIS\n        tensor(..., grad_fn=<NllLossBackward0>)\n    \"\"\"\n\n    _tied_weights_keys = [\"lm_head.decoder.weight\", \"lm_head.decoder.bias\"]\n\n    def __init__(self, config: RnaMsmConfig):\n        super().__init__(config)\n        self.rnamsm = RnaMsmModel(config, add_pooling_layer=False)\n        self.lm_head = MaskedLMHead(config, weight=self.rnamsm.embeddings.word_embeddings.weight)\n\n        # Initialize weights and apply final processing\n        self.post_init()\n\n    def forward(\n        self,\n        input_ids: Tensor | NestedTensor,\n        attention_mask: Tensor | None = None,\n        position_ids: Tensor | None = None,\n        inputs_embeds: Tensor | NestedTensor | None = None,\n        labels: Tensor | None = None,\n        output_attentions: bool | None = None,\n        output_hidden_states: bool | None = None,\n        return_dict: bool | None = None,\n        **kwargs,\n    ) -> Tuple[Tensor, ...] | RnaMsmForMaskedLMOutput:\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n        outputs = self.rnamsm(\n            input_ids,\n            attention_mask=attention_mask,\n            position_ids=position_ids,\n            inputs_embeds=inputs_embeds,\n            output_attentions=output_attentions,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n            **kwargs,\n        )\n        output = self.lm_head(outputs, labels)\n        logits, loss = output.logits, output.loss\n\n        if not return_dict:\n            output = (logits,) + outputs[2:]\n            return ((loss,) + output) if loss is not None else output\n\n        return RnaMsmForMaskedLMOutput(\n            loss=loss,\n            logits=logits,\n            hidden_states=outputs.hidden_states,\n            col_attentions=outputs.col_attentions,\n            row_attentions=outputs.row_attentions,\n        )\n
"},{"location":"models/rnamsm/#multimolecule.models.rnamsm.RnaMsmForPreTraining","title":"RnaMsmForPreTraining","text":"

Bases: RnaMsmPreTrainedModel

Examples:

Python Console Session
>>> from multimolecule import RnaMsmConfig, RnaMsmForPreTraining, RnaTokenizer\n>>> config = RnaMsmConfig()\n>>> model = RnaMsmForPreTraining(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels_mlm=input[\"input_ids\"])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<AddBackward0>)\n>>> output[\"logits\"].shape\ntorch.Size([1, 7, 26])\n>>> output[\"contact_map\"].shape\ntorch.Size([1, 5, 5, 1])\n
Source code in multimolecule/models/rnamsm/modeling_rnamsm.py Python
class RnaMsmForPreTraining(RnaMsmPreTrainedModel):\n    \"\"\"\n    Examples:\n        >>> from multimolecule import RnaMsmConfig, RnaMsmForPreTraining, RnaTokenizer\n        >>> config = RnaMsmConfig()\n        >>> model = RnaMsmForPreTraining(config)\n        >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n        >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n        >>> output = model(**input, labels_mlm=input[\"input_ids\"])\n        >>> output[\"loss\"]  # doctest:+ELLIPSIS\n        tensor(..., grad_fn=<AddBackward0>)\n        >>> output[\"logits\"].shape\n        torch.Size([1, 7, 26])\n        >>> output[\"contact_map\"].shape\n        torch.Size([1, 5, 5, 1])\n    \"\"\"\n\n    _tied_weights_keys = [\n        \"lm_head.decoder.weight\",\n        \"lm_head.decoder.bias\",\n        \"pretrain.predictions.decoder.weight\",\n        \"pretrain.predictions.decoder.bias\",\n        \"pretrain.predictions_ss.decoder.weight\",\n        \"pretrain.predictions_ss.decoder.bias\",\n    ]\n\n    def __init__(self, config: RnaMsmConfig):\n        super().__init__(config)\n        self.rnamsm = RnaMsmModel(config, add_pooling_layer=False)\n        self.pretrain = RnaMsmPreTrainingHeads(config, weight=self.rnamsm.embeddings.word_embeddings.weight)\n\n        # Initialize weights and apply final processing\n        self.post_init()\n\n    def forward(\n        self,\n        input_ids: Tensor | NestedTensor,\n        attention_mask: Tensor | None = None,\n        position_ids: Tensor | None = None,\n        inputs_embeds: Tensor | NestedTensor | None = None,\n        labels_mlm: Tensor | None = None,\n        labels_contact: Tensor | None = None,\n        output_attentions: bool | None = None,\n        output_hidden_states: bool | None = None,\n        return_dict: bool | None = None,\n        **kwargs,\n    ) -> Tuple[Tensor, ...] | RnaMsmForPreTrainingOutput:\n        if output_attentions is False:\n            warn(\"output_attentions must be True for contact classification and will be ignored.\")\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n        outputs = self.rnamsm(\n            input_ids,\n            attention_mask=attention_mask,\n            position_ids=position_ids,\n            inputs_embeds=inputs_embeds,\n            output_attentions=True,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n            **kwargs,\n        )\n        total_loss, logits, contact_map = self.pretrain(\n            outputs, attention_mask, input_ids, labels_mlm=labels_mlm, labels_contact=labels_contact\n        )\n\n        if not return_dict:\n            output = (logits, contact_map) + outputs[2:]\n            return ((total_loss,) + output) if total_loss is not None else output\n\n        return RnaMsmForPreTrainingOutput(\n            loss=total_loss,\n            logits=logits,\n            contact_map=contact_map,\n            hidden_states=outputs.hidden_states,\n            col_attentions=outputs.col_attentions,\n            row_attentions=outputs.row_attentions,\n        )\n
"},{"location":"models/rnamsm/#multimolecule.models.rnamsm.RnaMsmForSequencePrediction","title":"RnaMsmForSequencePrediction","text":"

Bases: RnaMsmPreTrainedModel

Examples:

Python Console Session
>>> from multimolecule import RnaMsmConfig, RnaMsmForSequencePrediction, RnaTokenizer\n>>> config = RnaMsmConfig()\n>>> model = RnaMsmForSequencePrediction(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=torch.tensor([[1]]))\n>>> output[\"logits\"].shape\ntorch.Size([1, 1])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n
Source code in multimolecule/models/rnamsm/modeling_rnamsm.py Python
class RnaMsmForSequencePrediction(RnaMsmPreTrainedModel):\n    \"\"\"\n    Examples:\n        >>> from multimolecule import RnaMsmConfig, RnaMsmForSequencePrediction, RnaTokenizer\n        >>> config = RnaMsmConfig()\n        >>> model = RnaMsmForSequencePrediction(config)\n        >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n        >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n        >>> output = model(**input, labels=torch.tensor([[1]]))\n        >>> output[\"logits\"].shape\n        torch.Size([1, 1])\n        >>> output[\"loss\"]  # doctest:+ELLIPSIS\n        tensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n    \"\"\"\n\n    def __init__(self, config: RnaMsmConfig):\n        super().__init__(config)\n        self.rnamsm = RnaMsmModel(config, add_pooling_layer=True)\n        self.sequence_head = SequencePredictionHead(config)\n        self.head_config = self.sequence_head.config\n\n        # Initialize weights and apply final processing\n        self.post_init()\n\n    def forward(\n        self,\n        input_ids: Tensor | NestedTensor,\n        attention_mask: Tensor | None = None,\n        position_ids: Tensor | None = None,\n        inputs_embeds: Tensor | NestedTensor | None = None,\n        labels: Tensor | None = None,\n        output_attentions: bool = False,\n        output_hidden_states: bool = False,\n        return_dict: bool = True,\n        **kwargs,\n    ) -> Tuple[Tensor, ...] | RnaMsmSequencePredictorOutput:\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n        outputs = self.rnamsm(\n            input_ids,\n            attention_mask=attention_mask,\n            position_ids=position_ids,\n            inputs_embeds=inputs_embeds,\n            output_attentions=output_attentions,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n            **kwargs,\n        )\n        output = self.sequence_head(outputs, labels)\n        logits, loss = output.logits, output.loss\n\n        if not return_dict:\n            output = (logits,) + outputs[2:]\n            return ((loss,) + output) if loss is not None else output\n\n        return RnaMsmSequencePredictorOutput(\n            loss=loss,\n            logits=logits,\n            hidden_states=outputs.hidden_states,\n            col_attentions=outputs.col_attentions,\n            row_attentions=outputs.row_attentions,\n        )\n
"},{"location":"models/rnamsm/#multimolecule.models.rnamsm.RnaMsmForTokenPrediction","title":"RnaMsmForTokenPrediction","text":"

Bases: RnaMsmPreTrainedModel

Examples:

Python Console Session
>>> from multimolecule import RnaMsmConfig, RnaMsmForTokenPrediction, RnaTokenizer\n>>> config = RnaMsmConfig()\n>>> model = RnaMsmForTokenPrediction(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=torch.randint(2, (1, 5)))\n>>> output[\"logits\"].shape\ntorch.Size([1, 5, 1])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n
Source code in multimolecule/models/rnamsm/modeling_rnamsm.py Python
class RnaMsmForTokenPrediction(RnaMsmPreTrainedModel):\n    \"\"\"\n    Examples:\n        >>> from multimolecule import RnaMsmConfig, RnaMsmForTokenPrediction, RnaTokenizer\n        >>> config = RnaMsmConfig()\n        >>> model = RnaMsmForTokenPrediction(config)\n        >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n        >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n        >>> output = model(**input, labels=torch.randint(2, (1, 5)))\n        >>> output[\"logits\"].shape\n        torch.Size([1, 5, 1])\n        >>> output[\"loss\"]  # doctest:+ELLIPSIS\n        tensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n    \"\"\"\n\n    def __init__(self, config: RnaMsmConfig):\n        super().__init__(config)\n        self.rnamsm = RnaMsmModel(config, add_pooling_layer=True)\n        self.token_head = TokenPredictionHead(config)\n        self.head_config = self.token_head.config\n\n        # Initialize weights and apply final processing\n        self.post_init()\n\n    def forward(\n        self,\n        input_ids: Tensor | NestedTensor,\n        attention_mask: Tensor | None = None,\n        position_ids: Tensor | None = None,\n        inputs_embeds: Tensor | NestedTensor | None = None,\n        labels: Tensor | None = None,\n        output_attentions: bool = False,\n        output_hidden_states: bool = False,\n        return_dict: bool = True,\n        **kwargs,\n    ) -> Tuple[Tensor, ...] | RnaMsmTokenPredictorOutput:\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n        outputs = self.rnamsm(\n            input_ids,\n            attention_mask=attention_mask,\n            position_ids=position_ids,\n            inputs_embeds=inputs_embeds,\n            output_attentions=output_attentions,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n            **kwargs,\n        )\n        output = self.token_head(outputs, attention_mask, input_ids, labels)\n        logits, loss = output.logits, output.loss\n\n        if not return_dict:\n            output = (logits,) + outputs[2:]\n            return ((loss,) + output) if loss is not None else output\n\n        return RnaMsmTokenPredictorOutput(\n            loss=loss,\n            logits=logits,\n            hidden_states=outputs.hidden_states,\n            col_attentions=outputs.col_attentions,\n            row_attentions=outputs.row_attentions,\n        )\n
"},{"location":"models/rnamsm/#multimolecule.models.rnamsm.RnaMsmModel","title":"RnaMsmModel","text":"

Bases: RnaMsmPreTrainedModel

Examples:

Python Console Session
>>> from multimolecule import RnaMsmConfig, RnaMsmModel, RnaTokenizer\n>>> config = RnaMsmConfig()\n>>> model = RnaMsmModel(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input)\n>>> output[\"last_hidden_state\"].shape\ntorch.Size([1, 7, 768])\n>>> output[\"pooler_output\"].shape\ntorch.Size([1, 768])\n
Source code in multimolecule/models/rnamsm/modeling_rnamsm.py Python
class RnaMsmModel(RnaMsmPreTrainedModel):\n    \"\"\"\n    Examples:\n        >>> from multimolecule import RnaMsmConfig, RnaMsmModel, RnaTokenizer\n        >>> config = RnaMsmConfig()\n        >>> model = RnaMsmModel(config)\n        >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n        >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n        >>> output = model(**input)\n        >>> output[\"last_hidden_state\"].shape\n        torch.Size([1, 7, 768])\n        >>> output[\"pooler_output\"].shape\n        torch.Size([1, 768])\n    \"\"\"\n\n    def __init__(self, config: RnaMsmConfig, add_pooling_layer: bool = True):\n        super().__init__(config)\n        self.pad_token_id = config.pad_token_id\n        self.embeddings = RnaMsmEmbeddings(config)\n        self.encoder = RnaMsmEncoder(config)\n        self.pooler = RnaMsmPooler(config) if add_pooling_layer else None\n\n        # Initialize weights and apply final processing\n        self.post_init()\n\n    def forward(\n        self,\n        input_ids: Tensor | NestedTensor,\n        attention_mask: Tensor | None = None,\n        position_ids: Tensor | None = None,\n        inputs_embeds: Tensor | NestedTensor | None = None,\n        output_attentions: bool | None = None,\n        output_hidden_states: bool | None = None,\n        return_dict: bool | None = None,\n        **kwargs,\n    ) -> Tuple[Tensor, ...] | RnaMsmModelOutputWithPooling:\n        if kwargs:\n            warn(\n                f\"Additional keyword arguments `{', '.join(kwargs)}` are detected in \"\n                f\"`{self.__class__.__name__}.forward`, they will be ignored.\\n\"\n                \"This is provided for backward compatibility and may lead to unexpected behavior.\"\n            )\n        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions\n        output_hidden_states = (\n            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states\n        )\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n\n        if isinstance(input_ids, NestedTensor):\n            input_ids, attention_mask = input_ids.tensor, input_ids.mask\n        if input_ids is not None and inputs_embeds is not None:\n            raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n        if input_ids is not None:\n            self.warn_if_padding_and_no_attention_mask(input_ids, attention_mask)\n        elif inputs_embeds is None:\n            raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n        if attention_mask is None:\n            attention_mask = (\n                input_ids.ne(self.pad_token_id) if self.pad_token_id is not None else torch.ones_like(input_ids)\n            )\n\n        unsqueeze_input = input_ids.ndim == 2\n        if unsqueeze_input:\n            input_ids = input_ids.unsqueeze(1)\n        if attention_mask.ndim == 2:\n            attention_mask = attention_mask.unsqueeze(1)\n\n        embedding_output = self.embeddings(\n            input_ids=input_ids,\n            position_ids=position_ids,\n            attention_mask=attention_mask,\n            inputs_embeds=inputs_embeds,\n        )\n        encoder_outputs = self.encoder(\n            embedding_output,\n            attention_mask=attention_mask,\n            output_attentions=output_attentions,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n        )\n        sequence_output = encoder_outputs[0]\n        if unsqueeze_input:\n            sequence_output = sequence_output.squeeze(1)\n        pooled_output = self.pooler(sequence_output) if self.pooler is not None else None\n\n        if not return_dict:\n            return (sequence_output, pooled_output) + encoder_outputs[1:]\n\n        return RnaMsmModelOutputWithPooling(\n            last_hidden_state=sequence_output,\n            pooler_output=pooled_output,\n            hidden_states=encoder_outputs.hidden_states,\n            col_attentions=encoder_outputs.col_attentions,\n            row_attentions=encoder_outputs.row_attentions,\n        )\n
"},{"location":"models/rnamsm/#multimolecule.models.rnamsm.RnaMsmPreTrainedModel","title":"RnaMsmPreTrainedModel","text":"

Bases: PreTrainedModel

An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained models.

Source code in multimolecule/models/rnamsm/modeling_rnamsm.py Python
class RnaMsmPreTrainedModel(PreTrainedModel):\n    \"\"\"\n    An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained\n    models.\n    \"\"\"\n\n    config_class = RnaMsmConfig\n    base_model_prefix = \"rnamsm\"\n    supports_gradient_checkpointing = True\n    _no_split_modules = [\"RnaMsmLayer\", \"RnaMsmAxialLayer\", \"RnaMsmPkmLayer\", \"RnaMsmEmbeddings\"]\n\n    # Copied from transformers.models.bert.modeling_bert.BertPreTrainedModel._init_weights\n    def _init_weights(self, module: nn.Module):\n        \"\"\"Initialize the weights\"\"\"\n        if isinstance(module, nn.Linear):\n            # Slightly different from the TF version which uses truncated_normal for initialization\n            # cf https://github.com/pytorch/pytorch/pull/5617\n            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)\n            if module.bias is not None:\n                module.bias.data.zero_()\n        elif isinstance(module, nn.Embedding):\n            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)\n            if module.padding_idx is not None:\n                module.weight.data[module.padding_idx].zero_()\n        elif isinstance(module, nn.LayerNorm) and module.elementwise_affine:\n            module.bias.data.zero_()\n            module.weight.data.fill_(1.0)\n
"},{"location":"models/splicebert/","title":"SpliceBERT","text":"

Pre-trained model on messenger RNA precursor (pre-mRNA) using a masked language modeling (MLM) objective.

"},{"location":"models/splicebert/#disclaimer","title":"Disclaimer","text":"

This is an UNOFFICIAL implementation of the Self-supervised learning on millions of pre-mRNA sequences improves sequence-based RNA splicing prediction by Ken Chen, et al.

The OFFICIAL repository of SpliceBERT is at chenkenbio/SpliceBERT.

Tip

The MultiMolecule team has confirmed that the provided model and checkpoints are producing the same intermediate representations as the original implementation.

The team releasing SpliceBERT did not write this model card for this model so this model card has been written by the MultiMolecule team.

"},{"location":"models/splicebert/#model-details","title":"Model Details","text":"

SpliceBERT is a bert-style model pre-trained on a large corpus of messenger RNA precursor sequences in a self-supervised fashion. This means that the model was trained on the raw nucleotides of RNA sequences only, with an automatic process to generate inputs and labels from those texts. Please refer to the Training Details section for more information on the training process.

"},{"location":"models/splicebert/#variations","title":"Variations","text":""},{"location":"models/splicebert/#model-specification","title":"Model Specification","text":"Variants Num Layers Hidden Size Num Heads Intermediate Size Num Parameters (M) FLOPs (G) MACs (G) Max Num Tokens splicebert 6 512 16 2048 19.72 5.04 2.52 1024 splicebert.510 19.45 510 splicebert-human.510"},{"location":"models/splicebert/#links","title":"Links","text":""},{"location":"models/splicebert/#usage","title":"Usage","text":"

The model file depends on the multimolecule library. You can install it using pip:

Bash
pip install multimolecule\n
"},{"location":"models/splicebert/#direct-use","title":"Direct Use","text":"

You can use this model directly with a pipeline for masked language modeling:

Python
>>> import multimolecule  # you must import multimolecule to register models\n>>> from transformers import pipeline\n>>> unmasker = pipeline(\"fill-mask\", model=\"multimolecule/splicebert\")\n>>> unmasker(\"gguc<mask>cucugguuagaccagaucugagccu\")\n\n[{'score': 0.340412974357605,\n  'token': 9,\n  'token_str': 'U',\n  'sequence': 'G G U C U C U C U G G U U A G A C C A G A U C U G A G C C U'},\n {'score': 0.13882005214691162,\n  'token': 12,\n  'token_str': 'Y',\n  'sequence': 'G G U C Y C U C U G G U U A G A C C A G A U C U G A G C C U'},\n {'score': 0.056610625237226486,\n  'token': 7,\n  'token_str': 'C',\n  'sequence': 'G G U C C C U C U G G U U A G A C C A G A U C U G A G C C U'},\n {'score': 0.05455885827541351,\n  'token': 19,\n  'token_str': 'H',\n  'sequence': 'G G U C H C U C U G G U U A G A C C A G A U C U G A G C C U'},\n {'score': 0.05356108024716377,\n  'token': 14,\n  'token_str': 'W',\n  'sequence': 'G G U C W C U C U G G U U A G A C C A G A U C U G A G C C U'}]\n
"},{"location":"models/splicebert/#downstream-use","title":"Downstream Use","text":""},{"location":"models/splicebert/#extract-features","title":"Extract Features","text":"

Here is how to use this model to get the features of a given sequence in PyTorch:

Python
from multimolecule import RnaTokenizer, SpliceBertModel\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/splicebert\")\nmodel = SpliceBertModel.from_pretrained(\"multimolecule/splicebert\")\n\ntext = \"UAGCUUAUCAGACUGAUGUUGA\"\ninput = tokenizer(text, return_tensors=\"pt\")\n\noutput = model(**input)\n
"},{"location":"models/splicebert/#sequence-classification-regression","title":"Sequence Classification / Regression","text":"

Note: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for sequence classification or regression.

Here is how to use this model as backbone to fine-tune for a sequence-level task in PyTorch:

Python
import torch\nfrom multimolecule import RnaTokenizer, SpliceBertForSequencePrediction\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/splicebert\")\nmodel = SpliceBertForSequencePrediction.from_pretrained(\"multimolecule/splicebert\")\n\ntext = \"UAGCUUAUCAGACUGAUGUUGA\"\ninput = tokenizer(text, return_tensors=\"pt\")\nlabel = torch.tensor([1])\n\noutput = model(**input, labels=label)\n
"},{"location":"models/splicebert/#token-classification-regression","title":"Token Classification / Regression","text":"

Note: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for nucleotide classification or regression.

Here is how to use this model as backbone to fine-tune for a nucleotide-level task in PyTorch:

Python
import torch\nfrom multimolecule import RnaTokenizer, SpliceBertForTokenPrediction\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/splicebert\")\nmodel = SpliceBertForTokenPrediction.from_pretrained(\"multimolecule/splicebert\")\n\ntext = \"UAGCUUAUCAGACUGAUGUUGA\"\ninput = tokenizer(text, return_tensors=\"pt\")\nlabel = torch.randint(2, (len(text), ))\n\noutput = model(**input, labels=label)\n
"},{"location":"models/splicebert/#contact-classification-regression","title":"Contact Classification / Regression","text":"

Note: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for contact classification or regression.

Here is how to use this model as backbone to fine-tune for a contact-level task in PyTorch:

Python
import torch\nfrom multimolecule import RnaTokenizer, SpliceBertForContactPrediction\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/splicebert\")\nmodel = SpliceBertForContactPrediction.from_pretrained(\"multimolecule/splicebert\")\n\ntext = \"UAGCUUAUCAGACUGAUGUUGA\"\ninput = tokenizer(text, return_tensors=\"pt\")\nlabel = torch.randint(2, (len(text), len(text)))\n\noutput = model(**input, labels=label)\n
"},{"location":"models/splicebert/#training-details","title":"Training Details","text":"

SpliceBERT used Masked Language Modeling (MLM) as the pre-training objective: taking a sequence, the model randomly masks 15% of the tokens in the input then runs the entire masked sentence through the model and has to predict the masked tokens. This is comparable to the Cloze task in language modeling.

"},{"location":"models/splicebert/#training-data","title":"Training Data","text":"

The SpliceBERT model was pre-trained on messenger RNA precursor sequences from UCSC Genome Browser. UCSC Genome Browser provides visualization, analysis, and download of comprehensive vertebrate genome data with aligned annotation tracks (known genes, predicted genes, ESTs, mRNAs, CpG islands, etc.).

SpliceBERT collected reference genomes and gene annotations from the UCSC Genome Browser for 72 vertebrate species. It applied bedtools getfasta to extract pre-mRNA sequences from the reference genomes based on the gene annotations. The pre-mRNA sequences are then used to pre-train SpliceBERT. The pre-training data contains 2 million pre-mRNA sequences with a total length of 65 billion nucleotides.

Note RnaTokenizer will convert \u201cT\u201ds to \u201cU\u201ds for you, you may disable this behaviour by passing replace_T_with_U=False.

"},{"location":"models/splicebert/#training-procedure","title":"Training Procedure","text":""},{"location":"models/splicebert/#preprocessing","title":"Preprocessing","text":"

SpliceBERT used masked language modeling (MLM) as the pre-training objective. The masking procedure is similar to the one used in BERT:

"},{"location":"models/splicebert/#pretraining","title":"PreTraining","text":"

The model was trained on 8 NVIDIA V100 GPUs.

SpliceBERT trained model in a two-stage training process:

  1. Pre-train with sequences of a fixed length of 510 nucleotides.
  2. Pre-train with sequences of a variable length between 64 and 1024 nucleotides.

The intermediate model after the first stage is available as multimolecule/splicebert.510.

SpliceBERT also pre-trained a model on human data only to validate the contribution of multi-species pre-training. The intermediate model after the first stage is available as multimolecule/splicebert-human.510.

"},{"location":"models/splicebert/#citation","title":"Citation","text":"

BibTeX:

BibTeX
@article {chen2023self,\n    author = {Chen, Ken and Zhou, Yue and Ding, Maolin and Wang, Yu and Ren, Zhixiang and Yang, Yuedong},\n    title = {Self-supervised learning on millions of pre-mRNA sequences improves sequence-based RNA splicing prediction},\n    elocation-id = {2023.01.31.526427},\n    year = {2023},\n    doi = {10.1101/2023.01.31.526427},\n    publisher = {Cold Spring Harbor Laboratory},\n    abstract = {RNA splicing is an important post-transcriptional process of gene expression in eukaryotic cells. Predicting RNA splicing from primary sequences can facilitate the interpretation of genomic variants. In this study, we developed a novel self-supervised pre-trained language model, SpliceBERT, to improve sequence-based RNA splicing prediction. Pre-training on pre-mRNA sequences from vertebrates enables SpliceBERT to capture evolutionary conservation information and characterize the unique property of splice sites. SpliceBERT also improves zero-shot prediction of variant effects on splicing by considering sequence context information, and achieves superior performance for predicting branchpoint in the human genome and splice sites across species. Our study highlighted the importance of pre-training genomic language models on a diverse range of species and suggested that pre-trained language models were promising for deciphering the sequence logic of RNA splicing.Competing Interest StatementThe authors have declared no competing interest.},\n    URL = {https://www.biorxiv.org/content/early/2023/05/09/2023.01.31.526427},\n    eprint = {https://www.biorxiv.org/content/early/2023/05/09/2023.01.31.526427.full.pdf},\n    journal = {bioRxiv}\n}\n
"},{"location":"models/splicebert/#contact","title":"Contact","text":"

Please use GitHub issues of MultiMolecule for any questions or comments on the model card.

Please contact the authors of the SpliceBERT paper for questions or comments on the paper/model.

"},{"location":"models/splicebert/#license","title":"License","text":"

This model is licensed under the AGPL-3.0 License.

Text Only
SPDX-License-Identifier: AGPL-3.0-or-later\n
"},{"location":"models/splicebert/#multimolecule.models.splicebert","title":"multimolecule.models.splicebert","text":""},{"location":"models/splicebert/#multimolecule.models.splicebert.RnaTokenizer","title":"RnaTokenizer","text":"

Bases: Tokenizer

Tokenizer for RNA sequences.

Parameters:

Name Type Description Default Alphabet | str | List[str] | None

alphabet to use for tokenization.

None int

Size of kmer to tokenize.

1 bool

Whether to tokenize into codons.

False bool

Whether to replace T with U.

True bool

Whether to convert input to uppercase.

True

Examples:

Python Console Session
>>> from multimolecule import RnaTokenizer\n>>> tokenizer = RnaTokenizer()\n>>> tokenizer('<pad><cls><eos><unk><mask><null>ACGUNRYSWKMBDHV.X*-I')[\"input_ids\"]\n[1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 2]\n>>> tokenizer('acgu')[\"input_ids\"]\n[1, 6, 7, 8, 9, 2]\n>>> tokenizer('acgt')[\"input_ids\"]\n[1, 6, 7, 8, 9, 2]\n>>> tokenizer = RnaTokenizer(replace_T_with_U=False)\n>>> tokenizer('acgt')[\"input_ids\"]\n[1, 6, 7, 8, 3, 2]\n>>> tokenizer = RnaTokenizer(nmers=3)\n>>> tokenizer('uagcuuauc')[\"input_ids\"]\n[1, 83, 17, 64, 49, 96, 84, 22, 2]\n>>> tokenizer = RnaTokenizer(codon=True)\n>>> tokenizer('uagcuuauc')[\"input_ids\"]\n[1, 83, 49, 22, 2]\n>>> tokenizer('uagcuuauca')[\"input_ids\"]\nTraceback (most recent call last):\nValueError: length of input sequence must be a multiple of 3 for codon tokenization, but got 10\n
Source code in multimolecule/tokenisers/rna/tokenization_rna.py Python
class RnaTokenizer(Tokenizer):\n    \"\"\"\n    Tokenizer for RNA sequences.\n\n    Args:\n        alphabet: alphabet to use for tokenization.\n\n            - If is `None`, the standard RNA alphabet will be used.\n            - If is a `string`, it should correspond to the name of a predefined alphabet. The options include\n                + `standard`\n                + `extended`\n                + `streamline`\n                + `nucleobase`\n            - If is an alphabet or a list of characters, that specific alphabet will be used.\n        nmers: Size of kmer to tokenize.\n        codon: Whether to tokenize into codons.\n        replace_T_with_U: Whether to replace T with U.\n        do_upper_case: Whether to convert input to uppercase.\n\n    Examples:\n        >>> from multimolecule import RnaTokenizer\n        >>> tokenizer = RnaTokenizer()\n        >>> tokenizer('<pad><cls><eos><unk><mask><null>ACGUNRYSWKMBDHV.X*-I')[\"input_ids\"]\n        [1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 2]\n        >>> tokenizer('acgu')[\"input_ids\"]\n        [1, 6, 7, 8, 9, 2]\n        >>> tokenizer('acgt')[\"input_ids\"]\n        [1, 6, 7, 8, 9, 2]\n        >>> tokenizer = RnaTokenizer(replace_T_with_U=False)\n        >>> tokenizer('acgt')[\"input_ids\"]\n        [1, 6, 7, 8, 3, 2]\n        >>> tokenizer = RnaTokenizer(nmers=3)\n        >>> tokenizer('uagcuuauc')[\"input_ids\"]\n        [1, 83, 17, 64, 49, 96, 84, 22, 2]\n        >>> tokenizer = RnaTokenizer(codon=True)\n        >>> tokenizer('uagcuuauc')[\"input_ids\"]\n        [1, 83, 49, 22, 2]\n        >>> tokenizer('uagcuuauca')[\"input_ids\"]\n        Traceback (most recent call last):\n        ValueError: length of input sequence must be a multiple of 3 for codon tokenization, but got 10\n    \"\"\"\n\n    model_input_names = [\"input_ids\", \"attention_mask\"]\n\n    def __init__(\n        self,\n        alphabet: Alphabet | str | List[str] | None = None,\n        nmers: int = 1,\n        codon: bool = False,\n        replace_T_with_U: bool = True,\n        do_upper_case: bool = True,\n        additional_special_tokens: List | Tuple | None = None,\n        **kwargs,\n    ):\n        if codon and (nmers > 1 and nmers != 3):\n            raise ValueError(\"Codon and nmers cannot be used together.\")\n        if codon:\n            nmers = 3  # set to 3 to get correct vocab\n        if not isinstance(alphabet, Alphabet):\n            alphabet = get_alphabet(alphabet, nmers=nmers)\n        super().__init__(\n            alphabet=alphabet,\n            nmers=nmers,\n            codon=codon,\n            replace_T_with_U=replace_T_with_U,\n            do_upper_case=do_upper_case,\n            additional_special_tokens=additional_special_tokens,\n            **kwargs,\n        )\n        self.replace_T_with_U = replace_T_with_U\n        self.nmers = nmers\n        self.codon = codon\n\n    def _tokenize(self, text: str, **kwargs):\n        if self.do_upper_case:\n            text = text.upper()\n        if self.replace_T_with_U:\n            text = text.replace(\"T\", \"U\")\n        if self.codon:\n            if len(text) % 3 != 0:\n                raise ValueError(\n                    f\"length of input sequence must be a multiple of 3 for codon tokenization, but got {len(text)}\"\n                )\n            return [text[i : i + 3] for i in range(0, len(text), 3)]\n        if self.nmers > 1:\n            return [text[i : i + self.nmers] for i in range(len(text) - self.nmers + 1)]  # noqa: E203\n        return list(text)\n
"},{"location":"models/splicebert/#multimolecule.models.splicebert.RnaTokenizer(alphabet)","title":"alphabet","text":""},{"location":"models/splicebert/#multimolecule.models.splicebert.RnaTokenizer(nmers)","title":"nmers","text":""},{"location":"models/splicebert/#multimolecule.models.splicebert.RnaTokenizer(codon)","title":"codon","text":""},{"location":"models/splicebert/#multimolecule.models.splicebert.RnaTokenizer(replace_T_with_U)","title":"replace_T_with_U","text":""},{"location":"models/splicebert/#multimolecule.models.splicebert.RnaTokenizer(do_upper_case)","title":"do_upper_case","text":""},{"location":"models/splicebert/#multimolecule.models.splicebert.SpliceBertConfig","title":"SpliceBertConfig","text":"

Bases: PreTrainedConfig

This is the configuration class to store the configuration of a SpliceBertModel. It is used to instantiate a SpliceBert model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the SpliceBert biomed-AI/SpliceBERT architecture.

Configuration objects inherit from PreTrainedConfig and can be used to control the model outputs. Read the documentation from PreTrainedConfig for more information.

Parameters:

Name Type Description Default int

Vocabulary size of the SpliceBert model. Defines the number of different tokens that can be represented by the inputs_ids passed when calling [SpliceBertModel].

26 int

Dimensionality of the encoder layers and the pooler layer.

512 int

Number of hidden layers in the Transformer encoder.

6 int

Number of attention heads for each attention layer in the Transformer encoder.

16 int

Dimensionality of the \u201cintermediate\u201d (often named feed-forward) layer in the Transformer encoder.

2048 float

The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.

0.1 float

The dropout ratio for the attention probabilities.

0.1 int

The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).

1026 float

The standard deviation of the truncated_normal_initializer for initializing all weight matrices.

0.02 float

The epsilon used by the layer normalization layers.

1e-12

Examples:

Python Console Session
>>> from multimolecule import SpliceBertModel, SpliceBertConfig\n>>> # Initializing a SpliceBERT multimolecule/splicebert style configuration\n>>> configuration = SpliceBertConfig()\n>>> # Initializing a model (with random weights) from the multimolecule/splicebert style configuration\n>>> model = SpliceBertModel(configuration)\n>>> # Accessing the model configuration\n>>> configuration = model.config\n
Source code in multimolecule/models/splicebert/configuration_splicebert.py Python
class SpliceBertConfig(PreTrainedConfig):\n    r\"\"\"\n    This is the configuration class to store the configuration of a\n    [`SpliceBertModel`][multimolecule.models.SpliceBertModel]. It is used to instantiate a SpliceBert model according\n    to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will\n    yield a similar configuration to that of the SpliceBert\n    [biomed-AI/SpliceBERT](https://github.com/biomed-AI/SpliceBERT) architecture.\n\n    Configuration objects inherit from [`PreTrainedConfig`][multimolecule.models.PreTrainedConfig] and can be used to\n    control the model outputs. Read the documentation from [`PreTrainedConfig`][multimolecule.models.PreTrainedConfig]\n    for more information.\n\n    Args:\n        vocab_size:\n            Vocabulary size of the SpliceBert model. Defines the number of different tokens that can be represented by\n            the `inputs_ids` passed when calling [`SpliceBertModel`].\n        hidden_size:\n            Dimensionality of the encoder layers and the pooler layer.\n        num_hidden_layers:\n            Number of hidden layers in the Transformer encoder.\n        num_attention_heads:\n            Number of attention heads for each attention layer in the Transformer encoder.\n        intermediate_size:\n            Dimensionality of the \"intermediate\" (often named feed-forward) layer in the Transformer encoder.\n        hidden_dropout:\n            The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.\n        attention_dropout:\n            The dropout ratio for the attention probabilities.\n        max_position_embeddings:\n            The maximum sequence length that this model might ever be used with. Typically set this to something large\n            just in case (e.g., 512 or 1024 or 2048).\n        initializer_range:\n            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.\n        layer_norm_eps:\n            The epsilon used by the layer normalization layers.\n\n    Examples:\n        >>> from multimolecule import SpliceBertModel, SpliceBertConfig\n        >>> # Initializing a SpliceBERT multimolecule/splicebert style configuration\n        >>> configuration = SpliceBertConfig()\n        >>> # Initializing a model (with random weights) from the multimolecule/splicebert style configuration\n        >>> model = SpliceBertModel(configuration)\n        >>> # Accessing the model configuration\n        >>> configuration = model.config\n    \"\"\"\n\n    model_type = \"splicebert\"\n\n    def __init__(\n        self,\n        vocab_size: int = 26,\n        hidden_size: int = 512,\n        num_hidden_layers: int = 6,\n        num_attention_heads: int = 16,\n        intermediate_size: int = 2048,\n        hidden_act: str = \"gelu\",\n        hidden_dropout: float = 0.1,\n        attention_dropout: float = 0.1,\n        max_position_embeddings: int = 1026,\n        initializer_range: float = 0.02,\n        layer_norm_eps: float = 1e-12,\n        position_embedding_type: str = \"absolute\",\n        is_decoder: bool = False,\n        use_cache: bool = True,\n        head: HeadConfig | None = None,\n        lm_head: MaskedLMHeadConfig | None = None,\n        **kwargs,\n    ):\n        super().__init__(**kwargs)\n\n        self.vocab_size = vocab_size\n        self.type_vocab_size = 2\n        self.hidden_size = hidden_size\n        self.num_hidden_layers = num_hidden_layers\n        self.num_attention_heads = num_attention_heads\n        self.intermediate_size = intermediate_size\n        self.hidden_act = hidden_act\n        self.hidden_dropout = hidden_dropout\n        self.attention_dropout = attention_dropout\n        self.max_position_embeddings = max_position_embeddings\n        self.initializer_range = initializer_range\n        self.layer_norm_eps = layer_norm_eps\n        self.position_embedding_type = position_embedding_type\n        self.is_decoder = is_decoder\n        self.use_cache = use_cache\n        self.head = HeadConfig(**head) if head is not None else None\n        self.lm_head = MaskedLMHeadConfig(**lm_head) if lm_head is not None else None\n
"},{"location":"models/splicebert/#multimolecule.models.splicebert.SpliceBertConfig(vocab_size)","title":"vocab_size","text":""},{"location":"models/splicebert/#multimolecule.models.splicebert.SpliceBertConfig(hidden_size)","title":"hidden_size","text":""},{"location":"models/splicebert/#multimolecule.models.splicebert.SpliceBertConfig(num_hidden_layers)","title":"num_hidden_layers","text":""},{"location":"models/splicebert/#multimolecule.models.splicebert.SpliceBertConfig(num_attention_heads)","title":"num_attention_heads","text":""},{"location":"models/splicebert/#multimolecule.models.splicebert.SpliceBertConfig(intermediate_size)","title":"intermediate_size","text":""},{"location":"models/splicebert/#multimolecule.models.splicebert.SpliceBertConfig(hidden_dropout)","title":"hidden_dropout","text":""},{"location":"models/splicebert/#multimolecule.models.splicebert.SpliceBertConfig(attention_dropout)","title":"attention_dropout","text":""},{"location":"models/splicebert/#multimolecule.models.splicebert.SpliceBertConfig(max_position_embeddings)","title":"max_position_embeddings","text":""},{"location":"models/splicebert/#multimolecule.models.splicebert.SpliceBertConfig(initializer_range)","title":"initializer_range","text":""},{"location":"models/splicebert/#multimolecule.models.splicebert.SpliceBertConfig(layer_norm_eps)","title":"layer_norm_eps","text":""},{"location":"models/splicebert/#multimolecule.models.splicebert.SpliceBertForContactPrediction","title":"SpliceBertForContactPrediction","text":"

Bases: SpliceBertPreTrainedModel

Examples:

Python Console Session
>>> from multimolecule import SpliceBertConfig, SpliceBertForContactPrediction, RnaTokenizer\n>>> config = SpliceBertConfig()\n>>> model = SpliceBertForContactPrediction(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=torch.randint(2, (1, 5, 5)))\n>>> output[\"logits\"].shape\ntorch.Size([1, 5, 5, 1])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n
Source code in multimolecule/models/splicebert/modeling_splicebert.py Python
class SpliceBertForContactPrediction(SpliceBertPreTrainedModel):\n    \"\"\"\n    Examples:\n        >>> from multimolecule import SpliceBertConfig, SpliceBertForContactPrediction, RnaTokenizer\n        >>> config = SpliceBertConfig()\n        >>> model = SpliceBertForContactPrediction(config)\n        >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n        >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n        >>> output = model(**input, labels=torch.randint(2, (1, 5, 5)))\n        >>> output[\"logits\"].shape\n        torch.Size([1, 5, 5, 1])\n        >>> output[\"loss\"]  # doctest:+ELLIPSIS\n        tensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n    \"\"\"\n\n    def __init__(self, config: SpliceBertConfig):\n        super().__init__(config)\n        self.splicebert = SpliceBertModel(config, add_pooling_layer=True)\n        self.contact_head = ContactPredictionHead(config)\n        self.head_config = self.contact_head.config\n\n        # Initialize weights and apply final processing\n        self.post_init()\n\n    def forward(\n        self,\n        input_ids: Tensor | NestedTensor,\n        attention_mask: Tensor | None = None,\n        position_ids: Tensor | None = None,\n        head_mask: Tensor | None = None,\n        inputs_embeds: Tensor | NestedTensor | None = None,\n        labels: Tensor | None = None,\n        output_attentions: bool | None = None,\n        output_hidden_states: bool | None = None,\n        return_dict: bool | None = None,\n        **kwargs,\n    ) -> Tuple[Tensor, ...] | ContactPredictorOutput:\n        if output_attentions is False:\n            warn(\"output_attentions must be True for contact classification and will be ignored.\")\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n        outputs = self.splicebert(\n            input_ids,\n            attention_mask=attention_mask,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            output_attentions=True,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n            **kwargs,\n        )\n        output = self.contact_head(outputs, attention_mask, input_ids, labels)\n        logits, loss = output.logits, output.loss\n\n        if not return_dict:\n            output = (logits,) + outputs[2:]\n            return ((loss,) + output) if loss is not None else output\n\n        return ContactPredictorOutput(\n            loss=loss,\n            logits=logits,\n            hidden_states=outputs.hidden_states,\n            attentions=outputs.attentions,\n        )\n
"},{"location":"models/splicebert/#multimolecule.models.splicebert.SpliceBertForMaskedLM","title":"SpliceBertForMaskedLM","text":"

Bases: SpliceBertPreTrainedModel

Examples:

Python Console Session
>>> from multimolecule import SpliceBertConfig, SpliceBertForMaskedLM, RnaTokenizer\n>>> config = SpliceBertConfig()\n>>> model = SpliceBertForMaskedLM(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=input[\"input_ids\"])\n>>> output[\"logits\"].shape\ntorch.Size([1, 7, 26])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<NllLossBackward0>)\n
Source code in multimolecule/models/splicebert/modeling_splicebert.py Python
class SpliceBertForMaskedLM(SpliceBertPreTrainedModel):\n    \"\"\"\n    Examples:\n        >>> from multimolecule import SpliceBertConfig, SpliceBertForMaskedLM, RnaTokenizer\n        >>> config = SpliceBertConfig()\n        >>> model = SpliceBertForMaskedLM(config)\n        >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n        >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n        >>> output = model(**input, labels=input[\"input_ids\"])\n        >>> output[\"logits\"].shape\n        torch.Size([1, 7, 26])\n        >>> output[\"loss\"]  # doctest:+ELLIPSIS\n        tensor(..., grad_fn=<NllLossBackward0>)\n    \"\"\"\n\n    _tied_weights_keys = [\"lm_head.decoder.bias\", \"lm_head.decoder.weight\"]\n\n    def __init__(self, config: SpliceBertConfig):\n        super().__init__(config)\n        if config.is_decoder:\n            logger.warning(\n                \"If you want to use `SpliceBertForMaskedLM` make sure `config.is_decoder=False` for \"\n                \"bi-directional self-attention.\"\n            )\n        self.splicebert = SpliceBertModel(config, add_pooling_layer=False)\n        self.lm_head = MaskedLMHead(config)\n\n        # Initialize weights and apply final processing\n        self.post_init()\n\n    def get_output_embeddings(self):\n        return self.lm_head.decoder\n\n    def set_output_embeddings(self, new_embeddings):\n        self.lm_head.decoder = new_embeddings\n\n    def forward(\n        self,\n        input_ids: Tensor | NestedTensor,\n        attention_mask: Tensor | None = None,\n        position_ids: Tensor | None = None,\n        head_mask: Tensor | None = None,\n        inputs_embeds: Tensor | NestedTensor | None = None,\n        encoder_hidden_states: Tensor | None = None,\n        encoder_attention_mask: Tensor | None = None,\n        labels: Tensor | None = None,\n        output_attentions: bool | None = None,\n        output_hidden_states: bool | None = None,\n        return_dict: bool | None = None,\n        **kwargs,\n    ) -> Tuple[Tensor, ...] | MaskedLMOutput:\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n        outputs = self.splicebert(\n            input_ids,\n            attention_mask=attention_mask,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            encoder_hidden_states=encoder_hidden_states,\n            encoder_attention_mask=encoder_attention_mask,\n            output_attentions=output_attentions,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n            **kwargs,\n        )\n        output = self.lm_head(outputs, labels)\n        logits, loss = output.logits, output.loss\n\n        if not return_dict:\n            output = (logits,) + outputs[2:]\n            return ((loss,) + output) if loss is not None else output\n\n        return MaskedLMOutput(\n            loss=loss,\n            logits=logits,\n            hidden_states=outputs.hidden_states,\n            attentions=outputs.attentions,\n        )\n
"},{"location":"models/splicebert/#multimolecule.models.splicebert.SpliceBertForSequencePrediction","title":"SpliceBertForSequencePrediction","text":"

Bases: SpliceBertPreTrainedModel

Examples:

Python Console Session
>>> from multimolecule import SpliceBertConfig, SpliceBertForSequencePrediction, RnaTokenizer\n>>> config = SpliceBertConfig()\n>>> model = SpliceBertForSequencePrediction(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=torch.tensor([[1]]))\n>>> output[\"logits\"].shape\ntorch.Size([1, 1])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n
Source code in multimolecule/models/splicebert/modeling_splicebert.py Python
class SpliceBertForSequencePrediction(SpliceBertPreTrainedModel):\n    \"\"\"\n    Examples:\n        >>> from multimolecule import SpliceBertConfig, SpliceBertForSequencePrediction, RnaTokenizer\n        >>> config = SpliceBertConfig()\n        >>> model = SpliceBertForSequencePrediction(config)\n        >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n        >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n        >>> output = model(**input, labels=torch.tensor([[1]]))\n        >>> output[\"logits\"].shape\n        torch.Size([1, 1])\n        >>> output[\"loss\"]  # doctest:+ELLIPSIS\n        tensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n    \"\"\"\n\n    def __init__(self, config: SpliceBertConfig):\n        super().__init__(config)\n        self.splicebert = SpliceBertModel(config, add_pooling_layer=True)\n        self.sequence_head = SequencePredictionHead(config)\n        self.head_config = self.sequence_head.config\n\n        # Initialize weights and apply final processing\n        self.post_init()\n\n    def forward(\n        self,\n        input_ids: Tensor | NestedTensor,\n        attention_mask: Tensor | None = None,\n        position_ids: Tensor | None = None,\n        head_mask: Tensor | None = None,\n        inputs_embeds: Tensor | NestedTensor | None = None,\n        labels: Tensor | None = None,\n        output_attentions: bool | None = None,\n        output_hidden_states: bool | None = None,\n        return_dict: bool | None = None,\n        **kwargs,\n    ) -> Tuple[Tensor, ...] | SequencePredictorOutput:\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n        outputs = self.splicebert(\n            input_ids,\n            attention_mask=attention_mask,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            output_attentions=output_attentions,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n            **kwargs,\n        )\n        output = self.sequence_head(outputs, labels)\n        logits, loss = output.logits, output.loss\n\n        if not return_dict:\n            output = (logits,) + outputs[2:]\n            return ((loss,) + output) if loss is not None else output\n\n        return SequencePredictorOutput(\n            loss=loss,\n            logits=logits,\n            hidden_states=outputs.hidden_states,\n            attentions=outputs.attentions,\n        )\n
"},{"location":"models/splicebert/#multimolecule.models.splicebert.SpliceBertForTokenPrediction","title":"SpliceBertForTokenPrediction","text":"

Bases: SpliceBertPreTrainedModel

Examples:

Python Console Session
>>> from multimolecule import SpliceBertConfig, SpliceBertForTokenPrediction, RnaTokenizer\n>>> config = SpliceBertConfig()\n>>> model = SpliceBertForTokenPrediction(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=torch.randint(2, (1, 5)))\n>>> output[\"logits\"].shape\ntorch.Size([1, 5, 1])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n
Source code in multimolecule/models/splicebert/modeling_splicebert.py Python
class SpliceBertForTokenPrediction(SpliceBertPreTrainedModel):\n    \"\"\"\n    Examples:\n        >>> from multimolecule import SpliceBertConfig, SpliceBertForTokenPrediction, RnaTokenizer\n        >>> config = SpliceBertConfig()\n        >>> model = SpliceBertForTokenPrediction(config)\n        >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n        >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n        >>> output = model(**input, labels=torch.randint(2, (1, 5)))\n        >>> output[\"logits\"].shape\n        torch.Size([1, 5, 1])\n        >>> output[\"loss\"]  # doctest:+ELLIPSIS\n        tensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n    \"\"\"\n\n    def __init__(self, config: SpliceBertConfig):\n        super().__init__(config)\n        self.splicebert = SpliceBertModel(config, add_pooling_layer=True)\n        self.token_head = TokenPredictionHead(config)\n        self.head_config = self.token_head.config\n\n        # Initialize weights and apply final processing\n        self.post_init()\n\n    def forward(\n        self,\n        input_ids: Tensor | NestedTensor,\n        attention_mask: Tensor | None = None,\n        position_ids: Tensor | None = None,\n        head_mask: Tensor | None = None,\n        inputs_embeds: Tensor | NestedTensor | None = None,\n        labels: Tensor | None = None,\n        output_attentions: bool | None = None,\n        output_hidden_states: bool | None = None,\n        return_dict: bool | None = None,\n        **kwargs,\n    ) -> Tuple[Tensor, ...] | TokenPredictorOutput:\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n        outputs = self.splicebert(\n            input_ids,\n            attention_mask=attention_mask,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            output_attentions=output_attentions,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n            **kwargs,\n        )\n        output = self.token_head(outputs, attention_mask, input_ids, labels)\n        logits, loss = output.logits, output.loss\n\n        if not return_dict:\n            output = (logits,) + outputs[2:]\n            return ((loss,) + output) if loss is not None else output\n\n        return TokenPredictorOutput(\n            loss=loss,\n            logits=logits,\n            hidden_states=outputs.hidden_states,\n            attentions=outputs.attentions,\n        )\n
"},{"location":"models/splicebert/#multimolecule.models.splicebert.SpliceBertModel","title":"SpliceBertModel","text":"

Bases: SpliceBertPreTrainedModel

Examples:

Python Console Session
>>> from multimolecule import SpliceBertConfig, SpliceBertModel, RnaTokenizer\n>>> config = SpliceBertConfig()\n>>> model = SpliceBertModel(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input)\n>>> output[\"last_hidden_state\"].shape\ntorch.Size([1, 7, 512])\n>>> output[\"pooler_output\"].shape\ntorch.Size([1, 512])\n
Source code in multimolecule/models/splicebert/modeling_splicebert.py Python
class SpliceBertModel(SpliceBertPreTrainedModel):\n    \"\"\"\n    Examples:\n        >>> from multimolecule import SpliceBertConfig, SpliceBertModel, RnaTokenizer\n        >>> config = SpliceBertConfig()\n        >>> model = SpliceBertModel(config)\n        >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n        >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n        >>> output = model(**input)\n        >>> output[\"last_hidden_state\"].shape\n        torch.Size([1, 7, 512])\n        >>> output[\"pooler_output\"].shape\n        torch.Size([1, 512])\n    \"\"\"\n\n    def __init__(self, config: SpliceBertConfig, add_pooling_layer: bool = True):\n        super().__init__(config)\n        self.pad_token_id = config.pad_token_id\n        self.embeddings = SpliceBertEmbeddings(config)\n        self.encoder = SpliceBertEncoder(config)\n        self.pooler = SpliceBertPooler(config) if add_pooling_layer else None\n\n        # Initialize weights and apply final processing\n        self.post_init()\n\n    def get_input_embeddings(self):\n        return self.embeddings.word_embeddings\n\n    def set_input_embeddings(self, value):\n        self.embeddings.word_embeddings = value\n\n    def _prune_heads(self, heads_to_prune):\n        \"\"\"\n        Prunes heads of the model. heads_to_prune: dict of {layer_num: list of heads to prune in this layer} See base\n        class PreTrainedModel\n        \"\"\"\n        for layer, heads in heads_to_prune.items():\n            self.encoder.layer[layer].attention.prune_heads(heads)\n\n    def forward(\n        self,\n        input_ids: Tensor | NestedTensor,\n        attention_mask: Tensor | None = None,\n        position_ids: Tensor | None = None,\n        head_mask: Tensor | None = None,\n        inputs_embeds: Tensor | NestedTensor | None = None,\n        encoder_hidden_states: Tensor | None = None,\n        encoder_attention_mask: Tensor | None = None,\n        past_key_values: Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None = None,\n        use_cache: bool | None = None,\n        output_attentions: bool | None = None,\n        output_hidden_states: bool | None = None,\n        return_dict: bool | None = None,\n        **kwargs,\n    ) -> Tuple[Tensor, ...] | BaseModelOutputWithPoolingAndCrossAttentions:\n        r\"\"\"\n        Args:\n            encoder_hidden_states:\n                Shape: `(batch_size, sequence_length, hidden_size)`\n\n                Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if\n                the model is configured as a decoder.\n            encoder_attention_mask:\n                Shape: `(batch_size, sequence_length)`\n\n                Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used\n                in the cross-attention if the model is configured as a decoder. Mask values selected in `[0, 1]`:\n\n                - 1 for tokens that are **not masked**,\n                - 0 for tokens that are **masked**.\n            past_key_values:\n                Tuple of length `config.n_layers` with each tuple having 4 tensors of shape\n                `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)\n\n                Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up\n                decoding.\n\n                If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those\n                that don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of\n                all `decoder_input_ids` of shape `(batch_size, sequence_length)`.\n            use_cache:\n                If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding\n                (see `past_key_values`).\n        \"\"\"\n        if kwargs:\n            warn(\n                f\"Additional keyword arguments `{', '.join(kwargs)}` are detected in \"\n                f\"`{self.__class__.__name__}.forward`, they will be ignored.\\n\"\n                \"This is provided for backward compatibility and may lead to unexpected behavior.\"\n            )\n        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions\n        output_hidden_states = (\n            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states\n        )\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n\n        if self.config.is_decoder:\n            use_cache = use_cache if use_cache is not None else self.config.use_cache\n        else:\n            use_cache = False\n\n        if isinstance(input_ids, NestedTensor):\n            input_ids, attention_mask = input_ids.tensor, input_ids.mask\n        if input_ids is not None and inputs_embeds is not None:\n            raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n        if input_ids is not None:\n            self.warn_if_padding_and_no_attention_mask(input_ids, attention_mask)\n            input_shape = input_ids.size()\n        elif inputs_embeds is not None:\n            input_shape = inputs_embeds.size()[:-1]\n        else:\n            raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n        batch_size, seq_length = input_shape\n        device = input_ids.device if input_ids is not None else inputs_embeds.device  # type: ignore[union-attr]\n\n        # past_key_values_length\n        past_key_values_length = past_key_values[0][0].shape[2] if past_key_values is not None else 0\n\n        if attention_mask is None:\n            attention_mask = (\n                input_ids.ne(self.pad_token_id)\n                if self.pad_token_id is not None\n                else torch.ones(((batch_size, seq_length + past_key_values_length)), device=device)\n            )\n\n        # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]\n        # ourselves in which case we just need to make it broadcastable to all heads.\n        extended_attention_mask: Tensor = self.get_extended_attention_mask(attention_mask, input_shape)\n\n        # If a 2D or 3D attention mask is provided for the cross-attention\n        # we need to make broadcastable to [batch_size, num_heads, seq_length, seq_length]\n        if self.config.is_decoder and encoder_hidden_states is not None:\n            encoder_batch_size, encoder_sequence_length, _ = encoder_hidden_states.size()\n            encoder_hidden_shape = (encoder_batch_size, encoder_sequence_length)\n            if encoder_attention_mask is None:\n                encoder_attention_mask = torch.ones(encoder_hidden_shape, device=device)\n            encoder_extended_attention_mask = self.invert_attention_mask(encoder_attention_mask)\n        else:\n            encoder_extended_attention_mask = None\n\n        # Prepare head mask if needed\n        # 1.0 in head_mask indicate we keep the head\n        # attention_probs has shape bsz x n_heads x N x N\n        # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]\n        # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]\n        head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)\n\n        embedding_output = self.embeddings(\n            input_ids=input_ids,\n            position_ids=position_ids,\n            inputs_embeds=inputs_embeds,\n            past_key_values_length=past_key_values_length,\n        )\n        encoder_outputs = self.encoder(\n            embedding_output,\n            attention_mask=extended_attention_mask,\n            head_mask=head_mask,\n            encoder_hidden_states=encoder_hidden_states,\n            encoder_attention_mask=encoder_extended_attention_mask,\n            past_key_values=past_key_values,\n            use_cache=use_cache,\n            output_attentions=output_attentions,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n        )\n        sequence_output = encoder_outputs[0]\n        pooled_output = self.pooler(sequence_output) if self.pooler is not None else None\n\n        if not return_dict:\n            return (sequence_output, pooled_output) + encoder_outputs[1:]\n\n        return BaseModelOutputWithPoolingAndCrossAttentions(\n            last_hidden_state=sequence_output,\n            pooler_output=pooled_output,\n            past_key_values=encoder_outputs.past_key_values,\n            hidden_states=encoder_outputs.hidden_states,\n            attentions=encoder_outputs.attentions,\n            cross_attentions=encoder_outputs.cross_attentions,\n        )\n
"},{"location":"models/splicebert/#multimolecule.models.splicebert.SpliceBertModel.forward","title":"forward","text":"Python
forward(input_ids: Tensor | NestedTensor, attention_mask: Tensor | None = None, position_ids: Tensor | None = None, head_mask: Tensor | None = None, inputs_embeds: Tensor | NestedTensor | None = None, encoder_hidden_states: Tensor | None = None, encoder_attention_mask: Tensor | None = None, past_key_values: Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None = None, use_cache: bool | None = None, output_attentions: bool | None = None, output_hidden_states: bool | None = None, return_dict: bool | None = None, **kwargs) -> Tuple[Tensor, ...] | BaseModelOutputWithPoolingAndCrossAttentions\n

Parameters:

Name Type Description Default Tensor | None

Shape: (batch_size, sequence_length, hidden_size)

Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if the model is configured as a decoder.

None Tensor | None

Shape: (batch_size, sequence_length)

Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used in the cross-attention if the model is configured as a decoder. Mask values selected in [0, 1]:

None Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None

Tuple of length config.n_layers with each tuple having 4 tensors of shape `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)

Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up decoding.

If past_key_values are used, the user can optionally input only the last decoder_input_ids (those that don\u2019t have their past key value states given to this model) of shape (batch_size, 1) instead of all decoder_input_ids of shape (batch_size, sequence_length).

None bool | None

If set to True, past_key_values key value states are returned and can be used to speed up decoding (see past_key_values).

None Source code in multimolecule/models/splicebert/modeling_splicebert.py Python
def forward(\n    self,\n    input_ids: Tensor | NestedTensor,\n    attention_mask: Tensor | None = None,\n    position_ids: Tensor | None = None,\n    head_mask: Tensor | None = None,\n    inputs_embeds: Tensor | NestedTensor | None = None,\n    encoder_hidden_states: Tensor | None = None,\n    encoder_attention_mask: Tensor | None = None,\n    past_key_values: Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None = None,\n    use_cache: bool | None = None,\n    output_attentions: bool | None = None,\n    output_hidden_states: bool | None = None,\n    return_dict: bool | None = None,\n    **kwargs,\n) -> Tuple[Tensor, ...] | BaseModelOutputWithPoolingAndCrossAttentions:\n    r\"\"\"\n    Args:\n        encoder_hidden_states:\n            Shape: `(batch_size, sequence_length, hidden_size)`\n\n            Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if\n            the model is configured as a decoder.\n        encoder_attention_mask:\n            Shape: `(batch_size, sequence_length)`\n\n            Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used\n            in the cross-attention if the model is configured as a decoder. Mask values selected in `[0, 1]`:\n\n            - 1 for tokens that are **not masked**,\n            - 0 for tokens that are **masked**.\n        past_key_values:\n            Tuple of length `config.n_layers` with each tuple having 4 tensors of shape\n            `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)\n\n            Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up\n            decoding.\n\n            If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those\n            that don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of\n            all `decoder_input_ids` of shape `(batch_size, sequence_length)`.\n        use_cache:\n            If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding\n            (see `past_key_values`).\n    \"\"\"\n    if kwargs:\n        warn(\n            f\"Additional keyword arguments `{', '.join(kwargs)}` are detected in \"\n            f\"`{self.__class__.__name__}.forward`, they will be ignored.\\n\"\n            \"This is provided for backward compatibility and may lead to unexpected behavior.\"\n        )\n    output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions\n    output_hidden_states = (\n        output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states\n    )\n    return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n\n    if self.config.is_decoder:\n        use_cache = use_cache if use_cache is not None else self.config.use_cache\n    else:\n        use_cache = False\n\n    if isinstance(input_ids, NestedTensor):\n        input_ids, attention_mask = input_ids.tensor, input_ids.mask\n    if input_ids is not None and inputs_embeds is not None:\n        raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n    if input_ids is not None:\n        self.warn_if_padding_and_no_attention_mask(input_ids, attention_mask)\n        input_shape = input_ids.size()\n    elif inputs_embeds is not None:\n        input_shape = inputs_embeds.size()[:-1]\n    else:\n        raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n    batch_size, seq_length = input_shape\n    device = input_ids.device if input_ids is not None else inputs_embeds.device  # type: ignore[union-attr]\n\n    # past_key_values_length\n    past_key_values_length = past_key_values[0][0].shape[2] if past_key_values is not None else 0\n\n    if attention_mask is None:\n        attention_mask = (\n            input_ids.ne(self.pad_token_id)\n            if self.pad_token_id is not None\n            else torch.ones(((batch_size, seq_length + past_key_values_length)), device=device)\n        )\n\n    # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]\n    # ourselves in which case we just need to make it broadcastable to all heads.\n    extended_attention_mask: Tensor = self.get_extended_attention_mask(attention_mask, input_shape)\n\n    # If a 2D or 3D attention mask is provided for the cross-attention\n    # we need to make broadcastable to [batch_size, num_heads, seq_length, seq_length]\n    if self.config.is_decoder and encoder_hidden_states is not None:\n        encoder_batch_size, encoder_sequence_length, _ = encoder_hidden_states.size()\n        encoder_hidden_shape = (encoder_batch_size, encoder_sequence_length)\n        if encoder_attention_mask is None:\n            encoder_attention_mask = torch.ones(encoder_hidden_shape, device=device)\n        encoder_extended_attention_mask = self.invert_attention_mask(encoder_attention_mask)\n    else:\n        encoder_extended_attention_mask = None\n\n    # Prepare head mask if needed\n    # 1.0 in head_mask indicate we keep the head\n    # attention_probs has shape bsz x n_heads x N x N\n    # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]\n    # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]\n    head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)\n\n    embedding_output = self.embeddings(\n        input_ids=input_ids,\n        position_ids=position_ids,\n        inputs_embeds=inputs_embeds,\n        past_key_values_length=past_key_values_length,\n    )\n    encoder_outputs = self.encoder(\n        embedding_output,\n        attention_mask=extended_attention_mask,\n        head_mask=head_mask,\n        encoder_hidden_states=encoder_hidden_states,\n        encoder_attention_mask=encoder_extended_attention_mask,\n        past_key_values=past_key_values,\n        use_cache=use_cache,\n        output_attentions=output_attentions,\n        output_hidden_states=output_hidden_states,\n        return_dict=return_dict,\n    )\n    sequence_output = encoder_outputs[0]\n    pooled_output = self.pooler(sequence_output) if self.pooler is not None else None\n\n    if not return_dict:\n        return (sequence_output, pooled_output) + encoder_outputs[1:]\n\n    return BaseModelOutputWithPoolingAndCrossAttentions(\n        last_hidden_state=sequence_output,\n        pooler_output=pooled_output,\n        past_key_values=encoder_outputs.past_key_values,\n        hidden_states=encoder_outputs.hidden_states,\n        attentions=encoder_outputs.attentions,\n        cross_attentions=encoder_outputs.cross_attentions,\n    )\n
"},{"location":"models/splicebert/#multimolecule.models.splicebert.SpliceBertModel.forward(encoder_hidden_states)","title":"encoder_hidden_states","text":""},{"location":"models/splicebert/#multimolecule.models.splicebert.SpliceBertModel.forward(encoder_attention_mask)","title":"encoder_attention_mask","text":""},{"location":"models/splicebert/#multimolecule.models.splicebert.SpliceBertModel.forward(past_key_values)","title":"past_key_values","text":""},{"location":"models/splicebert/#multimolecule.models.splicebert.SpliceBertModel.forward(use_cache)","title":"use_cache","text":""},{"location":"models/splicebert/#multimolecule.models.splicebert.SpliceBertPreTrainedModel","title":"SpliceBertPreTrainedModel","text":"

Bases: PreTrainedModel

An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained models.

Source code in multimolecule/models/splicebert/modeling_splicebert.py Python
class SpliceBertPreTrainedModel(PreTrainedModel):\n    \"\"\"\n    An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained\n    models.\n    \"\"\"\n\n    config_class = SpliceBertConfig\n    base_model_prefix = \"splicebert\"\n    supports_gradient_checkpointing = True\n    _no_split_modules = [\"SpliceBertLayer\", \"SpliceBertEmbeddings\"]\n\n    # Copied from transformers.models.bert.modeling_bert.BertPreTrainedModel._init_weights\n    def _init_weights(self, module: nn.Module):\n        \"\"\"Initialize the weights\"\"\"\n        if isinstance(module, nn.Linear):\n            # Slightly different from the TF version which uses truncated_normal for initialization\n            # cf https://github.com/pytorch/pytorch/pull/5617\n            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)\n            if module.bias is not None:\n                module.bias.data.zero_()\n        elif isinstance(module, nn.Embedding):\n            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)\n            if module.padding_idx is not None:\n                module.weight.data[module.padding_idx].zero_()\n        elif isinstance(module, nn.LayerNorm):\n            module.bias.data.zero_()\n            module.weight.data.fill_(1.0)\n\n    def _set_gradient_checkpointing(self, module, value=False):\n        if isinstance(module, SpliceBertEncoder):\n            module.gradient_checkpointing = value\n
"},{"location":"models/utrbert/","title":"3UTRBERT","text":"

Pre-trained model on 3\u2019 untranslated region (3\u2019UTR) using a masked language modeling (MLM) objective.

"},{"location":"models/utrbert/#disclaimer","title":"Disclaimer","text":"

This is an UNOFFICIAL implementation of the Deciphering 3\u2019 UTR mediated gene regulation using interpretable deep representation learning by Yuning Yang, Gen Li, et al.

The OFFICIAL repository of 3UTRBERT is at yangyn533/3UTRBERT.

Tip

The MultiMolecule team has confirmed that the provided model and checkpoints are producing the same intermediate representations as the original implementation.

The team releasing 3UTRBERT did not write this model card for this model so this model card has been written by the MultiMolecule team.

"},{"location":"models/utrbert/#model-details","title":"Model Details","text":"

3UTRBERT is a bert-style model pre-trained on a large corpus of 3\u2019 untranslated regions (3\u2019UTRs) in a self-supervised fashion. This means that the model was trained on the raw nucleotides of RNA sequences only, with an automatic process to generate inputs and labels from those texts. Please refer to the Training Details section for more information on the training process.

"},{"location":"models/utrbert/#variations","title":"Variations","text":""},{"location":"models/utrbert/#model-specification","title":"Model Specification","text":"Variants Num Layers Hidden Size Num Heads Intermediate Size Num Parameters (M) FLOPs (G) MACs (G) Max Num Tokens UTRBERT-3mer 12 768 12 3072 86.14 22.36 11.17 512 UTRBERT-4mer 86.53 UTRBERT-5mer 88.45 UTRBERT-6mer 98.05"},{"location":"models/utrbert/#links","title":"Links","text":""},{"location":"models/utrbert/#usage","title":"Usage","text":"

The model file depends on the multimolecule library. You can install it using pip:

Bash
pip install multimolecule\n
"},{"location":"models/utrbert/#direct-use","title":"Direct Use","text":"

Note: Default transformers pipeline does not support K-mer tokenization.

You can use this model directly with a pipeline for masked language modeling:

Python
>>> import multimolecule  # you must import multimolecule to register models\n>>> from transformers import pipeline\n>>> unmasker = pipeline(\"fill-mask\", model=\"multimolecule/utrbert-3mer\")\n>>> unmasker(\"gguc<mask><mask><mask>cugguuagaccagaucugagccu\")[1]\n\n[{'score': 0.40745577216148376,\n  'token': 47,\n  'token_str': 'CUC',\n  'sequence': '<cls> GGU GUC <mask> CUC <mask> CUG UGG GGU GUU UUA UAG AGA GAC ACC CCA CAG AGA GAU AUC UCU CUG UGA GAG AGC GCC CCU <eos>'},\n {'score': 0.40001827478408813,\n  'token': 32,\n  'token_str': 'CAC',\n  'sequence': '<cls> GGU GUC <mask> CAC <mask> CUG UGG GGU GUU UUA UAG AGA GAC ACC CCA CAG AGA GAU AUC UCU CUG UGA GAG AGC GCC CCU <eos>'},\n {'score': 0.14566268026828766,\n  'token': 37,\n  'token_str': 'CCC',\n  'sequence': '<cls> GGU GUC <mask> CCC <mask> CUG UGG GGU GUU UUA UAG AGA GAC ACC CCA CAG AGA GAU AUC UCU CUG UGA GAG AGC GCC CCU <eos>'},\n {'score': 0.04422207176685333,\n  'token': 42,\n  'token_str': 'CGC',\n  'sequence': '<cls> GGU GUC <mask> CGC <mask> CUG UGG GGU GUU UUA UAG AGA GAC ACC CCA CAG AGA GAU AUC UCU CUG UGA GAG AGC GCC CCU <eos>'},\n {'score': 0.0008025980787351727,\n  'token': 34,\n  'token_str': 'CAU',\n  'sequence': '<cls> GGU GUC <mask> CAU <mask> CUG UGG GGU GUU UUA UAG AGA GAC ACC CCA CAG AGA GAU AUC UCU CUG UGA GAG AGC GCC CCU <eos>'}]\n
"},{"location":"models/utrbert/#downstream-use","title":"Downstream Use","text":""},{"location":"models/utrbert/#extract-features","title":"Extract Features","text":"

Here is how to use this model to get the features of a given sequence in PyTorch:

Python
from multimolecule import RnaTokenizer, UtrBertModel\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/utrbert-3mer\")\nmodel = UtrBertModel.from_pretrained(\"multimolecule/utrbert-3mer\")\n\ntext = \"UAGCUUAUCAGACUGAUGUUGA\"\ninput = tokenizer(text, return_tensors=\"pt\")\n\noutput = model(**input)\n
"},{"location":"models/utrbert/#sequence-classification-regression","title":"Sequence Classification / Regression","text":"

Note: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for sequence classification or regression.

Here is how to use this model as backbone to fine-tune for a sequence-level task in PyTorch:

Python
import torch\nfrom multimolecule import RnaTokenizer, UtrBertForSequencePrediction\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/utrbert-3mer\")\nmodel = UtrBertForSequencePrediction.from_pretrained(\"multimolecule/utrbert-3mer\")\n\ntext = \"UAGCUUAUCAGACUGAUGUUGA\"\ninput = tokenizer(text, return_tensors=\"pt\")\nlabel = torch.tensor([1])\n\noutput = model(**input, labels=label)\n
"},{"location":"models/utrbert/#token-classification-regression","title":"Token Classification / Regression","text":"

Note: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for nucleotide classification or regression.

Here is how to use this model as backbone to fine-tune for a nucleotide-level task in PyTorch:

Python
import torch\nfrom multimolecule import RnaTokenizer, UtrBertForTokenPrediction\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/utrbert-3mer\")\nmodel = UtrBertForTokenPrediction.from_pretrained(\"multimolecule/utrbert-3mer\")\n\ntext = \"UAGCUUAUCAGACUGAUGUUGA\"\ninput = tokenizer(text, return_tensors=\"pt\")\nlabel = torch.randint(2, (len(text), ))\n\noutput = model(**input, labels=label)\n
"},{"location":"models/utrbert/#contact-classification-regression","title":"Contact Classification / Regression","text":"

Note: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for contact classification or regression.

Here is how to use this model as backbone to fine-tune for a contact-level task in PyTorch:

Python
import torch\nfrom multimolecule import RnaTokenizer, UtrBertForContactPrediction\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/utrbert-3mer\")\nmodel = UtrBertForContactPrediction.from_pretrained(\"multimolecule/utrbert-3mer\")\n\ntext = \"UAGCUUAUCAGACUGAUGUUGA\"\ninput = tokenizer(text, return_tensors=\"pt\")\nlabel = torch.randint(2, (len(text), len(text)))\n\noutput = model(**input, labels=label)\n
"},{"location":"models/utrbert/#training-details","title":"Training Details","text":"

3UTRBERT used Masked Language Modeling (MLM) as the pre-training objective: taking a sequence, the model randomly masks 15% of the tokens in the input then runs the entire masked sentence through the model and has to predict the masked tokens. This is comparable to the Cloze task in language modeling.

"},{"location":"models/utrbert/#training-data","title":"Training Data","text":"

The 3UTRBERT model was pre-trained on human mRNA transcript sequences from GENCODE. GENCODE aims to identify all gene features in the human genome using a combination of computational analysis, manual annotation, and experimental validation. The GENCODE release 40 used by this work contains 61,544 genes, and 246,624 transcripts.

3UTRBERT collected the human mRNA transcript sequences from GENCODE, including 108,573 unique mRNA transcripts. Only the longest transcript of each gene was used in the pre-training process. 3UTRBERT only used the 3\u2019 untranslated regions (3\u2019UTRs) of the mRNA transcripts for pre-training to avoid codon constrains in the CDS region, and to reduce increased complexity of the entire mRNA transcripts. The average length of the 3\u2019UTRs was 1,227 nucleotides, while the median length was 631 nucleotides. Each 3\u2019UTR sequence was cut to non-overlapping patches of 510 nucleotides. The remaining sequences were padded to the same length.

Note RnaTokenizer will convert \u201cT\u201ds to \u201cU\u201ds for you, you may disable this behaviour by passing replace_T_with_U=False.

"},{"location":"models/utrbert/#training-procedure","title":"Training Procedure","text":""},{"location":"models/utrbert/#preprocessing","title":"Preprocessing","text":"

3UTRBERT used masked language modeling (MLM) as the pre-training objective. The masking procedure is similar to the one used in BERT:

Since 3UTRBERT used k-mer tokenizer, it masks the entire k-mer instead of individual nucleotides to avoid information leakage.

For example, if the k-mer is 3, the sequence \"UAGCGUAU\" will be tokenized as [\"UAG\", \"AGC\", \"GCG\", \"CGU\", \"GUA\", \"UAU\"]. If the nucleotide \"C\" is masked, the adjacent tokens will also be masked, resulting [\"UAG\", \"<mask>\", \"<mask>\", \"<mask>\", \"GUA\", \"UAU\"].

"},{"location":"models/utrbert/#pretraining","title":"PreTraining","text":"

The model was trained on 4 NVIDIA Quadro RTX 6000 GPUs with 24GiB memories.

"},{"location":"models/utrbert/#citation","title":"Citation","text":"

BibTeX:

BibTeX
@article {yang2023deciphering,\n    author = {Yang, Yuning and Li, Gen and Pang, Kuan and Cao, Wuxinhao and Li, Xiangtao and Zhang, Zhaolei},\n    title = {Deciphering 3{\\textquoteright} UTR mediated gene regulation using interpretable deep representation learning},\n    elocation-id = {2023.09.08.556883},\n    year = {2023},\n    doi = {10.1101/2023.09.08.556883},\n    publisher = {Cold Spring Harbor Laboratory},\n    abstract = {The 3{\\textquoteright}untranslated regions (3{\\textquoteright}UTRs) of messenger RNAs contain many important cis-regulatory elements that are under functional and evolutionary constraints. We hypothesize that these constraints are similar to grammars and syntaxes in human languages and can be modeled by advanced natural language models such as Transformers, which has been very effective in modeling protein sequence and structures. Here we describe 3UTRBERT, which implements an attention-based language model, i.e., Bidirectional Encoder Representations from Transformers (BERT). 3UTRBERT was pre-trained on aggregated 3{\\textquoteright}UTR sequences of human mRNAs in a task-agnostic manner; the pre-trained model was then fine-tuned for specific downstream tasks such as predicting RBP binding sites, m6A RNA modification sites, and predicting RNA sub-cellular localizations. Benchmark results showed that 3UTRBERT generally outperformed other contemporary methods in each of these tasks. We also showed that the self-attention mechanism within 3UTRBERT allows direct visualization of the semantic relationship between sequence elements.Competing Interest StatementThe authors have declared no competing interest.},\n    URL = {https://www.biorxiv.org/content/early/2023/09/12/2023.09.08.556883},\n    eprint = {https://www.biorxiv.org/content/early/2023/09/12/2023.09.08.556883.full.pdf},\n    journal = {bioRxiv}\n}\n
"},{"location":"models/utrbert/#contact","title":"Contact","text":"

Please use GitHub issues of MultiMolecule for any questions or comments on the model card.

Please contact the authors of the 3UTRBERT paper for questions or comments on the paper/model.

"},{"location":"models/utrbert/#license","title":"License","text":"

This model is licensed under the AGPL-3.0 License.

Text Only
SPDX-License-Identifier: AGPL-3.0-or-later\n
"},{"location":"models/utrbert/#multimolecule.models.utrbert","title":"multimolecule.models.utrbert","text":""},{"location":"models/utrbert/#multimolecule.models.utrbert.RnaTokenizer","title":"RnaTokenizer","text":"

Bases: Tokenizer

Tokenizer for RNA sequences.

Parameters:

Name Type Description Default Alphabet | str | List[str] | None

alphabet to use for tokenization.

None int

Size of kmer to tokenize.

1 bool

Whether to tokenize into codons.

False bool

Whether to replace T with U.

True bool

Whether to convert input to uppercase.

True

Examples:

Python Console Session
>>> from multimolecule import RnaTokenizer\n>>> tokenizer = RnaTokenizer()\n>>> tokenizer('<pad><cls><eos><unk><mask><null>ACGUNRYSWKMBDHV.X*-I')[\"input_ids\"]\n[1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 2]\n>>> tokenizer('acgu')[\"input_ids\"]\n[1, 6, 7, 8, 9, 2]\n>>> tokenizer('acgt')[\"input_ids\"]\n[1, 6, 7, 8, 9, 2]\n>>> tokenizer = RnaTokenizer(replace_T_with_U=False)\n>>> tokenizer('acgt')[\"input_ids\"]\n[1, 6, 7, 8, 3, 2]\n>>> tokenizer = RnaTokenizer(nmers=3)\n>>> tokenizer('uagcuuauc')[\"input_ids\"]\n[1, 83, 17, 64, 49, 96, 84, 22, 2]\n>>> tokenizer = RnaTokenizer(codon=True)\n>>> tokenizer('uagcuuauc')[\"input_ids\"]\n[1, 83, 49, 22, 2]\n>>> tokenizer('uagcuuauca')[\"input_ids\"]\nTraceback (most recent call last):\nValueError: length of input sequence must be a multiple of 3 for codon tokenization, but got 10\n
Source code in multimolecule/tokenisers/rna/tokenization_rna.py Python
class RnaTokenizer(Tokenizer):\n    \"\"\"\n    Tokenizer for RNA sequences.\n\n    Args:\n        alphabet: alphabet to use for tokenization.\n\n            - If is `None`, the standard RNA alphabet will be used.\n            - If is a `string`, it should correspond to the name of a predefined alphabet. The options include\n                + `standard`\n                + `extended`\n                + `streamline`\n                + `nucleobase`\n            - If is an alphabet or a list of characters, that specific alphabet will be used.\n        nmers: Size of kmer to tokenize.\n        codon: Whether to tokenize into codons.\n        replace_T_with_U: Whether to replace T with U.\n        do_upper_case: Whether to convert input to uppercase.\n\n    Examples:\n        >>> from multimolecule import RnaTokenizer\n        >>> tokenizer = RnaTokenizer()\n        >>> tokenizer('<pad><cls><eos><unk><mask><null>ACGUNRYSWKMBDHV.X*-I')[\"input_ids\"]\n        [1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 2]\n        >>> tokenizer('acgu')[\"input_ids\"]\n        [1, 6, 7, 8, 9, 2]\n        >>> tokenizer('acgt')[\"input_ids\"]\n        [1, 6, 7, 8, 9, 2]\n        >>> tokenizer = RnaTokenizer(replace_T_with_U=False)\n        >>> tokenizer('acgt')[\"input_ids\"]\n        [1, 6, 7, 8, 3, 2]\n        >>> tokenizer = RnaTokenizer(nmers=3)\n        >>> tokenizer('uagcuuauc')[\"input_ids\"]\n        [1, 83, 17, 64, 49, 96, 84, 22, 2]\n        >>> tokenizer = RnaTokenizer(codon=True)\n        >>> tokenizer('uagcuuauc')[\"input_ids\"]\n        [1, 83, 49, 22, 2]\n        >>> tokenizer('uagcuuauca')[\"input_ids\"]\n        Traceback (most recent call last):\n        ValueError: length of input sequence must be a multiple of 3 for codon tokenization, but got 10\n    \"\"\"\n\n    model_input_names = [\"input_ids\", \"attention_mask\"]\n\n    def __init__(\n        self,\n        alphabet: Alphabet | str | List[str] | None = None,\n        nmers: int = 1,\n        codon: bool = False,\n        replace_T_with_U: bool = True,\n        do_upper_case: bool = True,\n        additional_special_tokens: List | Tuple | None = None,\n        **kwargs,\n    ):\n        if codon and (nmers > 1 and nmers != 3):\n            raise ValueError(\"Codon and nmers cannot be used together.\")\n        if codon:\n            nmers = 3  # set to 3 to get correct vocab\n        if not isinstance(alphabet, Alphabet):\n            alphabet = get_alphabet(alphabet, nmers=nmers)\n        super().__init__(\n            alphabet=alphabet,\n            nmers=nmers,\n            codon=codon,\n            replace_T_with_U=replace_T_with_U,\n            do_upper_case=do_upper_case,\n            additional_special_tokens=additional_special_tokens,\n            **kwargs,\n        )\n        self.replace_T_with_U = replace_T_with_U\n        self.nmers = nmers\n        self.codon = codon\n\n    def _tokenize(self, text: str, **kwargs):\n        if self.do_upper_case:\n            text = text.upper()\n        if self.replace_T_with_U:\n            text = text.replace(\"T\", \"U\")\n        if self.codon:\n            if len(text) % 3 != 0:\n                raise ValueError(\n                    f\"length of input sequence must be a multiple of 3 for codon tokenization, but got {len(text)}\"\n                )\n            return [text[i : i + 3] for i in range(0, len(text), 3)]\n        if self.nmers > 1:\n            return [text[i : i + self.nmers] for i in range(len(text) - self.nmers + 1)]  # noqa: E203\n        return list(text)\n
"},{"location":"models/utrbert/#multimolecule.models.utrbert.RnaTokenizer(alphabet)","title":"alphabet","text":""},{"location":"models/utrbert/#multimolecule.models.utrbert.RnaTokenizer(nmers)","title":"nmers","text":""},{"location":"models/utrbert/#multimolecule.models.utrbert.RnaTokenizer(codon)","title":"codon","text":""},{"location":"models/utrbert/#multimolecule.models.utrbert.RnaTokenizer(replace_T_with_U)","title":"replace_T_with_U","text":""},{"location":"models/utrbert/#multimolecule.models.utrbert.RnaTokenizer(do_upper_case)","title":"do_upper_case","text":""},{"location":"models/utrbert/#multimolecule.models.utrbert.UtrBertConfig","title":"UtrBertConfig","text":"

Bases: PreTrainedConfig

This is the configuration class to store the configuration of a UtrBertModel. It is used to instantiate a 3UTRBERT model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the 3UTRBERT yangyn533/3UTRBERT architecture.

Configuration objects inherit from PreTrainedConfig and can be used to control the model outputs. Read the documentation from PreTrainedConfig for more information.

Parameters:

Name Type Description Default int | None

Vocabulary size of the UTRBERT model. Defines the number of different tokens that can be represented by the inputs_ids passed when calling [BertModel].

None int | None

kmer size of the UTRBERT model. Defines the vocabulary size of the model.

None int

Dimensionality of the encoder layers and the pooler layer.

768 int

Number of hidden layers in the Transformer encoder.

12 int

Number of attention heads for each attention layer in the Transformer encoder.

12 int

Dimensionality of the \u201cintermediate\u201d (often named feed-forward) layer in the Transformer encoder.

3072 str

The non-linear activation function (function or string) in the encoder and pooler. If string, \"gelu\", \"relu\", \"silu\" and \"gelu_new\" are supported.

'gelu' float

The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.

0.1 float

The dropout ratio for the attention probabilities.

0.1 int

The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).

512 float

The standard deviation of the truncated_normal_initializer for initializing all weight matrices.

0.02 float

The epsilon used by the layer normalization layers.

1e-12 str

Type of position embedding. Choose one of \"absolute\", \"relative_key\", \"relative_key_query\". For positional embeddings use \"absolute\". For more information on \"relative_key\", please refer to Self-Attention with Relative Position Representations (Shaw et al.). For more information on \"relative_key_query\", please refer to Method 4 in Improve Transformer Models with Better Relative Position Embeddings (Huang et al.).

'absolute' bool

Whether the model is used as a decoder or not. If False, the model is used as an encoder.

False bool

Whether or not the model should return the last key/values attentions (not used by all models). Only relevant if config.is_decoder=True.

True

Examples:

Python Console Session
>>> from multimolecule import UtrBertConfig, UtrBertModel\n>>> # Initializing a UtrBERT multimolecule/utrbert style configuration\n>>> configuration = UtrBertConfig(vocab_size=26, nmers=1)\n>>> # Initializing a model (with random weights) from the multimolecule/utrbert style configuration\n>>> model = UtrBertModel(configuration)\n>>> # Accessing the model configuration\n>>> configuration = model.config\n
Source code in multimolecule/models/utrbert/configuration_utrbert.py Python
class UtrBertConfig(PreTrainedConfig):\n    r\"\"\"\n    This is the configuration class to store the configuration of a [`UtrBertModel`][multimolecule.models.UtrBertModel].\n    It is used to instantiate a 3UTRBERT model according to the specified arguments, defining the model architecture.\n    Instantiating a configuration with the defaults will yield a similar configuration to that of the 3UTRBERT\n    [yangyn533/3UTRBERT](https://github.com/yangyn533/3UTRBERT) architecture.\n\n    Configuration objects inherit from [`PreTrainedConfig`][multimolecule.models.PreTrainedConfig] and can be used to\n    control the model outputs. Read the documentation from [`PreTrainedConfig`][multimolecule.models.PreTrainedConfig]\n    for more information.\n\n    Args:\n        vocab_size:\n            Vocabulary size of the UTRBERT model. Defines the number of different tokens that can be represented by the\n            `inputs_ids` passed when calling [`BertModel`].\n        nmers:\n            kmer size of the UTRBERT model. Defines the vocabulary size of the model.\n        hidden_size:\n            Dimensionality of the encoder layers and the pooler layer.\n        num_hidden_layers:\n            Number of hidden layers in the Transformer encoder.\n        num_attention_heads:\n            Number of attention heads for each attention layer in the Transformer encoder.\n        intermediate_size:\n            Dimensionality of the \"intermediate\" (often named feed-forward) layer in the Transformer encoder.\n        hidden_act:\n            The non-linear activation function (function or string) in the encoder and pooler. If string, `\"gelu\"`,\n            `\"relu\"`, `\"silu\"` and `\"gelu_new\"` are supported.\n        hidden_dropout:\n            The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.\n        attention_dropout:\n            The dropout ratio for the attention probabilities.\n        max_position_embeddings:\n            The maximum sequence length that this model might ever be used with. Typically set this to something large\n            just in case (e.g., 512 or 1024 or 2048).\n        initializer_range:\n            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.\n        layer_norm_eps:\n            The epsilon used by the layer normalization layers.\n        position_embedding_type:\n            Type of position embedding. Choose one of `\"absolute\"`, `\"relative_key\"`, `\"relative_key_query\"`. For\n            positional embeddings use `\"absolute\"`. For more information on `\"relative_key\"`, please refer to\n            [Self-Attention with Relative Position Representations (Shaw et al.)](https://arxiv.org/abs/1803.02155).\n            For more information on `\"relative_key_query\"`, please refer to *Method 4* in [Improve Transformer Models\n            with Better Relative Position Embeddings (Huang et al.)](https://arxiv.org/abs/2009.13658).\n        is_decoder:\n            Whether the model is used as a decoder or not. If `False`, the model is used as an encoder.\n        use_cache:\n            Whether or not the model should return the last key/values attentions (not used by all models). Only\n            relevant if `config.is_decoder=True`.\n\n    Examples:\n        >>> from multimolecule import UtrBertConfig, UtrBertModel\n        >>> # Initializing a UtrBERT multimolecule/utrbert style configuration\n        >>> configuration = UtrBertConfig(vocab_size=26, nmers=1)\n        >>> # Initializing a model (with random weights) from the multimolecule/utrbert style configuration\n        >>> model = UtrBertModel(configuration)\n        >>> # Accessing the model configuration\n        >>> configuration = model.config\n    \"\"\"\n\n    model_type = \"utrbert\"\n\n    def __init__(\n        self,\n        vocab_size: int | None = None,\n        nmers: int | None = None,\n        hidden_size: int = 768,\n        num_hidden_layers: int = 12,\n        num_attention_heads: int = 12,\n        intermediate_size: int = 3072,\n        hidden_act: str = \"gelu\",\n        hidden_dropout: float = 0.1,\n        attention_dropout: float = 0.1,\n        max_position_embeddings: int = 512,\n        initializer_range: float = 0.02,\n        layer_norm_eps: float = 1e-12,\n        position_embedding_type: str = \"absolute\",\n        is_decoder: bool = False,\n        use_cache: bool = True,\n        head: HeadConfig | None = None,\n        lm_head: MaskedLMHeadConfig | None = None,\n        **kwargs,\n    ):\n        super().__init__(**kwargs)\n\n        self.vocab_size = vocab_size\n        self.type_vocab_size = 2\n        self.nmers = nmers\n        self.hidden_size = hidden_size\n        self.num_hidden_layers = num_hidden_layers\n        self.num_attention_heads = num_attention_heads\n        self.hidden_act = hidden_act\n        self.intermediate_size = intermediate_size\n        self.hidden_dropout = hidden_dropout\n        self.attention_dropout = attention_dropout\n        self.max_position_embeddings = max_position_embeddings\n        self.initializer_range = initializer_range\n        self.layer_norm_eps = layer_norm_eps\n        self.position_embedding_type = position_embedding_type\n        self.is_decoder = is_decoder\n        self.use_cache = use_cache\n        self.head = HeadConfig(**head) if head is not None else None\n        self.lm_head = MaskedLMHeadConfig(**lm_head) if lm_head is not None else None\n
"},{"location":"models/utrbert/#multimolecule.models.utrbert.UtrBertConfig(vocab_size)","title":"vocab_size","text":""},{"location":"models/utrbert/#multimolecule.models.utrbert.UtrBertConfig(nmers)","title":"nmers","text":""},{"location":"models/utrbert/#multimolecule.models.utrbert.UtrBertConfig(hidden_size)","title":"hidden_size","text":""},{"location":"models/utrbert/#multimolecule.models.utrbert.UtrBertConfig(num_hidden_layers)","title":"num_hidden_layers","text":""},{"location":"models/utrbert/#multimolecule.models.utrbert.UtrBertConfig(num_attention_heads)","title":"num_attention_heads","text":""},{"location":"models/utrbert/#multimolecule.models.utrbert.UtrBertConfig(intermediate_size)","title":"intermediate_size","text":""},{"location":"models/utrbert/#multimolecule.models.utrbert.UtrBertConfig(hidden_act)","title":"hidden_act","text":""},{"location":"models/utrbert/#multimolecule.models.utrbert.UtrBertConfig(hidden_dropout)","title":"hidden_dropout","text":""},{"location":"models/utrbert/#multimolecule.models.utrbert.UtrBertConfig(attention_dropout)","title":"attention_dropout","text":""},{"location":"models/utrbert/#multimolecule.models.utrbert.UtrBertConfig(max_position_embeddings)","title":"max_position_embeddings","text":""},{"location":"models/utrbert/#multimolecule.models.utrbert.UtrBertConfig(initializer_range)","title":"initializer_range","text":""},{"location":"models/utrbert/#multimolecule.models.utrbert.UtrBertConfig(layer_norm_eps)","title":"layer_norm_eps","text":""},{"location":"models/utrbert/#multimolecule.models.utrbert.UtrBertConfig(position_embedding_type)","title":"position_embedding_type","text":""},{"location":"models/utrbert/#multimolecule.models.utrbert.UtrBertConfig(is_decoder)","title":"is_decoder","text":""},{"location":"models/utrbert/#multimolecule.models.utrbert.UtrBertConfig(use_cache)","title":"use_cache","text":""},{"location":"models/utrbert/#multimolecule.models.utrbert.UtrBertForContactPrediction","title":"UtrBertForContactPrediction","text":"

Bases: UtrBertPreTrainedModel

Examples:

Python Console Session
>>> from multimolecule import UtrBertConfig, UtrBertForContactPrediction, RnaTokenizer\n>>> tokenizer = RnaTokenizer(nmers=1)\n>>> config = UtrBertConfig(vocab_size=tokenizer.vocab_size)\n>>> model = UtrBertForContactPrediction(config)\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=torch.randint(2, (1, 5, 5)))\n>>> output[\"logits\"].shape\ntorch.Size([1, 5, 5, 1])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n
Source code in multimolecule/models/utrbert/modeling_utrbert.py Python
class UtrBertForContactPrediction(UtrBertPreTrainedModel):\n    \"\"\"\n    Examples:\n        >>> from multimolecule import UtrBertConfig, UtrBertForContactPrediction, RnaTokenizer\n        >>> tokenizer = RnaTokenizer(nmers=1)\n        >>> config = UtrBertConfig(vocab_size=tokenizer.vocab_size)\n        >>> model = UtrBertForContactPrediction(config)\n        >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n        >>> output = model(**input, labels=torch.randint(2, (1, 5, 5)))\n        >>> output[\"logits\"].shape\n        torch.Size([1, 5, 5, 1])\n        >>> output[\"loss\"]  # doctest:+ELLIPSIS\n        tensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n    \"\"\"\n\n    def __init__(self, config: UtrBertConfig):\n        super().__init__(config)\n        self.utrbert = UtrBertModel(config, add_pooling_layer=True)\n        self.contact_head = ContactPredictionHead(config)\n        self.head_config = self.contact_head.config\n\n        # Initialize weights and apply final processing\n        self.post_init()\n\n    def forward(\n        self,\n        input_ids: Tensor | NestedTensor,\n        attention_mask: Tensor | None = None,\n        position_ids: Tensor | None = None,\n        head_mask: Tensor | None = None,\n        inputs_embeds: Tensor | NestedTensor | None = None,\n        labels: Tensor | None = None,\n        output_attentions: bool | None = None,\n        output_hidden_states: bool | None = None,\n        return_dict: bool | None = None,\n        **kwargs,\n    ) -> Tuple[Tensor, ...] | ContactPredictorOutput:\n        if output_attentions is False:\n            warn(\"output_attentions must be True for contact classification and will be ignored.\")\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n        outputs = self.utrbert(\n            input_ids,\n            attention_mask=attention_mask,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            output_attentions=True,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n            **kwargs,\n        )\n        output = self.contact_head(outputs, attention_mask, input_ids, labels)\n        logits, loss = output.logits, output.loss\n\n        if not return_dict:\n            output = (logits,) + outputs[2:]\n            return ((loss,) + output) if loss is not None else output\n\n        return ContactPredictorOutput(\n            loss=loss,\n            logits=logits,\n            hidden_states=outputs.hidden_states,\n            attentions=outputs.attentions,\n        )\n
"},{"location":"models/utrbert/#multimolecule.models.utrbert.UtrBertForMaskedLM","title":"UtrBertForMaskedLM","text":"

Bases: UtrBertPreTrainedModel

Examples:

Python Console Session
>>> from multimolecule import UtrBertConfig, UtrBertForMaskedLM, RnaTokenizer\n>>> tokenizer = RnaTokenizer(nmers=2)\n>>> config = UtrBertConfig(vocab_size=tokenizer.vocab_size)\n>>> model = UtrBertForMaskedLM(config)\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=input[\"input_ids\"])\n>>> output[\"logits\"].shape\ntorch.Size([1, 6, 31])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<NllLossBackward0>)\n
Source code in multimolecule/models/utrbert/modeling_utrbert.py Python
class UtrBertForMaskedLM(UtrBertPreTrainedModel):\n    \"\"\"\n    Examples:\n        >>> from multimolecule import UtrBertConfig, UtrBertForMaskedLM, RnaTokenizer\n        >>> tokenizer = RnaTokenizer(nmers=2)\n        >>> config = UtrBertConfig(vocab_size=tokenizer.vocab_size)\n        >>> model = UtrBertForMaskedLM(config)\n        >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n        >>> output = model(**input, labels=input[\"input_ids\"])\n        >>> output[\"logits\"].shape\n        torch.Size([1, 6, 31])\n        >>> output[\"loss\"]  # doctest:+ELLIPSIS\n        tensor(..., grad_fn=<NllLossBackward0>)\n    \"\"\"\n\n    _tied_weights_keys = [\"lm_head.decoder.weight\", \"lm_head.decoder.bias\"]\n\n    def __init__(self, config: UtrBertConfig):\n        super().__init__(config)\n        if config.is_decoder:\n            logger.warning(\n                \"If you want to use `BertForMaskedLM` make sure `config.is_decoder=False` for \"\n                \"bi-directional self-attention.\"\n            )\n        self.utrbert = UtrBertModel(config, add_pooling_layer=False)\n        self.lm_head = MaskedLMHead(config)\n\n        # Initialize weights and apply final processing\n        self.post_init()\n\n    def get_output_embeddings(self):\n        return self.lm_head.decoder\n\n    def set_output_embeddings(self, new_embeddings):\n        self.lm_head.decoder = new_embeddings\n\n    def forward(\n        self,\n        input_ids: Tensor | NestedTensor,\n        attention_mask: Tensor | None = None,\n        position_ids: Tensor | None = None,\n        head_mask: Tensor | None = None,\n        inputs_embeds: Tensor | NestedTensor | None = None,\n        encoder_hidden_states: Tensor | None = None,\n        encoder_attention_mask: Tensor | None = None,\n        labels: Tensor | None = None,\n        output_attentions: bool | None = None,\n        output_hidden_states: bool | None = None,\n        return_dict: bool | None = None,\n        **kwargs,\n    ) -> Tuple[Tensor, ...] | MaskedLMOutput:\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n        outputs = self.utrbert(\n            input_ids,\n            attention_mask=attention_mask,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            encoder_hidden_states=encoder_hidden_states,\n            encoder_attention_mask=encoder_attention_mask,\n            output_attentions=output_attentions,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n            **kwargs,\n        )\n        output = self.lm_head(outputs, labels)\n        logits, loss = output.logits, output.loss\n\n        if not return_dict:\n            output = (logits,) + outputs[2:]\n            return ((loss,) + output) if loss is not None else output\n\n        return MaskedLMOutput(\n            loss=loss,\n            logits=logits,\n            hidden_states=outputs.hidden_states,\n            attentions=outputs.attentions,\n        )\n
"},{"location":"models/utrbert/#multimolecule.models.utrbert.UtrBertForSequencePrediction","title":"UtrBertForSequencePrediction","text":"

Bases: UtrBertPreTrainedModel

Examples:

Python Console Session
>>> from multimolecule import UtrBertConfig, UtrBertForSequencePrediction, RnaTokenizer\n>>> tokenizer = RnaTokenizer(nmers=4)\n>>> config = UtrBertConfig(vocab_size=tokenizer.vocab_size)\n>>> model = UtrBertForSequencePrediction(config)\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=torch.tensor([[1]]))\n>>> output[\"logits\"].shape\ntorch.Size([1, 1])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n
Source code in multimolecule/models/utrbert/modeling_utrbert.py Python
class UtrBertForSequencePrediction(UtrBertPreTrainedModel):\n    \"\"\"\n    Examples:\n        >>> from multimolecule import UtrBertConfig, UtrBertForSequencePrediction, RnaTokenizer\n        >>> tokenizer = RnaTokenizer(nmers=4)\n        >>> config = UtrBertConfig(vocab_size=tokenizer.vocab_size)\n        >>> model = UtrBertForSequencePrediction(config)\n        >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n        >>> output = model(**input, labels=torch.tensor([[1]]))\n        >>> output[\"logits\"].shape\n        torch.Size([1, 1])\n        >>> output[\"loss\"]  # doctest:+ELLIPSIS\n        tensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n    \"\"\"\n\n    def __init__(self, config: UtrBertConfig):\n        super().__init__(config)\n        self.utrbert = UtrBertModel(config)\n        self.sequence_head = SequencePredictionHead(config)\n        self.head_config = self.sequence_head.config\n\n        # Initialize weights and apply final processing\n        self.post_init()\n\n    def forward(\n        self,\n        input_ids: Tensor | NestedTensor,\n        attention_mask: Tensor | None = None,\n        position_ids: Tensor | None = None,\n        head_mask: Tensor | None = None,\n        inputs_embeds: Tensor | NestedTensor | None = None,\n        labels: Tensor | None = None,\n        output_attentions: bool | None = None,\n        output_hidden_states: bool | None = None,\n        return_dict: bool | None = None,\n        **kwargs,\n    ) -> Tuple[Tensor, ...] | SequencePredictorOutput:\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n        outputs = self.utrbert(\n            input_ids,\n            attention_mask=attention_mask,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            output_attentions=output_attentions,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n            **kwargs,\n        )\n        output = self.sequence_head(outputs, labels)\n        logits, loss = output.logits, output.loss\n\n        if not return_dict:\n            output = (logits,) + outputs[2:]\n            return ((loss,) + output) if loss is not None else output\n\n        return SequencePredictorOutput(\n            loss=loss,\n            logits=logits,\n            hidden_states=outputs.hidden_states,\n            attentions=outputs.attentions,\n        )\n
"},{"location":"models/utrbert/#multimolecule.models.utrbert.UtrBertForTokenPrediction","title":"UtrBertForTokenPrediction","text":"

Bases: UtrBertPreTrainedModel

Examples:

Python Console Session
>>> from multimolecule import UtrBertConfig, UtrBertForTokenPrediction, RnaTokenizer\n>>> tokenizer = RnaTokenizer(nmers=2)\n>>> config = UtrBertConfig(vocab_size=tokenizer.vocab_size, nmers=2)\n>>> model = UtrBertForTokenPrediction(config)\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=torch.randint(2, (1, 5)))\n>>> output[\"logits\"].shape\ntorch.Size([1, 5, 1])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n
Source code in multimolecule/models/utrbert/modeling_utrbert.py Python
class UtrBertForTokenPrediction(UtrBertPreTrainedModel):\n    \"\"\"\n    Examples:\n        >>> from multimolecule import UtrBertConfig, UtrBertForTokenPrediction, RnaTokenizer\n        >>> tokenizer = RnaTokenizer(nmers=2)\n        >>> config = UtrBertConfig(vocab_size=tokenizer.vocab_size, nmers=2)\n        >>> model = UtrBertForTokenPrediction(config)\n        >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n        >>> output = model(**input, labels=torch.randint(2, (1, 5)))\n        >>> output[\"logits\"].shape\n        torch.Size([1, 5, 1])\n        >>> output[\"loss\"]  # doctest:+ELLIPSIS\n        tensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n    \"\"\"\n\n    def __init__(self, config: UtrBertConfig):\n        super().__init__(config)\n        self.num_labels = config.num_labels\n        self.utrbert = UtrBertModel(config, add_pooling_layer=False)\n        self.token_head = TokenKMerHead(config)\n        self.head_config = self.token_head.config\n\n        # Initialize weights and apply final processing\n        self.post_init()\n\n    def forward(\n        self,\n        input_ids: Tensor | NestedTensor,\n        attention_mask: Tensor | None = None,\n        position_ids: Tensor | None = None,\n        head_mask: Tensor | None = None,\n        inputs_embeds: Tensor | NestedTensor | None = None,\n        labels: Tensor | None = None,\n        output_attentions: bool | None = None,\n        output_hidden_states: bool | None = None,\n        return_dict: bool | None = None,\n        **kwargs,\n    ) -> Tuple[Tensor, ...] | TokenPredictorOutput:\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n        outputs = self.utrbert(\n            input_ids,\n            attention_mask=attention_mask,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            output_attentions=output_attentions,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n            **kwargs,\n        )\n        output = self.token_head(outputs, attention_mask, input_ids, labels)\n        logits, loss = output.logits, output.loss\n\n        if not return_dict:\n            output = (logits,) + outputs[2:]\n            return ((loss,) + output) if loss is not None else output\n\n        return TokenPredictorOutput(\n            loss=loss,\n            logits=logits,\n            hidden_states=outputs.hidden_states,\n            attentions=outputs.attentions,\n        )\n
"},{"location":"models/utrbert/#multimolecule.models.utrbert.UtrBertModel","title":"UtrBertModel","text":"

Bases: UtrBertPreTrainedModel

Examples:

Python Console Session
>>> from multimolecule import UtrBertConfig, UtrBertModel, RnaTokenizer\n>>> tokenizer = RnaTokenizer(nmers=1)\n>>> config = UtrBertConfig(vocab_size=tokenizer.vocab_size)\n>>> model = UtrBertModel(config)\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input)\n>>> output[\"last_hidden_state\"].shape\ntorch.Size([1, 7, 768])\n>>> output[\"pooler_output\"].shape\ntorch.Size([1, 768])\n
Source code in multimolecule/models/utrbert/modeling_utrbert.py Python
class UtrBertModel(UtrBertPreTrainedModel):\n    \"\"\"\n    Examples:\n        >>> from multimolecule import UtrBertConfig, UtrBertModel, RnaTokenizer\n        >>> tokenizer = RnaTokenizer(nmers=1)\n        >>> config = UtrBertConfig(vocab_size=tokenizer.vocab_size)\n        >>> model = UtrBertModel(config)\n        >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n        >>> output = model(**input)\n        >>> output[\"last_hidden_state\"].shape\n        torch.Size([1, 7, 768])\n        >>> output[\"pooler_output\"].shape\n        torch.Size([1, 768])\n    \"\"\"\n\n    def __init__(self, config: UtrBertConfig, add_pooling_layer: bool = True):\n        super().__init__(config)\n        self.pad_token_id = config.pad_token_id\n        self.embeddings = UtrBertEmbeddings(config)\n        self.encoder = UtrBertEncoder(config)\n        self.pooler = UtrBertPooler(config) if add_pooling_layer else None\n\n        # Initialize weights and apply final processing\n        self.post_init()\n\n    def get_input_embeddings(self):\n        return self.embeddings.word_embeddings\n\n    def set_input_embeddings(self, value):\n        self.embeddings.word_embeddings = value\n\n    def _prune_heads(self, heads_to_prune):\n        \"\"\"\n        Prunes heads of the model. heads_to_prune: dict of {layer_num: list of heads to prune in this layer} See base\n        class PreTrainedModel\n        \"\"\"\n        for layer, heads in heads_to_prune.items():\n            self.encoder.layer[layer].attention.prune_heads(heads)\n\n    def forward(\n        self,\n        input_ids: Tensor | NestedTensor,\n        attention_mask: Tensor | None = None,\n        position_ids: Tensor | None = None,\n        head_mask: Tensor | None = None,\n        inputs_embeds: Tensor | NestedTensor | None = None,\n        encoder_hidden_states: Tensor | None = None,\n        encoder_attention_mask: Tensor | None = None,\n        past_key_values: Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None = None,\n        use_cache: bool | None = None,\n        output_attentions: bool | None = None,\n        output_hidden_states: bool | None = None,\n        return_dict: bool | None = None,\n        **kwargs,\n    ) -> Tuple[Tensor, ...] | BaseModelOutputWithPoolingAndCrossAttentions:\n        r\"\"\"\n        Args:\n            encoder_hidden_states:\n                Shape: `(batch_size, sequence_length, hidden_size)`\n\n                Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if\n                the model is configured as a decoder.\n            encoder_attention_mask:\n                Shape: `(batch_size, sequence_length)`\n\n                Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used\n                in the cross-attention if the model is configured as a decoder. Mask values selected in `[0, 1]`:\n\n                - 1 for tokens that are **not masked**,\n                - 0 for tokens that are **masked**.\n            past_key_values:\n                Tuple of length `config.n_layers` with each tuple having 4 tensors of shape\n                `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)\n\n                Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up\n                decoding.\n\n                If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those\n                that don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of\n                all `decoder_input_ids` of shape `(batch_size, sequence_length)`.\n            use_cache:\n                If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding\n                (see `past_key_values`).\n        \"\"\"\n        if kwargs:\n            warn(\n                f\"Additional keyword arguments `{', '.join(kwargs)}` are detected in \"\n                f\"`{self.__class__.__name__}.forward`, they will be ignored.\\n\"\n                \"This is provided for backward compatibility and may lead to unexpected behavior.\"\n            )\n        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions\n        output_hidden_states = (\n            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states\n        )\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n\n        if self.config.is_decoder:\n            use_cache = use_cache if use_cache is not None else self.config.use_cache\n        else:\n            use_cache = False\n\n        if isinstance(input_ids, NestedTensor):\n            input_ids, attention_mask = input_ids.tensor, input_ids.mask\n        if input_ids is not None and inputs_embeds is not None:\n            raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n        if input_ids is not None:\n            self.warn_if_padding_and_no_attention_mask(input_ids, attention_mask)\n            input_shape = input_ids.size()\n        elif inputs_embeds is not None:\n            input_shape = inputs_embeds.size()[:-1]\n        else:\n            raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n        batch_size, seq_length = input_shape\n        device = input_ids.device if input_ids is not None else inputs_embeds.device  # type: ignore[union-attr]\n\n        # past_key_values_length\n        past_key_values_length = past_key_values[0][0].shape[2] if past_key_values is not None else 0\n\n        if attention_mask is None:\n            attention_mask = (\n                input_ids.ne(self.pad_token_id)\n                if self.pad_token_id is not None\n                else torch.ones(((batch_size, seq_length + past_key_values_length)), device=device)\n            )\n\n        # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]\n        # ourselves in which case we just need to make it broadcastable to all heads.\n        extended_attention_mask: Tensor = self.get_extended_attention_mask(attention_mask, input_shape)\n\n        # If a 2D or 3D attention mask is provided for the cross-attention\n        # we need to make broadcastable to [batch_size, num_heads, seq_length, seq_length]\n        if self.config.is_decoder and encoder_hidden_states is not None:\n            encoder_batch_size, encoder_sequence_length, _ = encoder_hidden_states.size()\n            encoder_hidden_shape = (encoder_batch_size, encoder_sequence_length)\n            if encoder_attention_mask is None:\n                encoder_attention_mask = torch.ones(encoder_hidden_shape, device=device)\n            encoder_extended_attention_mask = self.invert_attention_mask(encoder_attention_mask)\n        else:\n            encoder_extended_attention_mask = None\n\n        # Prepare head mask if needed\n        # 1.0 in head_mask indicate we keep the head\n        # attention_probs has shape bsz x n_heads x N x N\n        # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]\n        # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]\n        head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)\n\n        embedding_output = self.embeddings(\n            input_ids=input_ids,\n            position_ids=position_ids,\n            inputs_embeds=inputs_embeds,\n            past_key_values_length=past_key_values_length,\n        )\n        encoder_outputs = self.encoder(\n            embedding_output,\n            attention_mask=extended_attention_mask,\n            head_mask=head_mask,\n            encoder_hidden_states=encoder_hidden_states,\n            encoder_attention_mask=encoder_extended_attention_mask,\n            past_key_values=past_key_values,\n            use_cache=use_cache,\n            output_attentions=output_attentions,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n        )\n        sequence_output = encoder_outputs[0]\n        pooled_output = self.pooler(sequence_output) if self.pooler is not None else None\n\n        if not return_dict:\n            return (sequence_output, pooled_output) + encoder_outputs[1:]\n\n        return BaseModelOutputWithPoolingAndCrossAttentions(\n            last_hidden_state=sequence_output,\n            pooler_output=pooled_output,\n            past_key_values=encoder_outputs.past_key_values,\n            hidden_states=encoder_outputs.hidden_states,\n            attentions=encoder_outputs.attentions,\n            cross_attentions=encoder_outputs.cross_attentions,\n        )\n
"},{"location":"models/utrbert/#multimolecule.models.utrbert.UtrBertModel.forward","title":"forward","text":"Python
forward(input_ids: Tensor | NestedTensor, attention_mask: Tensor | None = None, position_ids: Tensor | None = None, head_mask: Tensor | None = None, inputs_embeds: Tensor | NestedTensor | None = None, encoder_hidden_states: Tensor | None = None, encoder_attention_mask: Tensor | None = None, past_key_values: Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None = None, use_cache: bool | None = None, output_attentions: bool | None = None, output_hidden_states: bool | None = None, return_dict: bool | None = None, **kwargs) -> Tuple[Tensor, ...] | BaseModelOutputWithPoolingAndCrossAttentions\n

Parameters:

Name Type Description Default Tensor | None

Shape: (batch_size, sequence_length, hidden_size)

Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if the model is configured as a decoder.

None Tensor | None

Shape: (batch_size, sequence_length)

Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used in the cross-attention if the model is configured as a decoder. Mask values selected in [0, 1]:

None Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None

Tuple of length config.n_layers with each tuple having 4 tensors of shape `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)

Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up decoding.

If past_key_values are used, the user can optionally input only the last decoder_input_ids (those that don\u2019t have their past key value states given to this model) of shape (batch_size, 1) instead of all decoder_input_ids of shape (batch_size, sequence_length).

None bool | None

If set to True, past_key_values key value states are returned and can be used to speed up decoding (see past_key_values).

None Source code in multimolecule/models/utrbert/modeling_utrbert.py Python
def forward(\n    self,\n    input_ids: Tensor | NestedTensor,\n    attention_mask: Tensor | None = None,\n    position_ids: Tensor | None = None,\n    head_mask: Tensor | None = None,\n    inputs_embeds: Tensor | NestedTensor | None = None,\n    encoder_hidden_states: Tensor | None = None,\n    encoder_attention_mask: Tensor | None = None,\n    past_key_values: Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None = None,\n    use_cache: bool | None = None,\n    output_attentions: bool | None = None,\n    output_hidden_states: bool | None = None,\n    return_dict: bool | None = None,\n    **kwargs,\n) -> Tuple[Tensor, ...] | BaseModelOutputWithPoolingAndCrossAttentions:\n    r\"\"\"\n    Args:\n        encoder_hidden_states:\n            Shape: `(batch_size, sequence_length, hidden_size)`\n\n            Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if\n            the model is configured as a decoder.\n        encoder_attention_mask:\n            Shape: `(batch_size, sequence_length)`\n\n            Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used\n            in the cross-attention if the model is configured as a decoder. Mask values selected in `[0, 1]`:\n\n            - 1 for tokens that are **not masked**,\n            - 0 for tokens that are **masked**.\n        past_key_values:\n            Tuple of length `config.n_layers` with each tuple having 4 tensors of shape\n            `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)\n\n            Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up\n            decoding.\n\n            If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those\n            that don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of\n            all `decoder_input_ids` of shape `(batch_size, sequence_length)`.\n        use_cache:\n            If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding\n            (see `past_key_values`).\n    \"\"\"\n    if kwargs:\n        warn(\n            f\"Additional keyword arguments `{', '.join(kwargs)}` are detected in \"\n            f\"`{self.__class__.__name__}.forward`, they will be ignored.\\n\"\n            \"This is provided for backward compatibility and may lead to unexpected behavior.\"\n        )\n    output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions\n    output_hidden_states = (\n        output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states\n    )\n    return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n\n    if self.config.is_decoder:\n        use_cache = use_cache if use_cache is not None else self.config.use_cache\n    else:\n        use_cache = False\n\n    if isinstance(input_ids, NestedTensor):\n        input_ids, attention_mask = input_ids.tensor, input_ids.mask\n    if input_ids is not None and inputs_embeds is not None:\n        raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n    if input_ids is not None:\n        self.warn_if_padding_and_no_attention_mask(input_ids, attention_mask)\n        input_shape = input_ids.size()\n    elif inputs_embeds is not None:\n        input_shape = inputs_embeds.size()[:-1]\n    else:\n        raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n    batch_size, seq_length = input_shape\n    device = input_ids.device if input_ids is not None else inputs_embeds.device  # type: ignore[union-attr]\n\n    # past_key_values_length\n    past_key_values_length = past_key_values[0][0].shape[2] if past_key_values is not None else 0\n\n    if attention_mask is None:\n        attention_mask = (\n            input_ids.ne(self.pad_token_id)\n            if self.pad_token_id is not None\n            else torch.ones(((batch_size, seq_length + past_key_values_length)), device=device)\n        )\n\n    # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]\n    # ourselves in which case we just need to make it broadcastable to all heads.\n    extended_attention_mask: Tensor = self.get_extended_attention_mask(attention_mask, input_shape)\n\n    # If a 2D or 3D attention mask is provided for the cross-attention\n    # we need to make broadcastable to [batch_size, num_heads, seq_length, seq_length]\n    if self.config.is_decoder and encoder_hidden_states is not None:\n        encoder_batch_size, encoder_sequence_length, _ = encoder_hidden_states.size()\n        encoder_hidden_shape = (encoder_batch_size, encoder_sequence_length)\n        if encoder_attention_mask is None:\n            encoder_attention_mask = torch.ones(encoder_hidden_shape, device=device)\n        encoder_extended_attention_mask = self.invert_attention_mask(encoder_attention_mask)\n    else:\n        encoder_extended_attention_mask = None\n\n    # Prepare head mask if needed\n    # 1.0 in head_mask indicate we keep the head\n    # attention_probs has shape bsz x n_heads x N x N\n    # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]\n    # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]\n    head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)\n\n    embedding_output = self.embeddings(\n        input_ids=input_ids,\n        position_ids=position_ids,\n        inputs_embeds=inputs_embeds,\n        past_key_values_length=past_key_values_length,\n    )\n    encoder_outputs = self.encoder(\n        embedding_output,\n        attention_mask=extended_attention_mask,\n        head_mask=head_mask,\n        encoder_hidden_states=encoder_hidden_states,\n        encoder_attention_mask=encoder_extended_attention_mask,\n        past_key_values=past_key_values,\n        use_cache=use_cache,\n        output_attentions=output_attentions,\n        output_hidden_states=output_hidden_states,\n        return_dict=return_dict,\n    )\n    sequence_output = encoder_outputs[0]\n    pooled_output = self.pooler(sequence_output) if self.pooler is not None else None\n\n    if not return_dict:\n        return (sequence_output, pooled_output) + encoder_outputs[1:]\n\n    return BaseModelOutputWithPoolingAndCrossAttentions(\n        last_hidden_state=sequence_output,\n        pooler_output=pooled_output,\n        past_key_values=encoder_outputs.past_key_values,\n        hidden_states=encoder_outputs.hidden_states,\n        attentions=encoder_outputs.attentions,\n        cross_attentions=encoder_outputs.cross_attentions,\n    )\n
"},{"location":"models/utrbert/#multimolecule.models.utrbert.UtrBertModel.forward(encoder_hidden_states)","title":"encoder_hidden_states","text":""},{"location":"models/utrbert/#multimolecule.models.utrbert.UtrBertModel.forward(encoder_attention_mask)","title":"encoder_attention_mask","text":""},{"location":"models/utrbert/#multimolecule.models.utrbert.UtrBertModel.forward(past_key_values)","title":"past_key_values","text":""},{"location":"models/utrbert/#multimolecule.models.utrbert.UtrBertModel.forward(use_cache)","title":"use_cache","text":""},{"location":"models/utrbert/#multimolecule.models.utrbert.UtrBertPreTrainedModel","title":"UtrBertPreTrainedModel","text":"

Bases: PreTrainedModel

An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained models.

Source code in multimolecule/models/utrbert/modeling_utrbert.py Python
class UtrBertPreTrainedModel(PreTrainedModel):\n    \"\"\"\n    An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained\n    models.\n    \"\"\"\n\n    config_class = UtrBertConfig\n    base_model_prefix = \"utrbert\"\n    supports_gradient_checkpointing = True\n    _no_split_modules = [\"UtrBertLayer\", \"UtrBertEmbeddings\"]\n\n    # Copied from transformers.models.bert.modeling_bert.BertPreTrainedModel._init_weights\n    def _init_weights(self, module: nn.Module):\n        \"\"\"Initialize the weights\"\"\"\n        if isinstance(module, nn.Linear):\n            # Slightly different from the TF version which uses truncated_normal for initialization\n            # cf https://github.com/pytorch/pytorch/pull/5617\n            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)\n            if module.bias is not None:\n                module.bias.data.zero_()\n        elif isinstance(module, nn.Embedding):\n            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)\n            if module.padding_idx is not None:\n                module.weight.data[module.padding_idx].zero_()\n        elif isinstance(module, nn.LayerNorm):\n            module.bias.data.zero_()\n            module.weight.data.fill_(1.0)\n
"},{"location":"models/utrlm/","title":"UTR-LM","text":"

Pre-trained model on 5\u2019 untranslated region (5\u2019UTR) using masked language modeling (MLM), Secondary Structure (SS), and Minimum Free Energy (MFE) objectives.

"},{"location":"models/utrlm/#statement","title":"Statement","text":"

A 5\u2019 UTR Language Model for Decoding Untranslated Regions of mRNA and Function Predictions is published in Nature Machine Intelligence, which is a Closed Access / Author-Fee journal.

Machine learning has been at the forefront of the movement for free and open access to research.

We see no role for closed access or author-fee publication in the future of machine learning research and believe the adoption of these journals as an outlet of record for the machine learning community would be a retrograde step.

The MultiMolecule team is committed to the principles of open access and open science.

We do NOT endorse the publication of manuscripts in Closed Access / Author-Fee journals and encourage the community to support Open Access journals and conferences.

Please consider signing the Statement on Nature Machine Intelligence.

"},{"location":"models/utrlm/#disclaimer","title":"Disclaimer","text":"

This is an UNOFFICIAL implementation of the A 5\u2019 UTR Language Model for Decoding Untranslated Regions of mRNA and Function Predictions by Yanyi Chu, Dan Yu, et al.

The OFFICIAL repository of UTR-LM is at a96123155/UTR-LM.

Warning

The MultiMolecule team is unable to confirm that the provided model and checkpoints are producing the same intermediate representations as the original implementation. This is because

The proposed method is published in a Closed Access / Author-Fee journal.

The team releasing UTR-LM did not write this model card for this model so this model card has been written by the MultiMolecule team.

"},{"location":"models/utrlm/#model-details","title":"Model Details","text":"

UTR-LM is a bert-style model pre-trained on a large corpus of 5\u2019 untranslated regions (5\u2019UTRs) in a self-supervised fashion. This means that the model was trained on the raw nucleotides of RNA sequences only, with an automatic process to generate inputs and labels from those texts. Please refer to the Training Details section for more information on the training process.

"},{"location":"models/utrlm/#variations","title":"Variations","text":""},{"location":"models/utrlm/#model-specification","title":"Model Specification","text":"Variants Num Layers Hidden Size Num Heads Intermediate Size Num Parameters (M) FLOPs (G) MACs (G) Max Num Tokens UTR-LM MRL 6 128 16 512 1.21 0.35 0.18 1022 UTR-LM TE_EL"},{"location":"models/utrlm/#links","title":"Links","text":""},{"location":"models/utrlm/#usage","title":"Usage","text":"

The model file depends on the multimolecule library. You can install it using pip:

Bash
pip install multimolecule\n
"},{"location":"models/utrlm/#direct-use","title":"Direct Use","text":"

You can use this model directly with a pipeline for masked language modeling:

Python
>>> import multimolecule  # you must import multimolecule to register models\n>>> from transformers import pipeline\n>>> unmasker = pipeline(\"fill-mask\", model=\"multimolecule/utrlm-te_el\")\n>>> unmasker(\"gguc<mask>cucugguuagaccagaucugagccu\")\n\n[{'score': 0.07707168161869049,\n  'token': 23,\n  'token_str': '*',\n  'sequence': 'G G U C * C U C U G G U U A G A C C A G A U C U G A G C C U'},\n {'score': 0.07588472962379456,\n  'token': 5,\n  'token_str': '<null>',\n  'sequence': 'G G U C C U C U G G U U A G A C C A G A U C U G A G C C U'},\n {'score': 0.07178673148155212,\n  'token': 9,\n  'token_str': 'U',\n  'sequence': 'G G U C U C U C U G G U U A G A C C A G A U C U G A G C C U'},\n {'score': 0.06414645165205002,\n  'token': 10,\n  'token_str': 'N',\n  'sequence': 'G G U C N C U C U G G U U A G A C C A G A U C U G A G C C U'},\n {'score': 0.06385370343923569,\n  'token': 12,\n  'token_str': 'Y',\n  'sequence': 'G G U C Y C U C U G G U U A G A C C A G A U C U G A G C C U'}]\n
"},{"location":"models/utrlm/#downstream-use","title":"Downstream Use","text":""},{"location":"models/utrlm/#extract-features","title":"Extract Features","text":"

Here is how to use this model to get the features of a given sequence in PyTorch:

Python
from multimolecule import RnaTokenizer, UtrLmModel\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/utrlm-te_el\")\nmodel = UtrLmModel.from_pretrained(\"multimolecule/utrlm-te_el\")\n\ntext = \"UAGCUUAUCAGACUGAUGUUGA\"\ninput = tokenizer(text, return_tensors=\"pt\")\n\noutput = model(**input)\n
"},{"location":"models/utrlm/#sequence-classification-regression","title":"Sequence Classification / Regression","text":"

Note: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for sequence classification or regression.

Here is how to use this model as backbone to fine-tune for a sequence-level task in PyTorch:

Python
import torch\nfrom multimolecule import RnaTokenizer, UtrLmForSequencePrediction\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/utrlm-te_el\")\nmodel = UtrLmForSequencePrediction.from_pretrained(\"multimolecule/utrlm-te_el\")\n\ntext = \"UAGCUUAUCAGACUGAUGUUGA\"\ninput = tokenizer(text, return_tensors=\"pt\")\nlabel = torch.tensor([1])\n\noutput = model(**input, labels=label)\n
"},{"location":"models/utrlm/#token-classification-regression","title":"Token Classification / Regression","text":"

Note: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for nucleotide classification or regression.

Here is how to use this model as backbone to fine-tune for a nucleotide-level task in PyTorch:

Python
import torch\nfrom multimolecule import RnaTokenizer, UtrLmForTokenPrediction\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/utrlm-te_el\")\nmodel = UtrLmForTokenPrediction.from_pretrained(\"multimolecule/utrlm-te_el\")\n\ntext = \"UAGCUUAUCAGACUGAUGUUGA\"\ninput = tokenizer(text, return_tensors=\"pt\")\nlabel = torch.randint(2, (len(text), ))\n\noutput = model(**input, labels=label)\n
"},{"location":"models/utrlm/#contact-classification-regression","title":"Contact Classification / Regression","text":"

Note: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for contact classification or regression.

Here is how to use this model as backbone to fine-tune for a contact-level task in PyTorch:

Python
import torch\nfrom multimolecule import RnaTokenizer, UtrLmForContactPrediction\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/utrlm-te_el\")\nmodel = UtrLmForContactPrediction.from_pretrained(\"multimolecule/utrlm-te_el\")\n\ntext = \"UAGCUUAUCAGACUGAUGUUGA\"\ninput = tokenizer(text, return_tensors=\"pt\")\nlabel = torch.randint(2, (len(text), len(text)))\n\noutput = model(**input, labels=label)\n
"},{"location":"models/utrlm/#training-details","title":"Training Details","text":"

UTR-LM used a mixed training strategy with one self-supervised task and two supervised tasks, where the labels of both supervised tasks are calculated using ViennaRNA.

  1. Masked Language Modeling (MLM): taking a sequence, the model randomly masks 15% of the tokens in the input then runs the entire masked sentence through the model and has to predict the masked tokens. This is comparable to the Cloze task in language modeling.
  2. Secondary Structure (SS): predicting the secondary structure of the <mask> token in the MLM task.
  3. Minimum Free Energy (MFE): predicting the minimum free energy of the 5\u2019 UTR sequence.
"},{"location":"models/utrlm/#training-data","title":"Training Data","text":"

The UTR-LM model was pre-trained on 5\u2019 UTR sequences from three sources:

UTR-LM preprocessed the 5\u2019 UTR sequences in a 4-step pipeline:

  1. removed all coding sequence (CDS) and non-5\u2019 UTR fragments from the raw sequences.
  2. identified and removed duplicate sequences
  3. truncated the sequences to fit within a range of 30 to 1022 bp
  4. filtered out incorrect and low-quality sequences

Note RnaTokenizer will convert \u201cT\u201ds to \u201cU\u201ds for you, you may disable this behaviour by passing replace_T_with_U=False.

"},{"location":"models/utrlm/#training-procedure","title":"Training Procedure","text":""},{"location":"models/utrlm/#preprocessing","title":"Preprocessing","text":"

UTR-LM used masked language modeling (MLM) as one of the pre-training objectives. The masking procedure is similar to the one used in BERT:

"},{"location":"models/utrlm/#pretraining","title":"PreTraining","text":"

The model was trained on two clusters:

  1. 4 NVIDIA V100 GPUs with 16GiB memories.
  2. 4 NVIDIA P100 GPUs with 32GiB memories.
"},{"location":"models/utrlm/#citation","title":"Citation","text":"

BibTeX:

BibTeX
@article {chu2023a,\n    author = {Chu, Yanyi and Yu, Dan and Li, Yupeng and Huang, Kaixuan and Shen, Yue and Cong, Le and Zhang, Jason and Wang, Mengdi},\n    title = {A 5{\\textquoteright} UTR Language Model for Decoding Untranslated Regions of mRNA and Function Predictions},\n    elocation-id = {2023.10.11.561938},\n    year = {2023},\n    doi = {10.1101/2023.10.11.561938},\n    publisher = {Cold Spring Harbor Laboratory},\n    abstract = {The 5{\\textquoteright} UTR, a regulatory region at the beginning of an mRNA molecule, plays a crucial role in regulating the translation process and impacts the protein expression level. Language models have showcased their effectiveness in decoding the functions of protein and genome sequences. Here, we introduced a language model for 5{\\textquoteright} UTR, which we refer to as the UTR-LM. The UTR-LM is pre-trained on endogenous 5{\\textquoteright} UTRs from multiple species and is further augmented with supervised information including secondary structure and minimum free energy. We fine-tuned the UTR-LM in a variety of downstream tasks. The model outperformed the best-known benchmark by up to 42\\% for predicting the Mean Ribosome Loading, and by up to 60\\% for predicting the Translation Efficiency and the mRNA Expression Level. The model also applies to identifying unannotated Internal Ribosome Entry Sites within the untranslated region and improves the AUPR from 0.37 to 0.52 compared to the best baseline. Further, we designed a library of 211 novel 5{\\textquoteright} UTRs with high predicted values of translation efficiency and evaluated them via a wet-lab assay. Experiment results confirmed that our top designs achieved a 32.5\\% increase in protein production level relative to well-established 5{\\textquoteright} UTR optimized for therapeutics.Competing Interest StatementThe authors have declared no competing interest.},\n    URL = {https://www.biorxiv.org/content/early/2023/10/14/2023.10.11.561938},\n    eprint = {https://www.biorxiv.org/content/early/2023/10/14/2023.10.11.561938.full.pdf},\n    journal = {bioRxiv}\n}\n
"},{"location":"models/utrlm/#contact","title":"Contact","text":"

Please use GitHub issues of MultiMolecule for any questions or comments on the model card.

Please contact the authors of the UTR-LM paper for questions or comments on the paper/model.

"},{"location":"models/utrlm/#license","title":"License","text":"

This model is licensed under the AGPL-3.0 License.

Text Only
SPDX-License-Identifier: AGPL-3.0-or-later\n
"},{"location":"models/utrlm/#multimolecule.models.utrlm","title":"multimolecule.models.utrlm","text":""},{"location":"models/utrlm/#multimolecule.models.utrlm.RnaTokenizer","title":"RnaTokenizer","text":"

Bases: Tokenizer

Tokenizer for RNA sequences.

Parameters:

Name Type Description Default Alphabet | str | List[str] | None

alphabet to use for tokenization.

None int

Size of kmer to tokenize.

1 bool

Whether to tokenize into codons.

False bool

Whether to replace T with U.

True bool

Whether to convert input to uppercase.

True

Examples:

Python Console Session
>>> from multimolecule import RnaTokenizer\n>>> tokenizer = RnaTokenizer()\n>>> tokenizer('<pad><cls><eos><unk><mask><null>ACGUNRYSWKMBDHV.X*-I')[\"input_ids\"]\n[1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 2]\n>>> tokenizer('acgu')[\"input_ids\"]\n[1, 6, 7, 8, 9, 2]\n>>> tokenizer('acgt')[\"input_ids\"]\n[1, 6, 7, 8, 9, 2]\n>>> tokenizer = RnaTokenizer(replace_T_with_U=False)\n>>> tokenizer('acgt')[\"input_ids\"]\n[1, 6, 7, 8, 3, 2]\n>>> tokenizer = RnaTokenizer(nmers=3)\n>>> tokenizer('uagcuuauc')[\"input_ids\"]\n[1, 83, 17, 64, 49, 96, 84, 22, 2]\n>>> tokenizer = RnaTokenizer(codon=True)\n>>> tokenizer('uagcuuauc')[\"input_ids\"]\n[1, 83, 49, 22, 2]\n>>> tokenizer('uagcuuauca')[\"input_ids\"]\nTraceback (most recent call last):\nValueError: length of input sequence must be a multiple of 3 for codon tokenization, but got 10\n
Source code in multimolecule/tokenisers/rna/tokenization_rna.py Python
class RnaTokenizer(Tokenizer):\n    \"\"\"\n    Tokenizer for RNA sequences.\n\n    Args:\n        alphabet: alphabet to use for tokenization.\n\n            - If is `None`, the standard RNA alphabet will be used.\n            - If is a `string`, it should correspond to the name of a predefined alphabet. The options include\n                + `standard`\n                + `extended`\n                + `streamline`\n                + `nucleobase`\n            - If is an alphabet or a list of characters, that specific alphabet will be used.\n        nmers: Size of kmer to tokenize.\n        codon: Whether to tokenize into codons.\n        replace_T_with_U: Whether to replace T with U.\n        do_upper_case: Whether to convert input to uppercase.\n\n    Examples:\n        >>> from multimolecule import RnaTokenizer\n        >>> tokenizer = RnaTokenizer()\n        >>> tokenizer('<pad><cls><eos><unk><mask><null>ACGUNRYSWKMBDHV.X*-I')[\"input_ids\"]\n        [1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 2]\n        >>> tokenizer('acgu')[\"input_ids\"]\n        [1, 6, 7, 8, 9, 2]\n        >>> tokenizer('acgt')[\"input_ids\"]\n        [1, 6, 7, 8, 9, 2]\n        >>> tokenizer = RnaTokenizer(replace_T_with_U=False)\n        >>> tokenizer('acgt')[\"input_ids\"]\n        [1, 6, 7, 8, 3, 2]\n        >>> tokenizer = RnaTokenizer(nmers=3)\n        >>> tokenizer('uagcuuauc')[\"input_ids\"]\n        [1, 83, 17, 64, 49, 96, 84, 22, 2]\n        >>> tokenizer = RnaTokenizer(codon=True)\n        >>> tokenizer('uagcuuauc')[\"input_ids\"]\n        [1, 83, 49, 22, 2]\n        >>> tokenizer('uagcuuauca')[\"input_ids\"]\n        Traceback (most recent call last):\n        ValueError: length of input sequence must be a multiple of 3 for codon tokenization, but got 10\n    \"\"\"\n\n    model_input_names = [\"input_ids\", \"attention_mask\"]\n\n    def __init__(\n        self,\n        alphabet: Alphabet | str | List[str] | None = None,\n        nmers: int = 1,\n        codon: bool = False,\n        replace_T_with_U: bool = True,\n        do_upper_case: bool = True,\n        additional_special_tokens: List | Tuple | None = None,\n        **kwargs,\n    ):\n        if codon and (nmers > 1 and nmers != 3):\n            raise ValueError(\"Codon and nmers cannot be used together.\")\n        if codon:\n            nmers = 3  # set to 3 to get correct vocab\n        if not isinstance(alphabet, Alphabet):\n            alphabet = get_alphabet(alphabet, nmers=nmers)\n        super().__init__(\n            alphabet=alphabet,\n            nmers=nmers,\n            codon=codon,\n            replace_T_with_U=replace_T_with_U,\n            do_upper_case=do_upper_case,\n            additional_special_tokens=additional_special_tokens,\n            **kwargs,\n        )\n        self.replace_T_with_U = replace_T_with_U\n        self.nmers = nmers\n        self.codon = codon\n\n    def _tokenize(self, text: str, **kwargs):\n        if self.do_upper_case:\n            text = text.upper()\n        if self.replace_T_with_U:\n            text = text.replace(\"T\", \"U\")\n        if self.codon:\n            if len(text) % 3 != 0:\n                raise ValueError(\n                    f\"length of input sequence must be a multiple of 3 for codon tokenization, but got {len(text)}\"\n                )\n            return [text[i : i + 3] for i in range(0, len(text), 3)]\n        if self.nmers > 1:\n            return [text[i : i + self.nmers] for i in range(len(text) - self.nmers + 1)]  # noqa: E203\n        return list(text)\n
"},{"location":"models/utrlm/#multimolecule.models.utrlm.RnaTokenizer(alphabet)","title":"alphabet","text":""},{"location":"models/utrlm/#multimolecule.models.utrlm.RnaTokenizer(nmers)","title":"nmers","text":""},{"location":"models/utrlm/#multimolecule.models.utrlm.RnaTokenizer(codon)","title":"codon","text":""},{"location":"models/utrlm/#multimolecule.models.utrlm.RnaTokenizer(replace_T_with_U)","title":"replace_T_with_U","text":""},{"location":"models/utrlm/#multimolecule.models.utrlm.RnaTokenizer(do_upper_case)","title":"do_upper_case","text":""},{"location":"models/utrlm/#multimolecule.models.utrlm.UtrLmConfig","title":"UtrLmConfig","text":"

Bases: PreTrainedConfig

This is the configuration class to store the configuration of a UtrLmModel. It is used to instantiate a UTR-LM model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the UTR-LM a96123155/UTR-LM architecture.

Configuration objects inherit from PreTrainedConfig and can be used to control the model outputs. Read the documentation from PreTrainedConfig for more information.

Parameters:

Name Type Description Default int

Vocabulary size of the UTR-LM model. Defines the number of different tokens that can be represented by the inputs_ids passed when calling [UtrLmModel].

26 int

Dimensionality of the encoder layers and the pooler layer.

128 int

Number of hidden layers in the Transformer encoder.

6 int

Number of attention heads for each attention layer in the Transformer encoder.

16 int

Dimensionality of the \u201cintermediate\u201d (often named feed-forward) layer in the Transformer encoder.

512 float

The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.

0.1 float

The dropout ratio for the attention probabilities.

0.1 int

The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).

1026 float

The standard deviation of the truncated_normal_initializer for initializing all weight matrices.

0.02 float

The epsilon used by the layer normalization layers.

1e-12 str

Type of position embedding. Choose one of \"absolute\", \"relative_key\", \"relative_key_query\", \"rotary\". For positional embeddings use \"absolute\". For more information on \"relative_key\", please refer to Self-Attention with Relative Position Representations (Shaw et al.). For more information on \"relative_key_query\", please refer to Method 4 in Improve Transformer Models with Better Relative Position Embeddings (Huang et al.).

'rotary' bool

Whether the model is used as a decoder or not. If False, the model is used as an encoder.

False bool

Whether or not the model should return the last key/values attentions (not used by all models). Only relevant if config.is_decoder=True.

True bool

Whether to apply layer normalization after embeddings but before the main stem of the network.

False bool

When this is enabled, masked tokens are treated as if they had been dropped out by input dropout.

False

Examples:

Python Console Session
>>> from multimolecule import UtrLmModel, UtrLmConfig\n>>> # Initializing a UTR-LM multimolecule/utrlm style configuration\n>>> configuration = UtrLmConfig()\n>>> # Initializing a model (with random weights) from the multimolecule/utrlm style configuration\n>>> model = UtrLmModel(configuration)\n>>> # Accessing the model configuration\n>>> configuration = model.config\n
Source code in multimolecule/models/utrlm/configuration_utrlm.py Python
class UtrLmConfig(PreTrainedConfig):\n    r\"\"\"\n    This is the configuration class to store the configuration of a [`UtrLmModel`][multimolecule.models.UtrLmModel].\n    It is used to instantiate a UTR-LM model according to the specified arguments, defining the model architecture.\n    Instantiating a configuration with the defaults will yield a similar configuration to that of the UTR-LM\n    [a96123155/UTR-LM](https://github.com/a96123155/UTR-LM) architecture.\n\n    Configuration objects inherit from [`PreTrainedConfig`][multimolecule.models.PreTrainedConfig] and can be used to\n    control the model outputs. Read the documentation from [`PreTrainedConfig`][multimolecule.models.PreTrainedConfig]\n    for more information.\n\n    Args:\n        vocab_size:\n            Vocabulary size of the UTR-LM model. Defines the number of different tokens that can be represented by the\n            `inputs_ids` passed when calling [`UtrLmModel`].\n        hidden_size:\n            Dimensionality of the encoder layers and the pooler layer.\n        num_hidden_layers:\n            Number of hidden layers in the Transformer encoder.\n        num_attention_heads:\n            Number of attention heads for each attention layer in the Transformer encoder.\n        intermediate_size:\n            Dimensionality of the \"intermediate\" (often named feed-forward) layer in the Transformer encoder.\n        hidden_dropout:\n            The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.\n        attention_dropout:\n            The dropout ratio for the attention probabilities.\n        max_position_embeddings:\n            The maximum sequence length that this model might ever be used with. Typically set this to something large\n            just in case (e.g., 512 or 1024 or 2048).\n        initializer_range:\n            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.\n        layer_norm_eps:\n            The epsilon used by the layer normalization layers.\n        position_embedding_type:\n            Type of position embedding. Choose one of `\"absolute\"`, `\"relative_key\"`, `\"relative_key_query\", \"rotary\"`.\n            For positional embeddings use `\"absolute\"`. For more information on `\"relative_key\"`, please refer to\n            [Self-Attention with Relative Position Representations (Shaw et al.)](https://arxiv.org/abs/1803.02155).\n            For more information on `\"relative_key_query\"`, please refer to *Method 4* in [Improve Transformer Models\n            with Better Relative Position Embeddings (Huang et al.)](https://arxiv.org/abs/2009.13658).\n        is_decoder:\n            Whether the model is used as a decoder or not. If `False`, the model is used as an encoder.\n        use_cache:\n            Whether or not the model should return the last key/values attentions (not used by all models). Only\n            relevant if `config.is_decoder=True`.\n        emb_layer_norm_before:\n            Whether to apply layer normalization after embeddings but before the main stem of the network.\n        token_dropout:\n            When this is enabled, masked tokens are treated as if they had been dropped out by input dropout.\n\n    Examples:\n        >>> from multimolecule import UtrLmModel, UtrLmConfig\n        >>> # Initializing a UTR-LM multimolecule/utrlm style configuration\n        >>> configuration = UtrLmConfig()\n        >>> # Initializing a model (with random weights) from the multimolecule/utrlm style configuration\n        >>> model = UtrLmModel(configuration)\n        >>> # Accessing the model configuration\n        >>> configuration = model.config\n    \"\"\"\n\n    model_type = \"utrlm\"\n\n    def __init__(\n        self,\n        vocab_size: int = 26,\n        hidden_size: int = 128,\n        num_hidden_layers: int = 6,\n        num_attention_heads: int = 16,\n        intermediate_size: int = 512,\n        hidden_act: str = \"gelu\",\n        hidden_dropout: float = 0.1,\n        attention_dropout: float = 0.1,\n        max_position_embeddings: int = 1026,\n        initializer_range: float = 0.02,\n        layer_norm_eps: float = 1e-12,\n        position_embedding_type: str = \"rotary\",\n        is_decoder: bool = False,\n        use_cache: bool = True,\n        emb_layer_norm_before: bool = False,\n        token_dropout: bool = False,\n        head: HeadConfig | None = None,\n        lm_head: MaskedLMHeadConfig | None = None,\n        ss_head: HeadConfig | None = None,\n        mfe_head: HeadConfig | None = None,\n        **kwargs,\n    ):\n        super().__init__(**kwargs)\n\n        self.vocab_size = vocab_size\n        self.hidden_size = hidden_size\n        self.num_hidden_layers = num_hidden_layers\n        self.num_attention_heads = num_attention_heads\n        self.intermediate_size = intermediate_size\n        self.hidden_act = hidden_act\n        self.hidden_dropout = hidden_dropout\n        self.attention_dropout = attention_dropout\n        self.max_position_embeddings = max_position_embeddings\n        self.initializer_range = initializer_range\n        self.layer_norm_eps = layer_norm_eps\n        self.position_embedding_type = position_embedding_type\n        self.is_decoder = is_decoder\n        self.use_cache = use_cache\n        self.emb_layer_norm_before = emb_layer_norm_before\n        self.token_dropout = token_dropout\n        self.head = HeadConfig(**head) if head is not None else None\n        self.lm_head = MaskedLMHeadConfig(**lm_head) if lm_head is not None else None\n        self.ss_head = HeadConfig(**ss_head) if ss_head is not None else None\n        self.mfe_head = HeadConfig(**mfe_head) if mfe_head is not None else None\n
"},{"location":"models/utrlm/#multimolecule.models.utrlm.UtrLmConfig(vocab_size)","title":"vocab_size","text":""},{"location":"models/utrlm/#multimolecule.models.utrlm.UtrLmConfig(hidden_size)","title":"hidden_size","text":""},{"location":"models/utrlm/#multimolecule.models.utrlm.UtrLmConfig(num_hidden_layers)","title":"num_hidden_layers","text":""},{"location":"models/utrlm/#multimolecule.models.utrlm.UtrLmConfig(num_attention_heads)","title":"num_attention_heads","text":""},{"location":"models/utrlm/#multimolecule.models.utrlm.UtrLmConfig(intermediate_size)","title":"intermediate_size","text":""},{"location":"models/utrlm/#multimolecule.models.utrlm.UtrLmConfig(hidden_dropout)","title":"hidden_dropout","text":""},{"location":"models/utrlm/#multimolecule.models.utrlm.UtrLmConfig(attention_dropout)","title":"attention_dropout","text":""},{"location":"models/utrlm/#multimolecule.models.utrlm.UtrLmConfig(max_position_embeddings)","title":"max_position_embeddings","text":""},{"location":"models/utrlm/#multimolecule.models.utrlm.UtrLmConfig(initializer_range)","title":"initializer_range","text":""},{"location":"models/utrlm/#multimolecule.models.utrlm.UtrLmConfig(layer_norm_eps)","title":"layer_norm_eps","text":""},{"location":"models/utrlm/#multimolecule.models.utrlm.UtrLmConfig(position_embedding_type)","title":"position_embedding_type","text":""},{"location":"models/utrlm/#multimolecule.models.utrlm.UtrLmConfig(is_decoder)","title":"is_decoder","text":""},{"location":"models/utrlm/#multimolecule.models.utrlm.UtrLmConfig(use_cache)","title":"use_cache","text":""},{"location":"models/utrlm/#multimolecule.models.utrlm.UtrLmConfig(emb_layer_norm_before)","title":"emb_layer_norm_before","text":""},{"location":"models/utrlm/#multimolecule.models.utrlm.UtrLmConfig(token_dropout)","title":"token_dropout","text":""},{"location":"models/utrlm/#multimolecule.models.utrlm.UtrLmForContactPrediction","title":"UtrLmForContactPrediction","text":"

Bases: UtrLmPreTrainedModel

Examples:

Python Console Session
>>> from multimolecule import UtrLmConfig, UtrLmForContactPrediction, RnaTokenizer\n>>> config = UtrLmConfig()\n>>> model = UtrLmForContactPrediction(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=torch.randint(2, (1, 5, 5)))\n>>> output[\"logits\"].shape\ntorch.Size([1, 5, 5, 1])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n
Source code in multimolecule/models/utrlm/modeling_utrlm.py Python
class UtrLmForContactPrediction(UtrLmPreTrainedModel):\n    \"\"\"\n    Examples:\n        >>> from multimolecule import UtrLmConfig, UtrLmForContactPrediction, RnaTokenizer\n        >>> config = UtrLmConfig()\n        >>> model = UtrLmForContactPrediction(config)\n        >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n        >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n        >>> output = model(**input, labels=torch.randint(2, (1, 5, 5)))\n        >>> output[\"logits\"].shape\n        torch.Size([1, 5, 5, 1])\n        >>> output[\"loss\"]  # doctest:+ELLIPSIS\n        tensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n    \"\"\"\n\n    def __init__(self, config: UtrLmConfig):\n        super().__init__(config)\n        self.utrlm = UtrLmModel(config, add_pooling_layer=True)\n        self.contact_head = ContactPredictionHead(config)\n        self.head_config = self.contact_head.config\n\n        # Initialize weights and apply final processing\n        self.post_init()\n\n    def forward(\n        self,\n        input_ids: Tensor | NestedTensor,\n        attention_mask: Tensor | None = None,\n        position_ids: Tensor | None = None,\n        head_mask: Tensor | None = None,\n        inputs_embeds: Tensor | NestedTensor | None = None,\n        labels: Tensor | None = None,\n        output_attentions: bool | None = None,\n        output_hidden_states: bool | None = None,\n        return_dict: bool | None = None,\n        **kwargs,\n    ) -> Tuple[Tensor, ...] | ContactPredictorOutput:\n        if output_attentions is False:\n            warn(\"output_attentions must be True for contact classification and will be ignored.\")\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n        outputs = self.utrlm(\n            input_ids,\n            attention_mask=attention_mask,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            output_attentions=True,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n            **kwargs,\n        )\n        output = self.contact_head(outputs, attention_mask, input_ids, labels)\n        logits, loss = output.logits, output.loss\n\n        if not return_dict:\n            output = (logits,) + outputs[2:]\n            return ((loss,) + output) if loss is not None else output\n\n        return ContactPredictorOutput(\n            loss=loss,\n            logits=logits,\n            hidden_states=outputs.hidden_states,\n            attentions=outputs.attentions,\n        )\n
"},{"location":"models/utrlm/#multimolecule.models.utrlm.UtrLmForMaskedLM","title":"UtrLmForMaskedLM","text":"

Bases: UtrLmPreTrainedModel

Examples:

Python Console Session
>>> from multimolecule import UtrLmConfig, UtrLmModel, RnaTokenizer\n>>> config = UtrLmConfig()\n>>> model = UtrLmForMaskedLM(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=input[\"input_ids\"])\n>>> output[\"logits\"].shape\ntorch.Size([1, 7, 26])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<NllLossBackward0>)\n
Source code in multimolecule/models/utrlm/modeling_utrlm.py Python
class UtrLmForMaskedLM(UtrLmPreTrainedModel):\n    \"\"\"\n    Examples:\n        >>> from multimolecule import UtrLmConfig, UtrLmModel, RnaTokenizer\n        >>> config = UtrLmConfig()\n        >>> model = UtrLmForMaskedLM(config)\n        >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n        >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n        >>> output = model(**input, labels=input[\"input_ids\"])\n        >>> output[\"logits\"].shape\n        torch.Size([1, 7, 26])\n        >>> output[\"loss\"]  # doctest:+ELLIPSIS\n        tensor(..., grad_fn=<NllLossBackward0>)\n    \"\"\"\n\n    _tied_weights_keys = [\"lm_head.decoder.weight\", \"lm_head.decoder.bias\"]\n\n    def __init__(self, config: UtrLmConfig):\n        super().__init__(config)\n        if config.is_decoder:\n            logger.warning(\n                \"If you want to use `UtrLmForMaskedLM` make sure `config.is_decoder=False` for \"\n                \"bi-directional self-attention.\"\n            )\n        self.utrlm = UtrLmModel(config, add_pooling_layer=False)\n        self.lm_head = MaskedLMHead(config)\n\n        # Initialize weights and apply final processing\n        self.post_init()\n\n    def forward(\n        self,\n        input_ids: Tensor | NestedTensor,\n        attention_mask: Tensor | None = None,\n        position_ids: Tensor | None = None,\n        head_mask: Tensor | None = None,\n        inputs_embeds: Tensor | NestedTensor | None = None,\n        encoder_hidden_states: Tensor | None = None,\n        encoder_attention_mask: Tensor | None = None,\n        labels: Tensor | None = None,\n        output_attentions: bool | None = None,\n        output_hidden_states: bool | None = None,\n        return_dict: bool | None = None,\n        **kwargs,\n    ) -> Tuple[Tensor, ...] | MaskedLMOutput:\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n        outputs = self.utrlm(\n            input_ids,\n            attention_mask=attention_mask,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            encoder_hidden_states=encoder_hidden_states,\n            encoder_attention_mask=encoder_attention_mask,\n            output_attentions=output_attentions,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n            **kwargs,\n        )\n        output = self.lm_head(outputs, labels)\n        logits, loss = output.logits, output.loss\n\n        if not return_dict:\n            output = (logits,) + outputs[2:]\n            return ((loss,) + output) if loss is not None else output\n\n        return MaskedLMOutput(\n            loss=loss,\n            logits=logits,\n            hidden_states=outputs.hidden_states,\n            attentions=outputs.attentions,\n        )\n
"},{"location":"models/utrlm/#multimolecule.models.utrlm.UtrLmForPreTraining","title":"UtrLmForPreTraining","text":"

Bases: UtrLmPreTrainedModel

Examples:

Python Console Session
>>> from multimolecule import UtrLmConfig, UtrLmModel, RnaTokenizer\n>>> config = UtrLmConfig()\n>>> model = UtrLmForPreTraining(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels_mlm=input[\"input_ids\"])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<AddBackward0>)\n>>> output[\"logits\"].shape\ntorch.Size([1, 7, 26])\n>>> output[\"contact_map\"].shape\ntorch.Size([1, 5, 5, 1])\n
Source code in multimolecule/models/utrlm/modeling_utrlm.py Python
class UtrLmForPreTraining(UtrLmPreTrainedModel):\n    \"\"\"\n    Examples:\n        >>> from multimolecule import UtrLmConfig, UtrLmModel, RnaTokenizer\n        >>> config = UtrLmConfig()\n        >>> model = UtrLmForPreTraining(config)\n        >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n        >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n        >>> output = model(**input, labels_mlm=input[\"input_ids\"])\n        >>> output[\"loss\"]  # doctest:+ELLIPSIS\n        tensor(..., grad_fn=<AddBackward0>)\n        >>> output[\"logits\"].shape\n        torch.Size([1, 7, 26])\n        >>> output[\"contact_map\"].shape\n        torch.Size([1, 5, 5, 1])\n    \"\"\"\n\n    _tied_weights_keys = [\n        \"lm_head.decoder.weight\",\n        \"lm_head.decoder.bias\",\n        \"pretrain.predictions.decoder.weight\",\n        \"pretrain.predictions.decoder.bias\",\n        \"pretrain.predictions_ss.decoder.weight\",\n        \"pretrain.predictions_ss.decoder.bias\",\n    ]\n\n    def __init__(self, config: UtrLmConfig):\n        super().__init__(config)\n        if config.is_decoder:\n            logger.warning(\n                \"If you want to use `UtrLmForPreTraining` make sure `config.is_decoder=False` for \"\n                \"bi-directional self-attention.\"\n            )\n        self.utrlm = UtrLmModel(config, add_pooling_layer=False)\n        self.pretrain = UtrLmPreTrainingHeads(config)\n\n        # Initialize weights and apply final processing\n        self.post_init()\n\n    def get_output_embeddings(self):\n        return self.pretrain.predictions.decoder\n\n    def set_output_embeddings(self, embeddings):\n        self.pretrain.predictions.decoder = embeddings\n\n    def forward(\n        self,\n        input_ids: Tensor | NestedTensor,\n        attention_mask: Tensor | None = None,\n        position_ids: Tensor | None = None,\n        head_mask: Tensor | None = None,\n        inputs_embeds: Tensor | NestedTensor | None = None,\n        encoder_hidden_states: Tensor | None = None,\n        encoder_attention_mask: Tensor | None = None,\n        labels_mlm: Tensor | None = None,\n        labels_contact: Tensor | None = None,\n        labels_ss: Tensor | None = None,\n        labels_mfe: Tensor | None = None,\n        output_attentions: bool | None = None,\n        output_hidden_states: bool | None = None,\n        return_dict: bool | None = None,\n        **kwargs,\n    ) -> Tuple[Tensor, ...] | UtrLmForPreTrainingOutput:\n        if output_attentions is False:\n            warn(\"output_attentions must be True for contact classification and will be ignored.\")\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n        outputs = self.utrlm(\n            input_ids,\n            attention_mask=attention_mask,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            encoder_hidden_states=encoder_hidden_states,\n            encoder_attention_mask=encoder_attention_mask,\n            output_attentions=True,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n            **kwargs,\n        )\n        total_loss, logits, contact_map, secondary_structure, minimum_free_energy = self.pretrain(\n            outputs,\n            attention_mask,\n            input_ids,\n            labels_mlm=labels_mlm,\n            labels_contact=labels_contact,\n            labels_ss=labels_ss,\n            labels_mfe=labels_mfe,\n        )\n\n        if not return_dict:\n            output = (logits,) + outputs[2:]\n            return ((total_loss,) + output) if total_loss is not None else output\n\n        return UtrLmForPreTrainingOutput(\n            loss=total_loss,\n            logits=logits,\n            contact_map=contact_map,\n            secondary_structure=secondary_structure,\n            minimum_free_energy=minimum_free_energy,\n            hidden_states=outputs.hidden_states,\n            attentions=outputs.attentions,\n        )\n
"},{"location":"models/utrlm/#multimolecule.models.utrlm.UtrLmForSequencePrediction","title":"UtrLmForSequencePrediction","text":"

Bases: UtrLmPreTrainedModel

Examples:

Python Console Session
>>> from multimolecule import UtrLmConfig, UtrLmModel, RnaTokenizer\n>>> config = UtrLmConfig()\n>>> model = UtrLmForSequencePrediction(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=torch.tensor([[1]]))\n>>> output[\"logits\"].shape\ntorch.Size([1, 1])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n
Source code in multimolecule/models/utrlm/modeling_utrlm.py Python
class UtrLmForSequencePrediction(UtrLmPreTrainedModel):\n    \"\"\"\n    Examples:\n        >>> from multimolecule import UtrLmConfig, UtrLmModel, RnaTokenizer\n        >>> config = UtrLmConfig()\n        >>> model = UtrLmForSequencePrediction(config)\n        >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n        >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n        >>> output = model(**input, labels=torch.tensor([[1]]))\n        >>> output[\"logits\"].shape\n        torch.Size([1, 1])\n        >>> output[\"loss\"]  # doctest:+ELLIPSIS\n        tensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n    \"\"\"\n\n    def __init__(self, config: UtrLmConfig):\n        super().__init__(config)\n        self.utrlm = UtrLmModel(config, add_pooling_layer=True)\n        self.sequence_head = SequencePredictionHead(config)\n        self.head_config = self.sequence_head.config\n\n        # Initialize weights and apply final processing\n        self.post_init()\n\n    def forward(\n        self,\n        input_ids: Tensor | NestedTensor,\n        attention_mask: Tensor | None = None,\n        position_ids: Tensor | None = None,\n        head_mask: Tensor | None = None,\n        inputs_embeds: Tensor | NestedTensor | None = None,\n        labels: Tensor | None = None,\n        output_attentions: bool | None = None,\n        output_hidden_states: bool | None = None,\n        return_dict: bool | None = None,\n        **kwargs,\n    ) -> Tuple[Tensor, ...] | SequencePredictorOutput:\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n        outputs = self.utrlm(\n            input_ids,\n            attention_mask=attention_mask,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            output_attentions=output_attentions,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n            **kwargs,\n        )\n        output = self.sequence_head(outputs, labels)\n        logits, loss = output.logits, output.loss\n\n        if not return_dict:\n            output = (logits,) + outputs[2:]\n            return ((loss,) + output) if loss is not None else output\n\n        return SequencePredictorOutput(\n            loss=loss,\n            logits=logits,\n            hidden_states=outputs.hidden_states,\n            attentions=outputs.attentions,\n        )\n
"},{"location":"models/utrlm/#multimolecule.models.utrlm.UtrLmForTokenPrediction","title":"UtrLmForTokenPrediction","text":"

Bases: UtrLmPreTrainedModel

Examples:

Python Console Session
>>> from multimolecule import UtrLmConfig, UtrLmModel, RnaTokenizer\n>>> config = UtrLmConfig()\n>>> model = UtrLmForTokenPrediction(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=torch.randint(2, (1, 5)))\n>>> output[\"logits\"].shape\ntorch.Size([1, 5, 1])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n
Source code in multimolecule/models/utrlm/modeling_utrlm.py Python
class UtrLmForTokenPrediction(UtrLmPreTrainedModel):\n    \"\"\"\n    Examples:\n        >>> from multimolecule import UtrLmConfig, UtrLmModel, RnaTokenizer\n        >>> config = UtrLmConfig()\n        >>> model = UtrLmForTokenPrediction(config)\n        >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n        >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n        >>> output = model(**input, labels=torch.randint(2, (1, 5)))\n        >>> output[\"logits\"].shape\n        torch.Size([1, 5, 1])\n        >>> output[\"loss\"]  # doctest:+ELLIPSIS\n        tensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n    \"\"\"\n\n    def __init__(self, config: UtrLmConfig):\n        super().__init__(config)\n        self.utrlm = UtrLmModel(config, add_pooling_layer=True)\n        self.token_head = TokenPredictionHead(config)\n        self.head_config = self.token_head.config\n\n        # Initialize weights and apply final processing\n        self.post_init()\n\n    def forward(\n        self,\n        input_ids: Tensor | NestedTensor,\n        attention_mask: Tensor | None = None,\n        position_ids: Tensor | None = None,\n        head_mask: Tensor | None = None,\n        inputs_embeds: Tensor | NestedTensor | None = None,\n        labels: Tensor | None = None,\n        output_attentions: bool | None = None,\n        output_hidden_states: bool | None = None,\n        return_dict: bool | None = None,\n        **kwargs,\n    ) -> Tuple[Tensor, ...] | TokenPredictorOutput:\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n        outputs = self.utrlm(\n            input_ids,\n            attention_mask=attention_mask,\n            position_ids=position_ids,\n            head_mask=head_mask,\n            inputs_embeds=inputs_embeds,\n            output_attentions=output_attentions,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n            **kwargs,\n        )\n        output = self.token_head(outputs, attention_mask, input_ids, labels)\n        logits, loss = output.logits, output.loss\n\n        if not return_dict:\n            output = (logits,) + outputs[2:]\n            return ((loss,) + output) if loss is not None else output\n\n        return TokenPredictorOutput(\n            loss=loss,\n            logits=logits,\n            hidden_states=outputs.hidden_states,\n            attentions=outputs.attentions,\n        )\n
"},{"location":"models/utrlm/#multimolecule.models.utrlm.UtrLmModel","title":"UtrLmModel","text":"

Bases: UtrLmPreTrainedModel

Examples:

Python Console Session
>>> from multimolecule import UtrLmConfig, UtrLmModel, RnaTokenizer\n>>> config = UtrLmConfig()\n>>> model = UtrLmModel(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input)\n>>> output[\"last_hidden_state\"].shape\ntorch.Size([1, 7, 128])\n>>> output[\"pooler_output\"].shape\ntorch.Size([1, 128])\n
Source code in multimolecule/models/utrlm/modeling_utrlm.py Python
class UtrLmModel(UtrLmPreTrainedModel):\n    \"\"\"\n    Examples:\n        >>> from multimolecule import UtrLmConfig, UtrLmModel, RnaTokenizer\n        >>> config = UtrLmConfig()\n        >>> model = UtrLmModel(config)\n        >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n        >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n        >>> output = model(**input)\n        >>> output[\"last_hidden_state\"].shape\n        torch.Size([1, 7, 128])\n        >>> output[\"pooler_output\"].shape\n        torch.Size([1, 128])\n    \"\"\"\n\n    def __init__(self, config: UtrLmConfig, add_pooling_layer: bool = True):\n        super().__init__(config)\n        self.pad_token_id = config.pad_token_id\n        self.embeddings = UtrLmEmbeddings(config)\n        self.encoder = UtrLmEncoder(config)\n        self.pooler = UtrLmPooler(config) if add_pooling_layer else None\n\n        # Initialize weights and apply final processing\n        self.post_init()\n\n    def get_input_embeddings(self):\n        return self.embeddings.word_embeddings\n\n    def set_input_embeddings(self, value):\n        self.embeddings.word_embeddings = value\n\n    def _prune_heads(self, heads_to_prune):\n        \"\"\"\n        Prunes heads of the model. heads_to_prune: dict of {layer_num: list of heads to prune in this layer} See base\n        class PreTrainedModel\n        \"\"\"\n        for layer, heads in heads_to_prune.items():\n            self.encoder.layer[layer].attention.prune_heads(heads)\n\n    def forward(\n        self,\n        input_ids: Tensor | NestedTensor,\n        attention_mask: Tensor | None = None,\n        position_ids: Tensor | None = None,\n        head_mask: Tensor | None = None,\n        inputs_embeds: Tensor | NestedTensor | None = None,\n        encoder_hidden_states: Tensor | None = None,\n        encoder_attention_mask: Tensor | None = None,\n        past_key_values: Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None = None,\n        use_cache: bool | None = None,\n        output_attentions: bool | None = None,\n        output_hidden_states: bool | None = None,\n        return_dict: bool | None = None,\n        **kwargs,\n    ) -> Tuple[Tensor, ...] | BaseModelOutputWithPoolingAndCrossAttentions:\n        r\"\"\"\n        Args:\n            encoder_hidden_states:\n                Shape: `(batch_size, sequence_length, hidden_size)`\n\n                Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if\n                the model is configured as a decoder.\n            encoder_attention_mask:\n                Shape: `(batch_size, sequence_length)`\n\n                Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used\n                in the cross-attention if the model is configured as a decoder. Mask values selected in `[0, 1]`:\n\n                - 1 for tokens that are **not masked**,\n                - 0 for tokens that are **masked**.\n            past_key_values:\n                Tuple of length `config.n_layers` with each tuple having 4 tensors of shape\n                `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)\n\n                Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up\n                decoding.\n\n                If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those\n                that don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of\n                all `decoder_input_ids` of shape `(batch_size, sequence_length)`.\n            use_cache:\n                If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding\n                (see `past_key_values`).\n        \"\"\"\n        if kwargs:\n            warn(\n                f\"Additional keyword arguments `{', '.join(kwargs)}` are detected in \"\n                f\"`{self.__class__.__name__}.forward`, they will be ignored.\\n\"\n                \"This is provided for backward compatibility and may lead to unexpected behavior.\"\n            )\n        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions\n        output_hidden_states = (\n            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states\n        )\n        return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n\n        if self.config.is_decoder:\n            use_cache = use_cache if use_cache is not None else self.config.use_cache\n        else:\n            use_cache = False\n\n        if isinstance(input_ids, NestedTensor):\n            input_ids, attention_mask = input_ids.tensor, input_ids.mask\n        if input_ids is not None and inputs_embeds is not None:\n            raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n        if input_ids is not None:\n            self.warn_if_padding_and_no_attention_mask(input_ids, attention_mask)\n            input_shape = input_ids.size()\n        elif inputs_embeds is not None:\n            input_shape = inputs_embeds.size()[:-1]\n        else:\n            raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n        batch_size, seq_length = input_shape\n        device = input_ids.device if input_ids is not None else inputs_embeds.device  # type: ignore[union-attr]\n\n        # past_key_values_length\n        past_key_values_length = past_key_values[0][0].shape[2] if past_key_values is not None else 0\n\n        if attention_mask is None:\n            attention_mask = (\n                input_ids.ne(self.pad_token_id)\n                if self.pad_token_id is not None\n                else torch.ones(((batch_size, seq_length + past_key_values_length)), device=device)\n            )\n\n        # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]\n        # ourselves in which case we just need to make it broadcastable to all heads.\n        extended_attention_mask: Tensor = self.get_extended_attention_mask(attention_mask, input_shape)\n\n        # If a 2D or 3D attention mask is provided for the cross-attention\n        # we need to make broadcastable to [batch_size, num_heads, seq_length, seq_length]\n        if self.config.is_decoder and encoder_hidden_states is not None:\n            encoder_batch_size, encoder_sequence_length, _ = encoder_hidden_states.size()\n            encoder_hidden_shape = (encoder_batch_size, encoder_sequence_length)\n            if encoder_attention_mask is None:\n                encoder_attention_mask = torch.ones(encoder_hidden_shape, device=device)\n            encoder_extended_attention_mask = self.invert_attention_mask(encoder_attention_mask)\n        else:\n            encoder_extended_attention_mask = None\n\n        # Prepare head mask if needed\n        # 1.0 in head_mask indicate we keep the head\n        # attention_probs has shape bsz x n_heads x N x N\n        # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]\n        # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]\n        head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)\n\n        embedding_output = self.embeddings(\n            input_ids=input_ids,\n            position_ids=position_ids,\n            attention_mask=attention_mask,\n            inputs_embeds=inputs_embeds,\n            past_key_values_length=past_key_values_length,\n        )\n        encoder_outputs = self.encoder(\n            embedding_output,\n            attention_mask=extended_attention_mask,\n            head_mask=head_mask,\n            encoder_hidden_states=encoder_hidden_states,\n            encoder_attention_mask=encoder_extended_attention_mask,\n            past_key_values=past_key_values,\n            use_cache=use_cache,\n            output_attentions=output_attentions,\n            output_hidden_states=output_hidden_states,\n            return_dict=return_dict,\n        )\n        sequence_output = encoder_outputs[0]\n        pooled_output = self.pooler(sequence_output) if self.pooler is not None else None\n\n        if not return_dict:\n            return (sequence_output, pooled_output) + encoder_outputs[1:]\n\n        return BaseModelOutputWithPoolingAndCrossAttentions(\n            last_hidden_state=sequence_output,\n            pooler_output=pooled_output,\n            past_key_values=encoder_outputs.past_key_values,\n            hidden_states=encoder_outputs.hidden_states,\n            attentions=encoder_outputs.attentions,\n            cross_attentions=encoder_outputs.cross_attentions,\n        )\n
"},{"location":"models/utrlm/#multimolecule.models.utrlm.UtrLmModel.forward","title":"forward","text":"Python
forward(input_ids: Tensor | NestedTensor, attention_mask: Tensor | None = None, position_ids: Tensor | None = None, head_mask: Tensor | None = None, inputs_embeds: Tensor | NestedTensor | None = None, encoder_hidden_states: Tensor | None = None, encoder_attention_mask: Tensor | None = None, past_key_values: Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None = None, use_cache: bool | None = None, output_attentions: bool | None = None, output_hidden_states: bool | None = None, return_dict: bool | None = None, **kwargs) -> Tuple[Tensor, ...] | BaseModelOutputWithPoolingAndCrossAttentions\n

Parameters:

Name Type Description Default Tensor | None

Shape: (batch_size, sequence_length, hidden_size)

Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if the model is configured as a decoder.

None Tensor | None

Shape: (batch_size, sequence_length)

Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used in the cross-attention if the model is configured as a decoder. Mask values selected in [0, 1]:

None Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None

Tuple of length config.n_layers with each tuple having 4 tensors of shape `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)

Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up decoding.

If past_key_values are used, the user can optionally input only the last decoder_input_ids (those that don\u2019t have their past key value states given to this model) of shape (batch_size, 1) instead of all decoder_input_ids of shape (batch_size, sequence_length).

None bool | None

If set to True, past_key_values key value states are returned and can be used to speed up decoding (see past_key_values).

None Source code in multimolecule/models/utrlm/modeling_utrlm.py Python
def forward(\n    self,\n    input_ids: Tensor | NestedTensor,\n    attention_mask: Tensor | None = None,\n    position_ids: Tensor | None = None,\n    head_mask: Tensor | None = None,\n    inputs_embeds: Tensor | NestedTensor | None = None,\n    encoder_hidden_states: Tensor | None = None,\n    encoder_attention_mask: Tensor | None = None,\n    past_key_values: Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None = None,\n    use_cache: bool | None = None,\n    output_attentions: bool | None = None,\n    output_hidden_states: bool | None = None,\n    return_dict: bool | None = None,\n    **kwargs,\n) -> Tuple[Tensor, ...] | BaseModelOutputWithPoolingAndCrossAttentions:\n    r\"\"\"\n    Args:\n        encoder_hidden_states:\n            Shape: `(batch_size, sequence_length, hidden_size)`\n\n            Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if\n            the model is configured as a decoder.\n        encoder_attention_mask:\n            Shape: `(batch_size, sequence_length)`\n\n            Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used\n            in the cross-attention if the model is configured as a decoder. Mask values selected in `[0, 1]`:\n\n            - 1 for tokens that are **not masked**,\n            - 0 for tokens that are **masked**.\n        past_key_values:\n            Tuple of length `config.n_layers` with each tuple having 4 tensors of shape\n            `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)\n\n            Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up\n            decoding.\n\n            If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those\n            that don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of\n            all `decoder_input_ids` of shape `(batch_size, sequence_length)`.\n        use_cache:\n            If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding\n            (see `past_key_values`).\n    \"\"\"\n    if kwargs:\n        warn(\n            f\"Additional keyword arguments `{', '.join(kwargs)}` are detected in \"\n            f\"`{self.__class__.__name__}.forward`, they will be ignored.\\n\"\n            \"This is provided for backward compatibility and may lead to unexpected behavior.\"\n        )\n    output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions\n    output_hidden_states = (\n        output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states\n    )\n    return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n\n    if self.config.is_decoder:\n        use_cache = use_cache if use_cache is not None else self.config.use_cache\n    else:\n        use_cache = False\n\n    if isinstance(input_ids, NestedTensor):\n        input_ids, attention_mask = input_ids.tensor, input_ids.mask\n    if input_ids is not None and inputs_embeds is not None:\n        raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n    if input_ids is not None:\n        self.warn_if_padding_and_no_attention_mask(input_ids, attention_mask)\n        input_shape = input_ids.size()\n    elif inputs_embeds is not None:\n        input_shape = inputs_embeds.size()[:-1]\n    else:\n        raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n    batch_size, seq_length = input_shape\n    device = input_ids.device if input_ids is not None else inputs_embeds.device  # type: ignore[union-attr]\n\n    # past_key_values_length\n    past_key_values_length = past_key_values[0][0].shape[2] if past_key_values is not None else 0\n\n    if attention_mask is None:\n        attention_mask = (\n            input_ids.ne(self.pad_token_id)\n            if self.pad_token_id is not None\n            else torch.ones(((batch_size, seq_length + past_key_values_length)), device=device)\n        )\n\n    # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]\n    # ourselves in which case we just need to make it broadcastable to all heads.\n    extended_attention_mask: Tensor = self.get_extended_attention_mask(attention_mask, input_shape)\n\n    # If a 2D or 3D attention mask is provided for the cross-attention\n    # we need to make broadcastable to [batch_size, num_heads, seq_length, seq_length]\n    if self.config.is_decoder and encoder_hidden_states is not None:\n        encoder_batch_size, encoder_sequence_length, _ = encoder_hidden_states.size()\n        encoder_hidden_shape = (encoder_batch_size, encoder_sequence_length)\n        if encoder_attention_mask is None:\n            encoder_attention_mask = torch.ones(encoder_hidden_shape, device=device)\n        encoder_extended_attention_mask = self.invert_attention_mask(encoder_attention_mask)\n    else:\n        encoder_extended_attention_mask = None\n\n    # Prepare head mask if needed\n    # 1.0 in head_mask indicate we keep the head\n    # attention_probs has shape bsz x n_heads x N x N\n    # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]\n    # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]\n    head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)\n\n    embedding_output = self.embeddings(\n        input_ids=input_ids,\n        position_ids=position_ids,\n        attention_mask=attention_mask,\n        inputs_embeds=inputs_embeds,\n        past_key_values_length=past_key_values_length,\n    )\n    encoder_outputs = self.encoder(\n        embedding_output,\n        attention_mask=extended_attention_mask,\n        head_mask=head_mask,\n        encoder_hidden_states=encoder_hidden_states,\n        encoder_attention_mask=encoder_extended_attention_mask,\n        past_key_values=past_key_values,\n        use_cache=use_cache,\n        output_attentions=output_attentions,\n        output_hidden_states=output_hidden_states,\n        return_dict=return_dict,\n    )\n    sequence_output = encoder_outputs[0]\n    pooled_output = self.pooler(sequence_output) if self.pooler is not None else None\n\n    if not return_dict:\n        return (sequence_output, pooled_output) + encoder_outputs[1:]\n\n    return BaseModelOutputWithPoolingAndCrossAttentions(\n        last_hidden_state=sequence_output,\n        pooler_output=pooled_output,\n        past_key_values=encoder_outputs.past_key_values,\n        hidden_states=encoder_outputs.hidden_states,\n        attentions=encoder_outputs.attentions,\n        cross_attentions=encoder_outputs.cross_attentions,\n    )\n
"},{"location":"models/utrlm/#multimolecule.models.utrlm.UtrLmModel.forward(encoder_hidden_states)","title":"encoder_hidden_states","text":""},{"location":"models/utrlm/#multimolecule.models.utrlm.UtrLmModel.forward(encoder_attention_mask)","title":"encoder_attention_mask","text":""},{"location":"models/utrlm/#multimolecule.models.utrlm.UtrLmModel.forward(past_key_values)","title":"past_key_values","text":""},{"location":"models/utrlm/#multimolecule.models.utrlm.UtrLmModel.forward(use_cache)","title":"use_cache","text":""},{"location":"models/utrlm/#multimolecule.models.utrlm.UtrLmPreTrainedModel","title":"UtrLmPreTrainedModel","text":"

Bases: PreTrainedModel

An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained models.

Source code in multimolecule/models/utrlm/modeling_utrlm.py Python
class UtrLmPreTrainedModel(PreTrainedModel):\n    \"\"\"\n    An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained\n    models.\n    \"\"\"\n\n    config_class = UtrLmConfig\n    base_model_prefix = \"utrlm\"\n    supports_gradient_checkpointing = True\n    _no_split_modules = [\"UtrLmLayer\", \"UtrLmEmbeddings\"]\n\n    # Copied from transformers.models.bert.modeling_bert.BertPreTrainedModel._init_weights\n    def _init_weights(self, module: nn.Module):\n        \"\"\"Initialize the weights\"\"\"\n        if isinstance(module, nn.Linear):\n            # Slightly different from the TF version which uses truncated_normal for initialization\n            # cf https://github.com/pytorch/pytorch/pull/5617\n            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)\n            if module.bias is not None:\n                module.bias.data.zero_()\n        elif isinstance(module, nn.Embedding):\n            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)\n            if module.padding_idx is not None:\n                module.weight.data[module.padding_idx].zero_()\n        elif isinstance(module, nn.LayerNorm):\n            module.bias.data.zero_()\n            module.weight.data.fill_(1.0)\n
"},{"location":"module/","title":"module","text":"

module provides a collection of pre-defined modules for users to implement their own architectures.

MultiMolecule is built upon the ecosystem, embracing a similar design philosophy: Don\u2019t Repeat Yourself. We follow the single model file policy where each model under the models package contains one and only one modeling.py file that describes the network design.

The module package is intended for simple, reusable modules that are consistent across multiple models. This approach minimizes code duplication and promotes clean, maintainable code.

"},{"location":"module/#key-features","title":"Key Features","text":""},{"location":"module/#modules","title":"Modules","text":""},{"location":"module/embeddings/","title":"embeddings","text":"

embeddings provide a collection of pre-defined positional embeddings.

"},{"location":"module/embeddings/#multimolecule.module.embeddings","title":"multimolecule.module.embeddings","text":""},{"location":"module/embeddings/#multimolecule.module.embeddings.RotaryEmbedding","title":"RotaryEmbedding","text":"

Bases: Module

Rotary position embeddings based on those in RoFormer.

Query and keys are transformed by rotation matrices which depend on their relative positions.

Cache

The inverse frequency buffer is cached and updated only when the sequence length changes or the device changes.

Sequence Length

Rotary Embedding is irrespective of the sequence length and can be used for any sequence length.

Source code in multimolecule/module/embeddings/rotary.py Python
@PositionEmbeddingRegistry.register(\"rotary\")\n@PositionEmbeddingRegistryHF.register(\"rotary\")\nclass RotaryEmbedding(nn.Module):\n    \"\"\"\n    Rotary position embeddings based on those in\n    [RoFormer](https://huggingface.co/docs/transformers/model_doc/roformer).\n\n    Query and keys are transformed by rotation\n    matrices which depend on their relative positions.\n\n    Tip: **Cache**\n        The inverse frequency buffer is cached and updated only when the sequence length changes or the device changes.\n\n    Success: **Sequence Length**\n        Rotary Embedding is irrespective of the sequence length and can be used for any sequence length.\n    \"\"\"\n\n    def __init__(self, embedding_dim: int):\n        super().__init__()\n        # Generate and save the inverse frequency buffer (non trainable)\n        inv_freq = 1.0 / (10000 ** (torch.arange(0, embedding_dim, 2, dtype=torch.int64).float() / embedding_dim))\n        self.register_buffer(\"inv_freq\", inv_freq)\n\n        self._seq_len_cached = None\n        self._cos_cached = None\n        self._sin_cached = None\n\n    def forward(self, q: Tensor, k: Tensor) -> Tuple[Tensor, Tensor]:\n        self._update_cos_sin_tables(k, seq_dimension=-2)\n\n        return (self.apply_rotary_pos_emb(q), self.apply_rotary_pos_emb(k))\n\n    def _update_cos_sin_tables(self, x, seq_dimension=2):\n        seq_len = x.shape[seq_dimension]\n\n        # Reset the tables if the sequence length has changed,\n        # or if we're on a new device (possibly due to tracing for instance)\n        if seq_len != self._seq_len_cached or self._cos_cached.device != x.device:\n            self._seq_len_cached = seq_len\n            t = torch.arange(x.shape[seq_dimension], device=x.device).type_as(self.inv_freq)\n            freqs = torch.outer(t, self.inv_freq)\n            emb = torch.cat((freqs, freqs), dim=-1).to(x.device)\n\n            self._cos_cached = emb.cos()[None, None, :, :]\n            self._sin_cached = emb.sin()[None, None, :, :]\n\n        return self._cos_cached, self._sin_cached\n\n    def apply_rotary_pos_emb(self, x):\n        cos = self._cos_cached[:, :, : x.shape[-2], :]\n        sin = self._sin_cached[:, :, : x.shape[-2], :]\n\n        return (x * cos) + (self.rotate_half(x) * sin)\n\n    @staticmethod\n    def rotate_half(x):\n        x1, x2 = x.chunk(2, dim=-1)\n        return torch.cat((-x2, x1), dim=-1)\n
"},{"location":"module/embeddings/#multimolecule.module.embeddings.SinusoidalEmbedding","title":"SinusoidalEmbedding","text":"

Bases: Embedding

Sinusoidal positional embeddings for inputs with any length.

Freezing

The embeddings are frozen and cannot be trained. They will not be saved in the model\u2019s state_dict.

Padding Idx

Padding symbols are ignored if the padding_idx is specified.

Sequence Length

These embeddings get automatically extended in forward if more positions is needed.

Source code in multimolecule/module/embeddings/sinusoidal.py Python
@PositionEmbeddingRegistry.register(\"sinusoidal\")\n@PositionEmbeddingRegistryHF.register(\"sinusoidal\")\nclass SinusoidalEmbedding(nn.Embedding):\n    r\"\"\"\n    Sinusoidal positional embeddings for inputs with any length.\n\n    Note: **Freezing**\n        The embeddings are frozen and cannot be trained.\n        They will not be saved in the model's state_dict.\n\n    Tip: **Padding Idx**\n        Padding symbols are ignored if the padding_idx is specified.\n\n    Success: **Sequence Length**\n        These embeddings get automatically extended in forward if more positions is needed.\n    \"\"\"\n\n    _is_hf_initialized = True\n\n    def __init__(self, num_embeddings: int, embedding_dim: int, padding_idx: int | None = None, bias: int = 0):\n        weight = self.get_embedding(num_embeddings, embedding_dim, padding_idx)\n        super().__init__(num_embeddings, embedding_dim, padding_idx, _weight=weight.detach(), _freeze=True)\n        self.bias = bias\n\n    def update_weight(self, num_embeddings: int, embedding_dim: int, padding_idx: int | None = None):\n        weight = self.get_embedding(num_embeddings, embedding_dim, padding_idx).to(\n            dtype=self.weight.dtype, device=self.weight.device  # type: ignore[has-type]\n        )\n        self.weight = nn.Parameter(weight.detach(), requires_grad=False)\n\n    @staticmethod\n    def get_embedding(num_embeddings: int, embedding_dim: int, padding_idx: int | None = None) -> Tensor:\n        \"\"\"\n        Build sinusoidal embeddings.\n\n        This matches the implementation in tensor2tensor, but differs slightly from the description in Section 3.5 of\n        \"Attention Is All You Need\".\n        \"\"\"\n        half_dim = embedding_dim // 2\n        emb = torch.exp(torch.arange(half_dim, dtype=torch.float) * -(math.log(10000) / (half_dim - 1)))\n        emb = torch.arange(num_embeddings, dtype=torch.float).unsqueeze(1) * emb.unsqueeze(0)\n        emb = torch.cat([torch.sin(emb), torch.cos(emb)], dim=1).view(num_embeddings, -1)\n        if embedding_dim % 2 == 1:\n            # zero pad\n            emb = torch.cat([emb, torch.zeros(num_embeddings, 1)], dim=1)\n        if padding_idx is not None:\n            emb[padding_idx, :] = 0\n        return emb\n\n    @staticmethod\n    def get_position_ids(tensor, padding_idx: int | None = None):\n        \"\"\"\n        Replace non-padding symbols with their position numbers.\n\n        Position numbers begin at padding_idx+1. Padding symbols are ignored.\n        \"\"\"\n        # The series of casts and type-conversions here are carefully\n        # balanced to both work with ONNX export and XLA. In particular XLA\n        # prefers ints, cumsum defaults to output longs, and ONNX doesn't know\n        # how to handle the dtype kwarg in cumsum.\n        if padding_idx is None:\n            return torch.cumsum(tensor.new_ones(tensor.size(1)).long(), dim=0) - 1\n        mask = tensor.ne(padding_idx).int()\n        return (torch.cumsum(mask, dim=1).type_as(mask) * mask).long() + padding_idx\n\n    def forward(self, input_ids: Tensor) -> Tensor:\n        _, seq_len = input_ids.shape[:2]\n        # expand embeddings if needed\n        max_pos = seq_len + self.bias + 1\n        if self.padding_idx is not None:\n            max_pos += self.padding_idx\n        if max_pos > self.weight.size(0):\n            self.update_weight(max_pos, self.embedding_dim, self.padding_idx)\n        # Need to shift the position ids by the padding index\n        position_ids = self.get_position_ids(input_ids, self.padding_idx) + self.bias\n        return super().forward(position_ids)\n\n    def state_dict(self, destination=None, prefix=\"\", keep_vars=False):\n        return {}\n\n    def load_state_dict(self, *args, state_dict, strict=True):\n        return\n\n    def _load_from_state_dict(\n        self, state_dict, prefix, local_metadata, strict, missing_keys, unexpected_keys, error_msgs\n    ):\n        return\n
"},{"location":"module/embeddings/#multimolecule.module.embeddings.SinusoidalEmbedding.get_embedding","title":"get_embedding staticmethod","text":"Python
get_embedding(num_embeddings: int, embedding_dim: int, padding_idx: int | None = None) -> Tensor\n

Build sinusoidal embeddings.

This matches the implementation in tensor2tensor, but differs slightly from the description in Section 3.5 of \u201cAttention Is All You Need\u201d.

Source code in multimolecule/module/embeddings/sinusoidal.py Python
@staticmethod\ndef get_embedding(num_embeddings: int, embedding_dim: int, padding_idx: int | None = None) -> Tensor:\n    \"\"\"\n    Build sinusoidal embeddings.\n\n    This matches the implementation in tensor2tensor, but differs slightly from the description in Section 3.5 of\n    \"Attention Is All You Need\".\n    \"\"\"\n    half_dim = embedding_dim // 2\n    emb = torch.exp(torch.arange(half_dim, dtype=torch.float) * -(math.log(10000) / (half_dim - 1)))\n    emb = torch.arange(num_embeddings, dtype=torch.float).unsqueeze(1) * emb.unsqueeze(0)\n    emb = torch.cat([torch.sin(emb), torch.cos(emb)], dim=1).view(num_embeddings, -1)\n    if embedding_dim % 2 == 1:\n        # zero pad\n        emb = torch.cat([emb, torch.zeros(num_embeddings, 1)], dim=1)\n    if padding_idx is not None:\n        emb[padding_idx, :] = 0\n    return emb\n
"},{"location":"module/embeddings/#multimolecule.module.embeddings.SinusoidalEmbedding.get_position_ids","title":"get_position_ids staticmethod","text":"Python
get_position_ids(tensor, padding_idx: int | None = None)\n

Replace non-padding symbols with their position numbers.

Position numbers begin at padding_idx+1. Padding symbols are ignored.

Source code in multimolecule/module/embeddings/sinusoidal.py Python
@staticmethod\ndef get_position_ids(tensor, padding_idx: int | None = None):\n    \"\"\"\n    Replace non-padding symbols with their position numbers.\n\n    Position numbers begin at padding_idx+1. Padding symbols are ignored.\n    \"\"\"\n    # The series of casts and type-conversions here are carefully\n    # balanced to both work with ONNX export and XLA. In particular XLA\n    # prefers ints, cumsum defaults to output longs, and ONNX doesn't know\n    # how to handle the dtype kwarg in cumsum.\n    if padding_idx is None:\n        return torch.cumsum(tensor.new_ones(tensor.size(1)).long(), dim=0) - 1\n    mask = tensor.ne(padding_idx).int()\n    return (torch.cumsum(mask, dim=1).type_as(mask) * mask).long() + padding_idx\n
"},{"location":"module/heads/","title":"heads","text":"

heads provide a collection of pre-defined prediction heads.

heads take in either a ModelOutupt, a dict, or a tuple as input. It automatically looks for the model output required for prediction and processes it accordingly.

Some prediction heads may require additional information, such as the attention_mask or the input_ids, like ContactPredictionHead. These additional arguments can be passed in as arguments/keyword arguments.

Note that heads use the same ModelOutupt conventions as the Transformers. If the model output is a tuple, we consider the first element as the pooler_output, the second element as the last_hidden_state, and the last element as the attention_map. It is the user\u2019s responsibility to ensure that the model output is correctly formatted.

If the model output is a ModelOutupt or a dict, the heads will look for the HeadConfig.output_name from the model output. You can specify the output_name in the HeadConfig to ensure that the heads can correctly locate the required tensor.

"},{"location":"module/heads/#multimolecule.module.heads.config","title":"multimolecule.module.heads.config","text":""},{"location":"module/heads/#multimolecule.module.heads.config.HeadConfig","title":"HeadConfig","text":"

Bases: BaseHeadConfig

Configuration class for a prediction head.

Parameters:

Name Type Description Default

Number of labels to use in the last layer added to the model, typically for a classification task.

Head should look for Config.num_labels if is None.

required

Problem type for XxxForYyyPrediction models. Can be one of \"binary\", \"regression\", \"multiclass\" or \"multilabel\".

Head should look for Config.problem_type if is None.

required

Dimensionality of the encoder layers and the pooler layer.

Head should look for Config.hidden_size if is None.

required

The dropout ratio for the hidden states.

required

The transform operation applied to hidden states.

required

The activation function of transform applied to hidden states.

required

Whether to apply bias to the final prediction layer.

required

The activation function of the final prediction output.

required

The epsilon used by the layer normalization layers.

required

The name of the tensor required in model outputs.

If is None, will use the default output name of the corresponding head.

required

The type of the head in the model.

This is used by MultiMoleculeModel to construct heads.

required Source code in multimolecule/module/heads/config.py Python
class HeadConfig(BaseHeadConfig):\n    r\"\"\"\n    Configuration class for a prediction head.\n\n    Args:\n        num_labels:\n            Number of labels to use in the last layer added to the model, typically for a classification task.\n\n            Head should look for [`Config.num_labels`][multimolecule.PreTrainedConfig] if is `None`.\n        problem_type:\n            Problem type for `XxxForYyyPrediction` models. Can be one of `\"binary\"`, `\"regression\"`,\n            `\"multiclass\"` or `\"multilabel\"`.\n\n            Head should look for [`Config.problem_type`][multimolecule.PreTrainedConfig] if is `None`.\n        hidden_size:\n            Dimensionality of the encoder layers and the pooler layer.\n\n            Head should look for [`Config.hidden_size`][multimolecule.PreTrainedConfig] if is `None`.\n        dropout:\n            The dropout ratio for the hidden states.\n        transform:\n            The transform operation applied to hidden states.\n        transform_act:\n            The activation function of transform applied to hidden states.\n        bias:\n            Whether to apply bias to the final prediction layer.\n        act:\n            The activation function of the final prediction output.\n        layer_norm_eps:\n            The epsilon used by the layer normalization layers.\n        output_name:\n            The name of the tensor required in model outputs.\n\n            If is `None`, will use the default output name of the corresponding head.\n        type:\n            The type of the head in the model.\n\n            This is used by [`MultiMoleculeModel`][multimolecule.MultiMoleculeModel] to construct heads.\n    \"\"\"\n\n    num_labels: Optional[int] = None\n    problem_type: Optional[str] = None\n    hidden_size: Optional[int] = None\n    dropout: float = 0.0\n    transform: Optional[str] = None\n    transform_act: Optional[str] = \"gelu\"\n    bias: bool = True\n    act: Optional[str] = None\n    layer_norm_eps: float = 1e-12\n    output_name: Optional[str] = None\n    type: Optional[str] = None\n
"},{"location":"module/heads/#multimolecule.module.heads.config.HeadConfig(num_labels)","title":"num_labels","text":""},{"location":"module/heads/#multimolecule.module.heads.config.HeadConfig(problem_type)","title":"problem_type","text":""},{"location":"module/heads/#multimolecule.module.heads.config.HeadConfig(hidden_size)","title":"hidden_size","text":""},{"location":"module/heads/#multimolecule.module.heads.config.HeadConfig(dropout)","title":"dropout","text":""},{"location":"module/heads/#multimolecule.module.heads.config.HeadConfig(transform)","title":"transform","text":""},{"location":"module/heads/#multimolecule.module.heads.config.HeadConfig(transform_act)","title":"transform_act","text":""},{"location":"module/heads/#multimolecule.module.heads.config.HeadConfig(bias)","title":"bias","text":""},{"location":"module/heads/#multimolecule.module.heads.config.HeadConfig(act)","title":"act","text":""},{"location":"module/heads/#multimolecule.module.heads.config.HeadConfig(layer_norm_eps)","title":"layer_norm_eps","text":""},{"location":"module/heads/#multimolecule.module.heads.config.HeadConfig(output_name)","title":"output_name","text":""},{"location":"module/heads/#multimolecule.module.heads.config.HeadConfig(type)","title":"type","text":""},{"location":"module/heads/#multimolecule.module.heads.config.MaskedLMHeadConfig","title":"MaskedLMHeadConfig","text":"

Bases: BaseHeadConfig

Configuration class for a Masked Language Modeling head.

Parameters:

Name Type Description Default

Dimensionality of the encoder layers and the pooler layer.

Head should look for Config.hidden_size if is None.

required

The dropout ratio for the hidden states.

required

The transform operation applied to hidden states.

required

The activation function of transform applied to hidden states.

required

Whether to apply bias to the final prediction layer.

required

The activation function of the final prediction output.

required

The epsilon used by the layer normalization layers.

required

The name of the tensor required in model outputs.

If is None, will use the default output name of the corresponding head.

required Source code in multimolecule/module/heads/config.py Python
class MaskedLMHeadConfig(BaseHeadConfig):\n    r\"\"\"\n    Configuration class for a Masked Language Modeling head.\n\n    Args:\n        hidden_size:\n            Dimensionality of the encoder layers and the pooler layer.\n\n            Head should look for [`Config.hidden_size`][multimolecule.PreTrainedConfig] if is `None`.\n        dropout:\n            The dropout ratio for the hidden states.\n        transform:\n            The transform operation applied to hidden states.\n        transform_act:\n            The activation function of transform applied to hidden states.\n        bias:\n            Whether to apply bias to the final prediction layer.\n        act:\n            The activation function of the final prediction output.\n        layer_norm_eps:\n            The epsilon used by the layer normalization layers.\n        output_name:\n            The name of the tensor required in model outputs.\n\n            If is `None`, will use the default output name of the corresponding head.\n    \"\"\"\n\n    hidden_size: Optional[int] = None\n    dropout: float = 0.0\n    transform: Optional[str] = \"nonlinear\"\n    transform_act: Optional[str] = \"gelu\"\n    bias: bool = True\n    act: Optional[str] = None\n    layer_norm_eps: float = 1e-12\n    output_name: Optional[str] = None\n
"},{"location":"module/heads/#multimolecule.module.heads.config.MaskedLMHeadConfig(hidden_size)","title":"hidden_size","text":""},{"location":"module/heads/#multimolecule.module.heads.config.MaskedLMHeadConfig(dropout)","title":"dropout","text":""},{"location":"module/heads/#multimolecule.module.heads.config.MaskedLMHeadConfig(transform)","title":"transform","text":""},{"location":"module/heads/#multimolecule.module.heads.config.MaskedLMHeadConfig(transform_act)","title":"transform_act","text":""},{"location":"module/heads/#multimolecule.module.heads.config.MaskedLMHeadConfig(bias)","title":"bias","text":""},{"location":"module/heads/#multimolecule.module.heads.config.MaskedLMHeadConfig(act)","title":"act","text":""},{"location":"module/heads/#multimolecule.module.heads.config.MaskedLMHeadConfig(layer_norm_eps)","title":"layer_norm_eps","text":""},{"location":"module/heads/#multimolecule.module.heads.config.MaskedLMHeadConfig(output_name)","title":"output_name","text":""},{"location":"module/heads/#multimolecule.module.heads.sequence","title":"multimolecule.module.heads.sequence","text":""},{"location":"module/heads/#multimolecule.module.heads.sequence.SequencePredictionHead","title":"SequencePredictionHead","text":"

Bases: PredictionHead

Head for tasks in sequence-level.

Parameters:

Name Type Description Default PreTrainedConfig

The configuration object for the model.

required HeadConfig | None

The configuration object for the head. If None, will use configuration from the config.

None Source code in multimolecule/module/heads/sequence.py Python
@HeadRegistry.register(\"sequence\")\nclass SequencePredictionHead(PredictionHead):\n    r\"\"\"\n    Head for tasks in sequence-level.\n\n    Args:\n        config: The configuration object for the model.\n        head_config: The configuration object for the head.\n            If None, will use configuration from the `config`.\n    \"\"\"\n\n    output_name: str = \"pooler_output\"\n    r\"\"\"The default output to use for the head.\"\"\"\n\n    def __init__(self, config: PreTrainedConfig, head_config: HeadConfig | None = None):\n        super().__init__(config, head_config)\n        if head_config is not None and head_config.output_name is not None:\n            self.output_name = head_config.output_name\n\n    def forward(  # type: ignore[override]  # pylint: disable=arguments-renamed\n        self,\n        outputs: ModelOutput | Tuple[Tensor, ...],\n        labels: Tensor | None = None,\n        output_name: str | None = None,\n        **kwargs,\n    ) -> HeadOutput:\n        r\"\"\"\n        Forward pass of the SequencePredictionHead.\n\n        Args:\n            outputs: The outputs of the model.\n            labels: The labels for the head.\n            output_name: The name of the output to use.\n                Defaults to `self.output_name`.\n        \"\"\"\n        if isinstance(outputs, (Mapping, ModelOutput)):\n            output = outputs[output_name or self.output_name]\n        elif isinstance(outputs, tuple):\n            output = outputs[1]\n        return super().forward(output, labels, **kwargs)\n
"},{"location":"module/heads/#multimolecule.module.heads.sequence.SequencePredictionHead(config)","title":"config","text":""},{"location":"module/heads/#multimolecule.module.heads.sequence.SequencePredictionHead(head_config)","title":"head_config","text":""},{"location":"module/heads/#multimolecule.module.heads.sequence.SequencePredictionHead.output_name","title":"output_name class-attribute instance-attribute","text":"Python
output_name: str = 'pooler_output'\n

The default output to use for the head.

"},{"location":"module/heads/#multimolecule.module.heads.sequence.SequencePredictionHead.forward","title":"forward","text":"Python
forward(outputs: ModelOutput | Tuple[Tensor, ...], labels: Tensor | None = None, output_name: str | None = None, **kwargs) -> HeadOutput\n

Forward pass of the SequencePredictionHead.

Parameters:

Name Type Description Default ModelOutput | Tuple[Tensor, ...]

The outputs of the model.

required Tensor | None

The labels for the head.

None str | None

The name of the output to use. Defaults to self.output_name.

None Source code in multimolecule/module/heads/sequence.py Python
def forward(  # type: ignore[override]  # pylint: disable=arguments-renamed\n    self,\n    outputs: ModelOutput | Tuple[Tensor, ...],\n    labels: Tensor | None = None,\n    output_name: str | None = None,\n    **kwargs,\n) -> HeadOutput:\n    r\"\"\"\n    Forward pass of the SequencePredictionHead.\n\n    Args:\n        outputs: The outputs of the model.\n        labels: The labels for the head.\n        output_name: The name of the output to use.\n            Defaults to `self.output_name`.\n    \"\"\"\n    if isinstance(outputs, (Mapping, ModelOutput)):\n        output = outputs[output_name or self.output_name]\n    elif isinstance(outputs, tuple):\n        output = outputs[1]\n    return super().forward(output, labels, **kwargs)\n
"},{"location":"module/heads/#multimolecule.module.heads.sequence.SequencePredictionHead.forward(outputs)","title":"outputs","text":""},{"location":"module/heads/#multimolecule.module.heads.sequence.SequencePredictionHead.forward(labels)","title":"labels","text":""},{"location":"module/heads/#multimolecule.module.heads.sequence.SequencePredictionHead.forward(output_name)","title":"output_name","text":""},{"location":"module/heads/#multimolecule.module.heads.token","title":"multimolecule.module.heads.token","text":""},{"location":"module/heads/#multimolecule.module.heads.token.TokenPredictionHead","title":"TokenPredictionHead","text":"

Bases: PredictionHead

Head for tasks in token-level.

Parameters:

Name Type Description Default PreTrainedConfig

The configuration object for the model.

required HeadConfig | None

The configuration object for the head. If None, will use configuration from the config.

None Source code in multimolecule/module/heads/token.py Python
@HeadRegistry.token.register(\"single\", default=True)\n@TokenHeadRegistryHF.register(\"single\", default=True)\nclass TokenPredictionHead(PredictionHead):\n    r\"\"\"\n    Head for tasks in token-level.\n\n    Args:\n        config: The configuration object for the model.\n        head_config: The configuration object for the head.\n            If None, will use configuration from the `config`.\n    \"\"\"\n\n    output_name: str = \"last_hidden_state\"\n    r\"\"\"The default output to use for the head.\"\"\"\n\n    def __init__(self, config: PreTrainedConfig, head_config: HeadConfig | None = None):\n        super().__init__(config, head_config)\n        if head_config is not None and head_config.output_name is not None:\n            self.output_name = head_config.output_name\n\n    def forward(  # type: ignore[override]  # pylint: disable=arguments-renamed\n        self,\n        outputs: ModelOutput | Tuple[Tensor, ...],\n        attention_mask: Tensor | None = None,\n        input_ids: NestedTensor | Tensor | None = None,\n        labels: Tensor | None = None,\n        output_name: str | None = None,\n        **kwargs,\n    ) -> HeadOutput:\n        r\"\"\"\n        Forward pass of the TokenPredictionHead.\n\n        Args:\n            outputs: The outputs of the model.\n            attention_mask: The attention mask for the inputs.\n            input_ids: The input ids for the inputs.\n            labels: The labels for the head.\n            output_name: The name of the output to use.\n                Defaults to `self.output_name`.\n        \"\"\"\n        if isinstance(outputs, (Mapping, ModelOutput)):\n            output = outputs[output_name or self.output_name]\n        elif isinstance(outputs, tuple):\n            output = outputs[0]\n        else:\n            raise ValueError(f\"Unsupported type for outputs: {type(outputs)}\")\n\n        if attention_mask is None:\n            attention_mask = self._get_attention_mask(input_ids)\n        output = output * attention_mask.unsqueeze(-1)\n        output, _, _ = self._remove_special_tokens(output, attention_mask, input_ids)\n\n        return super().forward(output, labels, **kwargs)\n
"},{"location":"module/heads/#multimolecule.module.heads.token.TokenPredictionHead(config)","title":"config","text":""},{"location":"module/heads/#multimolecule.module.heads.token.TokenPredictionHead(head_config)","title":"head_config","text":""},{"location":"module/heads/#multimolecule.module.heads.token.TokenPredictionHead.output_name","title":"output_name class-attribute instance-attribute","text":"Python
output_name: str = 'last_hidden_state'\n

The default output to use for the head.

"},{"location":"module/heads/#multimolecule.module.heads.token.TokenPredictionHead.forward","title":"forward","text":"Python
forward(outputs: ModelOutput | Tuple[Tensor, ...], attention_mask: Tensor | None = None, input_ids: NestedTensor | Tensor | None = None, labels: Tensor | None = None, output_name: str | None = None, **kwargs) -> HeadOutput\n

Forward pass of the TokenPredictionHead.

Parameters:

Name Type Description Default ModelOutput | Tuple[Tensor, ...]

The outputs of the model.

required Tensor | None

The attention mask for the inputs.

None NestedTensor | Tensor | None

The input ids for the inputs.

None Tensor | None

The labels for the head.

None str | None

The name of the output to use. Defaults to self.output_name.

None Source code in multimolecule/module/heads/token.py Python
def forward(  # type: ignore[override]  # pylint: disable=arguments-renamed\n    self,\n    outputs: ModelOutput | Tuple[Tensor, ...],\n    attention_mask: Tensor | None = None,\n    input_ids: NestedTensor | Tensor | None = None,\n    labels: Tensor | None = None,\n    output_name: str | None = None,\n    **kwargs,\n) -> HeadOutput:\n    r\"\"\"\n    Forward pass of the TokenPredictionHead.\n\n    Args:\n        outputs: The outputs of the model.\n        attention_mask: The attention mask for the inputs.\n        input_ids: The input ids for the inputs.\n        labels: The labels for the head.\n        output_name: The name of the output to use.\n            Defaults to `self.output_name`.\n    \"\"\"\n    if isinstance(outputs, (Mapping, ModelOutput)):\n        output = outputs[output_name or self.output_name]\n    elif isinstance(outputs, tuple):\n        output = outputs[0]\n    else:\n        raise ValueError(f\"Unsupported type for outputs: {type(outputs)}\")\n\n    if attention_mask is None:\n        attention_mask = self._get_attention_mask(input_ids)\n    output = output * attention_mask.unsqueeze(-1)\n    output, _, _ = self._remove_special_tokens(output, attention_mask, input_ids)\n\n    return super().forward(output, labels, **kwargs)\n
"},{"location":"module/heads/#multimolecule.module.heads.token.TokenPredictionHead.forward(outputs)","title":"outputs","text":""},{"location":"module/heads/#multimolecule.module.heads.token.TokenPredictionHead.forward(attention_mask)","title":"attention_mask","text":""},{"location":"module/heads/#multimolecule.module.heads.token.TokenPredictionHead.forward(input_ids)","title":"input_ids","text":""},{"location":"module/heads/#multimolecule.module.heads.token.TokenPredictionHead.forward(labels)","title":"labels","text":""},{"location":"module/heads/#multimolecule.module.heads.token.TokenPredictionHead.forward(output_name)","title":"output_name","text":""},{"location":"module/heads/#multimolecule.module.heads.token.TokenKMerHead","title":"TokenKMerHead","text":"

Bases: PredictionHead

Head for tasks in token-level with kmer inputs.

Parameters:

Name Type Description Default PreTrainedConfig

The configuration object for the model.

required HeadConfig | None

The configuration object for the head. If None, will use configuration from the config.

None Source code in multimolecule/module/heads/token.py Python
@HeadRegistry.register(\"token.kmer\")\n@TokenHeadRegistryHF.register(\"kmer\")\nclass TokenKMerHead(PredictionHead):\n    r\"\"\"\n    Head for tasks in token-level with kmer inputs.\n\n    Args:\n        config: The configuration object for the model.\n        head_config: The configuration object for the head.\n            If None, will use configuration from the `config`.\n    \"\"\"\n\n    output_name: str = \"last_hidden_state\"\n    r\"\"\"The default output to use for the head.\"\"\"\n\n    def __init__(self, config: PreTrainedConfig, head_config: HeadConfig | None = None):\n        super().__init__(config, head_config)\n        self.nmers = config.nmers\n        if head_config is not None and head_config.output_name is not None:\n            self.output_name = head_config.output_name\n        # Do not pass bos_token_id and eos_token_id to unfold_kmer_embeddings\n        # As they will be removed in preprocess\n        self.unfold_kmer_embeddings = partial(unfold_kmer_embeddings, nmers=self.nmers)\n\n    def forward(  # type: ignore[override]  # pylint: disable=arguments-renamed\n        self,\n        outputs: ModelOutput | Tuple[Tensor, ...],\n        attention_mask: Tensor | None = None,\n        input_ids: NestedTensor | Tensor | None = None,\n        labels: Tensor | None = None,\n        output_name: str | None = None,\n        **kwargs,\n    ) -> HeadOutput:\n        r\"\"\"\n        Forward pass of the TokenKMerHead.\n\n        Args:\n            outputs: The outputs of the model.\n            attention_mask: The attention mask for the inputs.\n            input_ids: The input ids for the inputs.\n            labels: The labels for the head.\n            output_name: The name of the output to use.\n                Defaults to `self.output_name`.\n        \"\"\"\n        if isinstance(outputs, (Mapping, ModelOutput)):\n            output = outputs[output_name or self.output_name]\n        elif isinstance(outputs, tuple):\n            output = outputs[0]\n        else:\n            raise ValueError(f\"Unsupported type for outputs: {type(outputs)}\")\n\n        if attention_mask is None:\n            attention_mask = self._get_attention_mask(input_ids)\n        output = output * attention_mask.unsqueeze(-1)\n        output, attention_mask, _ = self._remove_special_tokens(output, attention_mask, input_ids)\n\n        output = self.unfold_kmer_embeddings(output, attention_mask)\n        return super().forward(output, labels, **kwargs)\n
"},{"location":"module/heads/#multimolecule.module.heads.token.TokenKMerHead(config)","title":"config","text":""},{"location":"module/heads/#multimolecule.module.heads.token.TokenKMerHead(head_config)","title":"head_config","text":""},{"location":"module/heads/#multimolecule.module.heads.token.TokenKMerHead.output_name","title":"output_name class-attribute instance-attribute","text":"Python
output_name: str = 'last_hidden_state'\n

The default output to use for the head.

"},{"location":"module/heads/#multimolecule.module.heads.token.TokenKMerHead.forward","title":"forward","text":"Python
forward(outputs: ModelOutput | Tuple[Tensor, ...], attention_mask: Tensor | None = None, input_ids: NestedTensor | Tensor | None = None, labels: Tensor | None = None, output_name: str | None = None, **kwargs) -> HeadOutput\n

Forward pass of the TokenKMerHead.

Parameters:

Name Type Description Default ModelOutput | Tuple[Tensor, ...]

The outputs of the model.

required Tensor | None

The attention mask for the inputs.

None NestedTensor | Tensor | None

The input ids for the inputs.

None Tensor | None

The labels for the head.

None str | None

The name of the output to use. Defaults to self.output_name.

None Source code in multimolecule/module/heads/token.py Python
def forward(  # type: ignore[override]  # pylint: disable=arguments-renamed\n    self,\n    outputs: ModelOutput | Tuple[Tensor, ...],\n    attention_mask: Tensor | None = None,\n    input_ids: NestedTensor | Tensor | None = None,\n    labels: Tensor | None = None,\n    output_name: str | None = None,\n    **kwargs,\n) -> HeadOutput:\n    r\"\"\"\n    Forward pass of the TokenKMerHead.\n\n    Args:\n        outputs: The outputs of the model.\n        attention_mask: The attention mask for the inputs.\n        input_ids: The input ids for the inputs.\n        labels: The labels for the head.\n        output_name: The name of the output to use.\n            Defaults to `self.output_name`.\n    \"\"\"\n    if isinstance(outputs, (Mapping, ModelOutput)):\n        output = outputs[output_name or self.output_name]\n    elif isinstance(outputs, tuple):\n        output = outputs[0]\n    else:\n        raise ValueError(f\"Unsupported type for outputs: {type(outputs)}\")\n\n    if attention_mask is None:\n        attention_mask = self._get_attention_mask(input_ids)\n    output = output * attention_mask.unsqueeze(-1)\n    output, attention_mask, _ = self._remove_special_tokens(output, attention_mask, input_ids)\n\n    output = self.unfold_kmer_embeddings(output, attention_mask)\n    return super().forward(output, labels, **kwargs)\n
"},{"location":"module/heads/#multimolecule.module.heads.token.TokenKMerHead.forward(outputs)","title":"outputs","text":""},{"location":"module/heads/#multimolecule.module.heads.token.TokenKMerHead.forward(attention_mask)","title":"attention_mask","text":""},{"location":"module/heads/#multimolecule.module.heads.token.TokenKMerHead.forward(input_ids)","title":"input_ids","text":""},{"location":"module/heads/#multimolecule.module.heads.token.TokenKMerHead.forward(labels)","title":"labels","text":""},{"location":"module/heads/#multimolecule.module.heads.token.TokenKMerHead.forward(output_name)","title":"output_name","text":""},{"location":"module/heads/#multimolecule.module.heads.contact","title":"multimolecule.module.heads.contact","text":""},{"location":"module/heads/#multimolecule.module.heads.contact.ContactPredictionHead","title":"ContactPredictionHead","text":"

Bases: PredictionHead

Head for tasks in contact-level.

Performs symmetrization, and average product correct.

Parameters:

Name Type Description Default PreTrainedConfig

The configuration object for the model.

required HeadConfig | None

The configuration object for the head. If None, will use configuration from the config.

None Source code in multimolecule/module/heads/contact.py Python
@HeadRegistry.contact.register(\"attention\")\nclass ContactPredictionHead(PredictionHead):\n    r\"\"\"\n    Head for tasks in contact-level.\n\n    Performs symmetrization, and average product correct.\n\n    Args:\n        config: The configuration object for the model.\n        head_config: The configuration object for the head.\n            If None, will use configuration from the `config`.\n    \"\"\"\n\n    output_name: str = \"attentions\"\n    r\"\"\"The default output to use for the head.\"\"\"\n\n    requires_attention: bool = True\n\n    def __init__(self, config: PreTrainedConfig, head_config: HeadConfig | None = None):\n        super().__init__(config, head_config)\n        self.config.hidden_size = config.num_hidden_layers * config.num_attention_heads\n        num_layers = self.config.get(\"num_layers\", 16)\n        num_channels = self.config.get(\"num_channels\", self.config.hidden_size // 10)  # type: ignore[operator]\n        block = self.config.get(\"block\", \"auto\")\n        self.decoder = ResNet(\n            num_layers=num_layers,\n            hidden_size=self.config.hidden_size,  # type: ignore[arg-type]\n            block=block,\n            num_channels=num_channels,\n            num_labels=self.num_labels,\n        )\n        if head_config is not None and head_config.output_name is not None:\n            self.output_name = head_config.output_name\n\n    def forward(  # type: ignore[override]  # pylint: disable=arguments-renamed\n        self,\n        outputs: ModelOutput | Mapping | Tuple[Tensor, ...],\n        attention_mask: Tensor | None = None,\n        input_ids: NestedTensor | Tensor | None = None,\n        labels: Tensor | None = None,\n        output_name: str | None = None,\n        **kwargs,\n    ) -> HeadOutput:\n        r\"\"\"\n        Forward pass of the ContactPredictionHead.\n\n        Args:\n            outputs: The outputs of the model.\n            attention_mask: The attention mask for the inputs.\n            input_ids: The input ids for the inputs.\n            labels: The labels for the head.\n            output_name: The name of the output to use.\n                Defaults to `self.output_name`.\n        \"\"\"\n\n        if isinstance(outputs, (Mapping, ModelOutput)):\n            output = outputs[output_name or self.output_name]\n        elif isinstance(outputs, tuple):\n            output = outputs[-1]\n        attentions = torch.stack(output, 1)\n\n        # In the original model, attentions for padding tokens are completely zeroed out.\n        # This makes no difference most of the time because the other tokens won't attend to them,\n        # but it does for the contact prediction task, which takes attentions as input,\n        # so we have to mimic that here.\n        if attention_mask is None:\n            attention_mask = self._get_attention_mask(input_ids)\n        attention_mask = attention_mask.unsqueeze(1) * attention_mask.unsqueeze(2)\n        attentions = attentions * attention_mask[:, None, None, :, :]\n\n        # remove cls token attentions\n        if self.bos_token_id is not None:\n            attentions = attentions[..., 1:, 1:]\n            attention_mask = attention_mask[..., 1:]\n            if input_ids is not None:\n                input_ids = input_ids[..., 1:]\n        # remove eos token attentions\n        if self.eos_token_id is not None:\n            if input_ids is not None:\n                eos_mask = input_ids.ne(self.eos_token_id).to(attentions)\n            else:\n                last_valid_indices = attention_mask.sum(dim=-1)\n                seq_length = attention_mask.size(-1)\n                eos_mask = torch.arange(seq_length, device=attentions.device).unsqueeze(0) == last_valid_indices\n            eos_mask = eos_mask.unsqueeze(1) * eos_mask.unsqueeze(2)\n            attentions = attentions * eos_mask[:, None, None, :, :]\n            attentions = attentions[..., :-1, :-1]\n\n        # features: batch x channels x input_ids x input_ids (symmetric)\n        batch_size, layers, heads, seqlen, _ = attentions.size()\n        attentions = attentions.view(batch_size, layers * heads, seqlen, seqlen)\n        attentions = attentions.to(self.decoder.proj.weight.device)\n        attentions = average_product_correct(symmetrize(attentions))\n        attentions = attentions.permute(0, 2, 3, 1).squeeze(3)\n\n        return super().forward(attentions, labels, **kwargs)\n
"},{"location":"module/heads/#multimolecule.module.heads.contact.ContactPredictionHead(config)","title":"config","text":""},{"location":"module/heads/#multimolecule.module.heads.contact.ContactPredictionHead(head_config)","title":"head_config","text":""},{"location":"module/heads/#multimolecule.module.heads.contact.ContactPredictionHead.output_name","title":"output_name class-attribute instance-attribute","text":"Python
output_name: str = 'attentions'\n

The default output to use for the head.

"},{"location":"module/heads/#multimolecule.module.heads.contact.ContactPredictionHead.forward","title":"forward","text":"Python
forward(outputs: ModelOutput | Mapping | Tuple[Tensor, ...], attention_mask: Tensor | None = None, input_ids: NestedTensor | Tensor | None = None, labels: Tensor | None = None, output_name: str | None = None, **kwargs) -> HeadOutput\n

Forward pass of the ContactPredictionHead.

Parameters:

Name Type Description Default ModelOutput | Mapping | Tuple[Tensor, ...]

The outputs of the model.

required Tensor | None

The attention mask for the inputs.

None NestedTensor | Tensor | None

The input ids for the inputs.

None Tensor | None

The labels for the head.

None str | None

The name of the output to use. Defaults to self.output_name.

None Source code in multimolecule/module/heads/contact.py Python
def forward(  # type: ignore[override]  # pylint: disable=arguments-renamed\n    self,\n    outputs: ModelOutput | Mapping | Tuple[Tensor, ...],\n    attention_mask: Tensor | None = None,\n    input_ids: NestedTensor | Tensor | None = None,\n    labels: Tensor | None = None,\n    output_name: str | None = None,\n    **kwargs,\n) -> HeadOutput:\n    r\"\"\"\n    Forward pass of the ContactPredictionHead.\n\n    Args:\n        outputs: The outputs of the model.\n        attention_mask: The attention mask for the inputs.\n        input_ids: The input ids for the inputs.\n        labels: The labels for the head.\n        output_name: The name of the output to use.\n            Defaults to `self.output_name`.\n    \"\"\"\n\n    if isinstance(outputs, (Mapping, ModelOutput)):\n        output = outputs[output_name or self.output_name]\n    elif isinstance(outputs, tuple):\n        output = outputs[-1]\n    attentions = torch.stack(output, 1)\n\n    # In the original model, attentions for padding tokens are completely zeroed out.\n    # This makes no difference most of the time because the other tokens won't attend to them,\n    # but it does for the contact prediction task, which takes attentions as input,\n    # so we have to mimic that here.\n    if attention_mask is None:\n        attention_mask = self._get_attention_mask(input_ids)\n    attention_mask = attention_mask.unsqueeze(1) * attention_mask.unsqueeze(2)\n    attentions = attentions * attention_mask[:, None, None, :, :]\n\n    # remove cls token attentions\n    if self.bos_token_id is not None:\n        attentions = attentions[..., 1:, 1:]\n        attention_mask = attention_mask[..., 1:]\n        if input_ids is not None:\n            input_ids = input_ids[..., 1:]\n    # remove eos token attentions\n    if self.eos_token_id is not None:\n        if input_ids is not None:\n            eos_mask = input_ids.ne(self.eos_token_id).to(attentions)\n        else:\n            last_valid_indices = attention_mask.sum(dim=-1)\n            seq_length = attention_mask.size(-1)\n            eos_mask = torch.arange(seq_length, device=attentions.device).unsqueeze(0) == last_valid_indices\n        eos_mask = eos_mask.unsqueeze(1) * eos_mask.unsqueeze(2)\n        attentions = attentions * eos_mask[:, None, None, :, :]\n        attentions = attentions[..., :-1, :-1]\n\n    # features: batch x channels x input_ids x input_ids (symmetric)\n    batch_size, layers, heads, seqlen, _ = attentions.size()\n    attentions = attentions.view(batch_size, layers * heads, seqlen, seqlen)\n    attentions = attentions.to(self.decoder.proj.weight.device)\n    attentions = average_product_correct(symmetrize(attentions))\n    attentions = attentions.permute(0, 2, 3, 1).squeeze(3)\n\n    return super().forward(attentions, labels, **kwargs)\n
"},{"location":"module/heads/#multimolecule.module.heads.contact.ContactPredictionHead.forward(outputs)","title":"outputs","text":""},{"location":"module/heads/#multimolecule.module.heads.contact.ContactPredictionHead.forward(attention_mask)","title":"attention_mask","text":""},{"location":"module/heads/#multimolecule.module.heads.contact.ContactPredictionHead.forward(input_ids)","title":"input_ids","text":""},{"location":"module/heads/#multimolecule.module.heads.contact.ContactPredictionHead.forward(labels)","title":"labels","text":""},{"location":"module/heads/#multimolecule.module.heads.contact.ContactPredictionHead.forward(output_name)","title":"output_name","text":""},{"location":"module/heads/#multimolecule.module.heads.contact.ContactLogitsHead","title":"ContactLogitsHead","text":"

Bases: PredictionHead

Head for tasks in contact-level.

Performs symmetrization, and average product correct.

Parameters:

Name Type Description Default PreTrainedConfig

The configuration object for the model.

required HeadConfig | None

The configuration object for the head. If None, will use configuration from the config.

None Source code in multimolecule/module/heads/contact.py Python
@HeadRegistry.contact.register(\"logits\")\nclass ContactLogitsHead(PredictionHead):\n    r\"\"\"\n    Head for tasks in contact-level.\n\n    Performs symmetrization, and average product correct.\n\n    Args:\n        config: The configuration object for the model.\n        head_config: The configuration object for the head.\n            If None, will use configuration from the `config`.\n    \"\"\"\n\n    output_name: str = \"last_hidden_state\"\n    r\"\"\"The default output to use for the head.\"\"\"\n\n    requires_attention: bool = False\n\n    def __init__(self, config: PreTrainedConfig, head_config: HeadConfig | None = None):\n        super().__init__(config, head_config)\n        num_layers = self.config.get(\"num_layers\", 16)\n        num_channels = self.config.get(\"num_channels\", self.config.hidden_size // 10)  # type: ignore[operator]\n        block = self.config.get(\"block\", \"auto\")\n        self.decoder = ResNet(\n            num_layers=num_layers,\n            hidden_size=self.config.hidden_size,  # type: ignore[arg-type]\n            block=block,\n            num_channels=num_channels,\n            num_labels=self.num_labels,\n        )\n        if head_config is not None and head_config.output_name is not None:\n            self.output_name = head_config.output_name\n\n    def forward(  # type: ignore[override]  # pylint: disable=arguments-renamed\n        self,\n        outputs: ModelOutput | Mapping | Tuple[Tensor, ...],\n        attention_mask: Tensor | None = None,\n        input_ids: NestedTensor | Tensor | None = None,\n        labels: Tensor | None = None,\n        output_name: str | None = None,\n        **kwargs,\n    ) -> HeadOutput:\n        r\"\"\"\n        Forward pass of the ContactPredictionHead.\n\n        Args:\n            outputs: The outputs of the model.\n            attention_mask: The attention mask for the inputs.\n            input_ids: The input ids for the inputs.\n            labels: The labels for the head.\n            output_name: The name of the output to use.\n                Defaults to `self.output_name`.\n        \"\"\"\n        if isinstance(outputs, (Mapping, ModelOutput)):\n            output = outputs[output_name or self.output_name]\n        elif isinstance(outputs, tuple):\n            output = outputs[0]\n        else:\n            raise ValueError(f\"Unsupported type for outputs: {type(outputs)}\")\n\n        if attention_mask is None:\n            attention_mask = self._get_attention_mask(input_ids)\n        output = output * attention_mask.unsqueeze(-1)\n        output, _, _ = self._remove_special_tokens(output, attention_mask, input_ids)\n\n        # make symmetric contact map\n        contact_map = output.unsqueeze(1) * output.unsqueeze(2)\n\n        return super().forward(contact_map, labels, **kwargs)\n
"},{"location":"module/heads/#multimolecule.module.heads.contact.ContactLogitsHead(config)","title":"config","text":""},{"location":"module/heads/#multimolecule.module.heads.contact.ContactLogitsHead(head_config)","title":"head_config","text":""},{"location":"module/heads/#multimolecule.module.heads.contact.ContactLogitsHead.output_name","title":"output_name class-attribute instance-attribute","text":"Python
output_name: str = 'last_hidden_state'\n

The default output to use for the head.

"},{"location":"module/heads/#multimolecule.module.heads.contact.ContactLogitsHead.forward","title":"forward","text":"Python
forward(outputs: ModelOutput | Mapping | Tuple[Tensor, ...], attention_mask: Tensor | None = None, input_ids: NestedTensor | Tensor | None = None, labels: Tensor | None = None, output_name: str | None = None, **kwargs) -> HeadOutput\n

Forward pass of the ContactPredictionHead.

Parameters:

Name Type Description Default ModelOutput | Mapping | Tuple[Tensor, ...]

The outputs of the model.

required Tensor | None

The attention mask for the inputs.

None NestedTensor | Tensor | None

The input ids for the inputs.

None Tensor | None

The labels for the head.

None str | None

The name of the output to use. Defaults to self.output_name.

None Source code in multimolecule/module/heads/contact.py Python
def forward(  # type: ignore[override]  # pylint: disable=arguments-renamed\n    self,\n    outputs: ModelOutput | Mapping | Tuple[Tensor, ...],\n    attention_mask: Tensor | None = None,\n    input_ids: NestedTensor | Tensor | None = None,\n    labels: Tensor | None = None,\n    output_name: str | None = None,\n    **kwargs,\n) -> HeadOutput:\n    r\"\"\"\n    Forward pass of the ContactPredictionHead.\n\n    Args:\n        outputs: The outputs of the model.\n        attention_mask: The attention mask for the inputs.\n        input_ids: The input ids for the inputs.\n        labels: The labels for the head.\n        output_name: The name of the output to use.\n            Defaults to `self.output_name`.\n    \"\"\"\n    if isinstance(outputs, (Mapping, ModelOutput)):\n        output = outputs[output_name or self.output_name]\n    elif isinstance(outputs, tuple):\n        output = outputs[0]\n    else:\n        raise ValueError(f\"Unsupported type for outputs: {type(outputs)}\")\n\n    if attention_mask is None:\n        attention_mask = self._get_attention_mask(input_ids)\n    output = output * attention_mask.unsqueeze(-1)\n    output, _, _ = self._remove_special_tokens(output, attention_mask, input_ids)\n\n    # make symmetric contact map\n    contact_map = output.unsqueeze(1) * output.unsqueeze(2)\n\n    return super().forward(contact_map, labels, **kwargs)\n
"},{"location":"module/heads/#multimolecule.module.heads.contact.ContactLogitsHead.forward(outputs)","title":"outputs","text":""},{"location":"module/heads/#multimolecule.module.heads.contact.ContactLogitsHead.forward(attention_mask)","title":"attention_mask","text":""},{"location":"module/heads/#multimolecule.module.heads.contact.ContactLogitsHead.forward(input_ids)","title":"input_ids","text":""},{"location":"module/heads/#multimolecule.module.heads.contact.ContactLogitsHead.forward(labels)","title":"labels","text":""},{"location":"module/heads/#multimolecule.module.heads.contact.ContactLogitsHead.forward(output_name)","title":"output_name","text":""},{"location":"module/heads/#multimolecule.module.heads.contact.symmetrize","title":"symmetrize","text":"Python
symmetrize(x)\n

Make layer symmetric in final two dimensions, used for contact prediction.

Source code in multimolecule/module/heads/contact.py Python
def symmetrize(x):\n    \"Make layer symmetric in final two dimensions, used for contact prediction.\"\n    return x + x.transpose(-1, -2)\n
"},{"location":"module/heads/#multimolecule.module.heads.contact.average_product_correct","title":"average_product_correct","text":"Python
average_product_correct(x)\n

Perform average product correct, used for contact prediction.

Source code in multimolecule/module/heads/contact.py Python
def average_product_correct(x):\n    \"Perform average product correct, used for contact prediction.\"\n    a1 = x.sum(-1, keepdims=True)\n    a2 = x.sum(-2, keepdims=True)\n    a12 = x.sum((-1, -2), keepdims=True)\n\n    avg = a1 * a2\n    avg.div_(a12)  # in-place to reduce memory\n    normalized = x - avg\n    return normalized\n
"},{"location":"module/heads/#multimolecule.module.heads.pretrain","title":"multimolecule.module.heads.pretrain","text":""},{"location":"module/heads/#multimolecule.module.heads.pretrain.MaskedLMHead","title":"MaskedLMHead","text":"

Bases: Module

Head for masked language modeling.

Parameters:

Name Type Description Default PreTrainedConfig

The configuration object for the model.

required MaskedLMHeadConfig | None

The configuration object for the head. If None, will use configuration from the config.

None Source code in multimolecule/module/heads/pretrain.py Python
@HeadRegistry.register(\"masked_lm\")\nclass MaskedLMHead(nn.Module):\n    r\"\"\"\n    Head for masked language modeling.\n\n    Args:\n        config: The configuration object for the model.\n        head_config: The configuration object for the head.\n            If None, will use configuration from the `config`.\n    \"\"\"\n\n    output_name: str = \"last_hidden_state\"\n    r\"\"\"The default output to use for the head.\"\"\"\n\n    def __init__(\n        self, config: PreTrainedConfig, weight: Tensor | None = None, head_config: MaskedLMHeadConfig | None = None\n    ):\n        super().__init__()\n        if head_config is None:\n            head_config = (config.lm_head if hasattr(config, \"lm_head\") else config.head) or MaskedLMHeadConfig()\n        self.config: MaskedLMHeadConfig = head_config\n        if self.config.hidden_size is None:\n            self.config.hidden_size = config.hidden_size\n        self.num_labels = config.vocab_size\n        self.dropout = nn.Dropout(self.config.dropout)\n        self.transform = HeadTransformRegistryHF.build(self.config)\n        self.decoder = nn.Linear(self.config.hidden_size, self.num_labels, bias=False)\n        if weight is not None:\n            self.decoder.weight = weight\n        if self.config.bias:\n            self.bias = nn.Parameter(torch.zeros(self.num_labels))\n            self.decoder.bias = self.bias\n        self.activation = ACT2FN[self.config.act] if self.config.act is not None else None\n        if head_config is not None and head_config.output_name is not None:\n            self.output_name = head_config.output_name\n\n    def forward(\n        self, outputs: ModelOutput | Tuple[Tensor, ...], labels: Tensor | None = None, output_name: str | None = None\n    ) -> HeadOutput:\n        r\"\"\"\n        Forward pass of the MaskedLMHead.\n\n        Args:\n            outputs: The outputs of the model.\n            labels: The labels for the head.\n            output_name: The name of the output to use.\n                Defaults to `self.output_name`.\n        \"\"\"\n        if isinstance(outputs, (Mapping, ModelOutput)):\n            output = outputs[output_name or self.output_name]\n        elif isinstance(outputs, tuple):\n            output = outputs[0]\n        else:\n            raise ValueError(f\"Unsupported type for outputs: {type(outputs)}\")\n        output = self.dropout(output)\n        output = self.transform(output)\n        output = self.decoder(output)\n        if self.activation is not None:\n            output = self.activation(output)\n        if labels is not None:\n            if isinstance(labels, NestedTensor):\n                if isinstance(output, Tensor):\n                    output = labels.nested_like(output, strict=False)\n                return HeadOutput(output, F.cross_entropy(output.concat, labels.concat))\n            return HeadOutput(output, F.cross_entropy(output.view(-1, self.num_labels), labels.view(-1)))\n        return HeadOutput(output)\n
"},{"location":"module/heads/#multimolecule.module.heads.pretrain.MaskedLMHead(config)","title":"config","text":""},{"location":"module/heads/#multimolecule.module.heads.pretrain.MaskedLMHead(head_config)","title":"head_config","text":""},{"location":"module/heads/#multimolecule.module.heads.pretrain.MaskedLMHead.output_name","title":"output_name class-attribute instance-attribute","text":"Python
output_name: str = 'last_hidden_state'\n

The default output to use for the head.

"},{"location":"module/heads/#multimolecule.module.heads.pretrain.MaskedLMHead.forward","title":"forward","text":"Python
forward(outputs: ModelOutput | Tuple[Tensor, ...], labels: Tensor | None = None, output_name: str | None = None) -> HeadOutput\n

Forward pass of the MaskedLMHead.

Parameters:

Name Type Description Default ModelOutput | Tuple[Tensor, ...]

The outputs of the model.

required Tensor | None

The labels for the head.

None str | None

The name of the output to use. Defaults to self.output_name.

None Source code in multimolecule/module/heads/pretrain.py Python
def forward(\n    self, outputs: ModelOutput | Tuple[Tensor, ...], labels: Tensor | None = None, output_name: str | None = None\n) -> HeadOutput:\n    r\"\"\"\n    Forward pass of the MaskedLMHead.\n\n    Args:\n        outputs: The outputs of the model.\n        labels: The labels for the head.\n        output_name: The name of the output to use.\n            Defaults to `self.output_name`.\n    \"\"\"\n    if isinstance(outputs, (Mapping, ModelOutput)):\n        output = outputs[output_name or self.output_name]\n    elif isinstance(outputs, tuple):\n        output = outputs[0]\n    else:\n        raise ValueError(f\"Unsupported type for outputs: {type(outputs)}\")\n    output = self.dropout(output)\n    output = self.transform(output)\n    output = self.decoder(output)\n    if self.activation is not None:\n        output = self.activation(output)\n    if labels is not None:\n        if isinstance(labels, NestedTensor):\n            if isinstance(output, Tensor):\n                output = labels.nested_like(output, strict=False)\n            return HeadOutput(output, F.cross_entropy(output.concat, labels.concat))\n        return HeadOutput(output, F.cross_entropy(output.view(-1, self.num_labels), labels.view(-1)))\n    return HeadOutput(output)\n
"},{"location":"module/heads/#multimolecule.module.heads.pretrain.MaskedLMHead.forward(outputs)","title":"outputs","text":""},{"location":"module/heads/#multimolecule.module.heads.pretrain.MaskedLMHead.forward(labels)","title":"labels","text":""},{"location":"module/heads/#multimolecule.module.heads.pretrain.MaskedLMHead.forward(output_name)","title":"output_name","text":""},{"location":"module/heads/#multimolecule.module.heads.generic","title":"multimolecule.module.heads.generic","text":""},{"location":"module/heads/#multimolecule.module.heads.generic.PredictionHead","title":"PredictionHead","text":"

Bases: Module

Head for all-level of tasks.

Parameters:

Name Type Description Default PreTrainedConfig

The configuration object for the model.

required HeadConfig | None

The configuration object for the head. If None, will use configuration from the config.

None Source code in multimolecule/module/heads/generic.py Python
class PredictionHead(nn.Module):\n    r\"\"\"\n    Head for all-level of tasks.\n\n    Args:\n        config: The configuration object for the model.\n        head_config: The configuration object for the head.\n            If None, will use configuration from the `config`.\n    \"\"\"\n\n    num_labels: int\n    requires_attention: bool = False\n\n    def __init__(self, config: PreTrainedConfig, head_config: HeadConfig | None = None):\n        super().__init__()\n        if head_config is None:\n            head_config = config.head or HeadConfig(num_labels=config.num_labels)\n        elif head_config.num_labels is None:\n            head_config.num_labels = config.num_labels\n        self.config = head_config\n        if self.config.hidden_size is None:\n            self.config.hidden_size = config.hidden_size\n        if self.config.problem_type is None:\n            self.config.problem_type = config.problem_type\n        self.bos_token_id = config.bos_token_id\n        self.eos_token_id = config.eos_token_id\n        self.pad_token_id = config.pad_token_id\n        self.num_labels = self.config.num_labels  # type: ignore[assignment]\n        self.dropout = nn.Dropout(self.config.dropout)\n        self.transform = HeadTransformRegistryHF.build(self.config)\n        self.decoder = nn.Linear(self.config.hidden_size, self.num_labels, bias=self.config.bias)\n        self.activation = ACT2FN[self.config.act] if self.config.act is not None else None\n        self.criterion = CriterionRegistry.build(self.config)\n\n    def forward(self, embeddings: Tensor, labels: Tensor | None, **kwargs) -> HeadOutput:\n        r\"\"\"\n        Forward pass of the PredictionHead.\n\n        Args:\n            embeddings: The embeddings to be passed through the head.\n            labels: The labels for the head.\n        \"\"\"\n        if kwargs:\n            warn(\n                f\"The following arguments are not applicable to {self.__class__.__name__}\"\n                f\"and will be ignored: {kwargs.keys()}\"\n            )\n        output = self.dropout(embeddings)\n        output = self.transform(output)\n        output = self.decoder(output)\n        if self.activation is not None:\n            output = self.activation(output)\n        if labels is not None:\n            if isinstance(labels, NestedTensor):\n                if isinstance(output, Tensor):\n                    output = labels.nested_like(output, strict=False)\n                return HeadOutput(output, self.criterion(output.concat, labels.concat))\n            return HeadOutput(output, self.criterion(output, labels))\n        return HeadOutput(output)\n\n    def _get_attention_mask(self, input_ids: NestedTensor | Tensor) -> Tensor:\n        if isinstance(input_ids, NestedTensor):\n            return input_ids.mask\n        if input_ids is None:\n            raise ValueError(\n                f\"Either attention_mask or input_ids must be provided for {self.__class__.__name__} to work.\"\n            )\n        if self.pad_token_id is None:\n            raise ValueError(\n                f\"pad_token_id must be provided when attention_mask is not passed to {self.__class__.__name__}.\"\n            )\n        return input_ids.ne(self.pad_token_id)\n\n    def _remove_special_tokens(\n        self, output: Tensor, attention_mask: Tensor, input_ids: Tensor | None\n    ) -> Tuple[Tensor, Tensor, Tensor]:\n        # remove cls token embeddings\n        if self.bos_token_id is not None:\n            output = output[..., 1:, :]\n            attention_mask = attention_mask[..., 1:]\n            if input_ids is not None:\n                input_ids = input_ids[..., 1:]\n        # remove eos token embeddings\n        if self.eos_token_id is not None:\n            if input_ids is not None:\n                eos_mask = input_ids.ne(self.eos_token_id).to(output)\n                input_ids = input_ids[..., :-1]\n            else:\n                last_valid_indices = attention_mask.sum(dim=-1)\n                seq_length = attention_mask.size(-1)\n                eos_mask = torch.arange(seq_length, device=output.device) == last_valid_indices.unsqueeze(1)\n            output = output * eos_mask[:, :, None]\n            output = output[..., :-1, :]\n            attention_mask = attention_mask[..., 1:]\n        return output, attention_mask, input_ids\n
"},{"location":"module/heads/#multimolecule.module.heads.generic.PredictionHead(config)","title":"config","text":""},{"location":"module/heads/#multimolecule.module.heads.generic.PredictionHead(head_config)","title":"head_config","text":""},{"location":"module/heads/#multimolecule.module.heads.generic.PredictionHead.forward","title":"forward","text":"Python
forward(embeddings: Tensor, labels: Tensor | None, **kwargs) -> HeadOutput\n

Forward pass of the PredictionHead.

Parameters:

Name Type Description Default Tensor

The embeddings to be passed through the head.

required Tensor | None

The labels for the head.

required Source code in multimolecule/module/heads/generic.py Python
def forward(self, embeddings: Tensor, labels: Tensor | None, **kwargs) -> HeadOutput:\n    r\"\"\"\n    Forward pass of the PredictionHead.\n\n    Args:\n        embeddings: The embeddings to be passed through the head.\n        labels: The labels for the head.\n    \"\"\"\n    if kwargs:\n        warn(\n            f\"The following arguments are not applicable to {self.__class__.__name__}\"\n            f\"and will be ignored: {kwargs.keys()}\"\n        )\n    output = self.dropout(embeddings)\n    output = self.transform(output)\n    output = self.decoder(output)\n    if self.activation is not None:\n        output = self.activation(output)\n    if labels is not None:\n        if isinstance(labels, NestedTensor):\n            if isinstance(output, Tensor):\n                output = labels.nested_like(output, strict=False)\n            return HeadOutput(output, self.criterion(output.concat, labels.concat))\n        return HeadOutput(output, self.criterion(output, labels))\n    return HeadOutput(output)\n
"},{"location":"module/heads/#multimolecule.module.heads.generic.PredictionHead.forward(embeddings)","title":"embeddings","text":""},{"location":"module/heads/#multimolecule.module.heads.generic.PredictionHead.forward(labels)","title":"labels","text":""},{"location":"module/heads/#multimolecule.module.heads.output","title":"multimolecule.module.heads.output","text":""},{"location":"module/heads/#multimolecule.module.heads.output.HeadOutput","title":"HeadOutput dataclass","text":"

Bases: ModelOutput

Output of a prediction head.

Parameters:

Name Type Description Default FloatTensor

The prediction logits from the head.

required FloatTensor | None

The loss from the head. Defaults to None.

None Source code in multimolecule/module/heads/output.py Python
@dataclass\nclass HeadOutput(ModelOutput):\n    r\"\"\"\n    Output of a prediction head.\n\n    Args:\n        logits: The prediction logits from the head.\n        loss: The loss from the head.\n            Defaults to None.\n    \"\"\"\n\n    logits: FloatTensor\n    loss: FloatTensor | None = None\n
"},{"location":"module/heads/#multimolecule.module.heads.output.HeadOutput(logits)","title":"logits","text":""},{"location":"module/heads/#multimolecule.module.heads.output.HeadOutput(loss)","title":"loss","text":""},{"location":"tokenisers/","title":"tokenisers","text":"

tokenisers provide a collection of pre-defined tokenizers.

A tokenizer is a class that converts a sequence of nucleotides or amino acids into a sequence of indices. It is used to pre-process the input sequence before feeding it into a model.

Please refer to Tokenizer for more details.

"},{"location":"tokenisers/#available-tokenizers","title":"Available Tokenizers","text":""},{"location":"tokenisers/dna/","title":"DnaTokenizer","text":"

DnaTokenizer is smart, it tokenizes raw DNA nucleotides into tokens, no matter if the input is in uppercase or lowercase, uses T (Thymine) or U (Uracil), and with or without special tokens. It also supports tokenization into nmers and codons, so you don\u2019t have to write complex code to preprocess your data.

By default, DnaTokenizer uses the standard alphabet. If nmers is greater than 1, or codon is set to True, it will instead use the streamline alphabet.

MultiMolecule provides a set of predefined alphabets for tokenization.

"},{"location":"tokenisers/dna/#multimolecule.tokenisers.DnaTokenizer","title":"multimolecule.tokenisers.DnaTokenizer","text":"

Bases: Tokenizer

Tokenizer for DNA sequences.

Parameters:

Name Type Description Default Alphabet | str | List[str] | None

alphabet to use for tokenization.

None int

Size of kmer to tokenize.

1 bool

Whether to tokenize into codons.

False bool

Whether to replace U with T.

True bool

Whether to convert input to uppercase.

True

Examples:

Python Console Session
>>> from multimolecule import DnaTokenizer\n>>> tokenizer = DnaTokenizer()\n>>> tokenizer('<pad><cls><eos><unk><mask><null>ACGTNRYSWKMBDHV.X*-')[\"input_ids\"]\n[1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 2]\n>>> tokenizer('acgt')[\"input_ids\"]\n[1, 6, 7, 8, 9, 2]\n>>> tokenizer('acgu')[\"input_ids\"]\n[1, 6, 7, 8, 9, 2]\n>>> tokenizer = DnaTokenizer(replace_U_with_T=False)\n>>> tokenizer('acgu')[\"input_ids\"]\n[1, 6, 7, 8, 3, 2]\n>>> tokenizer = DnaTokenizer(nmers=3)\n>>> tokenizer('tataaagta')[\"input_ids\"]\n[1, 84, 21, 81, 6, 8, 19, 71, 2]\n>>> tokenizer = DnaTokenizer(codon=True)\n>>> tokenizer('tataaagta')[\"input_ids\"]\n[1, 84, 6, 71, 2]\n>>> tokenizer('tataaagtaa')[\"input_ids\"]\nTraceback (most recent call last):\nValueError: length of input sequence must be a multiple of 3 for codon tokenization, but got 10\n
Source code in multimolecule/tokenisers/dna/tokenization_dna.py Python
class DnaTokenizer(Tokenizer):\n    \"\"\"\n    Tokenizer for DNA sequences.\n\n    Args:\n        alphabet: alphabet to use for tokenization.\n\n            - If is `None`, the standard RNA alphabet will be used.\n            - If is a `string`, it should correspond to the name of a predefined alphabet. The options include\n                + `standard`\n                + `iupac`\n                + `streamline`\n                + `nucleobase`\n            - If is an alphabet or a list of characters, that specific alphabet will be used.\n        nmers: Size of kmer to tokenize.\n        codon: Whether to tokenize into codons.\n        replace_U_with_T: Whether to replace U with T.\n        do_upper_case: Whether to convert input to uppercase.\n\n    Examples:\n        >>> from multimolecule import DnaTokenizer\n        >>> tokenizer = DnaTokenizer()\n        >>> tokenizer('<pad><cls><eos><unk><mask><null>ACGTNRYSWKMBDHV.X*-')[\"input_ids\"]\n        [1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 2]\n        >>> tokenizer('acgt')[\"input_ids\"]\n        [1, 6, 7, 8, 9, 2]\n        >>> tokenizer('acgu')[\"input_ids\"]\n        [1, 6, 7, 8, 9, 2]\n        >>> tokenizer = DnaTokenizer(replace_U_with_T=False)\n        >>> tokenizer('acgu')[\"input_ids\"]\n        [1, 6, 7, 8, 3, 2]\n        >>> tokenizer = DnaTokenizer(nmers=3)\n        >>> tokenizer('tataaagta')[\"input_ids\"]\n        [1, 84, 21, 81, 6, 8, 19, 71, 2]\n        >>> tokenizer = DnaTokenizer(codon=True)\n        >>> tokenizer('tataaagta')[\"input_ids\"]\n        [1, 84, 6, 71, 2]\n        >>> tokenizer('tataaagtaa')[\"input_ids\"]\n        Traceback (most recent call last):\n        ValueError: length of input sequence must be a multiple of 3 for codon tokenization, but got 10\n    \"\"\"\n\n    model_input_names = [\"input_ids\", \"attention_mask\"]\n\n    def __init__(\n        self,\n        alphabet: Alphabet | str | List[str] | None = None,\n        nmers: int = 1,\n        codon: bool = False,\n        replace_U_with_T: bool = True,\n        do_upper_case: bool = True,\n        additional_special_tokens: List | Tuple | None = None,\n        **kwargs,\n    ):\n        if codon and (nmers > 1 and nmers != 3):\n            raise ValueError(\"Codon and nmers cannot be used together.\")\n        if codon:\n            nmers = 3  # set to 3 to get correct vocab\n        if not isinstance(alphabet, Alphabet):\n            alphabet = get_alphabet(alphabet, nmers=nmers)\n        super().__init__(\n            alphabet=alphabet,\n            nmers=nmers,\n            codon=codon,\n            replace_U_with_T=replace_U_with_T,\n            do_upper_case=do_upper_case,\n            additional_special_tokens=additional_special_tokens,\n            **kwargs,\n        )\n        self.replace_U_with_T = replace_U_with_T\n        self.nmers = nmers\n        self.codon = codon\n\n    def _tokenize(self, text: str, **kwargs):\n        if self.do_upper_case:\n            text = text.upper()\n        if self.replace_U_with_T:\n            text = text.replace(\"U\", \"T\")\n        if self.codon:\n            if len(text) % 3 != 0:\n                raise ValueError(\n                    f\"length of input sequence must be a multiple of 3 for codon tokenization, but got {len(text)}\"\n                )\n            return [text[i : i + 3] for i in range(0, len(text), 3)]\n        if self.nmers > 1:\n            return [text[i : i + self.nmers] for i in range(len(text) - self.nmers + 1)]  # noqa: E203\n        return list(text)\n
"},{"location":"tokenisers/dna/#multimolecule.tokenisers.DnaTokenizer(alphabet)","title":"alphabet","text":""},{"location":"tokenisers/dna/#multimolecule.tokenisers.DnaTokenizer(nmers)","title":"nmers","text":""},{"location":"tokenisers/dna/#multimolecule.tokenisers.DnaTokenizer(codon)","title":"codon","text":""},{"location":"tokenisers/dna/#multimolecule.tokenisers.DnaTokenizer(replace_U_with_T)","title":"replace_U_with_T","text":""},{"location":"tokenisers/dna/#multimolecule.tokenisers.DnaTokenizer(do_upper_case)","title":"do_upper_case","text":""},{"location":"tokenisers/dna/#standard-alphabet","title":"Standard Alphabet","text":"

The standard alphabet is an extended version of the IUPAC alphabet. This extension includes two additional symbols to the IUPAC alphabet, X and *.

gap

Note that we use . to represent a gap in the sequence.

While - exists in the standard alphabet, it is not used in MultiMolecule and is reserved for future use.

Code Represents A Adenine C Cytosine G Guanine T Thymine N Unknown R A or G Y C or T S C or G W A or T K G or T M A or C B C, G, or T D A, G, or T H A, C, or T V A, C, or G . Gap X Any * Not Used - Not Used"},{"location":"tokenisers/dna/#iupac-alphabet","title":"IUPAC Alphabet","text":"

IUPAC nucleotide code is a standard nucleotide code proposed by the International Union of Pure and Applied Chemistry (IUPAC) to represent DNA sequences.

It consists of 10 symbols that represent ambiguity in the nucleotide sequence and 1 symbol that represents a gap in addition to the streamline alphabet.

Code Represents A Adenine C Cytosine G Guanine T Thymine R A or G Y C or T S C or G W A or T K G or T M A or C B C, G, or T D A, G, or T H A, C, or T V A, C, or G N A, C, G, or T . Gap

Note that we use . to represent a gap in the sequence.

"},{"location":"tokenisers/dna/#streamline-alphabet","title":"Streamline Alphabet","text":"

The streamline alphabet includes one additional symbol to the nucleobase alphabet, N to represent unknown nucleobase.

Code Nucleotide A Adenine C Cytosine G Guanine T Thymine N Unknown"},{"location":"tokenisers/dna/#nucleobase-alphabet","title":"Nucleobase Alphabet","text":"

The nucleobase alphabet is a minimal version of the DNA alphabet that includes only the four canonical nucleotides A, C, G, and T.

Code Nucleotide A Adenine C Cytosine G Guanine T Thymine"},{"location":"tokenisers/dot_bracket/","title":"DotBracketTokenizer","text":"

DotBracketTokenizer provides a simple way to tokenize secondary structure in dot-bracket notation. It also supports tokenization into nmers and codons, so you don\u2019t have to write complex code to preprocess your data.

By default, DotBracketTokenizer uses the standard alphabet. If nmers is greater than 1, or codon is set to True, it will instead use the streamline alphabet.

MultiMolecule provides a set of predefined alphabets for tokenization.

"},{"location":"tokenisers/dot_bracket/#multimolecule.tokenisers.DotBracketTokenizer","title":"multimolecule.tokenisers.DotBracketTokenizer","text":"

Bases: Tokenizer

Tokenizer for Secondary Structure sequences.

Parameters:

Name Type Description Default Alphabet | str | List[str] | None

alphabet to use for tokenization.

None int

Size of kmer to tokenize.

1 bool

Whether to tokenize into codons.

False

Examples:

Python Console Session
>>> from multimolecule import DotBracketTokenizer\n>>> tokenizer = DotBracketTokenizer()\n>>> tokenizer('<pad><cls><eos><unk><mask><null>.()+,[]{}|<>-_:~$@^%*')[\"input_ids\"]\n[1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 2]\n>>> tokenizer('(.)')[\"input_ids\"]\n[1, 7, 6, 8, 2]\n>>> tokenizer('+(.)')[\"input_ids\"]\n[1, 9, 7, 6, 8, 2]\n>>> tokenizer = DotBracketTokenizer(nmers=3)\n>>> tokenizer('(((((+..........)))))')[\"input_ids\"]\n[1, 27, 27, 27, 29, 34, 54, 6, 6, 6, 6, 6, 6, 6, 6, 8, 16, 48, 48, 48, 2]\n>>> tokenizer = DotBracketTokenizer(codon=True)\n>>> tokenizer('(((((+..........)))))')[\"input_ids\"]\n[1, 27, 29, 6, 6, 6, 16, 48, 2]\n>>> tokenizer('(((((+...........)))))')[\"input_ids\"]\nTraceback (most recent call last):\nValueError: length of input sequence must be a multiple of 3 for codon tokenization, but got 22\n
Source code in multimolecule/tokenisers/dot_bracket/tokenization_db.py Python
class DotBracketTokenizer(Tokenizer):\n    \"\"\"\n    Tokenizer for Secondary Structure sequences.\n\n    Args:\n        alphabet: alphabet to use for tokenization.\n\n            - If is `None`, the standard Secondary Structure alphabet will be used.\n            - If is a `string`, it should correspond to the name of a predefined alphabet. The options include\n                + `standard`\n                + `iupac`\n                + `streamline`\n                + `nucleobase`\n            - If is an alphabet or a list of characters, that specific alphabet will be used.\n        nmers: Size of kmer to tokenize.\n        codon: Whether to tokenize into codons.\n\n    Examples:\n        >>> from multimolecule import DotBracketTokenizer\n        >>> tokenizer = DotBracketTokenizer()\n        >>> tokenizer('<pad><cls><eos><unk><mask><null>.()+,[]{}|<>-_:~$@^%*')[\"input_ids\"]\n        [1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 2]\n        >>> tokenizer('(.)')[\"input_ids\"]\n        [1, 7, 6, 8, 2]\n        >>> tokenizer('+(.)')[\"input_ids\"]\n        [1, 9, 7, 6, 8, 2]\n        >>> tokenizer = DotBracketTokenizer(nmers=3)\n        >>> tokenizer('(((((+..........)))))')[\"input_ids\"]\n        [1, 27, 27, 27, 29, 34, 54, 6, 6, 6, 6, 6, 6, 6, 6, 8, 16, 48, 48, 48, 2]\n        >>> tokenizer = DotBracketTokenizer(codon=True)\n        >>> tokenizer('(((((+..........)))))')[\"input_ids\"]\n        [1, 27, 29, 6, 6, 6, 16, 48, 2]\n        >>> tokenizer('(((((+...........)))))')[\"input_ids\"]\n        Traceback (most recent call last):\n        ValueError: length of input sequence must be a multiple of 3 for codon tokenization, but got 22\n    \"\"\"\n\n    model_input_names = [\"input_ids\", \"attention_mask\"]\n\n    def __init__(\n        self,\n        alphabet: Alphabet | str | List[str] | None = None,\n        nmers: int = 1,\n        codon: bool = False,\n        additional_special_tokens: List | Tuple | None = None,\n        **kwargs,\n    ):\n        if codon and (nmers > 1 and nmers != 3):\n            raise ValueError(\"Codon and nmers cannot be used together.\")\n        if codon:\n            nmers = 3  # set to 3 to get correct vocab\n        if not isinstance(alphabet, Alphabet):\n            alphabet = get_alphabet(alphabet, nmers=nmers)\n        super().__init__(\n            alphabet=alphabet,\n            nmers=nmers,\n            codon=codon,\n            additional_special_tokens=additional_special_tokens,\n            **kwargs,\n        )\n        self.nmers = nmers\n        self.codon = codon\n\n    def _tokenize(self, text: str, **kwargs):\n        if self.codon:\n            if len(text) % 3 != 0:\n                raise ValueError(\n                    f\"length of input sequence must be a multiple of 3 for codon tokenization, but got {len(text)}\"\n                )\n            return [text[i : i + 3] for i in range(0, len(text), 3)]\n        if self.nmers > 1:\n            return [text[i : i + self.nmers] for i in range(len(text) - self.nmers + 1)]  # noqa: E203\n        return list(text)\n
"},{"location":"tokenisers/dot_bracket/#multimolecule.tokenisers.DotBracketTokenizer(alphabet)","title":"alphabet","text":""},{"location":"tokenisers/dot_bracket/#multimolecule.tokenisers.DotBracketTokenizer(nmers)","title":"nmers","text":""},{"location":"tokenisers/dot_bracket/#multimolecule.tokenisers.DotBracketTokenizer(codon)","title":"codon","text":""},{"location":"tokenisers/dot_bracket/#standard-alphabet","title":"Standard Alphabet","text":"

The standard alphabet is an extended version of the Extended Dot-Bracket Notation. This extension includes most symbols from the WUSS notation for better compatibility with existing tools.

Code Represents . unpaired ( internal helices of all terminal stems ) internal helices of all terminal stems + nick between strand , unpaired in multibranch loops [ internal helices that includes at least one annotated () stem ] internal helices that includes at least one annotated () stem { all internal helices of deeper multifurcations } all internal helices of deeper multifurcations | mostly paired < simple terminal stems > simple terminal stems - bulges and interior loops _ unpaired : single stranded in the exterior loop ~ local structural alignment left regions of target and query unaligned $ Not Used @ Not Used ^ Not Used % Not Used * Not Used"},{"location":"tokenisers/dot_bracket/#extended-alphabet","title":"Extended Alphabet","text":"

Extended Dot-Bracket Notation is a more generalized version of the original Dot-Bracket notation may use additional pairs of brackets for annotating pseudo-knots, since different pairs of brackets are not required to be nested.

Code Represents . unpaired ( internal helices of all terminal stems ) internal helices of all terminal stems + nick between strand , unpaired in multibranch loops [ internal helices that includes at least one annotated () stem ] internal helices that includes at least one annotated () stem { all internal helices of deeper multifurcations } all internal helices of deeper multifurcations | mostly paired < simple terminal stems > simple terminal stems

Note that we use . to represent a gap in the sequence.

"},{"location":"tokenisers/dot_bracket/#streamline-alphabet","title":"Streamline Alphabet","text":"

The streamline alphabet includes one additional symbol to the nucleobase alphabet, N to represent unknown nucleobase.

Code Represents . unpaired ( internal helices of all terminal stems ) internal helices of all terminal stems + nick between strand"},{"location":"tokenisers/protein/","title":"ProteinTokenizer","text":"

ProteinTokenizer is smart, it tokenizes raw amino acids into tokens, no matter if the input is in uppercase or lowercase, and with or without special tokens.

By default, ProteinTokenizer uses the standard alphabet.

MultiMolecule provides a set of predefined alphabets for tokenization.

"},{"location":"tokenisers/protein/#multimolecule.tokenisers.ProteinTokenizer","title":"multimolecule.tokenisers.ProteinTokenizer","text":"

Bases: Tokenizer

Tokenizer for Protein sequences.

Parameters:

Name Type Description Default Alphabet | str | List[str] | None

alphabet to use for tokenization.

None bool

Whether to convert input to uppercase.

True

Examples:

Python Console Session
>>> from multimolecule import ProteinTokenizer\n>>> tokenizer = ProteinTokenizer()\n>>> tokenizer('ACDEFGHIKLMNPQRSTVWYXZBJUO')[\"input_ids\"]\n[1, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 2]\n>>> tokenizer('<pad><cls><eos><unk><mask><null>.*-')[\"input_ids\"]\n[1, 0, 1, 2, 3, 4, 5, 32, 33, 34, 2]\n>>> tokenizer('manlgcwmlv')[\"input_ids\"]\n[1, 16, 6, 17, 15, 11, 7, 24, 16, 15, 23, 2]\n
Source code in multimolecule/tokenisers/protein/tokenization_protein.py Python
class ProteinTokenizer(Tokenizer):\n    \"\"\"\n    Tokenizer for Protein sequences.\n\n    Args:\n        alphabet: alphabet to use for tokenization.\n\n            - If is `None`, the standard RNA alphabet will be used.\n            - If is a `string`, it should correspond to the name of a predefined alphabet. The options include\n                + `standard`\n                + `iupac`\n                + `streamline`\n            - If is an alphabet or a list of characters, that specific alphabet will be used.\n        do_upper_case: Whether to convert input to uppercase.\n\n    Examples:\n        >>> from multimolecule import ProteinTokenizer\n        >>> tokenizer = ProteinTokenizer()\n        >>> tokenizer('ACDEFGHIKLMNPQRSTVWYXZBJUO')[\"input_ids\"]\n        [1, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 2]\n        >>> tokenizer('<pad><cls><eos><unk><mask><null>.*-')[\"input_ids\"]\n        [1, 0, 1, 2, 3, 4, 5, 32, 33, 34, 2]\n        >>> tokenizer('manlgcwmlv')[\"input_ids\"]\n        [1, 16, 6, 17, 15, 11, 7, 24, 16, 15, 23, 2]\n    \"\"\"\n\n    model_input_names = [\"input_ids\", \"attention_mask\"]\n\n    def __init__(\n        self,\n        alphabet: Alphabet | str | List[str] | None = None,\n        do_upper_case: bool = True,\n        additional_special_tokens: List | Tuple | None = None,\n        **kwargs,\n    ):\n        if not isinstance(alphabet, Alphabet):\n            alphabet = get_alphabet(alphabet)\n        super().__init__(\n            alphabet=alphabet,\n            additional_special_tokens=additional_special_tokens,\n            do_upper_case=do_upper_case,\n            **kwargs,\n        )\n\n    def _tokenize(self, text: str, **kwargs):\n        if self.do_upper_case:\n            text = text.upper()\n        return list(text)\n
"},{"location":"tokenisers/protein/#multimolecule.tokenisers.ProteinTokenizer(alphabet)","title":"alphabet","text":""},{"location":"tokenisers/protein/#multimolecule.tokenisers.ProteinTokenizer(do_upper_case)","title":"do_upper_case","text":""},{"location":"tokenisers/protein/#standard-alphabet","title":"Standard Alphabet","text":"

The standard alphabet is an extended version of the IUPAC alphabet. This extension includes six additional symbols to the IUPAC alphabet, J, U, O, ., -, and *.

Amino Acid Code Three letter Code Amino Acid A Ala Alanine C Cys Cysteine D Asp Aspartic Acid E Glu Glutamic Acid F Phe Phenylalanine G Gly Glycine H His Histidine I Ile Isoleucine K Lys Lysine L Leu Leucine M Met Methionine N Asn Asparagine P Pro Proline Q Gln Glutamine R Arg Arginine S Ser Serine T Thr Threonine V Val Valine W Trp Tryptophan Y Tyr Tyrosine X Xaa Any amino acid Z Glx Glutamine (Q) or Glutamic acid (E) B Asx Aspartic acid (D) or Asparagine (N) J Xle Leucine (L) or Isoleucine (I) U Sec Selenocysteine O Pyl Pyrrolysine . \u2026 Not Used * *** Not Used - \u2014 Not Used"},{"location":"tokenisers/protein/#iupac-alphabet","title":"IUPAC Alphabet","text":"

IUPAC amino acid code is a standard amino acid code proposed by the International Union of Pure and Applied Chemistry (IUPAC) to represent Protein sequences.

The IUPAC amino acid code consists of three additional symbols to Streamline Alphabet, B, Z, and X.

Amino Acid Code Three letter Code Amino Acid A Ala Alanine B Asx Aspartic acid (D) or Asparagine (N) C Cys Cysteine D Asp Aspartic Acid E Glu Glutamic Acid F Phe Phenylalanine G Gly Glycine H His Histidine I Ile Isoleucine K Lys Lysine L Leu Leucine M Met Methionine N Asn Asparagine P Pro Proline Q Gln Glutamine R Arg Arginine S Ser Serine T Thr Threonine V Val Valine W Trp Tryptophan Y Tyr Tyrosine X Xaa Any amino acid Z Glx Glutamine (Q) or Glutamic acid (E)"},{"location":"tokenisers/protein/#streamline-alphabet","title":"Streamline Alphabet","text":"

The streamline alphabet is a simplified version of the standard alphabet.

Amino Acid Code Three letter Code Amino Acid A Ala Alanine C Cys Cysteine D Asp Aspartic Acid E Glu Glutamic Acid F Phe Phenylalanine G Gly Glycine H His Histidine I Ile Isoleucine K Lys Lysine L Leu Leucine M Met Methionine N Asn Asparagine P Pro Proline Q Gln Glutamine R Arg Arginine S Ser Serine T Thr Threonine V Val Valine W Trp Tryptophan Y Tyr Tyrosine X Xaa Any amino acid"},{"location":"tokenisers/rna/","title":"RnaTokenizer","text":"

RnaTokenizer is smart, it tokenizes raw RNA nucleotides into tokens, no matter if the input is in uppercase or lowercase, uses U (Uracil) or U (Thymine), and with or without special tokens. It also supports tokenization into nmers and codons, so you don\u2019t have to write complex code to preprocess your data.

By default, RnaTokenizer uses the standard alphabet. If nmers is greater than 1, or codon is set to True, it will instead use the streamline alphabet.

MultiMolecule provides a set of predefined alphabets for tokenization.

"},{"location":"tokenisers/rna/#multimolecule.tokenisers.RnaTokenizer","title":"multimolecule.tokenisers.RnaTokenizer","text":"

Bases: Tokenizer

Tokenizer for RNA sequences.

Parameters:

Name Type Description Default Alphabet | str | List[str] | None

alphabet to use for tokenization.

None int

Size of kmer to tokenize.

1 bool

Whether to tokenize into codons.

False bool

Whether to replace T with U.

True bool

Whether to convert input to uppercase.

True

Examples:

Python Console Session
>>> from multimolecule import RnaTokenizer\n>>> tokenizer = RnaTokenizer()\n>>> tokenizer('<pad><cls><eos><unk><mask><null>ACGUNRYSWKMBDHV.X*-I')[\"input_ids\"]\n[1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 2]\n>>> tokenizer('acgu')[\"input_ids\"]\n[1, 6, 7, 8, 9, 2]\n>>> tokenizer('acgt')[\"input_ids\"]\n[1, 6, 7, 8, 9, 2]\n>>> tokenizer = RnaTokenizer(replace_T_with_U=False)\n>>> tokenizer('acgt')[\"input_ids\"]\n[1, 6, 7, 8, 3, 2]\n>>> tokenizer = RnaTokenizer(nmers=3)\n>>> tokenizer('uagcuuauc')[\"input_ids\"]\n[1, 83, 17, 64, 49, 96, 84, 22, 2]\n>>> tokenizer = RnaTokenizer(codon=True)\n>>> tokenizer('uagcuuauc')[\"input_ids\"]\n[1, 83, 49, 22, 2]\n>>> tokenizer('uagcuuauca')[\"input_ids\"]\nTraceback (most recent call last):\nValueError: length of input sequence must be a multiple of 3 for codon tokenization, but got 10\n
Source code in multimolecule/tokenisers/rna/tokenization_rna.py Python
class RnaTokenizer(Tokenizer):\n    \"\"\"\n    Tokenizer for RNA sequences.\n\n    Args:\n        alphabet: alphabet to use for tokenization.\n\n            - If is `None`, the standard RNA alphabet will be used.\n            - If is a `string`, it should correspond to the name of a predefined alphabet. The options include\n                + `standard`\n                + `extended`\n                + `streamline`\n                + `nucleobase`\n            - If is an alphabet or a list of characters, that specific alphabet will be used.\n        nmers: Size of kmer to tokenize.\n        codon: Whether to tokenize into codons.\n        replace_T_with_U: Whether to replace T with U.\n        do_upper_case: Whether to convert input to uppercase.\n\n    Examples:\n        >>> from multimolecule import RnaTokenizer\n        >>> tokenizer = RnaTokenizer()\n        >>> tokenizer('<pad><cls><eos><unk><mask><null>ACGUNRYSWKMBDHV.X*-I')[\"input_ids\"]\n        [1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 2]\n        >>> tokenizer('acgu')[\"input_ids\"]\n        [1, 6, 7, 8, 9, 2]\n        >>> tokenizer('acgt')[\"input_ids\"]\n        [1, 6, 7, 8, 9, 2]\n        >>> tokenizer = RnaTokenizer(replace_T_with_U=False)\n        >>> tokenizer('acgt')[\"input_ids\"]\n        [1, 6, 7, 8, 3, 2]\n        >>> tokenizer = RnaTokenizer(nmers=3)\n        >>> tokenizer('uagcuuauc')[\"input_ids\"]\n        [1, 83, 17, 64, 49, 96, 84, 22, 2]\n        >>> tokenizer = RnaTokenizer(codon=True)\n        >>> tokenizer('uagcuuauc')[\"input_ids\"]\n        [1, 83, 49, 22, 2]\n        >>> tokenizer('uagcuuauca')[\"input_ids\"]\n        Traceback (most recent call last):\n        ValueError: length of input sequence must be a multiple of 3 for codon tokenization, but got 10\n    \"\"\"\n\n    model_input_names = [\"input_ids\", \"attention_mask\"]\n\n    def __init__(\n        self,\n        alphabet: Alphabet | str | List[str] | None = None,\n        nmers: int = 1,\n        codon: bool = False,\n        replace_T_with_U: bool = True,\n        do_upper_case: bool = True,\n        additional_special_tokens: List | Tuple | None = None,\n        **kwargs,\n    ):\n        if codon and (nmers > 1 and nmers != 3):\n            raise ValueError(\"Codon and nmers cannot be used together.\")\n        if codon:\n            nmers = 3  # set to 3 to get correct vocab\n        if not isinstance(alphabet, Alphabet):\n            alphabet = get_alphabet(alphabet, nmers=nmers)\n        super().__init__(\n            alphabet=alphabet,\n            nmers=nmers,\n            codon=codon,\n            replace_T_with_U=replace_T_with_U,\n            do_upper_case=do_upper_case,\n            additional_special_tokens=additional_special_tokens,\n            **kwargs,\n        )\n        self.replace_T_with_U = replace_T_with_U\n        self.nmers = nmers\n        self.codon = codon\n\n    def _tokenize(self, text: str, **kwargs):\n        if self.do_upper_case:\n            text = text.upper()\n        if self.replace_T_with_U:\n            text = text.replace(\"T\", \"U\")\n        if self.codon:\n            if len(text) % 3 != 0:\n                raise ValueError(\n                    f\"length of input sequence must be a multiple of 3 for codon tokenization, but got {len(text)}\"\n                )\n            return [text[i : i + 3] for i in range(0, len(text), 3)]\n        if self.nmers > 1:\n            return [text[i : i + self.nmers] for i in range(len(text) - self.nmers + 1)]  # noqa: E203\n        return list(text)\n
"},{"location":"tokenisers/rna/#multimolecule.tokenisers.RnaTokenizer(alphabet)","title":"alphabet","text":""},{"location":"tokenisers/rna/#multimolecule.tokenisers.RnaTokenizer(nmers)","title":"nmers","text":""},{"location":"tokenisers/rna/#multimolecule.tokenisers.RnaTokenizer(codon)","title":"codon","text":""},{"location":"tokenisers/rna/#multimolecule.tokenisers.RnaTokenizer(replace_T_with_U)","title":"replace_T_with_U","text":""},{"location":"tokenisers/rna/#multimolecule.tokenisers.RnaTokenizer(do_upper_case)","title":"do_upper_case","text":""},{"location":"tokenisers/rna/#standard-alphabet","title":"Standard Alphabet","text":"

The standard alphabet is an extended version of the IUPAC alphabet. This extension includes three additional symbols to the IUPAC alphabet, I, X and *.

gap

Note that we use . to represent a gap in the sequence.

While - exists in the standard alphabet, it is not used in MultiMolecule and is reserved for future use.

Code Represents A Adenine C Cytosine G Guanine U Uracil N Unknown R A or G Y C or U S C or G W A or U K G or U M A or C B C, G, or U D A, G, or U H A, C, or U V A, C, or G . Gap X Any * Not Used - Not Used I Inosine"},{"location":"tokenisers/rna/#iupac-alphabet","title":"IUPAC Alphabet","text":"

IUPAC nucleotide code is a standard nucleotide code proposed by the International Union of Pure and Applied Chemistry (IUPAC) to represent RNA sequences.

It consists of 10 symbols that represent ambiguity in the nucleotide sequence and 1 symbol that represents a gap in addition to the streamline alphabet.

Code Represents A Adenine C Cytosine G Guanine U Uracil R A or G Y C or U S G or C W A or U K G or U M A or C B C, G, or U D A, G, or U H A, C, or U V A, C, or G N A, C, G, or U . Gap

Note that we use . to represent a gap in the sequence.

"},{"location":"tokenisers/rna/#streamline-alphabet","title":"Streamline Alphabet","text":"

The streamline alphabet includes one additional symbol to the nucleobase alphabet, N to represent unknown nucleobase.

Code Nucleotide A Adenine C Cytosine G Guanine U Uracil N Unknown"},{"location":"tokenisers/rna/#nucleobase-alphabet","title":"Nucleobase Alphabet","text":"

The nucleobase alphabet is a minimal version of the RNA alphabet that includes only the four canonical nucleotides A, C, G, and U.

Code Nucleotide A Adenine C Cytosine G Guanine U Uracil"},{"location":"zh/","title":"MultiMolecule","text":"

\u200b\u4f7f\u7528\u200b\u673a\u5668\u200b\u5b66\u4e60\u200b\u52a0\u901f\u200b\u5206\u5b50\u751f\u7269\u5b66\u200b\u7814\u7a76\u200b

"},{"location":"zh/#_1","title":"\u4ecb\u7ecd","text":"

\u200b\u6b22\u8fce\u200b\u6765\u5230\u200b MultiMolecule (\u200b\u6d66\u539f\u200b)\uff0c\u200b\u8fd9\u662f\u200b\u4e00\u6b3e\u200b\u57fa\u7840\u200b\u5e93\u200b\uff0c\u200b\u65e8\u5728\u200b\u901a\u8fc7\u200b\u673a\u5668\u200b\u5b66\u4e60\u200b\u52a0\u901f\u200b\u5206\u5b50\u751f\u7269\u5b66\u200b\u7684\u200b\u79d1\u7814\u200b\u8fdb\u5c55\u200b\u3002 MultiMolecule \u200b\u63d0\u4f9b\u200b\u4e86\u200b\u4e00\u5957\u200b\u5168\u9762\u200b\u4e14\u200b\u7075\u6d3b\u200b\u7684\u200b\u5de5\u5177\u200b\uff0c\u200b\u5e2e\u52a9\u200b\u7814\u7a76\u200b\u4eba\u5458\u200b\u8f7b\u677e\u200b\u5229\u7528\u200b AI\uff0c\u200b\u4e3b\u8981\u200b\u805a\u7126\u200b\u4e8e\u200b\u751f\u7269\u200b\u5206\u5b50\u200b\u6570\u636e\u200b\uff08RNA\u3001DNA \u200b\u548c\u200b\u86cb\u767d\u8d28\u200b\uff09\u3002

"},{"location":"zh/#_2","title":"\u6982\u89c8","text":"

MultiMolecule \u200b\u4ee5\u200b\u7075\u6d3b\u6027\u200b\u548c\u200b\u6613\u7528\u6027\u200b\u4e3a\u200b\u8bbe\u8ba1\u200b\u6838\u5fc3\u200b\u3002 \u200b\u5176\u200b\u6a21\u5757\u5316\u200b\u8bbe\u8ba1\u200b\u5141\u8bb8\u200b\u60a8\u200b\u6839\u636e\u200b\u9700\u8981\u200b\u4ec5\u200b\u4f7f\u7528\u200b\u6240\u200b\u9700\u200b\u7684\u200b\u7ec4\u4ef6\u200b\uff0c\u200b\u5e76\u200b\u80fd\u200b\u65e0\u7f1d\u200b\u96c6\u6210\u200b\u5230\u200b\u73b0\u6709\u200b\u7684\u200b\u5de5\u4f5c\u200b\u6d41\u7a0b\u200b\u4e2d\u200b\uff0c\u200b\u800c\u200b\u4e0d\u4f1a\u200b\u589e\u52a0\u200b\u4e0d\u5fc5\u8981\u200b\u7684\u200b\u590d\u6742\u6027\u200b\u3002

"},{"location":"zh/#_3","title":"\u5b89\u88c5","text":"

\u200b\u4ece\u200b PyPI \u200b\u5b89\u88c5\u200b\u6700\u65b0\u200b\u7684\u200b\u7a33\u5b9a\u200b\u7248\u672c\u200b\uff1a

Bash
pip install multimolecule\n

\u200b\u4ece\u200b\u6e90\u4ee3\u7801\u200b\u5b89\u88c5\u200b\u6700\u65b0\u200b\u7248\u672c\u200b\uff1a

Bash
pip install git+https://github.com/DLS5-Omics/MultiMolecule\n
"},{"location":"zh/#_4","title":"\u5f15\u7528","text":"

\u200b\u5982\u679c\u200b\u60a8\u200b\u5728\u200b\u7814\u7a76\u200b\u4e2d\u200b\u4f7f\u7528\u200b MultiMolecule\uff0c\u200b\u8bf7\u200b\u6309\u7167\u200b\u4ee5\u4e0b\u200b\u65b9\u5f0f\u200b\u5f15\u7528\u200b\u6211\u4eec\u200b\uff1a

BibTeX
@software{chen_2024_12638419,\n  author    = {Chen, Zhiyuan and Zhu, Sophia Y.},\n  title     = {MultiMolecule},\n  doi       = {10.5281/zenodo.12638419},\n  publisher = {Zenodo},\n  url       = {https://doi.org/10.5281/zenodo.12638419},\n  year      = 2024,\n  month     = may,\n  day       = 4\n}\n
"},{"location":"zh/#_5","title":"\u8bb8\u53ef\u8bc1","text":"

\u200b\u6211\u4eec\u200b\u76f8\u4fe1\u200b\u5f00\u653e\u200b\u662f\u200b\u7814\u7a76\u200b\u7684\u200b\u57fa\u7840\u200b\u3002

MultiMolecule \u200b\u5728\u200bGNU Affero \u200b\u901a\u7528\u200b\u516c\u5171\u200b\u8bb8\u53ef\u8bc1\u200b\u4e0b\u200b\u6388\u6743\u200b\u3002

\u200b\u8bf7\u200b\u52a0\u5165\u200b\u6211\u4eec\u200b\uff0c\u200b\u5171\u540c\u200b\u5efa\u7acb\u200b\u4e00\u4e2a\u200b\u5f00\u653e\u200b\u7684\u200b\u7814\u7a76\u200b\u793e\u533a\u200b\u3002

SPDX-License-Identifier: AGPL-3.0-or-later

"},{"location":"zh/about/","title":"\u5173\u4e8e","text":"

\u200b\u7531\u4e39\u7075\u200b\u5728\u200b\u5730\u7403\u200b\u5f00\u53d1\u200b

\u200b\u6211\u4eec\u200b\u662f\u200b\u4e00\u4e2a\u200b\u7531\u200b\u5f00\u53d1\u8005\u200b\u3001\u200b\u8bbe\u8ba1\u200b\u4eba\u5458\u200b\u548c\u200b\u5176\u4ed6\u200b\u4eba\u5458\u200b\u7ec4\u6210\u200b\u7684\u200b\u793e\u533a\u200b\uff0c\u200b\u81f4\u529b\u4e8e\u200b\u8ba9\u200b\u6df1\u5ea6\u200b\u5b66\u4e60\u200b\u6280\u672f\u200b\u66f4\u52a0\u200b\u5f00\u653e\u200b\u3002

\u200b\u6211\u4eec\u200b\u662f\u200b\u4e00\u4e2a\u200b\u7531\u200b\u4e2a\u4f53\u200b\u7ec4\u6210\u200b\u7684\u200b\u793e\u533a\u200b\uff0c\u200b\u81f4\u529b\u4e8e\u200b\u63a8\u52a8\u200b\u6df1\u5ea6\u200b\u5b66\u4e60\u200b\u7684\u200b\u53ef\u80fd\u6027\u200b\u8fb9\u754c\u200b\u3002

\u200b\u6211\u4eec\u200b\u5bf9\u200b\u6df1\u5ea6\u200b\u5b66\u4e60\u200b\u53ca\u5176\u200b\u7528\u6237\u200b\u5145\u6ee1\u200b\u6fc0\u60c5\u200b\u3002

\u200b\u6211\u4eec\u200b\u662f\u200b\u4e39\u7075\u200b\u3002

"},{"location":"zh/about/license-faq/","title":"License FAQ","text":"

\u200b\u7ffb\u8bd1\u200b

\u200b\u672c\u6587\u200b\u5185\u5bb9\u200b\u4e3a\u200b\u7ffb\u8bd1\u200b\u7248\u672c\u200b\uff0c\u200b\u65e8\u5728\u200b\u4e3a\u200b\u7528\u6237\u200b\u63d0\u4f9b\u65b9\u4fbf\u200b\u3002 \u200b\u6211\u4eec\u200b\u5df2\u7ecf\u200b\u5c3d\u529b\u200b\u786e\u4fdd\u200b\u7ffb\u8bd1\u200b\u7684\u200b\u51c6\u786e\u6027\u200b\u3002 \u200b\u4f46\u200b\u8bf7\u200b\u6ce8\u610f\u200b\uff0c\u200b\u7ffb\u8bd1\u200b\u5185\u5bb9\u200b\u53ef\u80fd\u200b\u5305\u542b\u200b\u9519\u8bef\u200b\uff0c\u200b\u4ec5\u4f9b\u53c2\u8003\u200b\u3002 \u200b\u8bf7\u4ee5\u200b\u82f1\u6587\u200b\u539f\u6587\u200b\u4e3a\u51c6\u200b\u3002

\u200b\u4e3a\u200b\u6ee1\u8db3\u200b\u5408\u89c4\u6027\u200b\u4e0e\u200b\u6267\u6cd5\u200b\u8981\u6c42\u200b\uff0c\u200b\u7ffb\u8bd1\u200b\u6587\u6863\u200b\u4e2d\u200b\u7684\u200b\u4efb\u4f55\u200b\u4e0d\u200b\u51c6\u786e\u200b\u6216\u200b\u6b67\u4e49\u200b\u4e4b\u5904\u200b\u5747\u200b\u4e0d\u200b\u5177\u6709\u200b\u7ea6\u675f\u529b\u200b\uff0c\u200b\u4e5f\u200b\u4e0d\u200b\u5177\u5907\u200b\u6cd5\u5f8b\u6548\u529b\u200b\u3002

"},{"location":"zh/about/license-faq/#_1","title":"\u8bb8\u53ef\u200b\u534f\u8bae\u200b\u5e38\u89c1\u200b\u95ee\u9898\u89e3\u7b54","text":"

\u200b\u672c\u200b\u8bb8\u53ef\u200b\u534f\u8bae\u200b\u5e38\u89c1\u200b\u95ee\u9898\u89e3\u7b54\u200b\u89e3\u91ca\u200b\u4e86\u200b\u60a8\u200b\u53ef\u4ee5\u200b\u5728\u200b\u4f55\u79cd\u200b\u6761\u4ef6\u200b\u4e0b\u200b\u4f7f\u7528\u200b\u7531\u4e39\u7075\u200b\u56e2\u961f\u200b\uff08\u200b\u4e5f\u200b\u79f0\u4e3a\u200b\u4e39\u7075\u200b\uff09\uff08\u201c\u200b\u6211\u4eec\u200b\u201d\u200b\u6216\u200b\u201c\u200b\u6211\u4eec\u200b\u7684\u200b\u201d\uff09\u200b\u63d0\u4f9b\u200b\u7684\u200b\u6570\u636e\u200b\u3001\u200b\u6a21\u578b\u200b\u3001\u200b\u4ee3\u7801\u200b\u3001\u200b\u914d\u7f6e\u200b\u3001\u200b\u6587\u6863\u200b\u548c\u200b\u6743\u91cd\u200b\u3002 \u200b\u5b83\u200b\u4f5c\u4e3a\u200b\u6211\u4eec\u200b\u7684\u200b \u200b\u8bb8\u53ef\u200b\u534f\u8bae\u200b \u200b\u7684\u200b\u9644\u52a0\u6587\u4ef6\u200b\u3002

"},{"location":"zh/about/license-faq/#0","title":"0. \u200b\u5173\u952e\u70b9\u200b\u603b\u7ed3","text":"

\u200b\u672c\u200b\u603b\u7ed3\u200b\u63d0\u4f9b\u200b\u4e86\u200b\u5e38\u89c1\u200b\u95ee\u9898\u89e3\u7b54\u200b\u7684\u200b\u5173\u952e\u70b9\u200b\uff0c\u200b\u4f46\u200b\u60a8\u200b\u53ef\u4ee5\u200b\u901a\u8fc7\u200b\u70b9\u51fb\u200b\u6bcf\u4e2a\u200b\u5173\u952e\u70b9\u200b\u540e\u200b\u7684\u200b\u94fe\u63a5\u200b\u6216\u200b\u4f7f\u7528\u200b\u76ee\u5f55\u200b\u6765\u200b\u627e\u5230\u200b\u60a8\u200b\u6240\u200b\u67e5\u627e\u200b\u7684\u200b\u90e8\u5206\u200b\u4ee5\u200b\u4e86\u89e3\u200b\u66f4\u200b\u591a\u200b\u8be6\u60c5\u200b\u3002

\u200b\u5728\u200b MultiMolecule \u200b\u4e2d\u200b\uff0c\u200b\u4ec0\u4e48\u200b\u6784\u6210\u200b\u4e86\u200b\u201c\u200b\u6e90\u4ee3\u7801\u200b\u201d\uff1f

\u200b\u6211\u4eec\u200b\u8ba4\u4e3a\u200b\u6211\u4eec\u200b\u5b58\u50a8\u200b\u5e93\u4e2d\u200b\u7684\u200b\u6240\u6709\u200b\u5185\u5bb9\u200b\u90fd\u200b\u662f\u200b\u6e90\u4ee3\u7801\u200b\uff0c\u200b\u5305\u62ec\u200b\u6570\u636e\u200b\u3001\u200b\u6a21\u578b\u200b\u3001\u200b\u4ee3\u7801\u200b\u3001\u200b\u914d\u7f6e\u200b\u548c\u200b\u6587\u6863\u200b\u3002

\u200b\u5728\u200bMultiMolecule\u200b\u4e2d\u200b\uff0c\u200b\u4ec0\u4e48\u200b\u6784\u6210\u200b\u4e86\u200b\u201c\u200b\u6e90\u4ee3\u7801\u200b\u201d\uff1f

\u200b\u6211\u200b\u53ef\u4ee5\u200b\u4f7f\u7528\u200b MultiMolecule \u200b\u53d1\u8868\u200b\u7814\u7a76\u200b\u8bba\u6587\u200b\u5417\u200b\uff1f

\u200b\u89c6\u200b\u60c5\u51b5\u200b\u800c\u5b9a\u200b\u3002

\u200b\u60a8\u200b\u53ef\u4ee5\u200b\u6309\u7167\u200b \u200b\u8bb8\u53ef\u200b\u534f\u8bae\u200b \u200b\u7684\u200b\u6761\u6b3e\u200b\u5728\u200b\u5b8c\u5168\u200b\u5f00\u653e\u200b\u83b7\u53d6\u200b\u7684\u200b\u671f\u520a\u200b\u548c\u200b\u4f1a\u8bae\u200b\u6216\u9884\u200b\u5370\u672c\u200b\u670d\u52a1\u5668\u200b\u4e0a\u200b\u53d1\u8868\u200b\u7814\u7a76\u200b\u8bba\u6587\u200b\u3002

\u200b\u8981\u200b\u5728\u200b\u5c01\u95ed\u200b\u83b7\u53d6\u200b\u7684\u200b\u671f\u520a\u200b\u548c\u200b\u4f1a\u8bae\u200b\u4e0a\u200b\u53d1\u8868\u200b\u7814\u7a76\u200b\u8bba\u6587\u200b\uff0c\u200b\u60a8\u200b\u5fc5\u987b\u200b\u4ece\u200b\u6211\u4eec\u200b\u8fd9\u91cc\u200b\u83b7\u5f97\u200b\u5355\u72ec\u200b\u7684\u200b\u8bb8\u53ef\u200b\u3002

\u200b\u6211\u200b\u53ef\u4ee5\u200b\u4f7f\u7528\u200bMultiMolecule\u200b\u53d1\u8868\u200b\u7814\u7a76\u200b\u8bba\u6587\u200b\u5417\u200b\uff1f

\u200b\u6211\u200b\u53ef\u4ee5\u200b\u5c06\u200b MultiMolecule \u200b\u7528\u4e8e\u200b\u5546\u4e1a\u7528\u9014\u200b\u5417\u200b\uff1f

\u200b\u662f\u200b\u7684\u200b\uff0c\u200b\u60a8\u200b\u53ef\u4ee5\u200b\u6839\u636e\u200b \u200b\u8bb8\u53ef\u200b\u534f\u8bae\u200b \u200b\u7684\u200b\u6761\u6b3e\u200b\u5c06\u200bMultiMolecule\u200b\u7528\u4e8e\u200b\u5546\u4e1a\u7528\u9014\u200b\u3002

\u200b\u6211\u200b\u53ef\u4ee5\u200b\u5c06\u200bMultiMolecule\u200b\u7528\u4e8e\u200b\u5546\u4e1a\u7528\u9014\u200b\u5417\u200b\uff1f

\u200b\u4e0e\u200b\u67d0\u4e9b\u200b\u7ec4\u7ec7\u200b\u6709\u200b\u5173\u7cfb\u200b\u7684\u200b\u4eba\u200b\u662f\u5426\u200b\u6709\u200b\u7279\u5b9a\u200b\u7684\u200b\u8bb8\u53ef\u200b\u6761\u6b3e\u200b\uff1f

\u200b\u662f\u200b\u7684\u200b\uff0c\u200b\u4e0e\u200b\u67d0\u4e9b\u200b\u7ec4\u7ec7\u200b\u6709\u200b\u5173\u7cfb\u200b\u7684\u200b\u4eba\u200b\u6709\u200b\u7279\u5b9a\u200b\u7684\u200b\u8bb8\u53ef\u200b\u6761\u6b3e\u200b\u3002

\u200b\u4e0e\u200b\u67d0\u4e9b\u200b\u7ec4\u7ec7\u200b\u6709\u200b\u5173\u7cfb\u200b\u7684\u200b\u4eba\u200b\u662f\u5426\u200b\u6709\u200b\u7279\u5b9a\u200b\u7684\u200b\u8bb8\u53ef\u200b\u6761\u6b3e\u200b\uff1f

"},{"location":"zh/about/license-faq/#1-multimolecule","title":"1. \u200b\u5728\u200b MultiMolecule \u200b\u4e2d\u200b\uff0c\u200b\u4ec0\u4e48\u200b\u6784\u6210\u200b\u4e86\u200b\u201c\u200b\u6e90\u4ee3\u7801\u200b\u201d\uff1f","text":"

\u200b\u6211\u4eec\u200b\u8ba4\u4e3a\u200b\u6211\u4eec\u200b\u5b58\u50a8\u200b\u5e93\u4e2d\u200b\u7684\u200b\u6240\u6709\u200b\u5185\u5bb9\u200b\u90fd\u200b\u662f\u200b\u6e90\u4ee3\u7801\u200b\u3002

\u200b\u673a\u5668\u200b\u5b66\u4e60\u200b\u6a21\u578b\u200b\u7684\u200b\u8bad\u7ec3\u200b\u8fc7\u7a0b\u200b\u88ab\u200b\u89c6\u4f5c\u200b\u7c7b\u4f3c\u200b\u4e8e\u200b\u4f20\u7edf\u200b\u8f6f\u4ef6\u200b\u7684\u200b\u7f16\u8bd1\u200b\u8fc7\u7a0b\u200b\u3002\u200b\u56e0\u6b64\u200b\uff0c\u200b\u6a21\u578b\u200b\u3001\u200b\u4ee3\u7801\u200b\u3001\u200b\u914d\u7f6e\u200b\u3001\u200b\u6587\u6863\u200b\u548c\u200b\u7528\u4e8e\u200b\u8bad\u7ec3\u200b\u7684\u200b\u6570\u636e\u200b\u90fd\u200b\u88ab\u200b\u89c6\u4e3a\u200b\u6e90\u4ee3\u7801\u200b\u7684\u200b\u4e00\u90e8\u5206\u200b\uff0c\u200b\u800c\u200b\u8bad\u7ec3\u200b\u51fa\u200b\u7684\u200b\u6a21\u578b\u200b\u6743\u91cd\u200b\u5219\u200b\u88ab\u200b\u89c6\u4e3a\u200b\u76ee\u6807\u200b\u4ee3\u7801\u200b\u7684\u200b\u4e00\u90e8\u5206\u200b\u3002

\u200b\u6211\u4eec\u200b\u8fd8\u200b\u5c06\u200b\u7814\u7a76\u200b\u8bba\u6587\u200b\u548c\u200b\u624b\u7a3f\u200b\u89c6\u4e3a\u200b\u4e00\u79cd\u200b\u7279\u6b8a\u200b\u7684\u200b\u6587\u6863\u200b\u5f62\u5f0f\u200b\uff0c\u200b\u5b83\u4eec\u200b\u4e5f\u200b\u662f\u200b\u6e90\u4ee3\u7801\u200b\u7684\u200b\u4e00\u90e8\u5206\u200b\u3002

"},{"location":"zh/about/license-faq/#2-multimolecule","title":"2 \u200b\u6211\u200b\u53ef\u4ee5\u200b\u4f7f\u7528\u200b MultiMolecule \u200b\u53d1\u8868\u200b\u7814\u7a76\u200b\u8bba\u6587\u200b\u5417\u200b\uff1f","text":"

\u200b\u7531\u4e8e\u200b\u7814\u7a76\u200b\u8bba\u6587\u200b\u88ab\u200b\u89c6\u4e3a\u200b\u6e90\u4ee3\u7801\u200b\u7684\u200b\u4e00\u79cd\u200b\u5f62\u5f0f\u200b\uff0c\u200b\u5982\u679c\u200b\u53d1\u8868\u200b\u4f7f\u7528\u200b MultiMolecule \u200b\u7684\u200b\u8bba\u6587\u200b\uff0c\u200b\u51fa\u7248\u5546\u200b\u5fc5\u987b\u200b\u5f00\u6e90\u200b\u5176\u200b\u670d\u52a1\u5668\u200b\u4e0a\u200b\u7684\u200b\u6240\u6709\u200b\u6750\u6599\u200b\uff0c\u200b\u4ee5\u200b\u7b26\u5408\u200b \u200b\u8bb8\u53ef\u200b\u534f\u8bae\u200b \u200b\u7684\u200b\u8981\u6c42\u200b\u3002\u200b\u5bf9\u4e8e\u200b\u5927\u591a\u6570\u200b\u51fa\u7248\u5546\u200b\u6765\u8bf4\u200b\uff0c\u200b\u8fd9\u662f\u200b\u4e0d\u5207\u5b9e\u9645\u200b\u7684\u200b\u3002

\u200b\u4f5c\u4e3a\u200b \u200b\u8bb8\u53ef\u200b\u534f\u8bae\u200b \u200b\u7b2c\u200b 7 \u200b\u6761\u200b\u7684\u200b\u7279\u522b\u200b\u8c41\u514d\u200b\uff0c\u200b\u6211\u4eec\u200b\u5141\u8bb8\u200b\u5728\u200b\u4e0d\u200b\u5411\u200b\u4f5c\u8005\u200b\u6536\u53d6\u200b\u4efb\u4f55\u200b\u8d39\u7528\u200b\u7684\u200b\u5b8c\u5168\u200b\u5f00\u653e\u200b\u83b7\u53d6\u200b\u7684\u200b\u671f\u520a\u200b\u3001\u200b\u4f1a\u8bae\u200b\u6216\u9884\u200b\u5370\u672c\u200b\u670d\u52a1\u5668\u200b\u4e0a\u200b\u53d1\u8868\u200b\u4f7f\u7528\u200b MultiMolecule \u200b\u7684\u200b\u7814\u7a76\u200b\u8bba\u6587\u200b\uff0c\u200b\u524d\u63d0\u200b\u662f\u200b\u6240\u6709\u200b\u53d1\u8868\u200b\u7684\u200b\u624b\u7a3f\u200b\u90fd\u200b\u5e94\u200b\u6309\u7167\u200b\u5141\u8bb8\u200b\u5171\u4eab\u200b\u624b\u7a3f\u200b\u7684\u200bGNU \u200b\u81ea\u7531\u200b\u6587\u6863\u200b\u8bb8\u53ef\u200b\u534f\u8bae\u200b\uff08GFDL\uff09\u200b\u6216\u200b\u77e5\u8bc6\u200b\u5171\u4eab\u200b\u8bb8\u53ef\u200b\u534f\u8bae\u200b\u6216\u200bOSI \u200b\u6279\u51c6\u200b\u8bb8\u53ef\u200b\u534f\u8bae\u200b\u63d0\u4f9b\u200b\u3002

\u200b\u4f5c\u4e3a\u200b \u200b\u8bb8\u53ef\u200b\u534f\u8bae\u200b \u200b\u7b2c\u200b 7 \u200b\u6761\u200b\u7684\u200b\u7279\u522b\u200b\u8c41\u514d\u200b\uff0c\u200b\u6211\u4eec\u200b\u5141\u8bb8\u200b\u5728\u200b\u90e8\u5206\u200b\u975e\u76c8\u5229\u6027\u200b\u7684\u200b\u6742\u5fd7\u200b\u3001\u200b\u4f1a\u8bae\u200b\u6216\u9884\u200b\u5370\u672c\u200b\u670d\u52a1\u5668\u200b\u4e0a\u200b\u53d1\u8868\u200b\u4f7f\u7528\u200b MultiMolecule \u200b\u7684\u200b\u7814\u7a76\u200b\u8bba\u6587\u200b\u3002\u200b\u76ee\u524d\u200b\uff0c\u200b\u6211\u4eec\u200b\u5141\u8bb8\u200b\u7684\u200b\u975e\u76c8\u5229\u6027\u200b\u6742\u5fd7\u200b\u3001\u200b\u4f1a\u8bae\u200b\u6216\u9884\u200b\u5370\u672c\u200b\u670d\u52a1\u5668\u200b\u5305\u62ec\u200b\uff1a

\u200b\u8981\u200b\u5728\u200b\u5c01\u95ed\u200b\u83b7\u53d6\u200b\u7684\u200b\u671f\u520a\u200b\u6216\u200b\u4f1a\u8bae\u200b\u4e0a\u200b\u53d1\u8868\u200b\u8bba\u6587\u200b\uff0c\u200b\u60a8\u200b\u5fc5\u987b\u200b\u4ece\u200b\u6211\u4eec\u200b\u8fd9\u91cc\u200b\u83b7\u5f97\u200b\u5355\u72ec\u200b\u7684\u200b\u8bb8\u53ef\u200b\u3002\u200b\u8fd9\u200b\u901a\u5e38\u200b\u5305\u62ec\u200b\u5171\u540c\u200b\u7f72\u540d\u200b\u3001\u200b\u652f\u6301\u200b\u9879\u76ee\u200b\u7684\u200b\u8d39\u7528\u200b\u6216\u200b\u4e24\u8005\u200b\u517c\u800c\u6709\u4e4b\u200b\u3002\u200b\u8bf7\u200b\u901a\u8fc7\u200b multimolecule@zyc.ai \u200b\u4e0e\u200b\u6211\u4eec\u200b\u8054\u7cfb\u200b\u4ee5\u200b\u83b7\u53d6\u200b\u66f4\u200b\u591a\u200b\u4fe1\u606f\u200b\u3002

\u200b\u867d\u7136\u200b\u4e0d\u662f\u200b\u5f3a\u5236\u6027\u200b\u7684\u200b\uff0c\u200b\u4f46\u200b\u6211\u4eec\u200b\u5efa\u8bae\u200b\u5728\u200b\u7814\u7a76\u200b\u8bba\u6587\u200b\u4e2d\u200b\u5f15\u7528\u200b MultiMolecule \u200b\u9879\u76ee\u200b\u3002

"},{"location":"zh/about/license-faq/#3-multimolecule","title":"3. \u200b\u6211\u200b\u53ef\u4ee5\u200b\u5c06\u200b MultiMolecule \u200b\u7528\u4e8e\u200b\u5546\u4e1a\u7528\u9014\u200b\u5417\u200b\uff1f","text":"

\u200b\u662f\u200b\u7684\u200b\uff0c\u200b\u60a8\u200b\u53ef\u4ee5\u200b\u6839\u636e\u200b \u200b\u8bb8\u53ef\u200b\u534f\u8bae\u200b \u200b\u5c06\u200b MultiMolecule \u200b\u7528\u4e8e\u200b\u5546\u4e1a\u7528\u9014\u200b\u3002\u200b\u4f46\u662f\u200b\uff0c\u200b\u60a8\u200b\u5fc5\u987b\u200b\u5f00\u6e90\u200b\u5bf9\u200b\u6e90\u4ee3\u7801\u200b\u7684\u200b\u4efb\u4f55\u200b\u4fee\u6539\u200b\uff0c\u200b\u5e76\u200b\u4f7f\u200b\u5176\u200b\u5728\u200b \u200b\u8bb8\u53ef\u200b\u534f\u8bae\u200b \u200b\u4e0b\u200b\u53ef\u7528\u200b\u3002

\u200b\u5982\u679c\u200b\u60a8\u200b\u5e0c\u671b\u200b\u5728\u200b\u4e0d\u200b\u5f00\u6e90\u200b\u4fee\u6539\u200b\u5185\u5bb9\u200b\u7684\u200b\u60c5\u51b5\u200b\u4e0b\u200b\u5c06\u200b MultiMolecule \u200b\u7528\u4e8e\u200b\u5546\u4e1a\u7528\u9014\u200b\uff0c\u200b\u5219\u200b\u5fc5\u987b\u200b\u4ece\u200b\u6211\u4eec\u200b\u8fd9\u91cc\u200b\u83b7\u5f97\u200b\u5355\u72ec\u200b\u7684\u200b\u8bb8\u53ef\u200b\u3002\u200b\u8fd9\u200b\u901a\u5e38\u200b\u6d89\u53ca\u200b\u652f\u6301\u200b\u9879\u76ee\u200b\u7684\u200b\u8d39\u7528\u200b\u3002\u200b\u8bf7\u200b\u901a\u8fc7\u200b multimolecule@zyc.ai \u200b\u4e0e\u200b\u6211\u4eec\u200b\u8054\u7cfb\u200b\u4ee5\u200b\u83b7\u53d6\u200b\u66f4\u200b\u591a\u200b\u8be6\u7ec6\u4fe1\u606f\u200b\u3002

"},{"location":"zh/about/license-faq/#4","title":"4. \u200b\u4e0e\u200b\u67d0\u4e9b\u200b\u7ec4\u7ec7\u200b\u6709\u200b\u5173\u7cfb\u200b\u7684\u200b\u4eba\u200b\u662f\u5426\u200b\u6709\u200b\u7279\u5b9a\u200b\u7684\u200b\u8bb8\u53ef\u200b\u6761\u6b3e\u200b\uff1f","text":"

\u200b\u662f\u200b\u7684\u200b\uff01

\u200b\u5982\u679c\u200b\u60a8\u200b\u4e0e\u200b\u4e00\u4e2a\u200b\u4e0e\u200b\u6211\u4eec\u200b\u6709\u200b\u5355\u72ec\u200b\u8bb8\u53ef\u200b\u534f\u8bae\u200b\u7684\u200b\u7ec4\u7ec7\u200b\u6709\u200b\u5173\u7cfb\u200b\uff0c\u200b\u60a8\u200b\u53ef\u80fd\u200b\u4f1a\u200b\u53d7\u5230\u200b\u4e0d\u540c\u200b\u7684\u200b\u8bb8\u53ef\u200b\u6761\u6b3e\u200b\u7684\u200b\u7ea6\u675f\u200b\u3002\u200b\u8bf7\u200b\u54a8\u8be2\u200b\u60a8\u200b\u7ec4\u7ec7\u200b\u7684\u200b\u6cd5\u5f8b\u200b\u90e8\u95e8\u200b\uff0c\u200b\u4ee5\u200b\u786e\u5b9a\u200b\u60a8\u200b\u662f\u5426\u200b\u53d7\u5236\u4e8e\u200b\u5355\u72ec\u200b\u7684\u200b\u8bb8\u53ef\u200b\u534f\u8bae\u200b\u3002

\u200b\u4ee5\u4e0b\u200b\u7ec4\u7ec7\u200b\u7684\u200b\u6210\u5458\u200b\u81ea\u52a8\u200b\u83b7\u5f97\u200b\u4e00\u4e2a\u200b\u4e0d\u53ef\u200b\u8f6c\u8ba9\u200b\u3001\u200b\u4e0d\u53ef\u200b\u518d\u200b\u8bb8\u53ef\u200b\u3001\u200b\u4e0d\u53ef\u200b\u5206\u53d1\u200b\u7684\u200b MIT \u200b\u8bb8\u53ef\u200b\u534f\u8bae\u200b \u200b\u6765\u200b\u4f7f\u7528\u200b MultiMolecule\uff1a

\u200b\u6b64\u200b\u7279\u522b\u200b\u8bb8\u53ef\u200b\u88ab\u200b\u89c6\u4e3a\u200b \u200b\u8bb8\u53ef\u200b\u534f\u8bae\u200b \u200b\u7b2c\u200b 7 \u200b\u6761\u4e2d\u200b\u7684\u200b\u9644\u52a0\u200b\u6761\u6b3e\u200b\u3002 \u200b\u5b83\u200b\u4e0d\u53ef\u200b\u518d\u200b\u5206\u53d1\u200b\uff0c\u200b\u5e76\u4e14\u200b\u60a8\u200b\u88ab\u200b\u7981\u6b62\u200b\u521b\u5efa\u200b\u4efb\u4f55\u200b\u72ec\u7acb\u200b\u7684\u200b\u884d\u751f\u200b\u4f5c\u54c1\u200b\u3002 \u200b\u57fa\u4e8e\u200b\u6b64\u200b\u8bb8\u53ef\u200b\u7684\u200b\u4efb\u4f55\u200b\u4fee\u6539\u200b\u6216\u200b\u884d\u751f\u200b\u4f5c\u54c1\u200b\u5c06\u200b\u81ea\u52a8\u200b\u88ab\u200b\u89c6\u4e3a\u200b MultiMolecule \u200b\u7684\u200b\u884d\u751f\u200b\u4f5c\u54c1\u200b\uff0c\u200b\u5fc5\u987b\u200b\u9075\u5b88\u200b \u200b\u8bb8\u53ef\u200b\u534f\u8bae\u200b \u200b\u7684\u200b\u6240\u6709\u200b\u6761\u6b3e\u200b\u3002 \u200b\u8fd9\u200b\u786e\u4fdd\u200b\u4e86\u200b\u7b2c\u4e09\u65b9\u200b\u65e0\u6cd5\u200b\u7ed5\u8fc7\u200b\u8bb8\u53ef\u200b\u6761\u6b3e\u200b\u6216\u200b\u4ece\u200b\u884d\u751f\u200b\u4f5c\u54c1\u200b\u4e2d\u200b\u521b\u5efa\u200b\u5355\u72ec\u200b\u7684\u200b\u8bb8\u53ef\u200b\u534f\u8bae\u200b\u3002

"},{"location":"zh/about/license-faq/#5-agpl-multimolecule","title":"5. \u200b\u5982\u679c\u200b\u6211\u200b\u7684\u200b\u7ec4\u7ec7\u200b\u7981\u6b62\u200b\u4f7f\u7528\u200b AGPL \u200b\u8bb8\u53ef\u200b\u534f\u8bae\u200b\u4e0b\u200b\u7684\u200b\u4ee3\u7801\u200b\uff0c\u200b\u6211\u8be5\u200b\u5982\u4f55\u200b\u4f7f\u7528\u200b MultiMolecule\uff1f","text":"

\u200b\u4e00\u4e9b\u200b\u7ec4\u7ec7\u200b\uff08\u200b\u5982\u200bGoogle\uff09\u200b\u6709\u200b\u7981\u6b62\u200b\u4f7f\u7528\u200b AGPL \u200b\u8bb8\u53ef\u200b\u534f\u8bae\u200b\u4e0b\u200b\u4ee3\u7801\u200b\u7684\u200b\u653f\u7b56\u200b\u3002

\u200b\u5982\u679c\u200b\u60a8\u200b\u4e0e\u200b\u7981\u6b62\u200b\u4f7f\u7528\u200b AGPL \u200b\u8bb8\u53ef\u200b\u534f\u8bae\u200b\u4ee3\u7801\u200b\u7684\u200b\u7ec4\u7ec7\u200b\u6709\u200b\u5173\u7cfb\u200b\uff0c\u200b\u60a8\u200b\u5fc5\u987b\u200b\u4ece\u200b\u6211\u4eec\u200b\u8fd9\u91cc\u200b\u83b7\u5f97\u200b\u5355\u72ec\u200b\u7684\u200b\u8bb8\u53ef\u200b\u3002\u200b\u8bf7\u200b\u901a\u8fc7\u200b multimolecule@zyc.ai \u200b\u4e0e\u200b\u6211\u4eec\u200b\u8054\u7cfb\u200b\u4ee5\u200b\u83b7\u53d6\u200b\u66f4\u200b\u591a\u200b\u8be6\u7ec6\u4fe1\u606f\u200b\u3002

"},{"location":"zh/about/license-faq/#6-multimolecule","title":"6. \u200b\u5982\u679c\u200b\u6211\u200b\u662f\u200b\u7f8e\u56fd\u8054\u90a6\u653f\u5e9c\u200b\u7684\u200b\u96c7\u5458\u200b\uff0c\u200b\u6211\u200b\u53ef\u4ee5\u200b\u4f7f\u7528\u200b MultiMolecule \u200b\u5417\u200b\uff1f","text":"

\u200b\u4e0d\u80fd\u200b\u3002

\u200b\u6839\u636e\u200b17 U.S. Code \u00a7 105\uff0c\u200b\u7f8e\u56fd\u8054\u90a6\u653f\u5e9c\u200b\u96c7\u5458\u200b\u64b0\u5199\u200b\u7684\u200b\u4ee3\u7801\u200b\u4e0d\u200b\u53d7\u200b\u7248\u6743\u4fdd\u62a4\u200b\u3002

\u200b\u56e0\u6b64\u200b\uff0c\u200b\u7f8e\u56fd\u8054\u90a6\u653f\u5e9c\u200b\u96c7\u5458\u200b\u65e0\u6cd5\u200b\u9075\u5b88\u200b \u200b\u8bb8\u53ef\u200b\u534f\u8bae\u200b \u200b\u7684\u200b\u6761\u6b3e\u200b\u3002

"},{"location":"zh/about/license-faq/#7","title":"7. \u200b\u6211\u4eec\u200b\u4f1a\u200b\u66f4\u65b0\u200b\u6b64\u200b\u5e38\u89c1\u200b\u95ee\u9898\u89e3\u7b54\u200b\u5417\u200b\uff1f","text":"

\u200b\u7b80\u800c\u8a00\u4e4b\u200b

\u200b\u662f\u200b\u7684\u200b\uff0c\u200b\u6211\u4eec\u200b\u5c06\u200b\u6839\u636e\u200b\u9700\u8981\u200b\u66f4\u65b0\u200b\u6b64\u200b\u5e38\u89c1\u200b\u95ee\u9898\u89e3\u7b54\u200b\u4ee5\u200b\u4fdd\u6301\u200b\u4e0e\u200b\u76f8\u5173\u200b\u6cd5\u5f8b\u200b\u7684\u200b\u4e00\u81f4\u200b\u3002

\u200b\u6211\u4eec\u200b\u53ef\u80fd\u200b\u4f1a\u200b\u4e0d\u65f6\u200b\u66f4\u65b0\u200b\u6b64\u200b\u8bb8\u53ef\u200b\u534f\u8bae\u200b\u5e38\u89c1\u200b\u95ee\u9898\u89e3\u7b54\u200b\u3002 \u200b\u66f4\u65b0\u200b\u540e\u200b\u7684\u200b\u7248\u672c\u200b\u5c06\u200b\u901a\u8fc7\u200b\u66f4\u65b0\u200b\u672c\u200b\u9875\u9762\u200b\u5e95\u90e8\u200b\u7684\u200b\u201c\u200b\u6700\u540e\u200b\u4fee\u8ba2\u200b\u65f6\u95f4\u200b\u201d\u200b\u6765\u200b\u8868\u793a\u200b\u3002 \u200b\u5982\u679c\u200b\u6211\u4eec\u200b\u8fdb\u884c\u200b\u4efb\u4f55\u200b\u91cd\u5927\u200b\u66f4\u6539\u200b\uff0c\u200b\u6211\u4eec\u200b\u5c06\u200b\u901a\u8fc7\u200b\u5728\u200b\u672c\u9875\u200b\u53d1\u5e03\u200b\u65b0\u200b\u7684\u200b\u8bb8\u53ef\u200b\u534f\u8bae\u200b\u5e38\u89c1\u200b\u95ee\u9898\u89e3\u7b54\u200b\u6765\u200b\u901a\u77e5\u200b\u60a8\u200b\u3002 \u200b\u7531\u4e8e\u200b\u6211\u4eec\u200b\u4e0d\u200b\u6536\u96c6\u200b\u60a8\u200b\u7684\u200b\u4efb\u4f55\u200b\u8054\u7cfb\u200b\u4fe1\u606f\u200b\uff0c\u200b\u6211\u4eec\u200b\u65e0\u6cd5\u200b\u76f4\u63a5\u200b\u901a\u77e5\u200b\u60a8\u200b\u3002 \u200b\u6211\u4eec\u200b\u9f13\u52b1\u200b\u60a8\u200b\u7ecf\u5e38\u200b\u67e5\u770b\u200b\u672c\u200b\u8bb8\u53ef\u200b\u534f\u8bae\u200b\u5e38\u89c1\u200b\u95ee\u9898\u89e3\u7b54\u200b\uff0c\u200b\u4ee5\u200b\u4e86\u89e3\u200b\u60a8\u200b\u53ef\u4ee5\u200b\u5982\u4f55\u200b\u4f7f\u7528\u200b\u6211\u4eec\u200b\u7684\u200b\u6570\u636e\u200b\u3001\u200b\u6a21\u578b\u200b\u3001\u200b\u4ee3\u7801\u200b\u3001\u200b\u914d\u7f6e\u200b\u3001\u200b\u6587\u6863\u200b\u548c\u200b\u6743\u91cd\u200b\u3002

"},{"location":"zh/data/","title":"data","text":"

data \u200b\u63d0\u4f9b\u200b\u4e86\u200b\u4e00\u7cfb\u5217\u200b\u7528\u4e8e\u200b\u5904\u7406\u200b\u6570\u636e\u200b\u7684\u200b\u5b9e\u7528\u5de5\u5177\u200b\u3002

\u200b\u5c3d\u7ba1\u200b datasets \u200b\u662f\u200b\u4e00\u4e2a\u200b\u5f3a\u5927\u200b\u7684\u200b\u7ba1\u7406\u200b\u6570\u636e\u200b\u96c6\u200b\u7684\u200b\u5e93\u200b\uff0c\u200b\u4f46\u200b\u5b83\u200b\u662f\u200b\u4e00\u4e2a\u200b\u901a\u7528\u200b\u5de5\u5177\u200b\uff0c\u200b\u53ef\u80fd\u200b\u65e0\u6cd5\u200b\u6db5\u76d6\u200b\u79d1\u5b66\u200b\u5e94\u7528\u7a0b\u5e8f\u200b\u7684\u200b\u6240\u6709\u200b\u7279\u5b9a\u200b\u529f\u80fd\u200b\u3002

data \u200b\u5305\u200b\u65e8\u5728\u200b\u901a\u8fc7\u200b\u63d0\u4f9b\u200b\u5728\u200b\u79d1\u5b66\u200b\u4efb\u52a1\u200b\u4e2d\u200b\u5e38\u7528\u200b\u7684\u200b\u6570\u636e\u5904\u7406\u200b\u5b9e\u7528\u7a0b\u5e8f\u200b\u6765\u200b\u8865\u5145\u200b datasets\u3002

"},{"location":"zh/data/#_1","title":"\u4f7f\u7528","text":""},{"location":"zh/data/#_2","title":"\u4ece\u200b\u672c\u5730\u200b\u6570\u636e\u6587\u4ef6\u200b\u52a0\u8f7d","text":"Python
from multimolecule.data import Dataset\n\ndata = Dataset(\"data/rna/5utr.csv\", split=\"train\", pretrained=\"multimolecule/rna\")\n
"},{"location":"zh/data/#datasets","title":"\u4ece\u200b datasets\u200b\u52a0\u8f7d","text":"Python
from multimolecule.data import Dataset\n\ndata = Dataset(\"multimolecule/bprna-spot\", split=\"train\", pretrained=\"multimolecule/rna\")\n
"},{"location":"zh/datasets/","title":"datasets","text":"

datasets \u200b\u63d0\u4f9b\u200b\u4e86\u200b\u4e00\u7cfb\u5217\u200b\u5e7f\u6cdb\u200b\u4f7f\u7528\u200b\u7684\u200b\u6570\u636e\u200b\u96c6\u200b\u3002

"},{"location":"zh/datasets/#_1","title":"\u53ef\u7528\u200b\u6570\u636e\u200b\u96c6","text":""},{"location":"zh/datasets/#dna","title":"\u8131\u6c27\u6838\u7cd6\u6838\u9178\u200b\uff08DNA\uff09","text":""},{"location":"zh/datasets/#rna","title":"\u6838\u7cd6\u6838\u9178\u200b\uff08RNA\uff09","text":""},{"location":"zh/datasets/#_2","title":"\u4f7f\u7528","text":""},{"location":"zh/datasets/#multimolecule","title":"\u4f7f\u7528\u200b MultiMolecule \u200b\u52a0\u8f7d","text":"Python
from multimolecule.data import Dataset\n\ndata = Dataset(\"multimolecule/bprna-spot\", split=\"train\", pretrained=\"multimolecule/rna\")\n
"},{"location":"zh/models/","title":"models","text":"

models \u200b\u63d0\u4f9b\u200b\u4e86\u200b\u4e00\u7cfb\u5217\u200b\u9884\u200b\u8bad\u7ec3\u200b\u6a21\u578b\u200b\u3002

"},{"location":"zh/models/#_1","title":"\u6a21\u578b\u200b\u7c7b","text":"

\u200b\u5728\u200b transformers \u200b\u5e93\u200b\u5f53\u4e2d\u200b\uff0c\u200b\u6a21\u578b\u200b\u7c7b\u200b\u7684\u200b\u540d\u5b57\u200b\u6709\u65f6\u200b\u53ef\u4ee5\u200b\u5f15\u8d77\u200b\u8bef\u89e3\u200b\u3002 \u200b\u5c3d\u7ba1\u200b\u8fd9\u4e9b\u200b\u7c7b\u200b\u652f\u6301\u200b\u56de\u5f52\u200b\u548c\u200b\u5206\u7c7b\u200b\u4efb\u52a1\u200b\uff0c\u200b\u4f46\u200b\u5b83\u4eec\u200b\u7684\u200b\u540d\u5b57\u200b\u901a\u5e38\u200b\u5305\u542b\u200b xxxForSequenceClassification\uff0c\u200b\u8fd9\u200b\u53ef\u80fd\u200b\u6697\u793a\u200b\u5b83\u4eec\u200b\u53ea\u80fd\u200b\u7528\u4e8e\u200b\u5206\u7c7b\u200b\u3002

\u200b\u4e3a\u4e86\u200b\u907f\u514d\u200b\u8fd9\u79cd\u200b\u6b67\u4e49\u200b\uff0cMultiMolecule \u200b\u63d0\u4f9b\u200b\u4e86\u200b\u4e00\u7cfb\u5217\u200b\u6a21\u578b\u200b\u7c7b\u200b\uff0c\u200b\u8fd9\u4e9b\u200b\u7c7b\u200b\u7684\u200b\u540d\u79f0\u200b\u6e05\u6670\u200b\u3001\u200b\u76f4\u89c2\u200b\uff0c\u200b\u53cd\u6620\u200b\u4e86\u200b\u5b83\u4eec\u200b\u7684\u200b\u9884\u671f\u200b\u7528\u9014\u200b\uff1a

\u200b\u6bcf\u4e2a\u200b\u6a21\u578b\u200b\u90fd\u200b\u652f\u6301\u200b\u56de\u5f52\u200b\u548c\u200b\u5206\u7c7b\u200b\u4efb\u52a1\u200b\uff0c\u200b\u4e3a\u200b\u5e7f\u6cdb\u200b\u7684\u200b\u5e94\u7528\u200b\u63d0\u4f9b\u200b\u4e86\u200b\u7075\u6d3b\u6027\u200b\u548c\u200b\u7cbe\u5ea6\u200b\u3002

"},{"location":"zh/models/#_2","title":"\u63a5\u89e6\u200b\u9884\u6d4b","text":"

\u200b\u63a5\u89e6\u200b\u9884\u6d4b\u200b\u4e3a\u200b\u5e8f\u5217\u200b\u4e2d\u200b\u7684\u200b\u6bcf\u200b\u4e00\u5bf9\u200b\u4ee4\u724c\u200b\u5206\u914d\u200b\u4e00\u4e2a\u200b\u6807\u7b7e\u200b\u3002 \u200b\u6700\u200b\u5e38\u89c1\u200b\u7684\u200b\u63a5\u89e6\u200b\u9884\u6d4b\u200b\u4efb\u52a1\u200b\u4e4b\u4e00\u200b\u662f\u200b\u86cb\u767d\u8d28\u200b\u8ddd\u79bb\u200b\u56fe\u200b\u9884\u6d4b\u200b\u3002 \u200b\u86cb\u767d\u8d28\u200b\u8ddd\u79bb\u200b\u56fe\u200b\u9884\u6d4b\u200b\u8bd5\u56fe\u200b\u627e\u5230\u200b\u4e09\u7ef4\u200b\u86cb\u767d\u8d28\u200b\u7ed3\u6784\u200b\u4e2d\u200b\u6240\u6709\u200b\u53ef\u80fd\u200b\u7684\u200b\u6c28\u57fa\u9178\u200b\u6b8b\u57fa\u200b\u5bf9\u200b\u4e4b\u95f4\u200b\u7684\u200b\u8ddd\u79bb\u200b

"},{"location":"zh/models/#_3","title":"\u6838\u82f7\u9178\u200b\u9884\u6d4b","text":"

\u200b\u4e0e\u200b Token Classification \u200b\u7c7b\u4f3c\u200b\uff0c\u200b\u4f46\u200b\u5982\u679c\u200b\u6a21\u578b\u200b\u914d\u7f6e\u200b\u4e2d\u200b\u5b9a\u4e49\u200b\u4e86\u200b <bos> \u200b\u6216\u200b <eos> \u200b\u4ee4\u724c\u200b\uff0c\u200b\u5219\u200b\u5c06\u200b\u5176\u200b\u79fb\u9664\u200b\u3002

<bos> \u200b\u548c\u200b <eos> \u200b\u4ee4\u724c\u200b

\u200b\u5728\u200b MultiMolecule \u200b\u63d0\u4f9b\u200b\u7684\u200b\u5206\u8bcd\u5668\u200b\u4e2d\u200b\uff0c<bos> \u200b\u4ee4\u724c\u200b\u6307\u5411\u200b <cls> \u200b\u4ee4\u724c\u200b\uff0c<sep> \u200b\u4ee4\u724c\u200b\u6307\u5411\u200b <eos> \u200b\u4ee4\u724c\u200b\u3002

"},{"location":"zh/models/#_4","title":"\u4f7f\u7528","text":""},{"location":"zh/models/#multimoleculeautomodel","title":"\u4f7f\u7528\u200b multimolecule.AutoModel \u200b\u6784\u5efa","text":"Python
from transformers import AutoTokenizer\n\nfrom multimolecule import AutoModelForSequencePrediction\n\nmodel = AutoModelForSequencePrediction.from_pretrained(\"multimolecule/rnafm\")\ntokenizer = AutoTokenizer.from_pretrained(\"multimolecule/rnafm\")\n\nsequence = \"UAGCGUAUCAGACUGAUGUUG\"\noutput = model(**tokenizer(sequence, return_tensors=\"pt\"))\n
"},{"location":"zh/models/#_5","title":"\u76f4\u63a5\u200b\u8bbf\u95ee","text":"

\u200b\u6240\u6709\u200b\u6a21\u578b\u200b\u53ef\u4ee5\u200b\u901a\u8fc7\u200b from_pretrained \u200b\u65b9\u6cd5\u200b\u76f4\u63a5\u200b\u52a0\u8f7d\u200b\u3002

Python
from multimolecule.models import RnaFmForTokenPrediction, RnaTokenizer\n\nmodel = RnaFmForTokenPrediction.from_pretrained(\"multimolecule/rnafm\")\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rnafm\")\n\nsequence = \"UAGCGUAUCAGACUGAUGUUG\"\noutput = model(**tokenizer(sequence, return_tensors=\"pt\"))\n
"},{"location":"zh/models/#transformersautomodel","title":"\u4f7f\u7528\u200b transformers.AutoModel \u200b\u6784\u5efa","text":"

\u200b\u867d\u7136\u200b\u6211\u4eec\u200b\u4e3a\u200b\u6a21\u578b\u200b\u7c7b\u200b\u4f7f\u7528\u200b\u4e86\u200b\u4e0d\u540c\u200b\u7684\u200b\u547d\u540d\u200b\u7ea6\u5b9a\u200b\uff0c\u200b\u4f46\u200b\u6a21\u578b\u200b\u4ecd\u7136\u200b\u6ce8\u518c\u200b\u5230\u200b\u76f8\u5e94\u200b\u7684\u200b transformers.AutoModel \u200b\u4e2d\u200b\u3002

Python
from transformers import AutoModelForSequenceClassification, AutoTokenizer\n\nimport multimolecule  # noqa: F401\n\nmodel = AutoModelForSequenceClassification.from_pretrained(\"multimolecule/mrnafm\")\ntokenizer = AutoTokenizer.from_pretrained(\"multimolecule/mrnafm\")\n\nsequence = \"UAGCGUAUCAGACUGAUGUUG\"\noutput = model(**tokenizer(sequence, return_tensors=\"pt\"))\n

\u200b\u4f7f\u7528\u200b\u524d\u5148\u200b import multimolecule

\u200b\u8bf7\u200b\u6ce8\u610f\u200b\uff0c\u200b\u5728\u200b\u4f7f\u7528\u200b transformers.AutoModel \u200b\u6784\u5efa\u200b\u6a21\u578b\u200b\u4e4b\u524d\u200b\uff0c\u200b\u5fc5\u987b\u200b\u5148\u200b import multimolecule\u3002 \u200b\u6a21\u578b\u200b\u7684\u200b\u6ce8\u518c\u200b\u5728\u200b multimolecule \u200b\u5305\u4e2d\u200b\u5b8c\u6210\u200b\uff0c\u200b\u6a21\u578b\u200b\u5728\u200b transformers \u200b\u5305\u4e2d\u200b\u4e0d\u53ef\u200b\u7528\u200b\u3002

\u200b\u5982\u679c\u200b\u5728\u200b\u4f7f\u7528\u200b transformers.AutoModel \u200b\u4e4b\u524d\u200b\u672a\u200b import multimolecule\uff0c\u200b\u5c06\u4f1a\u200b\u5f15\u53d1\u200b\u4ee5\u4e0b\u200b\u9519\u8bef\u200b\uff1a

Python
ValueError: The checkpoint you are trying to load has model type `rnafm` but Transformers does not recognize this architecture. This could be because of an issue with the checkpoint, or because your version of Transformers is out of date.\n
"},{"location":"zh/models/#_6","title":"\u521d\u59cb\u5316\u200b\u4e00\u4e2a\u200b\u9999\u8349\u200b\u6a21\u578b","text":"

\u200b\u4f60\u200b\u4e5f\u200b\u53ef\u4ee5\u200b\u4f7f\u7528\u200b\u6a21\u578b\u200b\u7c7b\u200b\u521d\u59cb\u5316\u200b\u4e00\u4e2a\u200b\u57fa\u7840\u200b\u6a21\u578b\u200b\u3002

Python
from multimolecule.models import RnaFmConfig, RnaFmForTokenPrediction, RnaTokenizer\n\nconfig = RnaFmConfig()\nmodel = RnaFmForTokenPrediction(config)\ntokenizer = RnaTokenizer()\n\nsequence = \"UAGCGUAUCAGACUGAUGUUG\"\noutput = model(**tokenizer(sequence, return_tensors=\"pt\"))\n
"},{"location":"zh/models/#_7","title":"\u53ef\u7528\u200b\u6a21\u578b","text":""},{"location":"zh/models/#dna","title":"\u8131\u6c27\u6838\u7cd6\u6838\u9178\u200b\uff08DNA\uff09","text":""},{"location":"zh/models/#rna","title":"\u6838\u7cd6\u6838\u9178\u200b\uff08RNA\uff09","text":""},{"location":"zh/module/","title":"module","text":"

module \u200b\u63d0\u4f9b\u200b\u4e86\u200b\u4e00\u7cfb\u5217\u200b\u9884\u5b9a\u200b\u4e49\u200b\u6a21\u5757\u200b\uff0c\u200b\u4f9b\u200b\u7528\u6237\u200b\u5b9e\u73b0\u200b\u81ea\u5df1\u200b\u7684\u200b\u67b6\u6784\u200b\u3002

MultiMolecule \u200b\u5efa\u7acb\u200b\u5728\u200b \u200b\u751f\u6001\u7cfb\u7edf\u200b\u4e4b\u4e0a\u200b\uff0c\u200b\u62e5\u62b1\u200b\u7c7b\u4f3c\u200b\u7684\u200b\u8bbe\u8ba1\u200b\u7406\u5ff5\u200b\uff1a\u200b\u4e0d\u8981\u200b \u200b\u91cd\u590d\u200b\u81ea\u5df1\u200b\u3002 \u200b\u6211\u4eec\u200b\u9075\u5faa\u200b \u200b\u5355\u4e00\u200b\u6a21\u578b\u200b\u6587\u4ef6\u200b\u7b56\u7565\u200b\uff0c\u200b\u5176\u4e2d\u200b models \u200b\u5305\u4e2d\u200b\u7684\u200b\u6bcf\u4e2a\u200b\u6a21\u578b\u200b\u90fd\u200b\u5305\u542b\u200b\u4e00\u4e2a\u200b\u4e14\u200b\u4ec5\u200b\u6709\u200b\u4e00\u4e2a\u200b\u63cf\u8ff0\u200b\u7f51\u7edc\u200b\u8bbe\u8ba1\u200b\u7684\u200b modeling.py \u200b\u6587\u4ef6\u200b\u3002

module \u200b\u5305\u200b\u65e8\u5728\u200b\u63d0\u4f9b\u200b\u7b80\u5355\u200b\u3001\u200b\u53ef\u200b\u91cd\u7528\u200b\u7684\u200b\u6a21\u5757\u200b\uff0c\u200b\u8fd9\u4e9b\u200b\u6a21\u5757\u200b\u5728\u200b\u591a\u4e2a\u200b\u6a21\u578b\u200b\u4e2d\u200b\u4fdd\u6301\u4e00\u81f4\u200b\u3002\u200b\u8fd9\u79cd\u200b\u65b9\u6cd5\u200b\u6700\u5927\u200b\u7a0b\u5ea6\u200b\u5730\u200b\u51cf\u5c11\u200b\u4e86\u200b\u4ee3\u7801\u200b\u91cd\u590d\u200b\uff0c\u200b\u5e76\u200b\u4fc3\u8fdb\u200b\u4e86\u200b\u5e72\u51c0\u200b\u3001\u200b\u6613\u4e8e\u200b\u7ef4\u62a4\u200b\u7684\u200b\u4ee3\u7801\u200b\u3002

"},{"location":"zh/module/#_1","title":"\u6838\u5fc3\u200b\u7279\u6027","text":""},{"location":"zh/module/#modules","title":"Modules","text":""},{"location":"zh/module/embeddings/","title":"embeddings","text":"

embeddings \u200b\u63d0\u4f9b\u200b\u4e86\u200b\u4e00\u7cfb\u5217\u200b\u9884\u5b9a\u200b\u4e49\u200b\u7684\u200b\u4f4d\u7f6e\u200b\u7f16\u7801\u200b\u3002

"},{"location":"zh/module/heads/","title":"heads","text":"

heads \u200b\u63d0\u4f9b\u200b\u4e86\u200b\u4e00\u7cfb\u5217\u200b\u7684\u200b\u6a21\u578b\u200b\u9884\u6d4b\u200b\u5934\u200b\uff0c\u200b\u7528\u4e8e\u200b\u5904\u7406\u200b\u4e0d\u540c\u200b\u7684\u200b\u4efb\u52a1\u200b\u3002

heads \u200b\u63a5\u53d7\u200b ModelOutupt\u3001dict \u200b\u6216\u200b tuple \u200b\u4f5c\u4e3a\u200b\u8f93\u5165\u200b\u3002 \u200b\u5b83\u4f1a\u200b\u81ea\u52a8\u200b\u67e5\u627e\u200b\u9884\u6d4b\u200b\u6240\u200b\u9700\u200b\u7684\u200b\u6a21\u578b\u200b\u8f93\u51fa\u200b\u5e76\u200b\u76f8\u5e94\u200b\u5730\u200b\u5904\u7406\u200b\u3002

\u200b\u4e00\u4e9b\u200b\u9884\u6d4b\u200b\u5934\u200b\u53ef\u80fd\u200b\u9700\u8981\u200b\u989d\u5916\u200b\u7684\u200b\u4fe1\u606f\u200b\uff0c\u200b\u4f8b\u5982\u200b attention_mask \u200b\u6216\u200b input_ids\uff0c\u200b\u4f8b\u5982\u200b ContactPredictionHead\u3002 \u200b\u8fd9\u4e9b\u200b\u989d\u5916\u200b\u7684\u200b\u53c2\u6570\u200b\u53ef\u4ee5\u200b\u4f5c\u4e3a\u200b\u53c2\u6570\u200b/\u200b\u5173\u952e\u5b57\u200b\u53c2\u6570\u200b\u4f20\u5165\u200b\u3002

\u200b\u8bf7\u200b\u6ce8\u610f\u200b\uff0cheads \u200b\u4f7f\u7528\u200b\u4e0e\u200b Transformers \u200b\u76f8\u540c\u200b\u7684\u200b ModelOutupt \u200b\u7ea6\u5b9a\u200b\u3002 \u200b\u5982\u679c\u200b\u6a21\u578b\u200b\u8f93\u51fa\u200b\u662f\u200b\u4e00\u4e2a\u200b tuple\uff0c\u200b\u6211\u4eec\u200b\u5c06\u200b\u7b2c\u4e00\u4e2a\u200b\u5143\u7d20\u200b\u89c6\u4e3a\u200b pooler_output\uff0c\u200b\u7b2c\u4e8c\u4e2a\u200b\u5143\u7d20\u200b\u89c6\u4e3a\u200b last_hidden_state\uff0c\u200b\u6700\u540e\u200b\u4e00\u4e2a\u200b\u5143\u7d20\u200b\u89c6\u4e3a\u200b attention_map\u3002 \u200b\u7528\u6237\u200b\u6709\u200b\u8d23\u4efb\u200b\u786e\u4fdd\u200b\u6a21\u578b\u200b\u8f93\u51fa\u200b\u683c\u5f0f\u200b\u6b63\u786e\u200b\u3002

\u200b\u5982\u679c\u200b\u6a21\u578b\u200b\u8f93\u51fa\u200b\u662f\u200b\u4e00\u4e2a\u200b ModelOutupt \u200b\u6216\u200b\u4e00\u4e2a\u200b dict\uff0cheads \u200b\u5c06\u200b\u4ece\u200b\u6a21\u578b\u200b\u8f93\u51fa\u200b\u4e2d\u200b\u67e5\u627e\u200b HeadConfig.output_name\u3002 \u200b\u4f60\u200b\u53ef\u4ee5\u200b\u5728\u200b HeadConfig \u200b\u4e2d\u200b\u6307\u5b9a\u200b output_name\uff0c\u200b\u4ee5\u200b\u786e\u4fdd\u200b heads \u200b\u53ef\u4ee5\u200b\u6b63\u786e\u200b\u5b9a\u4f4d\u200b\u6240\u200b\u9700\u200b\u7684\u200b\u5f20\u91cf\u200b\u3002

"},{"location":"zh/tokenisers/","title":"tokenisers","text":"

tokenisers \u200b\u63d0\u4f9b\u200b\u4e86\u200b\u4e00\u7cfb\u5217\u200b\u9884\u5b9a\u200b\u4e49\u200b\u4ee4\u724c\u200b\u5668\u200b\u3002

\u200b\u4ee4\u724c\u200b\u5668\u662f\u200b\u4e00\u4e2a\u200b\u5c06\u200b\u6838\u82f7\u9178\u200b\u6216\u200b\u6c28\u57fa\u9178\u200b\u5e8f\u5217\u200b\u8f6c\u6362\u200b\u4e3a\u200b\u7d22\u5f15\u200b\u5e8f\u5217\u200b\u7684\u200b\u7c7b\u200b\u3002\u200b\u5b83\u200b\u7528\u4e8e\u200b\u5728\u200b\u5c06\u200b\u8f93\u5165\u200b\u5e8f\u5217\u200b\u9988\u9001\u200b\u5230\u200b\u6a21\u578b\u200b\u4e4b\u524d\u200b\u5bf9\u200b\u5176\u200b\u8fdb\u884c\u200b\u9884\u5904\u7406\u200b\u3002

\u200b\u8bf7\u53c2\u9605\u200b Tokenizer \u200b\u4e86\u89e3\u200b\u66f4\u200b\u591a\u200b\u7ec6\u8282\u200b\u3002

"},{"location":"zh/tokenisers/#_1","title":"\u53ef\u7528\u200b\u4ee4\u724c\u200b\u5668","text":""}]} \ No newline at end of file diff --git a/zh/index.html b/zh/index.html index 825f7379..2b5c5e0c 100644 --- a/zh/index.html +++ b/zh/index.html @@ -1006,19 +1006,6 @@

引用 day = 4 } -
-

Caution

-

MultiMolecule 项目使用GNU Affero 通用公共许可证授权。 -研究论文被认为是衍生作品,因此需要依据相同条款进行许可。

-

你只能在免费发表和阅读的完全开放获取的期刊、会议或预印本服务器上发布使用 MultiMolecule 的研究论文。 -你必须从作者获取豁免以在封闭获取/作者费用的期刊、会议或预印本服务器上发布使用 MultiMolecule 的研究论文。

-

你可能获取一个自动豁免如果你提交到以下非盈利性期刊当中:

- -

请参阅 许可协议常见问题解答

-

许可证

我们相信开放是研究的基础。

MultiMolecule 在GNU Affero 通用公共许可证下授权。