diff --git a/feed_rss_created.xml b/feed_rss_created.xml
index e37e7579..1621df43 100644
--- a/feed_rss_created.xml
+++ b/feed_rss_created.xml
@@ -1 +1 @@
-
We believe openness is the Foundation of Research.
-MultiMolecule is licensed under the GNU Affero General Public License.
+MultiMolecule is licensed under the GNU Affero General Public License.
Please join us in building an open research community.
SPDX-License-Identifier: AGPL-3.0-or-later
Accelerate Molecular Biology Research with Machine Learning
"},{"location":"#introduction","title":"Introduction","text":"
Welcome to MultiMolecule (\u200b\u6d66\u539f\u200b), a foundational library designed to accelerate scientific research in molecular biology through machine learning. MultiMolecule provides a comprehensive yet flexible set of tools for researchers aiming to leverage AI with ease, focusing on biomolecular data (RNA, DNA, and protein).
"},{"location":"#overview","title":"Overview","text":"MultiMolecule is built with flexibility and ease of use in mind. Its modular design allows you to utilize only the components you need, integrating seamlessly into your existing workflows without adding unnecessary complexity.
data
: Smart Dataset
that automatically infer tasks\u2014including their level (sequence, token, contact) and type (classification, regression). Provides multi-task datasets and samplers to facilitate multitask learning without additional configuration.datasets
: A collection of widely-used biomolecular datasets.module
: Modular neural network building blocks, including embeddings, heads, and criterions for constructing custom models.models
: Implementation of state-of-the-art pre-trained models in molecular biology.tokenisers
: Tokenizers to convert DNA, RNA, protein and other sequences to one-hot encodings.Install the most recent stable version on PyPI:
Bashpip install multimolecule\n
Install the latest version from the source:
Bashpip install git+https://github.com/DLS5-Omics/MultiMolecule\n
"},{"location":"#citation","title":"Citation","text":"If you use MultiMolecule in your research, please cite us as follows:
BibTeX@software{chen_2024_12638419,\n author = {Chen, Zhiyuan and Zhu, Sophia Y.},\n title = {MultiMolecule},\n doi = {10.5281/zenodo.12638419},\n publisher = {Zenodo},\n url = {https://doi.org/10.5281/zenodo.12638419},\n year = 2024,\n month = may,\n day = 4\n}\n
"},{"location":"#license","title":"License","text":"We believe openness is the Foundation of Research.
MultiMolecule is licensed under the GNU Affero General Public License.
Please join us in building an open research community.
SPDX-License-Identifier: AGPL-3.0-or-later
Developed by DanLing on Earth
We are a community of developers, designers, and others from around the world who are working together to make deep learning more accessible.
We are a community of individuals who seek to push the boundaries of what is possible with deep learning.
We are passionate about Deep Learning and the people who use it.
We are DanLing.
"},{"location":"about/license-faq/","title":"License FAQ","text":"This License FAQ explains the terms and conditions under which you may use the data, models, code, configuration, documentation, and weights provided by the DanLing Team (also known as DanLing) (\u2018we\u2019, \u2018us\u2019, or \u2018our\u2019). It serves as an addendum to our License.
"},{"location":"about/license-faq/#0-summary-of-key-points","title":"0. Summary of Key Points","text":"This summary provides key points from our license, but you can find out more details about any of these topics by clicking the link following each key point and by reading the full license.
What constitutes the \u2018source code\u2019 in MultiMolecule?
We consider everything in our repositories to be source code, including data, models, code, configuration, and documentation.
What constitutes the \u2018source code\u2019 in MultiMolecule?
Can I publish research papers using MultiMolecule?
It depends.
You can publish research papers on fully open access journals and conferences or preprint servers following the terms of the License.
You must obtain a separate license from us to publish research papers on closed access journals and conferences.
Can I publish research papers using MultiMolecule?
Can I use MultiMolecule for commercial purposes?
Yes, you can use MultiMolecule for commercial purposes under the terms of the License.
Can I use MultiMolecule for commercial purposes?
Do people affiliated with certain organizations have specific license terms?
Yes, people affiliated with certain organizations have specific license terms.
Do people affiliated with certain organizations have specific license terms?
"},{"location":"about/license-faq/#1-what-constitutes-the-source-code-in-multimolecule","title":"1. What constitutes the \u201csource code\u201d in MultiMolecule?","text":"We consider everything in our repositories to be source code.
The training process of machine learning models is viewed similarly to the compilation process of traditional software. As such, the model, code, configuration, documentation, and data used for training are all part of the source code, while the trained model weights are part of the object code.
We also consider research papers and manuscripts a special form of documentation, which are also part of the source code.
"},{"location":"about/license-faq/#2-can-i-publish-research-papers-using-multimolecule","title":"2. Can I publish research papers using MultiMolecule?","text":"Since research papers are considered a form of source code, publishers are legally required to open-source all materials on their server to comply with the License if they publish papers using MultiMolecule. This is generally impractical for most publishers.
As a special exemption under section 7 of the License, we grant permission to publish research papers using MultiMolecule in fully open access journals, conferences, or preprint servers that do not charge any fee from authors, provided all published manuscripts are made available under the GNU Free Documentation License (GFDL), or a Creative Commons license, or an OSI-approved license that permits the sharing of manuscripts.
As a special exemption under section 7 of the License, we grant permission to publish research papers using MultiMolecule in certain non-profit journals, conferences, or preprint servers. Currently, the non-profit journals, conferences, or preprint servers we allow include:
For publishing in closed access journals or conferences, you must obtain a separate license from us. This typically involves co-authorship, a fee to support the project, or both. Contact us at multimolecule@zyc.ai for more information.
While not mandatory, we recommend citing the MultiMolecule project in your research papers.
"},{"location":"about/license-faq/#3-can-i-use-multimolecule-for-commercial-purposes","title":"3. Can I use MultiMolecule for commercial purposes?","text":"Yes, MultiMolecule can be used for commercial purposes under the License. However, you must open-source any modifications to the source code and make them available under the License.
If you prefer to use MultiMolecule for commercial purposes without open-sourcing your modifications, you must obtain a separate license from us. This typically involves a fee to support the project. Contact us at multimolecule@zyc.ai for further details.
"},{"location":"about/license-faq/#4-do-people-affiliated-with-certain-organizations-have-specific-license-terms","title":"4. Do people affiliated with certain organizations have specific license terms?","text":"YES!
If you are affiliated with an organization that has a separate license agreement with us, you may be subject to different license terms. Please consult your organization\u2019s legal department to determine if you are subject to a separate license agreement.
Members of the following organizations automatically receive a non-transferable, non-sublicensable, and non-distributable MIT License to use MultiMolecule:
This special license is considered an additional term under section 7 of the License. It is not redistributable, and you are prohibited from creating any independent derivative works. Any modifications or derivative works based on this license are automatically considered derivative works of MultiMolecule and must comply with all the terms of the License. This ensures that third parties cannot bypass the license terms or create separate licenses from derivative works.
"},{"location":"about/license-faq/#5-how-can-i-use-multimolecule-if-my-organization-forbids-the-use-of-code-under-the-agpl-license","title":"5. How can I use MultiMolecule if my organization forbids the use of code under the AGPL License?","text":"Some organizations, such as Google, have policies that prohibit the use of code under the AGPL License.
If you are affiliated with an organization that forbids the use of AGPL-licensed code, you must obtain a separate license from us. Contact us at multimolecule@zyc.ai for more information.
"},{"location":"about/license-faq/#6-can-i-use-multimolecule-if-i-am-a-federal-employee-of-the-united-states-government","title":"6. Can I use MultiMolecule if I am a federal employee of the United States Government?","text":"No.
Code written by federal employees of the United States Government is not protected by copyright under 17 U.S. Code \u00a7 105.
As a result, federal employees of the United States Government cannot comply with the terms of the License.
"},{"location":"about/license-faq/#7-do-we-make-updates-to-this-faq","title":"7. Do we make updates to this FAQ?","text":"In Short
Yes, we will update this FAQ as necessary to stay compliant with relevant laws.
We may update this license FAQ from time to time. The updated version will be indicated by an updated \u2018Last Revised Time\u2019 at the bottom of this license FAQ. If we make any material changes, we will notify you by posting the new license FAQ on this page. We are unable to notify you directly as we do not collect any contact information from you. We encourage you to review this license FAQ frequently to stay informed of how you can use our data, models, code, configuration, documentation, and weights.
"},{"location":"about/license/","title":"GNU AFFERO GENERAL PUBLIC LICENSE","text":"Version 3, 19 November 2007
Copyright (C) 2007 Free Software Foundation, Inc. https://fsf.org/
Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed.
"},{"location":"about/license/#preamble","title":"Preamble","text":"The GNU Affero General Public License is a free, copyleft license for software and other kinds of works, specifically designed to ensure cooperation with the community in the case of network server software.
The licenses for most software and other practical works are designed to take away your freedom to share and change the works. By contrast, our General Public Licenses are intended to guarantee your freedom to share and change all versions of a program\u2013to make sure it remains free software for all its users.
When we speak of free software, we are referring to freedom, not price. Our General Public Licenses are designed to make sure that you have the freedom to distribute copies of free software (and charge for them if you wish), that you receive source code or can get it if you want it, that you can change the software or use pieces of it in new free programs, and that you know you can do these things.
Developers that use our General Public Licenses protect your rights with two steps: (1) assert copyright on the software, and (2) offer you this License which gives you legal permission to copy, distribute and/or modify the software.
A secondary benefit of defending all users\u2019 freedom is that improvements made in alternate versions of the program, if they receive widespread use, become available for other developers to incorporate. Many developers of free software are heartened and encouraged by the resulting cooperation. However, in the case of software used on network servers, this result may fail to come about. The GNU General Public License permits making a modified version and letting the public access it on a server without ever releasing its source code to the public.
The GNU Affero General Public License is designed specifically to ensure that, in such cases, the modified source code becomes available to the community. It requires the operator of a network server to provide the source code of the modified version running there to the users of that server. Therefore, public use of a modified version, on a publicly accessible server, gives the public access to the source code of the modified version.
An older license, called the Affero General Public License and published by Affero, was designed to accomplish similar goals. This is a different license, not a version of the Affero GPL, but Affero has released a new version of the Affero GPL which permits relicensing under this license.
The precise terms and conditions for copying, distribution and modification follow.
"},{"location":"about/license/#terms-and-conditions","title":"TERMS AND CONDITIONS","text":""},{"location":"about/license/#0-definitions","title":"0. Definitions.","text":"\u201cThis License\u201d refers to version 3 of the GNU Affero General Public License.
\u201cCopyright\u201d also means copyright-like laws that apply to other kinds of works, such as semiconductor masks.
\u201cThe Program\u201d refers to any copyrightable work licensed under this License. Each licensee is addressed as \u201cyou\u201d. \u201cLicensees\u201d and \u201crecipients\u201d may be individuals or organizations.
To \u201cmodify\u201d a work means to copy from or adapt all or part of the work in a fashion requiring copyright permission, other than the making of an exact copy. The resulting work is called a \u201cmodified version\u201d of the earlier work or a work \u201cbased on\u201d the earlier work.
A \u201ccovered work\u201d means either the unmodified Program or a work based on the Program.
To \u201cpropagate\u201d a work means to do anything with it that, without permission, would make you directly or secondarily liable for infringement under applicable copyright law, except executing it on a computer or modifying a private copy. Propagation includes copying, distribution (with or without modification), making available to the public, and in some countries other activities as well.
To \u201cconvey\u201d a work means any kind of propagation that enables other parties to make or receive copies. Mere interaction with a user through a computer network, with no transfer of a copy, is not conveying.
An interactive user interface displays \u201cAppropriate Legal Notices\u201d to the extent that it includes a convenient and prominently visible feature that (1) displays an appropriate copyright notice, and (2) tells the user that there is no warranty for the work (except to the extent that warranties are provided), that licensees may convey the work under this License, and how to view a copy of this License. If the interface presents a list of user commands or options, such as a menu, a prominent item in the list meets this criterion.
"},{"location":"about/license/#1-source-code","title":"1. Source Code.","text":"The \u201csource code\u201d for a work means the preferred form of the work for making modifications to it. \u201cObject code\u201d means any non-source form of a work.
A \u201cStandard Interface\u201d means an interface that either is an official standard defined by a recognized standards body, or, in the case of interfaces specified for a particular programming language, one that is widely used among developers working in that language.
The \u201cSystem Libraries\u201d of an executable work include anything, other than the work as a whole, that (a) is included in the normal form of packaging a Major Component, but which is not part of that Major Component, and (b) serves only to enable use of the work with that Major Component, or to implement a Standard Interface for which an implementation is available to the public in source code form. A \u201cMajor Component\u201d, in this context, means a major essential component (kernel, window system, and so on) of the specific operating system (if any) on which the executable work runs, or a compiler used to produce the work, or an object code interpreter used to run it.
The \u201cCorresponding Source\u201d for a work in object code form means all the source code needed to generate, install, and (for an executable work) run the object code and to modify the work, including scripts to control those activities. However, it does not include the work\u2019s System Libraries, or general-purpose tools or generally available free programs which are used unmodified in performing those activities but which are not part of the work. For example, Corresponding Source includes interface definition files associated with source files for the work, and the source code for shared libraries and dynamically linked subprograms that the work is specifically designed to require, such as by intimate data communication or control flow between those subprograms and other parts of the work.
The Corresponding Source need not include anything that users can regenerate automatically from other parts of the Corresponding Source.
The Corresponding Source for a work in source code form is that same work.
"},{"location":"about/license/#2-basic-permissions","title":"2. Basic Permissions.","text":"All rights granted under this License are granted for the term of copyright on the Program, and are irrevocable provided the stated conditions are met. This License explicitly affirms your unlimited permission to run the unmodified Program. The output from running a covered work is covered by this License only if the output, given its content, constitutes a covered work. This License acknowledges your rights of fair use or other equivalent, as provided by copyright law.
You may make, run and propagate covered works that you do not convey, without conditions so long as your license otherwise remains in force. You may convey covered works to others for the sole purpose of having them make modifications exclusively for you, or provide you with facilities for running those works, provided that you comply with the terms of this License in conveying all material for which you do not control copyright. Those thus making or running the covered works for you must do so exclusively on your behalf, under your direction and control, on terms that prohibit them from making any copies of your copyrighted material outside their relationship with you.
Conveying under any other circumstances is permitted solely under the conditions stated below. Sublicensing is not allowed; section 10 makes it unnecessary.
"},{"location":"about/license/#3-protecting-users-legal-rights-from-anti-circumvention-law","title":"3. Protecting Users\u2019 Legal Rights From Anti-Circumvention Law.","text":"No covered work shall be deemed part of an effective technological measure under any applicable law fulfilling obligations under article 11 of the WIPO copyright treaty adopted on 20 December 1996, or similar laws prohibiting or restricting circumvention of such measures.
When you convey a covered work, you waive any legal power to forbid circumvention of technological measures to the extent such circumvention is effected by exercising rights under this License with respect to the covered work, and you disclaim any intention to limit operation or modification of the work as a means of enforcing, against the work\u2019s users, your or third parties\u2019 legal rights to forbid circumvention of technological measures.
"},{"location":"about/license/#4-conveying-verbatim-copies","title":"4. Conveying Verbatim Copies.","text":"You may convey verbatim copies of the Program\u2019s source code as you receive it, in any medium, provided that you conspicuously and appropriately publish on each copy an appropriate copyright notice; keep intact all notices stating that this License and any non-permissive terms added in accord with section 7 apply to the code; keep intact all notices of the absence of any warranty; and give all recipients a copy of this License along with the Program.
You may charge any price or no price for each copy that you convey, and you may offer support or warranty protection for a fee.
"},{"location":"about/license/#5-conveying-modified-source-versions","title":"5. Conveying Modified Source Versions.","text":"You may convey a work based on the Program, or the modifications to produce it from the Program, in the form of source code under the terms of section 4, provided that you also meet all of these conditions:
A compilation of a covered work with other separate and independent works, which are not by their nature extensions of the covered work, and which are not combined with it such as to form a larger program, in or on a volume of a storage or distribution medium, is called an \u201caggregate\u201d if the compilation and its resulting copyright are not used to limit the access or legal rights of the compilation\u2019s users beyond what the individual works permit. Inclusion of a covered work in an aggregate does not cause this License to apply to the other parts of the aggregate.
"},{"location":"about/license/#6-conveying-non-source-forms","title":"6. Conveying Non-Source Forms.","text":"You may convey a covered work in object code form under the terms of sections 4 and 5, provided that you also convey the machine-readable Corresponding Source under the terms of this License, in one of these ways:
A separable portion of the object code, whose source code is excluded from the Corresponding Source as a System Library, need not be included in conveying the object code work.
A \u201cUser Product\u201d is either (1) a \u201cconsumer product\u201d, which means any tangible personal property which is normally used for personal, family, or household purposes, or (2) anything designed or sold for incorporation into a dwelling. In determining whether a product is a consumer product, doubtful cases shall be resolved in favor of coverage. For a particular product received by a particular user, \u201cnormally used\u201d refers to a typical or common use of that class of product, regardless of the status of the particular user or of the way in which the particular user actually uses, or expects or is expected to use, the product. A product is a consumer product regardless of whether the product has substantial commercial, industrial or non-consumer uses, unless such uses represent the only significant mode of use of the product.
\u201cInstallation Information\u201d for a User Product means any methods, procedures, authorization keys, or other information required to install and execute modified versions of a covered work in that User Product from a modified version of its Corresponding Source. The information must suffice to ensure that the continued functioning of the modified object code is in no case prevented or interfered with solely because modification has been made.
If you convey an object code work under this section in, or with, or specifically for use in, a User Product, and the conveying occurs as part of a transaction in which the right of possession and use of the User Product is transferred to the recipient in perpetuity or for a fixed term (regardless of how the transaction is characterized), the Corresponding Source conveyed under this section must be accompanied by the Installation Information. But this requirement does not apply if neither you nor any third party retains the ability to install modified object code on the User Product (for example, the work has been installed in ROM).
The requirement to provide Installation Information does not include a requirement to continue to provide support service, warranty, or updates for a work that has been modified or installed by the recipient, or for the User Product in which it has been modified or installed. Access to a network may be denied when the modification itself materially and adversely affects the operation of the network or violates the rules and protocols for communication across the network.
Corresponding Source conveyed, and Installation Information provided, in accord with this section must be in a format that is publicly documented (and with an implementation available to the public in source code form), and must require no special password or key for unpacking, reading or copying.
"},{"location":"about/license/#7-additional-terms","title":"7. Additional Terms.","text":"\u201cAdditional permissions\u201d are terms that supplement the terms of this License by making exceptions from one or more of its conditions. Additional permissions that are applicable to the entire Program shall be treated as though they were included in this License, to the extent that they are valid under applicable law. If additional permissions apply only to part of the Program, that part may be used separately under those permissions, but the entire Program remains governed by this License without regard to the additional permissions.
When you convey a copy of a covered work, you may at your option remove any additional permissions from that copy, or from any part of it. (Additional permissions may be written to require their own removal in certain cases when you modify the work.) You may place additional permissions on material, added by you to a covered work, for which you have or can give appropriate copyright permission.
Notwithstanding any other provision of this License, for material you add to a covered work, you may (if authorized by the copyright holders of that material) supplement the terms of this License with terms:
All other non-permissive additional terms are considered \u201cfurther restrictions\u201d within the meaning of section 10. If the Program as you received it, or any part of it, contains a notice stating that it is governed by this License along with a term that is a further restriction, you may remove that term. If a license document contains a further restriction but permits relicensing or conveying under this License, you may add to a covered work material governed by the terms of that license document, provided that the further restriction does not survive such relicensing or conveying.
If you add terms to a covered work in accord with this section, you must place, in the relevant source files, a statement of the additional terms that apply to those files, or a notice indicating where to find the applicable terms.
Additional terms, permissive or non-permissive, may be stated in the form of a separately written license, or stated as exceptions; the above requirements apply either way.
"},{"location":"about/license/#8-termination","title":"8. Termination.","text":"You may not propagate or modify a covered work except as expressly provided under this License. Any attempt otherwise to propagate or modify it is void, and will automatically terminate your rights under this License (including any patent licenses granted under the third paragraph of section 11).
However, if you cease all violation of this License, then your license from a particular copyright holder is reinstated (a) provisionally, unless and until the copyright holder explicitly and finally terminates your license, and (b) permanently, if the copyright holder fails to notify you of the violation by some reasonable means prior to 60 days after the cessation.
Moreover, your license from a particular copyright holder is reinstated permanently if the copyright holder notifies you of the violation by some reasonable means, this is the first time you have received notice of violation of this License (for any work) from that copyright holder, and you cure the violation prior to 30 days after your receipt of the notice.
Termination of your rights under this section does not terminate the licenses of parties who have received copies or rights from you under this License. If your rights have been terminated and not permanently reinstated, you do not qualify to receive new licenses for the same material under section 10.
"},{"location":"about/license/#9-acceptance-not-required-for-having-copies","title":"9. Acceptance Not Required for Having Copies.","text":"You are not required to accept this License in order to receive or run a copy of the Program. Ancillary propagation of a covered work occurring solely as a consequence of using peer-to-peer transmission to receive a copy likewise does not require acceptance. However, nothing other than this License grants you permission to propagate or modify any covered work. These actions infringe copyright if you do not accept this License. Therefore, by modifying or propagating a covered work, you indicate your acceptance of this License to do so.
"},{"location":"about/license/#10-automatic-licensing-of-downstream-recipients","title":"10. Automatic Licensing of Downstream Recipients.","text":"Each time you convey a covered work, the recipient automatically receives a license from the original licensors, to run, modify and propagate that work, subject to this License. You are not responsible for enforcing compliance by third parties with this License.
An \u201centity transaction\u201d is a transaction transferring control of an organization, or substantially all assets of one, or subdividing an organization, or merging organizations. If propagation of a covered work results from an entity transaction, each party to that transaction who receives a copy of the work also receives whatever licenses to the work the party\u2019s predecessor in interest had or could give under the previous paragraph, plus a right to possession of the Corresponding Source of the work from the predecessor in interest, if the predecessor has it or can get it with reasonable efforts.
You may not impose any further restrictions on the exercise of the rights granted or affirmed under this License. For example, you may not impose a license fee, royalty, or other charge for exercise of rights granted under this License, and you may not initiate litigation (including a cross-claim or counterclaim in a lawsuit) alleging that any patent claim is infringed by making, using, selling, offering for sale, or importing the Program or any portion of it.
"},{"location":"about/license/#11-patents","title":"11. Patents.","text":"A \u201ccontributor\u201d is a copyright holder who authorizes use under this License of the Program or a work on which the Program is based. The work thus licensed is called the contributor\u2019s \u201ccontributor version\u201d.
A contributor\u2019s \u201cessential patent claims\u201d are all patent claims owned or controlled by the contributor, whether already acquired or hereafter acquired, that would be infringed by some manner, permitted by this License, of making, using, or selling its contributor version, but do not include claims that would be infringed only as a consequence of further modification of the contributor version. For purposes of this definition, \u201ccontrol\u201d includes the right to grant patent sublicenses in a manner consistent with the requirements of this License.
Each contributor grants you a non-exclusive, worldwide, royalty-free patent license under the contributor\u2019s essential patent claims, to make, use, sell, offer for sale, import and otherwise run, modify and propagate the contents of its contributor version.
In the following three paragraphs, a \u201cpatent license\u201d is any express agreement or commitment, however denominated, not to enforce a patent (such as an express permission to practice a patent or covenant not to sue for patent infringement). To \u201cgrant\u201d such a patent license to a party means to make such an agreement or commitment not to enforce a patent against the party.
If you convey a covered work, knowingly relying on a patent license, and the Corresponding Source of the work is not available for anyone to copy, free of charge and under the terms of this License, through a publicly available network server or other readily accessible means, then you must either (1) cause the Corresponding Source to be so available, or (2) arrange to deprive yourself of the benefit of the patent license for this particular work, or (3) arrange, in a manner consistent with the requirements of this License, to extend the patent license to downstream recipients. \u201cKnowingly relying\u201d means you have actual knowledge that, but for the patent license, your conveying the covered work in a country, or your recipient\u2019s use of the covered work in a country, would infringe one or more identifiable patents in that country that you have reason to believe are valid.
If, pursuant to or in connection with a single transaction or arrangement, you convey, or propagate by procuring conveyance of, a covered work, and grant a patent license to some of the parties receiving the covered work authorizing them to use, propagate, modify or convey a specific copy of the covered work, then the patent license you grant is automatically extended to all recipients of the covered work and works based on it.
A patent license is \u201cdiscriminatory\u201d if it does not include within the scope of its coverage, prohibits the exercise of, or is conditioned on the non-exercise of one or more of the rights that are specifically granted under this License. You may not convey a covered work if you are a party to an arrangement with a third party that is in the business of distributing software, under which you make payment to the third party based on the extent of your activity of conveying the work, and under which the third party grants, to any of the parties who would receive the covered work from you, a discriminatory patent license (a) in connection with copies of the covered work conveyed by you (or copies made from those copies), or (b) primarily for and in connection with specific products or compilations that contain the covered work, unless you entered into that arrangement, or that patent license was granted, prior to 28 March 2007.
Nothing in this License shall be construed as excluding or limiting any implied license or other defenses to infringement that may otherwise be available to you under applicable patent law.
"},{"location":"about/license/#12-no-surrender-of-others-freedom","title":"12. No Surrender of Others\u2019 Freedom.","text":"If conditions are imposed on you (whether by court order, agreement or otherwise) that contradict the conditions of this License, they do not excuse you from the conditions of this License. If you cannot convey a covered work so as to satisfy simultaneously your obligations under this License and any other pertinent obligations, then as a consequence you may not convey it at all. For example, if you agree to terms that obligate you to collect a royalty for further conveying from those to whom you convey the Program, the only way you could satisfy both those terms and this License would be to refrain entirely from conveying the Program.
"},{"location":"about/license/#13-remote-network-interaction-use-with-the-gnu-general-public-license","title":"13. Remote Network Interaction; Use with the GNU General Public License.","text":"Notwithstanding any other provision of this License, if you modify the Program, your modified version must prominently offer all users interacting with it remotely through a computer network (if your version supports such interaction) an opportunity to receive the Corresponding Source of your version by providing access to the Corresponding Source from a network server at no charge, through some standard or customary means of facilitating copying of software. This Corresponding Source shall include the Corresponding Source for any work covered by version 3 of the GNU General Public License that is incorporated pursuant to the following paragraph.
Notwithstanding any other provision of this License, you have permission to link or combine any covered work with a work licensed under version 3 of the GNU General Public License into a single combined work, and to convey the resulting work. The terms of this License will continue to apply to the part which is the covered work, but the work with which it is combined will remain governed by version 3 of the GNU General Public License.
"},{"location":"about/license/#14-revised-versions-of-this-license","title":"14. Revised Versions of this License.","text":"The Free Software Foundation may publish revised and/or new versions of the GNU Affero General Public License from time to time. Such new versions will be similar in spirit to the present version, but may differ in detail to address new problems or concerns.
Each version is given a distinguishing version number. If the Program specifies that a certain numbered version of the GNU Affero General Public License \u201cor any later version\u201d applies to it, you have the option of following the terms and conditions either of that numbered version or of any later version published by the Free Software Foundation. If the Program does not specify a version number of the GNU Affero General Public License, you may choose any version ever published by the Free Software Foundation.
If the Program specifies that a proxy can decide which future versions of the GNU Affero General Public License can be used, that proxy\u2019s public statement of acceptance of a version permanently authorizes you to choose that version for the Program.
Later license versions may give you additional or different permissions. However, no additional obligations are imposed on any author or copyright holder as a result of your choosing to follow a later version.
"},{"location":"about/license/#15-disclaimer-of-warranty","title":"15. Disclaimer of Warranty.","text":"THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM \u201cAS IS\u201d WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU. SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING, REPAIR OR CORRECTION.
"},{"location":"about/license/#16-limitation-of-liability","title":"16. Limitation of Liability.","text":"IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MODIFIES AND/OR CONVEYS THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
"},{"location":"about/license/#17-interpretation-of-sections-15-and-16","title":"17. Interpretation of Sections 15 and 16.","text":"If the disclaimer of warranty and limitation of liability provided above cannot be given local legal effect according to their terms, reviewing courts shall apply local law that most closely approximates an absolute waiver of all civil liability in connection with the Program, unless a warranty or assumption of liability accompanies a copy of the Program in return for a fee.
END OF TERMS AND CONDITIONS
"},{"location":"about/license/#how-to-apply-these-terms-to-your-new-programs","title":"How to Apply These Terms to Your New Programs","text":"If you develop a new program, and you want it to be of the greatest possible use to the public, the best way to achieve this is to make it free software which everyone can redistribute and change under these terms.
To do so, attach the following notices to the program. It is safest to attach them to the start of each source file to most effectively state the exclusion of warranty; and each file should have at least the \u201ccopyright\u201d line and a pointer to where the full notice is found.
Text Only <one line to give the program's name and a brief idea of what it does.>\n Copyright (C) <year> <name of author>\n\n This program is free software: you can redistribute it and/or modify\n it under the terms of the GNU Affero General Public License as\n published by the Free Software Foundation, either version 3 of the\n License, or (at your option) any later version.\n\n This program is distributed in the hope that it will be useful,\n but WITHOUT ANY WARRANTY; without even the implied warranty of\n MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the\n GNU Affero General Public License for more details.\n\n You should have received a copy of the GNU Affero General Public License\n along with this program. If not, see <https://www.gnu.org/licenses/>.\n
Also add information on how to contact you by electronic and paper mail.
If your software can interact with users remotely through a computer network, you should also make sure that it provides a way for users to get its source. For example, if your program is a web application, its interface could display a \u201cSource\u201d link that leads users to an archive of the code. There are many ways you could offer source, and different solutions will be better for different programs; see section 13 for the specific requirements.
You should also get your employer (if you work as a programmer) or school, if any, to sign a \u201ccopyright disclaimer\u201d for the program, if necessary. For more information on this, and how to apply and follow the GNU AGPL, see https://www.gnu.org/licenses/.
"},{"location":"data/","title":"data","text":"data
provides a collection of data processing utilities for handling data.
While datasets
is a powerful library for managing datasets, it is a general-purpose tool that may not cover all the specific functionalities of scientific applications.
The data
package is designed to complement datasets
by offering additional data processing utilities that are commonly used in scientific tasks.
from multimolecule.data import Dataset\n\ndata = Dataset(\"data/rna/5utr.csv\", split=\"train\", pretrained=\"multimolecule/rna\")\n
"},{"location":"data/#load-from-datasets","title":"Load from datasets
","text":"Pythonfrom multimolecule.data import Dataset\n\ndata = Dataset(\"multimolecule/bprna-spot\", split=\"train\", pretrained=\"multimolecule/rna\")\n
"},{"location":"data/dataset/","title":"Dataset","text":""},{"location":"data/dataset/#multimolecule.data.Dataset","title":"multimolecule.data.Dataset","text":" Bases: Dataset
The base class for all datasets.
Dataset is a subclass of datasets.Dataset
that provides additional functionality for handling structured data. It has three main features:
Attributes:
Name Type Descriptiontasks
NestedDict
A nested dictionary of the inferred tasks for each label column in the dataset.
tokenizer
PreTrainedTokenizerBase
The pretrained tokenizer to use for tokenization.
truncation
bool
Whether to truncate sequences that exceed the maximum length of the tokenizer.
max_seq_length
int
The maximum length of the input sequences.
data_cols
List
The names of all columns in the dataset.
feature_cols
List
The names of the feature columns in the dataset.
label_cols
List
The names of the label columns in the dataset.
sequence_cols
List
The names of the sequence columns in the dataset.
column_names_map
Mapping[str, str] | None
A mapping of column names to new column names.
preprocess
bool
Whether to preprocess the dataset.
Parameters:
Name Type Description DefaultTable | DataFrame | dict | list | str
The dataset. This can be a path to a file, a tag on the Hugging Face Hub, a pyarrow.Table, a dict, a list, or a pandas.DataFrame.
requiredNamedSplit
The split of the dataset.
requiredPreTrainedTokenizerBase | None
A pretrained tokenizer to use for tokenization. Either tokenizer
or pretrained
must be specified.
None
str | None
The name of a pretrained tokenizer to use for tokenization. Either tokenizer
or pretrained
must be specified.
None
List | None
The names of the feature columns in the dataset. Will be inferred automatically if not specified.
None
List | None
The names of the label columns in the dataset. Will be inferred automatically if not specified.
None
List | None
The names of the ID columns in the dataset. Will be inferred automatically if not specified.
None
bool | None
Whether to preprocess the dataset. Preprocessing involves pre-tokenizing the sequences using the tokenizer. Defaults to True
.
None
bool | None
Whether to automatically rename sequence columns to standard name. Only works when there is exactly one sequence column You can control the naming through multimolecule.defaults.SEQUENCE_COL_NAME
. For more refined control, use column_names_map
.
None
Whether to automatically rename label column to standard name. Only works when there is exactly one label column. You can control the naming through multimolecule.defaults.LABEL_COL_NAME
. For more refined control, use column_names_map
.
Mapping[str, str] | None
A mapping of column names to new column names. This is useful for renaming columns to inputs that are expected by a model. Defaults to None
.
None
bool | None
Whether to truncate sequences that exceed the maximum length of the tokenizer. Defaults to False
.
None
int | None
The maximum length of the input sequences. Defaults to the model_max_length
of the tokenizer.
None
Mapping[str, Task] | None
A mapping of column names to tasks. Will be inferred automatically if not specified.
None
Mapping[str, int] | None
A mapping of column names to discrete mappings. This is useful for mapping the raw value to nominal value in classification tasks. Will be inferred automatically if not specified.
None
str
How to handle NaN and inf values in the dataset. Can be \u201cignore\u201d, \u201cerror\u201d, \u201cdrop\u201d, or \u201cfill\u201d. Defaults to \u201cignore\u201d.
'ignore'
str | int | float
The value to fill NaN and inf values with. Defaults to 0.
0
DatasetInfo | None
The dataset info.
None
Table | None
The indices table.
None
str | None
The fingerprint of the dataset.
None
Source code in multimolecule/data/dataset.py
Pythonclass Dataset(datasets.Dataset):\n r\"\"\"\n The base class for all datasets.\n\n Dataset is a subclass of [`datasets.Dataset`][] that provides additional functionality for handling structured data.\n It has three main features:\n\n - column identification: identify the special columns (sequence and structure columns) in the dataset.\n - tokenization: tokenize the sequence columns in the dataset using a pretrained tokenizer.\n - task inference: infer the task type and level of each label column in the dataset.\n\n Attributes:\n tasks: A nested dictionary of the inferred tasks for each label column in the dataset.\n tokenizer: The pretrained tokenizer to use for tokenization.\n truncation: Whether to truncate sequences that exceed the maximum length of the tokenizer.\n max_seq_length: The maximum length of the input sequences.\n data_cols: The names of all columns in the dataset.\n feature_cols: The names of the feature columns in the dataset.\n label_cols: The names of the label columns in the dataset.\n sequence_cols: The names of the sequence columns in the dataset.\n column_names_map: A mapping of column names to new column names.\n preprocess: Whether to preprocess the dataset.\n\n Args:\n data: The dataset. This can be a path to a file, a tag on the Hugging Face Hub, a pyarrow.Table,\n a [dict][], a [list][], or a [pandas.DataFrame][].\n split: The split of the dataset.\n tokenizer: A pretrained tokenizer to use for tokenization.\n Either `tokenizer` or `pretrained` must be specified.\n pretrained: The name of a pretrained tokenizer to use for tokenization.\n Either `tokenizer` or `pretrained` must be specified.\n feature_cols: The names of the feature columns in the dataset.\n Will be inferred automatically if not specified.\n label_cols: The names of the label columns in the dataset.\n Will be inferred automatically if not specified.\n id_cols: The names of the ID columns in the dataset.\n Will be inferred automatically if not specified.\n preprocess: Whether to preprocess the dataset.\n Preprocessing involves pre-tokenizing the sequences using the tokenizer.\n Defaults to `True`.\n auto_rename_sequence_col: Whether to automatically rename sequence columns to standard name.\n Only works when there is exactly one sequence column\n You can control the naming through `multimolecule.defaults.SEQUENCE_COL_NAME`.\n For more refined control, use `column_names_map`.\n auto_rename_label_cols: Whether to automatically rename label column to standard name.\n Only works when there is exactly one label column.\n You can control the naming through `multimolecule.defaults.LABEL_COL_NAME`.\n For more refined control, use `column_names_map`.\n column_names_map: A mapping of column names to new column names.\n This is useful for renaming columns to inputs that are expected by a model.\n Defaults to `None`.\n truncation: Whether to truncate sequences that exceed the maximum length of the tokenizer.\n Defaults to `False`.\n max_seq_length: The maximum length of the input sequences.\n Defaults to the `model_max_length` of the tokenizer.\n tasks: A mapping of column names to tasks.\n Will be inferred automatically if not specified.\n discrete_map: A mapping of column names to discrete mappings.\n This is useful for mapping the raw value to nominal value in classification tasks.\n Will be inferred automatically if not specified.\n nan_process: How to handle NaN and inf values in the dataset.\n Can be \"ignore\", \"error\", \"drop\", or \"fill\". Defaults to \"ignore\".\n fill_value: The value to fill NaN and inf values with.\n Defaults to 0.\n info: The dataset info.\n indices_table: The indices table.\n fingerprint: The fingerprint of the dataset.\n \"\"\"\n\n tokenizer: PreTrainedTokenizerBase\n truncation: bool = False\n max_seq_length: int\n seq_length_offset: int = 0\n\n _id_cols: List\n _feature_cols: List\n _label_cols: List\n\n _sequence_cols: List\n _secondary_structure_cols: List\n\n _tasks: NestedDict[str, Task]\n _discrete_map: Mapping\n\n preprocess: bool = True\n auto_rename_sequence_col: bool = True\n auto_rename_label_col: bool = False\n column_names_map: Mapping[str, str] | None = None\n ignored_cols: List[str] = []\n\n def __init__(\n self,\n data: Table | DataFrame | dict | list | str,\n split: datasets.NamedSplit,\n tokenizer: PreTrainedTokenizerBase | None = None,\n pretrained: str | None = None,\n feature_cols: List | None = None,\n label_cols: List | None = None,\n id_cols: List | None = None,\n preprocess: bool | None = None,\n auto_rename_sequence_col: bool | None = None,\n auto_rename_label_col: bool | None = None,\n column_names_map: Mapping[str, str] | None = None,\n truncation: bool | None = None,\n max_seq_length: int | None = None,\n tasks: Mapping[str, Task] | None = None,\n discrete_map: Mapping[str, int] | None = None,\n nan_process: str = \"ignore\",\n fill_value: str | int | float = 0,\n info: datasets.DatasetInfo | None = None,\n indices_table: Table | None = None,\n fingerprint: str | None = None,\n ignored_cols: List[str] | None = None,\n ):\n self._tasks = NestedDict()\n if tasks is not None:\n self.tasks = tasks\n if discrete_map is not None:\n self._discrete_map = discrete_map\n arrow_table = self.build_table(\n data, split, feature_cols, label_cols, nan_process=nan_process, fill_value=fill_value\n )\n super().__init__(\n arrow_table=arrow_table, split=split, info=info, indices_table=indices_table, fingerprint=fingerprint\n )\n self.identify_special_cols(feature_cols=feature_cols, label_cols=label_cols, id_cols=id_cols)\n self.post(\n tokenizer=tokenizer,\n pretrained=pretrained,\n preprocess=preprocess,\n truncation=truncation,\n max_seq_length=max_seq_length,\n auto_rename_sequence_col=auto_rename_sequence_col,\n auto_rename_label_col=auto_rename_label_col,\n column_names_map=column_names_map,\n )\n self.ignored_cols = ignored_cols or self.id_cols\n self.train = split == datasets.Split.TRAIN\n\n def build_table(\n self,\n data: Table | DataFrame | dict | str,\n split: datasets.NamedSplit,\n feature_cols: List | None = None,\n label_cols: List | None = None,\n nan_process: str | None = \"ignore\",\n fill_value: str | int | float = 0,\n ) -> datasets.table.Table:\n if isinstance(data, str):\n try:\n data = datasets.load_dataset(data, split=split).data\n except FileNotFoundError:\n data = dl.load_pandas(data)\n if isinstance(data, DataFrame):\n data = data.loc[:, ~data.columns.str.contains(\"^Unnamed\")]\n data = pa.Table.from_pandas(data, preserve_index=False)\n elif isinstance(data, dict):\n data = pa.Table.from_pydict(data)\n elif isinstance(data, list):\n data = pa.Table.from_pylist(data)\n elif isinstance(data, DataFrame):\n data = pa.Table.from_pandas(data, preserve_index=False)\n if feature_cols is not None and label_cols is not None:\n data = data.select(feature_cols + label_cols)\n data = self.process_nan(data, nan_process=nan_process, fill_value=fill_value)\n return data\n\n def post(\n self,\n tokenizer: PreTrainedTokenizerBase | None = None,\n pretrained: str | None = None,\n max_seq_length: int | None = None,\n truncation: bool | None = None,\n preprocess: bool | None = None,\n auto_rename_sequence_col: bool | None = None,\n auto_rename_label_col: bool | None = None,\n column_names_map: Mapping[str, str] | None = None,\n ) -> None:\n r\"\"\"\n Perform pre-processing steps after initialization.\n\n It first identifies the special columns (sequence and structure columns) in the dataset.\n Then it sets the feature and label columns based on the input arguments.\n If `auto_rename_sequence_col` is `True`, it will automatically rename the sequence column.\n If `auto_rename_label_col` is `True`, it will automatically rename the label column.\n Finally, it sets the [`transform`][datasets.Dataset.set_transform] function based on the `preprocess` flag.\n \"\"\"\n if tokenizer is None:\n if pretrained is None:\n raise ValueError(\"tokenizer and pretrained can not be both None.\")\n tokenizer = AutoTokenizer.from_pretrained(pretrained)\n if max_seq_length is None:\n max_seq_length = tokenizer.model_max_length\n else:\n tokenizer.model_max_length = max_seq_length\n self.tokenizer = tokenizer\n self.max_seq_length = max_seq_length\n if truncation is not None:\n self.truncation = truncation\n if self.tokenizer.cls_token is not None:\n self.seq_length_offset += 1\n if self.tokenizer.sep_token is not None and self.tokenizer.sep_token != self.tokenizer.eos_token:\n self.seq_length_offset += 1\n if self.tokenizer.eos_token is not None:\n self.seq_length_offset += 1\n if preprocess is not None:\n self.preprocess = preprocess\n if auto_rename_sequence_col is not None:\n self.auto_rename_sequence_col = auto_rename_sequence_col\n if auto_rename_label_col is not None:\n self.auto_rename_label_col = auto_rename_label_col\n if column_names_map is None:\n column_names_map = {}\n if self.auto_rename_sequence_col:\n if len(self.sequence_cols) != 1:\n raise ValueError(\"auto_rename_sequence_col can only be used when there is exactly one sequence column.\")\n column_names_map[self.sequence_cols[0]] = defaults.SEQUENCE_COL_NAME # type: ignore[index]\n if self.auto_rename_label_col:\n if len(self.label_cols) != 1:\n raise ValueError(\"auto_rename_label_col can only be used when there is exactly one label column.\")\n column_names_map[self.label_cols[0]] = defaults.LABEL_COL_NAME # type: ignore[index]\n self.column_names_map = column_names_map\n if self.column_names_map:\n self.rename_columns(self.column_names_map)\n self.infer_tasks()\n\n if self.preprocess:\n self.update(self.map(self.tokenization))\n if self.secondary_structure_cols:\n self.update(self.map(self.convert_secondary_structure))\n if self.discrete_map:\n self.update(self.map(self.map_discrete))\n fn_kwargs = {\n \"columns\": [name for name, task in self.tasks.items() if task.level in [\"token\", \"contact\"]],\n \"max_seq_length\": self.max_seq_length - self.seq_length_offset,\n }\n if self.truncation and 0 < self.max_seq_length < 2**32:\n self.update(self.map(self.truncate, fn_kwargs=fn_kwargs))\n self.set_transform(self.transform)\n\n def transform(self, batch: Mapping) -> Mapping:\n r\"\"\"\n Default [`transform`][datasets.Dataset.set_transform].\n\n See Also:\n [`collate`][multimolecule.Dataset.collate]\n \"\"\"\n return {k: self.collate(k, v) for k, v in batch.items() if k not in self.ignored_cols}\n\n def collate(self, col: str, data: Any) -> Tensor | NestedTensor | None:\n r\"\"\"\n Collate the data for a column.\n\n If the column is a sequence column, it will tokenize the data if `tokenize` is `True`.\n Otherwise, it will return a tensor or nested tensor.\n \"\"\"\n if col in self.sequence_cols:\n if isinstance(data[0], str):\n data = self.tokenize(data)\n return NestedTensor(data)\n if not self.preprocess:\n if col in self.discrete_map:\n data = map_value(data, self.discrete_map[col])\n if col in self.tasks:\n data = truncate_value(data, self.max_seq_length - self.seq_length_offset, self.tasks[col].level)\n if isinstance(data[0], str):\n return data\n try:\n return torch.tensor(data)\n except ValueError:\n return NestedTensor(data)\n\n def infer_tasks(self, sequence_col: str | None = None) -> NestedDict:\n for col in self.label_cols:\n if col in self.tasks:\n continue\n if col in self.secondary_structure_cols:\n task = Task(TaskType.Binary, level=TaskLevel.Contact, num_labels=1)\n self.tasks[col] = task # type: ignore[index]\n warn(\n f\"Secondary structure columns are assumed to be {task}. \"\n \"Please explicitly specify the task if this is not the case.\"\n )\n else:\n try:\n self.tasks[col] = self.infer_task(col, sequence_col) # type: ignore[index]\n except ValueError:\n raise ValueError(f\"Unable to infer task for column {col}.\")\n return self.tasks\n\n def infer_task(self, label_col: str, sequence_col: str | None = None) -> Task:\n if sequence_col is None:\n if len(self.sequence_cols) != 1:\n raise ValueError(\"sequence_col must be specified if there are multiple sequence columns.\")\n sequence_col = self.sequence_cols[0]\n sequence = self._data.column(sequence_col)\n column = self._data.column(label_col)\n return infer_task(\n sequence,\n column,\n truncation=self.truncation,\n max_seq_length=self.max_seq_length,\n seq_length_offset=self.seq_length_offset,\n )\n\n def infer_discrete_map(self, discrete_map: Mapping | None = None):\n self._discrete_map = discrete_map or NestedDict()\n ignored_cols = set(self.discrete_map.keys()) | set(self.sequence_cols) | set(self.secondary_structure_cols)\n data_cols = [i for i in self.data_cols if i not in ignored_cols]\n for col in data_cols:\n discrete_map = infer_discrete_map(self._data.column(col))\n if discrete_map:\n self._discrete_map[col] = discrete_map # type: ignore[index]\n return self._discrete_map\n\n def __getitems__(self, keys: int | slice | Iterable[int]) -> Any:\n return self.__getitem__(keys)\n\n def identify_special_cols(\n self, feature_cols: List | None = None, label_cols: List | None = None, id_cols: List | None = None\n ) -> Sequence:\n all_cols = self.data.column_names\n self._id_cols = id_cols or [i for i in all_cols if i in defaults.ID_COL_NAMES]\n\n string_cols: list[str] = [k for k, v in self.features.items() if k not in self.id_cols and v.dtype == \"string\"]\n self._sequence_cols = [i for i in string_cols if i.lower() in defaults.SEQUENCE_COL_NAMES]\n self._secondary_structure_cols = [i for i in string_cols if i in defaults.SECONDARY_STRUCTURE_COL_NAMES]\n\n data_cols = [i for i in all_cols if i not in self.id_cols]\n if label_cols is None:\n if feature_cols is None:\n feature_cols = [i for i in data_cols if i in defaults.SEQUENCE_COL_NAMES]\n label_cols = [i for i in data_cols if i not in feature_cols]\n self._label_cols = label_cols\n if feature_cols is None:\n feature_cols = [i for i in data_cols if i not in self.label_cols]\n self._feature_cols = feature_cols\n missing_feature_cols = set(self.feature_cols).difference(data_cols)\n if missing_feature_cols:\n raise ValueError(f\"{missing_feature_cols} are specified in feature_cols, but not found in dataset.\")\n missing_label_cols = set(self.label_cols).difference(data_cols)\n if missing_label_cols:\n raise ValueError(f\"{missing_label_cols} are specified in label_cols, but not found in dataset.\")\n return string_cols\n\n def tokenize(self, string: str) -> Tensor:\n return self.tokenizer(string, return_attention_mask=False, truncation=self.truncation)[\"input_ids\"]\n\n def tokenization(self, data: Mapping[str, str]) -> Mapping[str, Tensor]:\n return {col: self.tokenize(data[col]) for col in self.sequence_cols}\n\n def convert_secondary_structure(self, data: Mapping) -> Mapping:\n return {col: dot_bracket_to_contact_map(data[col]) for col in self.secondary_structure_cols}\n\n def map_discrete(self, data: Mapping) -> Mapping:\n return {name: map_value(data[name], mapping) for name, mapping in self.discrete_map.items()}\n\n def truncate(self, data: Mapping, columns: List[str], max_seq_length: int) -> Mapping:\n return {name: truncate_value(data[name], max_seq_length, self.tasks[name].level) for name in columns}\n\n def update(self, dataset: datasets.Dataset):\n r\"\"\"\n Perform an in-place update of the dataset.\n\n This method is used to update the dataset after changes have been made to the underlying data.\n It updates the format columns, data, info, and fingerprint of the dataset.\n \"\"\"\n # pylint: disable=W0212\n # Why datasets won't support in-place changes?\n # It's just impossible to extend.\n self._format_columns = dataset._format_columns\n self._data = dataset._data\n self._info = dataset._info\n self._fingerprint = dataset._fingerprint\n\n def rename_columns(self, column_mapping: Mapping[str, str], new_fingerprint: str | None = None) -> datasets.Dataset:\n self.update(super().rename_columns(column_mapping, new_fingerprint=new_fingerprint))\n self._id_cols = [column_mapping.get(i, i) for i in self.id_cols]\n self._feature_cols = [column_mapping.get(i, i) for i in self.feature_cols]\n self._label_cols = [column_mapping.get(i, i) for i in self.label_cols]\n self._sequence_cols = [column_mapping.get(i, i) for i in self.sequence_cols]\n self._secondary_structure_cols = [column_mapping.get(i, i) for i in self.secondary_structure_cols]\n self.tasks = {column_mapping.get(k, k): v for k, v in self.tasks.items()}\n return self\n\n def rename_column(\n self, original_column_name: str, new_column_name: str, new_fingerprint: str | None = None\n ) -> datasets.Dataset:\n self.update(super().rename_column(original_column_name, new_column_name, new_fingerprint))\n self._id_cols = [new_column_name if i == original_column_name else i for i in self.id_cols]\n self._feature_cols = [new_column_name if i == original_column_name else i for i in self.feature_cols]\n self._label_cols = [new_column_name if i == original_column_name else i for i in self.label_cols]\n self._sequence_cols = [new_column_name if i == original_column_name else i for i in self.sequence_cols]\n self._secondary_structure_cols = [\n new_column_name if i == original_column_name else i for i in self.secondary_structure_cols\n ]\n self.tasks = {new_column_name if k == original_column_name else k: v for k, v in self.tasks.items()}\n return self\n\n def process_nan(self, data: Table, nan_process: str | None, fill_value: str | int | float = 0) -> Table:\n if nan_process == \"ignore\":\n return data\n data = data.to_pandas()\n data = data.replace([float(\"inf\"), -float(\"inf\")], float(\"nan\"))\n if data.isnull().values.any():\n if nan_process is None or nan_process == \"error\":\n raise ValueError(\"NaN / inf values have been found in the dataset.\")\n warn(\n \"NaN / inf values have been found in the dataset.\\n\"\n \"While we can handle them, the data type of the corresponding column may be set to float, \"\n \"which can and very likely will disrupt the auto task recognition.\\n\"\n \"It is recommended to address these values before loading the dataset.\"\n )\n if nan_process == \"drop\":\n data = data.dropna()\n elif nan_process == \"fill\":\n data = data.fillna(fill_value)\n else:\n raise ValueError(f\"Invalid nan_process: {nan_process}\")\n return pa.Table.from_pandas(data, preserve_index=False)\n\n @property\n def id_cols(self) -> List:\n return self._id_cols\n\n @property\n def data_cols(self) -> List:\n return self.feature_cols + self.label_cols\n\n @property\n def feature_cols(self) -> List:\n return self._feature_cols\n\n @property\n def label_cols(self) -> List:\n return self._label_cols\n\n @property\n def sequence_cols(self) -> List:\n return self._sequence_cols\n\n @property\n def secondary_structure_cols(self) -> List:\n return self._secondary_structure_cols\n\n @property\n def tasks(self) -> NestedDict:\n if not hasattr(self, \"_tasks\"):\n self._tasks = NestedDict()\n return self.infer_tasks()\n return self._tasks\n\n @tasks.setter\n def tasks(self, tasks: Mapping):\n self._tasks = NestedDict()\n for name, task in tasks.items():\n if not isinstance(task, Task):\n task = Task(**task)\n self._tasks[name] = task\n\n @property\n def discrete_map(self) -> Mapping:\n if not hasattr(self, \"_discrete_map\"):\n return self.infer_discrete_map()\n return self._discrete_map\n
"},{"location":"data/dataset/#multimolecule.data.Dataset(data)","title":"data
","text":""},{"location":"data/dataset/#multimolecule.data.Dataset(split)","title":"split
","text":""},{"location":"data/dataset/#multimolecule.data.Dataset(tokenizer)","title":"tokenizer
","text":""},{"location":"data/dataset/#multimolecule.data.Dataset(pretrained)","title":"pretrained
","text":""},{"location":"data/dataset/#multimolecule.data.Dataset(feature_cols)","title":"feature_cols
","text":""},{"location":"data/dataset/#multimolecule.data.Dataset(label_cols)","title":"label_cols
","text":""},{"location":"data/dataset/#multimolecule.data.Dataset(id_cols)","title":"id_cols
","text":""},{"location":"data/dataset/#multimolecule.data.Dataset(preprocess)","title":"preprocess
","text":""},{"location":"data/dataset/#multimolecule.data.Dataset(auto_rename_sequence_col)","title":"auto_rename_sequence_col
","text":""},{"location":"data/dataset/#multimolecule.data.Dataset(auto_rename_label_cols)","title":"auto_rename_label_cols
","text":""},{"location":"data/dataset/#multimolecule.data.Dataset(column_names_map)","title":"column_names_map
","text":""},{"location":"data/dataset/#multimolecule.data.Dataset(truncation)","title":"truncation
","text":""},{"location":"data/dataset/#multimolecule.data.Dataset(max_seq_length)","title":"max_seq_length
","text":""},{"location":"data/dataset/#multimolecule.data.Dataset(tasks)","title":"tasks
","text":""},{"location":"data/dataset/#multimolecule.data.Dataset(discrete_map)","title":"discrete_map
","text":""},{"location":"data/dataset/#multimolecule.data.Dataset(nan_process)","title":"nan_process
","text":""},{"location":"data/dataset/#multimolecule.data.Dataset(fill_value)","title":"fill_value
","text":""},{"location":"data/dataset/#multimolecule.data.Dataset(info)","title":"info
","text":""},{"location":"data/dataset/#multimolecule.data.Dataset(indices_table)","title":"indices_table
","text":""},{"location":"data/dataset/#multimolecule.data.Dataset(fingerprint)","title":"fingerprint
","text":""},{"location":"data/dataset/#multimolecule.data.Dataset.post","title":"post","text":"Pythonpost(tokenizer: PreTrainedTokenizerBase | None = None, pretrained: str | None = None, max_seq_length: int | None = None, truncation: bool | None = None, preprocess: bool | None = None, auto_rename_sequence_col: bool | None = None, auto_rename_label_col: bool | None = None, column_names_map: Mapping[str, str] | None = None) -> None\n
Perform pre-processing steps after initialization.
It first identifies the special columns (sequence and structure columns) in the dataset. Then it sets the feature and label columns based on the input arguments. If auto_rename_sequence_col
is True
, it will automatically rename the sequence column. If auto_rename_label_col
is True
, it will automatically rename the label column. Finally, it sets the transform
function based on the preprocess
flag.
multimolecule/data/dataset.py
Pythondef post(\n self,\n tokenizer: PreTrainedTokenizerBase | None = None,\n pretrained: str | None = None,\n max_seq_length: int | None = None,\n truncation: bool | None = None,\n preprocess: bool | None = None,\n auto_rename_sequence_col: bool | None = None,\n auto_rename_label_col: bool | None = None,\n column_names_map: Mapping[str, str] | None = None,\n) -> None:\n r\"\"\"\n Perform pre-processing steps after initialization.\n\n It first identifies the special columns (sequence and structure columns) in the dataset.\n Then it sets the feature and label columns based on the input arguments.\n If `auto_rename_sequence_col` is `True`, it will automatically rename the sequence column.\n If `auto_rename_label_col` is `True`, it will automatically rename the label column.\n Finally, it sets the [`transform`][datasets.Dataset.set_transform] function based on the `preprocess` flag.\n \"\"\"\n if tokenizer is None:\n if pretrained is None:\n raise ValueError(\"tokenizer and pretrained can not be both None.\")\n tokenizer = AutoTokenizer.from_pretrained(pretrained)\n if max_seq_length is None:\n max_seq_length = tokenizer.model_max_length\n else:\n tokenizer.model_max_length = max_seq_length\n self.tokenizer = tokenizer\n self.max_seq_length = max_seq_length\n if truncation is not None:\n self.truncation = truncation\n if self.tokenizer.cls_token is not None:\n self.seq_length_offset += 1\n if self.tokenizer.sep_token is not None and self.tokenizer.sep_token != self.tokenizer.eos_token:\n self.seq_length_offset += 1\n if self.tokenizer.eos_token is not None:\n self.seq_length_offset += 1\n if preprocess is not None:\n self.preprocess = preprocess\n if auto_rename_sequence_col is not None:\n self.auto_rename_sequence_col = auto_rename_sequence_col\n if auto_rename_label_col is not None:\n self.auto_rename_label_col = auto_rename_label_col\n if column_names_map is None:\n column_names_map = {}\n if self.auto_rename_sequence_col:\n if len(self.sequence_cols) != 1:\n raise ValueError(\"auto_rename_sequence_col can only be used when there is exactly one sequence column.\")\n column_names_map[self.sequence_cols[0]] = defaults.SEQUENCE_COL_NAME # type: ignore[index]\n if self.auto_rename_label_col:\n if len(self.label_cols) != 1:\n raise ValueError(\"auto_rename_label_col can only be used when there is exactly one label column.\")\n column_names_map[self.label_cols[0]] = defaults.LABEL_COL_NAME # type: ignore[index]\n self.column_names_map = column_names_map\n if self.column_names_map:\n self.rename_columns(self.column_names_map)\n self.infer_tasks()\n\n if self.preprocess:\n self.update(self.map(self.tokenization))\n if self.secondary_structure_cols:\n self.update(self.map(self.convert_secondary_structure))\n if self.discrete_map:\n self.update(self.map(self.map_discrete))\n fn_kwargs = {\n \"columns\": [name for name, task in self.tasks.items() if task.level in [\"token\", \"contact\"]],\n \"max_seq_length\": self.max_seq_length - self.seq_length_offset,\n }\n if self.truncation and 0 < self.max_seq_length < 2**32:\n self.update(self.map(self.truncate, fn_kwargs=fn_kwargs))\n self.set_transform(self.transform)\n
"},{"location":"data/dataset/#multimolecule.data.Dataset.transform","title":"transform","text":"Pythontransform(batch: Mapping) -> Mapping\n
Default transform
.
collate
multimolecule/data/dataset.py
Pythondef transform(self, batch: Mapping) -> Mapping:\n r\"\"\"\n Default [`transform`][datasets.Dataset.set_transform].\n\n See Also:\n [`collate`][multimolecule.Dataset.collate]\n \"\"\"\n return {k: self.collate(k, v) for k, v in batch.items() if k not in self.ignored_cols}\n
"},{"location":"data/dataset/#multimolecule.data.Dataset.collate","title":"collate","text":"Pythoncollate(col: str, data: Any) -> Tensor | NestedTensor | None\n
Collate the data for a column.
If the column is a sequence column, it will tokenize the data if tokenize
is True
. Otherwise, it will return a tensor or nested tensor.
multimolecule/data/dataset.py
Pythondef collate(self, col: str, data: Any) -> Tensor | NestedTensor | None:\n r\"\"\"\n Collate the data for a column.\n\n If the column is a sequence column, it will tokenize the data if `tokenize` is `True`.\n Otherwise, it will return a tensor or nested tensor.\n \"\"\"\n if col in self.sequence_cols:\n if isinstance(data[0], str):\n data = self.tokenize(data)\n return NestedTensor(data)\n if not self.preprocess:\n if col in self.discrete_map:\n data = map_value(data, self.discrete_map[col])\n if col in self.tasks:\n data = truncate_value(data, self.max_seq_length - self.seq_length_offset, self.tasks[col].level)\n if isinstance(data[0], str):\n return data\n try:\n return torch.tensor(data)\n except ValueError:\n return NestedTensor(data)\n
"},{"location":"data/dataset/#multimolecule.data.Dataset.update","title":"update","text":"Pythonupdate(dataset: Dataset)\n
Perform an in-place update of the dataset.
This method is used to update the dataset after changes have been made to the underlying data. It updates the format columns, data, info, and fingerprint of the dataset.
Source code inmultimolecule/data/dataset.py
Pythondef update(self, dataset: datasets.Dataset):\n r\"\"\"\n Perform an in-place update of the dataset.\n\n This method is used to update the dataset after changes have been made to the underlying data.\n It updates the format columns, data, info, and fingerprint of the dataset.\n \"\"\"\n # pylint: disable=W0212\n # Why datasets won't support in-place changes?\n # It's just impossible to extend.\n self._format_columns = dataset._format_columns\n self._data = dataset._data\n self._info = dataset._info\n self._fingerprint = dataset._fingerprint\n
"},{"location":"datasets/","title":"datasets","text":"datasets
provide a collection of widely used datasets.
from multimolecule.data import Dataset\n\ndata = Dataset(\"multimolecule/bprna-spot\", split=\"train\", pretrained=\"multimolecule/rna\")\n
"},{"location":"datasets/archiveii/","title":"ArchiveII","text":"ArchiveII is a dataset of RNA sequences and their secondary structures, widely used in RNA secondary structure prediction benchmarks.
ArchiveII contains 2975 RNA samples across 10 RNA families, with sequence lengths ranging from 28 to 2968 nucleotides. This dataset is frequently used to evaluate RNA secondary structure prediction methods, including those that handle both pseudoknotted and non-pseudoknotted structures.
It is considered complementary to the RNAStrAlign dataset.
"},{"location":"datasets/archiveii/#disclaimer","title":"Disclaimer","text":"This is an UNOFFICIAL release of the ArchiveII by Mehdi Saman Booy, et al.
The team releasing ArchiveII did not write this dataset card for this dataset so this dataset card has been written by the MultiMolecule team.
"},{"location":"datasets/archiveii/#dataset-description","title":"Dataset Description","text":"id: A unique identifier for each RNA entry. This ID is derived from the family and the original .sta
file name, and serves as a reference to the specific RNA structure within the dataset.
sequence: The nucleotide sequence of the RNA molecule, represented using the standard RNA bases:
secondary_structure: The secondary structure of the RNA represented in dot-bracket notation, using up to three types of symbols to indicate base pairing and unpaired regions, as per bpRNA\u2019s standard:
.
): Represent unpaired nucleotides.(
and )
): Represent base pairs in standard stems (page 1).family: The RNA family to which the sequence belongs, such as 16S rRNA, 5S rRNA, etc.
This dataset is available in two additional variants:
This dataset is licensed under the AGPL-3.0 License.
Text OnlySPDX-License-Identifier: AGPL-3.0-or-later\n
"},{"location":"datasets/archiveii/#citation","title":"Citation","text":"BibTeX@article{samanbooy2022rna,\n author = {Saman Booy, Mehdi and Ilin, Alexander and Orponen, Pekka},\n journal = {BMC Bioinformatics},\n keywords = {Deep learning; Pseudoknotted structures; RNA structure prediction},\n month = feb,\n number = 1,\n pages = {58},\n publisher = {Springer Science and Business Media LLC},\n title = {{RNA} secondary structure prediction with convolutional neural networks},\n volume = 23,\n year = 2022\n}\n
"},{"location":"datasets/bprna-new/","title":"bpRNA-1m","text":"bpRNA-new is a database of single molecule secondary structures annotated using bpRNA.
bpRNA-new is a dataset of RNA families from Rfam 14.2, designed for cross-family validation to assess generalization capability. It focuses on families distinct from those in bpRNA-1m, providing a robust benchmark for evaluating model performance on unseen RNA families.
"},{"location":"datasets/bprna-new/#disclaimer","title":"Disclaimer","text":"This is an UNOFFICIAL release of the bpRNA-new by Kengo Sato, et al.
The team releasing bpRNA-new did not write this dataset card for this dataset so this dataset card has been written by the MultiMolecule team.
"},{"location":"datasets/bprna-new/#dataset-description","title":"Dataset Description","text":"This dataset is licensed under the AGPL-3.0 License.
Text OnlySPDX-License-Identifier: AGPL-3.0-or-later\n
"},{"location":"datasets/bprna-new/#citation","title":"Citation","text":"BibTeX@article{sato2021rna,\n author = {Sato, Kengo and Akiyama, Manato and Sakakibara, Yasubumi},\n journal = {Nature Communications},\n month = feb,\n number = 1,\n pages = {941},\n publisher = {Springer Science and Business Media LLC},\n title = {{RNA} secondary structure prediction using deep learning with thermodynamic integration},\n volume = 12,\n year = 2021\n}\n
"},{"location":"datasets/bprna-spot/","title":"bpRNA-1m","text":"bpRNA-spot is a database of single molecule secondary structures annotated using bpRNA.
bpRNA-spot is a subset of bpRNA-1m. It applies CD-HIT (CD-HIT-EST) to remove sequences with more than 80% sequence similarity from bpRNA-1m. It further randomly splits the remaining sequences into training, validation, and test sets with a ratio of apprxiately 8:1:1.
"},{"location":"datasets/bprna-spot/#disclaimer","title":"Disclaimer","text":"This is an UNOFFICIAL release of the bpRNA-spot by Jaswinder Singh, et al.
The team releasing bpRNA-spot did not write this dataset card for this dataset so this dataset card has been written by the MultiMolecule team.
"},{"location":"datasets/bprna-spot/#dataset-description","title":"Dataset Description","text":"This dataset is licensed under the AGPL-3.0 License.
Text OnlySPDX-License-Identifier: AGPL-3.0-or-later\n
"},{"location":"datasets/bprna-spot/#citation","title":"Citation","text":"BibTeX@article{singh2019rna,\n author = {Singh, Jaswinder and Hanson, Jack and Paliwal, Kuldip and Zhou, Yaoqi},\n journal = {Nature Communications},\n month = nov,\n number = 1,\n pages = {5407},\n publisher = {Springer Science and Business Media LLC},\n title = {{RNA} secondary structure prediction using an ensemble of two-dimensional deep neural networks and transfer learning},\n volume = 10,\n year = 2019\n}\n\n@article{darty2009varna,\n author = {Darty, K{\\'e}vin and Denise, Alain and Ponty, Yann},\n journal = {Bioinformatics},\n month = aug,\n number = 15,\n pages = {1974--1975},\n publisher = {Oxford University Press (OUP)},\n title = {{VARNA}: Interactive drawing and editing of the {RNA} secondary structure},\n volume = 25,\n year = 2009\n}\n\n@article{berman2000protein,\n author = {Berman, H M and Westbrook, J and Feng, Z and Gilliland, G and Bhat, T N and Weissig, H and Shindyalov, I N and Bourne, P E},\n journal = {Nucleic Acids Research},\n month = jan,\n number = 1,\n pages = {235--242},\n publisher = {Oxford University Press (OUP)},\n title = {The Protein Data Bank},\n volume = 28,\n year = 2000\n}\n
"},{"location":"datasets/bprna/","title":"bpRNA-1m","text":"bpRNA-1m is a database of single molecule secondary structures annotated using bpRNA.
"},{"location":"datasets/bprna/#disclaimer","title":"Disclaimer","text":"This is an UNOFFICIAL release of the bpRNA-1m by Center for Quantitative Life Sciences of the Oregon State University.
The team releasing bpRNA-1m did not write this dataset card for this dataset so this dataset card has been written by the MultiMolecule team.
"},{"location":"datasets/bprna/#dataset-description","title":"Dataset Description","text":"The converted dataset consists of the following columns, each providing specific information about the RNA secondary structures, consistent with the bpRNA standard:
id: A unique identifier for each RNA entry. This ID is derived from the original .sta
file name and serves as a reference to the specific RNA structure within the dataset.
sequence: The nucleotide sequence of the RNA molecule, represented using the standard RNA bases:
secondary_structure: The secondary structure of the RNA represented in dot-bracket notation, using up to three types of symbols to indicate base pairing and unpaired regions, as per bpRNA\u2019s standard:
.
): Represent unpaired nucleotides.(
and )
): Represent base pairs in standard stems (page 1).[
and ]
): Represent base pairs in pseudoknots (page 2).{
and }
): Represent base pairs in additional pseudoknots (page 3).structural_annotation: Structural annotations categorizing different regions of the RNA based on their roles within the secondary structure, consistent with bpRNA standards:
functional_annotation: Functional annotations indicating specific functional elements or regions within the RNA sequence, as defined by bpRNA:
This dataset is available in two variants:
This dataset is licensed under the AGPL-3.0 License.
Text OnlySPDX-License-Identifier: AGPL-3.0-or-later\n
"},{"location":"datasets/bprna/#citation","title":"Citation","text":"BibTeX@article{danaee2018bprna,\n author = {Danaee, Padideh and Rouches, Mason and Wiley, Michelle and Deng, Dezhong and Huang, Liang and Hendrix, David},\n journal = {Nucleic Acids Research},\n month = jun,\n number = 11,\n pages = {5381--5394},\n title = {{bpRNA}: large-scale automated annotation and analysis of {RNA} secondary structure},\n volume = 46,\n year = 2018\n}\n\n@article{cannone2002comparative,\n author = {Cannone, Jamie J and Subramanian, Sankar and Schnare, Murray N and Collett, James R and D'Souza, Lisa M and Du, Yushi and Feng, Brian and Lin, Nan and Madabusi, Lakshmi V and M{\\\"u}ller, Kirsten M and Pande, Nupur and Shang, Zhidi and Yu, Nan and Gutell, Robin R},\n copyright = {https://www.springernature.com/gp/researchers/text-and-data-mining},\n journal = {BMC Bioinformatics},\n month = jan,\n number = 1,\n pages = {2},\n publisher = {Springer Science and Business Media LLC},\n title = {The comparative {RNA} web ({CRW}) site: an online database of comparative sequence and structure information for ribosomal, intron, and other {RNAs}},\n volume = 3,\n year = 2002\n}\n\n@article{zwieb2003tmrdb,\n author = {Zwieb, Christian and Gorodkin, Jan and Knudsen, Bjarne and Burks, Jody and Wower, Jacek},\n journal = {Nucleic Acids Research},\n month = jan,\n number = 1,\n pages = {446--447},\n publisher = {Oxford University Press (OUP)},\n title = {{tmRDB} ({tmRNA} database)},\n volume = 31,\n year = 2003\n}\n\n@article{rosenblad2003srpdb,\n author = {Rosenblad, Magnus Alm and Gorodkin, Jan and Knudsen, Bjarne and Zwieb, Christian and Samuelsson, Tore},\n journal = {Nucleic Acids Research},\n month = jan,\n number = 1,\n pages = {363--364},\n publisher = {Oxford University Press (OUP)},\n title = {{SRPDB}: Signal Recognition Particle Database},\n volume = 31,\n year = 2003\n}\n\n@article{sprinzl2005compilation,\n author = {Sprinzl, Mathias and Vassilenko, Konstantin S},\n journal = {Nucleic Acids Research},\n month = jan,\n number = {Database issue},\n pages = {D139--40},\n publisher = {Oxford University Press (OUP)},\n title = {Compilation of {tRNA} sequences and sequences of {tRNA} genes},\n volume = 33,\n year = 2005\n}\n\n@article{brown1994ribonuclease,\n author = {Brown, J W and Haas, E S and Gilbert, D G and Pace, N R},\n journal = {Nucleic Acids Research},\n month = sep,\n number = 17,\n pages = {3660--3662},\n publisher = {Oxford University Press (OUP)},\n title = {The Ribonuclease {P} database},\n volume = 22,\n year = 1994\n}\n\n@article{griffiths2003rfam,\n author = {Griffiths-Jones, Sam and Bateman, Alex and Marshall, Mhairi and Khanna, Ajay and Eddy, Sean R},\n journal = {Nucleic Acids Research},\n month = jan,\n number = 1,\n pages = {439--441},\n publisher = {Oxford University Press (OUP)},\n title = {Rfam: an {RNA} family database},\n volume = 31,\n year = 2003\n}\n\n@article{berman2000protein,\n author = {Berman, H M and Westbrook, J and Feng, Z and Gilliland, G and Bhat, T N and Weissig, H and Shindyalov, I N and Bourne, P E},\n journal = {Nucleic Acids Research},\n month = jan,\n number = 1,\n pages = {235--242},\n publisher = {Oxford University Press (OUP)},\n title = {The Protein Data Bank},\n volume = 28,\n year = 2000\n}\n
"},{"location":"datasets/eternabench-cm/","title":"EternaBench-CM","text":"EternaBench-CM is a synthetic RNA dataset comprising 12,711 RNA constructs that have been chemically mapped using SHAPE and MAP-seq methods. These RNA sequences are probed to obtain experimental data on their nucleotide reactivity, which indicates whether specific regions of the RNA are flexible or structured. The dataset provides high-resolution, large-scale data that can be used for studying RNA folding and stability.
"},{"location":"datasets/eternabench-cm/#disclaimer","title":"Disclaimer","text":"This is an UNOFFICIAL release of the EternaBench-CM by Hannah K. Wayment-Steele, et al.
The team releasing EternaBench-CM did not write this dataset card for this dataset so this dataset card has been written by the MultiMolecule team.
"},{"location":"datasets/eternabench-cm/#dataset-description","title":"Dataset Description","text":"The dataset includes a large set of synthetic RNA sequences with experimental chemical mapping data, which provides a quantitative readout of RNA nucleotide reactivity. These data are ensemble-averaged and serve as a critical benchmark for evaluating secondary structure prediction algorithms in their ability to model RNA folding dynamics.
"},{"location":"datasets/eternabench-cm/#example-entry","title":"Example Entry","text":"index design sequence secondary_structure reactivity errors signal_to_noise 769337-1 d+m plots weaker again GGAAAAAAAAAAA\u2026 ................ [0.642,1.4853,0.1629, \u2026] [0.3181,0.4221,0.1823, \u2026] 3.227"},{"location":"datasets/eternabench-cm/#column-description","title":"Column Description","text":"id: A unique identifier for each RNA sequence entry.
design: The name given to each RNA design by contributors, used for easy reference.
sequence: The nucleotide sequence of the RNA molecule, represented using the standard RNA bases:
secondary_structure: The secondary structure of the RNA represented in dot-bracket notation, using up to three types of symbols to indicate base pairing and unpaired regions, as per bpRNA\u2019s standard:
.
): Represent unpaired nucleotides.(
and )
): Represent base pairs in standard stems (page 1).[
and ]
): Represent base pairs in pseudoknots (page 2).{
and }
): Represent base pairs in additional pseudoknots (page 3).reactivity: A list of normalized reactivity values for each nucleotide, representing the likelihood that a nucleotide is unpaired. High reactivity indicates high flexibility (unpaired regions), and low reactivity corresponds to paired or structured regions.
errors: Arrays of floating-point numbers indicating the experimental errors corresponding to the measurements in the reactivity. These values help quantify the uncertainty in the degradation rates and reactivity measurements.
signal_to_noise: The signal-to-noise ratio calculated from the reactivity and error values, providing a measure of data quality.
This dataset is licensed under the AGPL-3.0 License.
Text OnlySPDX-License-Identifier: AGPL-3.0-or-later\n
"},{"location":"datasets/eternabench-cm/#citation","title":"Citation","text":"BibTeX@article{waymentsteele2022rna,\n author = {Wayment-Steele, Hannah K and Kladwang, Wipapat and Strom, Alexandra I and Lee, Jeehyung and Treuille, Adrien and Becka, Alex and {Eterna Participants} and Das, Rhiju},\n journal = {Nature Methods},\n month = oct,\n number = 10,\n pages = {1234--1242},\n publisher = {Springer Science and Business Media LLC},\n title = {{RNA} secondary structure packages evaluated and improved by high-throughput experiments},\n volume = 19,\n year = 2022\n}\n
"},{"location":"datasets/eternabench-external/","title":"EternaBench-External","text":"EternaBench-External consists of 31 independent RNA datasets from various biological sources, including viral genomes, mRNAs, and synthetic RNAs. These sequences were probed using techniques such as SHAPE-CE, SHAPE-MaP, and DMS-MaP-seq to understand RNA secondary structures under different experimental and biological conditions. This dataset serves as a benchmark for evaluating RNA structure prediction models, with a particular focus on generalization to natural RNA molecules.
"},{"location":"datasets/eternabench-external/#disclaimer","title":"Disclaimer","text":"This is an UNOFFICIAL release of the EternaBench-External by Hannah K. Wayment-Steele, et al.
The team releasing EternaBench-External did not write this dataset card for this dataset so this dataset card has been written by the MultiMolecule team.
"},{"location":"datasets/eternabench-external/#dataset-description","title":"Dataset Description","text":"This dataset includes RNA sequences from various biological origins, including viral genomes and mRNAs, and covers a wide range of probing methods like SHAPE-CE and icSHAPE. Each dataset entry provides sequence information, reactivity profiles, and RNA secondary structure data. This dataset can be used to examine how RNA structures vary under different conditions and to validate structural predictions for diverse RNA types.
"},{"location":"datasets/eternabench-external/#example-entry","title":"Example Entry","text":"name sequence reactivity seqpos class dataset Dadonaite,2019 Influenza genome SHAPE(1M7) SSII-Mn(2+) Mut. TTTACCCACAGCTGTGAATT\u2026 [0.639309,0.813297,0.622869,\u2026] [7425,7426,7427,\u2026] viral_gRNA Dadonaite,2019"},{"location":"datasets/eternabench-external/#column-description","title":"Column Description","text":"name: The name of the dataset entry, typically including the experimental setup and biological source.
sequence: The nucleotide sequence of the RNA molecule, represented using the standard RNA bases:
reactivity: A list of normalized reactivity values for each nucleotide, representing the likelihood that a nucleotide is unpaired. High reactivity indicates high flexibility (unpaired regions), and low reactivity corresponds to paired or structured regions.
seqpos: A list of sequence positions corresponding to each nucleotide in the sequence.
class: The type of RNA sequence, can be one of the following:
dataset: The source or reference for the dataset entry, indicating its origin.
This dataset is available in four variants:
This dataset is licensed under the AGPL-3.0 License.
Text OnlySPDX-License-Identifier: AGPL-3.0-or-later\n
"},{"location":"datasets/eternabench-external/#citation","title":"Citation","text":"BibTeX@article{waymentsteele2022rna,\n author = {Wayment-Steele, Hannah K and Kladwang, Wipapat and Strom, Alexandra I and Lee, Jeehyung and Treuille, Adrien and Becka, Alex and {Eterna Participants} and Das, Rhiju},\n journal = {Nature Methods},\n month = oct,\n number = 10,\n pages = {1234--1242},\n publisher = {Springer Science and Business Media LLC},\n title = {{RNA} secondary structure packages evaluated and improved by high-throughput experiments},\n volume = 19,\n year = 2022\n}\n
"},{"location":"datasets/eternabench-switch/","title":"EternaBench-Switch","text":"EternaBench-Switch is a synthetic RNA dataset consisting of 7,228 riboswitch constructs, designed to explore the structural behavior of RNA molecules that change conformation upon binding to ligands such as FMN, theophylline, or tryptophan. These riboswitches exhibit different structural states in the presence or absence of their ligands, and the dataset includes detailed measurements of binding affinities (dissociation constants), activation ratios, and RNA folding properties.
"},{"location":"datasets/eternabench-switch/#disclaimer","title":"Disclaimer","text":"This is an UNOFFICIAL release of the EternaBench-Switch by Hannah K. Wayment-Steele, et al.
The team releasing EternaBench-Switch did not write this dataset card for this dataset so this dataset card has been written by the MultiMolecule team.
"},{"location":"datasets/eternabench-switch/#dataset-description","title":"Dataset Description","text":"The dataset includes synthetic RNA sequences designed to act as riboswitches. These molecules can adopt different structural states in response to ligand binding, and the dataset provides detailed information on the binding affinities for various ligands, along with metrics on the RNA\u2019s ability to switch between conformations. With over 7,000 entries, this dataset is highly useful for studying RNA folding, ligand interaction, and RNA structural dynamics.
"},{"location":"datasets/eternabench-switch/#example-entry","title":"Example Entry","text":"id design sequence activation_ratio ligand switch kd_off kd_on kd_fmn kd_no_fmn min_kd_val ms2_aptamer lig_aptamer ms2_lig_aptamer log_kd_nolig log_kd_lig log_kd_nolig_scaled log_kd_lig_scaled log_AR folding_subscore num_clusters 286 null AGGAAACAUGAGGAU\u2026 0.8824621522 FMN OFF 13.3115 15.084 null null 3.0082 .....(((((x((xxxx)))))))..... .................. .....(((((x((xx\u2026 2.7137 2.5886 1.6123 1.4873 -0.125 null null"},{"location":"datasets/eternabench-switch/#column-description","title":"Column Description","text":"id: A unique identifier for each RNA sequence entry.
design: The name given to each RNA design by contributors, used for easy reference.
sequence: The nucleotide sequence of the RNA molecule, represented using the standard RNA bases:
activation_ratio: The ratio reflecting the RNA molecule\u2019s structural change between two states (e.g., ON and OFF) upon ligand binding.
ligand: The small molecule ligand (e.g., FMN, theophylline) that the RNA is designed to bind to, inducing the switch.
switch: A binary or categorical value indicating whether the RNA demonstrates switching behavior.
kd_off: The dissociation constant (KD) when the RNA is in the \u201cOFF\u201d state (without ligand), representing its binding affinity.
kd_on: The dissociation constant (KD) when the RNA is in the \u201cON\u201d state (with ligand), indicating its affinity after activation.
kd_fmn: The dissociation constant for the RNA binding to the FMN ligand.
kd_no_fmn: The dissociation constant when no FMN ligand is present, indicating the RNA\u2019s behavior in a ligand-free state.
min_kd_val: The minimum KD value observed across different ligand-binding conditions.
ms2_aptamer: Indicates whether the RNA contains an MS2 aptamer, a motif that binds the MS2 viral coat protein.
lig_aptamer: A flag showing the presence of an aptamer that binds the ligand (e.g., FMN), demonstrating ligand-specific binding capability.
ms2_lig_aptamer: Indicates if the RNA contains both an MS2 aptamer and a ligand-binding aptamer, potentially allowing for multifaceted binding behavior.
log_kd_nolig: The logarithmic value of the dissociation constant without the ligand.
log_kd_lig: The logarithmic value of the dissociation constant with the ligand present.
log_kd_nolig_scaled: A normalized and scaled version of log_kd_nolig for easier comparison across conditions.
log_kd_lig_scaled: A normalized and scaled version of log_kd_lig for consistency in data comparisons.
log_AR: The logarithmic scale of the activation ratio, offering a standardized measure of activation strength.
folding_subscore: A numerical score indicating how well the RNA molecule folds into the predicted structure.
num_clusters: The number of distinct structural clusters or conformations predicted for the RNA, reflecting the complexity of the folding landscape.
This dataset is licensed under the AGPL-3.0 License.
Text OnlySPDX-License-Identifier: AGPL-3.0-or-later\n
"},{"location":"datasets/eternabench-switch/#citation","title":"Citation","text":"BibTeX@article{waymentsteele2022rna,\n author = {Wayment-Steele, Hannah K and Kladwang, Wipapat and Strom, Alexandra I and Lee, Jeehyung and Treuille, Adrien and Becka, Alex and {Eterna Participants} and Das, Rhiju},\n journal = {Nature Methods},\n month = oct,\n number = 10,\n pages = {1234--1242},\n publisher = {Springer Science and Business Media LLC},\n title = {{RNA} secondary structure packages evaluated and improved by high-throughput experiments},\n volume = 19,\n year = 2022\n}\n
"},{"location":"datasets/gencode/","title":"GENCODE","text":"GENCODE is a comprehensive annotation project that aims to provide high-quality annotations of the human and mouse genomes. The project is part of the ENCODE (ENCyclopedia Of DNA Elements) scale-up project, which seeks to identify all functional elements in the human genome.
"},{"location":"datasets/gencode/#disclaimer","title":"Disclaimer","text":"This is an UNOFFICIAL release of the GENCODE by Paul Flicek, Roderic Guigo, Manolis Kellis, Mark Gerstein, Benedict Paten, Michael Tress, Jyoti Choudhary, et al.
The team releasing GENCODE did not write this dataset card for this dataset so this dataset card has been written by the MultiMolecule team.
"},{"location":"datasets/gencode/#dataset-description","title":"Dataset Description","text":"This dataset is licensed under the AGPL-3.0 License.
Text OnlySPDX-License-Identifier: AGPL-3.0-or-later\n
"},{"location":"datasets/gencode/#datasets","title":"Datasets","text":"The GENCODE dataset is available in Human and Mouse:
@article{frankish2023gencode,\n author = {Frankish, Adam and Carbonell-Sala, S{\\'\\i}lvia and Diekhans, Mark and Jungreis, Irwin and Loveland, Jane E and Mudge, Jonathan M and Sisu, Cristina and Wright, James C and Arnan, Carme and Barnes, If and Banerjee, Abhimanyu and Bennett, Ruth and Berry, Andrew and Bignell, Alexandra and Boix, Carles and Calvet, Ferriol and Cerd{\\'a}n-V{\\'e}lez, Daniel and Cunningham, Fiona and Davidson, Claire and Donaldson, Sarah and Dursun, Cagatay and Fatima, Reham and Giorgetti, Stefano and Giron, Carlos Garc{\\i}a and Gonzalez, Jose Manuel and Hardy, Matthew and Harrison, Peter W and Hourlier, Thibaut and Hollis, Zoe and Hunt, Toby and James, Benjamin and Jiang, Yunzhe and Johnson, Rory and Kay, Mike and Lagarde, Julien and Martin, Fergal J and G{\\'o}mez, Laura Mart{\\'\\i}nez and Nair, Surag and Ni, Pengyu and Pozo, Fernando and Ramalingam, Vivek and Ruffier, Magali and Schmitt, Bianca M and Schreiber, Jacob M and Steed, Emily and Suner, Marie-Marthe and Sumathipala, Dulika and Sycheva, Irina and Uszczynska-Ratajczak, Barbara and Wass, Elizabeth and Yang, Yucheng T and Yates, Andrew and Zafrulla, Zahoor and Choudhary, Jyoti S and Gerstein, Mark and Guigo, Roderic and Hubbard, Tim J P and Kellis, Manolis and Kundaje, Anshul and Paten, Benedict and Tress, Michael L and Flicek, Paul},\n journal = {Nucleic Acids Research},\n month = jan,\n number = {D1},\n pages = {D942--D949},\n publisher = {Oxford University Press (OUP)},\n title = {{GENCODE}: reference annotation for the human and mouse genomes in 2023},\n volume = 51,\n year = 2023\n}\n\n@article{frankish2021gencode,\n author = {Frankish, Adam and Diekhans, Mark and Jungreis, Irwin and Lagarde, Julien and Loveland, Jane E and Mudge, Jonathan M and Sisu, Cristina and Wright, James C and Armstrong, Joel and Barnes, If and Berry, Andrew and Bignell, Alexandra and Boix, Carles and Carbonell Sala, Silvia and Cunningham, Fiona and Di Domenico, Tom{\\'a}s and Donaldson, Sarah and Fiddes, Ian T and Garc{\\'\\i}a Gir{\\'o}n, Carlos and Gonzalez, Jose Manuel and Grego, Tiago and Hardy, Matthew and Hourlier, Thibaut and Howe, Kevin L and Hunt, Toby and Izuogu, Osagie G and Johnson, Rory and Martin, Fergal J and Mart{\\'\\i}nez, Laura and Mohanan, Shamika and Muir, Paul and Navarro, Fabio C P and Parker, Anne and Pei, Baikang and Pozo, Fernando and Riera, Ferriol Calvet and Ruffier, Magali and Schmitt, Bianca M and Stapleton, Eloise and Suner, Marie-Marthe and Sycheva, Irina and Uszczynska-Ratajczak, Barbara and Wolf, Maxim Y and Xu, Jinuri and Yang, Yucheng T and Yates, Andrew and Zerbino, Daniel and Zhang, Yan and Choudhary, Jyoti S and Gerstein, Mark and Guig{\\'o}, Roderic and Hubbard, Tim J P and Kellis, Manolis and Paten, Benedict and Tress, Michael L and Flicek, Paul},\n journal = {Nucleic Acids Research},\n month = jan,\n number = {D1},\n pages = {D916--D923},\n publisher = {Oxford University Press (OUP)},\n title = {{GENCODE} 2021},\n volume = 49,\n year = 2021\n}\n\n@article{frankish2019gencode,\n author = {Frankish, Adam and Diekhans, Mark and Ferreira, Anne-Maud and Johnson, Rory and Jungreis, Irwin and Loveland, Jane and Mudge, Jonathan M and Sisu, Cristina and Wright, James and Armstrong, Joel and Barnes, If and Berry, Andrew and Bignell, Alexandra and Carbonell Sala, Silvia and Chrast, Jacqueline and Cunningham, Fiona and Di Domenico, Tom{\\'a}s and Donaldson, Sarah and Fiddes, Ian T and Garc{\\'\\i}a Gir{\\'o}n, Carlos and Gonzalez, Jose Manuel and Grego, Tiago and Hardy, Matthew and Hourlier, Thibaut and Hunt, Toby and Izuogu, Osagie G and Lagarde, Julien and Martin, Fergal J and Mart{\\'\\i}nez, Laura and Mohanan, Shamika and Muir, Paul and Navarro, Fabio C P and Parker, Anne and Pei, Baikang and Pozo, Fernando and Ruffier, Magali and Schmitt, Bianca M and Stapleton, Eloise and Suner, Marie-Marthe and Sycheva, Irina and Uszczynska-Ratajczak, Barbara and Xu, Jinuri and Yates, Andrew and Zerbino, Daniel and Zhang, Yan and Aken, Bronwen and Choudhary, Jyoti S and Gerstein, Mark and Guig{\\'o}, Roderic and Hubbard, Tim J P and Kellis, Manolis and Paten, Benedict and Reymond, Alexandre and Tress, Michael L and Flicek, Paul},\n journal = {Nucleic Acids Research},\n month = jan,\n number = {D1},\n pages = {D766--D773},\n publisher = {Oxford University Press (OUP)},\n title = {{GENCODE} reference annotation for the human and mouse genomes},\n volume = 47,\n year = 2019\n}\n\n@article{mudge2015creating,\n author = {Mudge, Jonathan M and Harrow, Jennifer},\n copyright = {https://creativecommons.org/licenses/by/4.0},\n journal = {Mamm. Genome},\n language = {en},\n month = oct,\n number = {9-10},\n pages = {366--378},\n publisher = {Springer Science and Business Media LLC},\n title = {Creating reference gene annotation for the mouse {C57BL6/J} genome assembly},\n volume = 26,\n year = 2015\n}\n\n@article{harrow2012gencode,\n author = {Harrow, Jennifer and Frankish, Adam and Gonzalez, Jose M and Tapanari, Electra and Diekhans, Mark and Kokocinski, Felix and Aken, Bronwen L and Barrell, Daniel and Zadissa, Amonida and Searle, Stephen and Barnes, If and Bignell, Alexandra and Boychenko, Veronika and Hunt, Toby and Kay, Mike and Mukherjee, Gaurab and Rajan, Jeena and Despacio-Reyes, Gloria and Saunders, Gary and Steward, Charles and Harte, Rachel and Lin, Michael and Howald, C{\\'e}dric and Tanzer, Andrea and Derrien, Thomas and Chrast, Jacqueline and Walters, Nathalie and Balasubramanian, Suganthi and Pei, Baikang and Tress, Michael and Rodriguez, Jose Manuel and Ezkurdia, Iakes and van Baren, Jeltje and Brent, Michael and Haussler, David and Kellis, Manolis and Valencia, Alfonso and Reymond, Alexandre and Gerstein, Mark and Guig{\\'o}, Roderic and Hubbard, Tim J},\n journal = {Genome Research},\n month = sep,\n number = 9,\n pages = {1760--1774},\n title = {{GENCODE}: the reference human genome annotation for The {ENCODE} Project},\n volume = 22,\n year = 2012\n}\n\n@article{harrow2006gencode,\n author = {Harrow, Jennifer and Denoeud, France and Frankish, Adam and Reymond, Alexandre and Chen, Chao-Kung and Chrast, Jacqueline and Lagarde, Julien and Gilbert, James G R and Storey, Roy and Swarbreck, David and Rossier, Colette and Ucla, Catherine and Hubbard, Tim and Antonarakis, Stylianos E and Guigo, Roderic},\n journal = {Genome Biology},\n month = aug,\n number = {Suppl 1},\n pages = {S4.1--9},\n publisher = {Springer Nature},\n title = {{GENCODE}: producing a reference annotation for {ENCODE}},\n volume = {7 Suppl 1},\n year = 2006\n}\n
"},{"location":"datasets/rfam/","title":"Rfam","text":"Rfam is a database of structure-annotated multiple sequence alignments, covariance models and family annotation for a number of non-coding RNA, cis-regulatory and self-splicing intron families.
The seed alignments are hand curated and aligned using available sequence and structure data, and covariance models are built from these alignments using the INFERNAL v1.1.4 software suite.
The full regions list is created by searching the RFAMSEQ database using the covariance model, and then listing all hits above a family specific threshold to the model.
Rfam is maintained by a consortium of researchers at the European Bioinformatics Institute, Sean Eddy\u2019s laboratory and Eric Nawrocki.
"},{"location":"datasets/rfam/#disclaimer","title":"Disclaimer","text":"This is an UNOFFICIAL release of the Rfam by Ioanna Kalvari, Eric P. Nawrocki, Sarah W. Burge, Paul P Gardner, Sam Griffiths-Jones, et al.
The team releasing Rfam did not write this dataset card for this dataset so this dataset card has been written by the MultiMolecule team.
"},{"location":"datasets/rfam/#dataset-description","title":"Dataset Description","text":"This dataset is licensed under the AGPL-3.0 License.
Text OnlySPDX-License-Identifier: AGPL-3.0-or-later\n
Tip
The original Rfam dataset is licensed under the CC0 1.0 Universal license and is available at Rfam.
"},{"location":"datasets/rfam/#citation","title":"Citation","text":"BibTeX@article{kalvari2021rfam,\n author = {Kalvari, Ioanna and Nawrocki, Eric P and Ontiveros-Palacios, Nancy and Argasinska, Joanna and Lamkiewicz, Kevin and Marz, Manja and Griffiths-Jones, Sam and Toffano-Nioche, Claire and Gautheret, Daniel and Weinberg, Zasha and Rivas, Elena and Eddy, Sean R and Finn, Robert D and Bateman, Alex and Petrov, Anton I},\n copyright = {http://creativecommons.org/licenses/by/4.0/},\n journal = {Nucleic Acids Research},\n language = {en},\n month = jan,\n number = {D1},\n pages = {D192--D200},\n publisher = {Oxford University Press (OUP)},\n title = {Rfam 14: expanded coverage of metagenomic, viral and {microRNA} families},\n volume = 49,\n year = 2021\n}\n\n@article{hufsky2021computational,\n author = {Hufsky, Franziska and Lamkiewicz, Kevin and Almeida, Alexandre and Aouacheria, Abdel and Arighi, Cecilia and Bateman, Alex and Baumbach, Jan and Beerenwinkel, Niko and Brandt, Christian and Cacciabue, Marco and Chuguransky, Sara and Drechsel, Oliver and Finn, Robert D and Fritz, Adrian and Fuchs, Stephan and Hattab, Georges and Hauschild, Anne-Christin and Heider, Dominik and Hoffmann, Marie and H{\\\"o}lzer, Martin and Hoops, Stefan and Kaderali, Lars and Kalvari, Ioanna and von Kleist, Max and Kmiecinski, Ren{\\'o} and K{\\\"u}hnert, Denise and Lasso, Gorka and Libin, Pieter and List, Markus and L{\\\"o}chel, Hannah F and Martin, Maria J and Martin, Roman and Matschinske, Julian and McHardy, Alice C and Mendes, Pedro and Mistry, Jaina and Navratil, Vincent and Nawrocki, Eric P and O'Toole, {\\'A}ine Niamh and Ontiveros-Palacios, Nancy and Petrov, Anton I and Rangel-Pineros, Guillermo and Redaschi, Nicole and Reimering, Susanne and Reinert, Knut and Reyes, Alejandro and Richardson, Lorna and Robertson, David L and Sadegh, Sepideh and Singer, Joshua B and Theys, Kristof and Upton, Chris and Welzel, Marius and Williams, Lowri and Marz, Manja},\n copyright = {http://creativecommons.org/licenses/by/4.0/},\n journal = {Briefings in Bioinformatics},\n month = mar,\n number = 2,\n pages = {642--663},\n publisher = {Oxford University Press (OUP)},\n title = {Computational strategies to combat {COVID-19}: useful tools to accelerate {SARS-CoV-2} and coronavirus research},\n volume = 22,\n year = 2021\n}\n\n@article{kalvari2018noncoding,\n author = {Kalvari, Ioanna and Nawrocki, Eric P and Argasinska, Joanna and Quinones-Olvera, Natalia and Finn, Robert D and Bateman, Alex and Petrov, Anton I},\n journal = {Current Protocols in Bioinformatics},\n month = jun,\n number = 1,\n pages = {e51},\n title = {Non-coding {RNA} analysis using the rfam database},\n volume = 62,\n year = 2018\n}\n\n@article{kalvari2018rfam,\n author = {Kalvari, Ioanna and Argasinska, Joanna and Quinones-Olvera,\n Natalia and Nawrocki, Eric P and Rivas, Elena and Eddy, Sean R\n and Bateman, Alex and Finn, Robert D and Petrov, Anton I},\n journal = {Nucleic Acids Research},\n month = jan,\n number = {D1},\n pages = {D335--D342},\n title = {Rfam 13.0: shifting to a genome-centric resource for non-coding {RNA} families},\n volume = 46,\n year = 2018\n}\n\n@article{nawrocki2015rfam,\n author = {Nawrocki, Eric P and Burge, Sarah W and Bateman, Alex and Daub, Jennifer and Eberhardt, Ruth Y and Eddy, Sean R and Floden, Evan W and Gardner, Paul P and Jones, Thomas A and Tate, John and Finn, Robert D},\n copyright = {http://creativecommons.org/licenses/by/4.0/},\n journal = {Nucleic Acids Research},\n month = jan,\n number = {Database issue},\n pages = {D130--7},\n publisher = {Oxford University Press (OUP)},\n title = {Rfam 12.0: updates to the {RNA} families database},\n volume = 43,\n year = 2015\n}\n\n@article{burge2013rfam,\n author = {Burge, Sarah W and Daub, Jennifer and Eberhardt, Ruth and Tate, John and Barquist, Lars and Nawrocki, Eric P and Eddy, Sean R and Gardner, Paul P and Bateman, Alex},\n copyright = {http://creativecommons.org/licenses/by-nc/3.0/},\n journal = {Nucleic Acids Research},\n month = jan,\n number = {Database issue},\n pages = {D226--32},\n publisher = {Oxford University Press (OUP)},\n title = {Rfam 11.0: 10 years of {RNA} families},\n volume = 41,\n year = 2013\n}\n\n@article{gardner2011rfam,\n author = {Gardner, Paul P and Daub, Jennifer and Tate, John and Moore, Benjamin L and Osuch, Isabelle H and Griffiths-Jones, Sam and Finn, Robert D and Nawrocki, Eric P and Kolbe, Diana L and Eddy, Sean R and Bateman, Alex},\n journal = {Nucleic Acids Research},\n month = jan,\n number = {Database issue},\n pages = {D141--5},\n title = {Rfam: Wikipedia, clans and the ``decimal'' release},\n volume = 39,\n year = 2011\n}\n\n@article{gardner2009rfam,\n author = {Gardner, Paul P and Daub, Jennifer and Tate, John G and Nawrocki, Eric P and Kolbe, Diana L and Lindgreen, Stinus and Wilkinson, Adam C and Finn, Robert D and Griffiths-Jones, Sam and Eddy, Sean R and Bateman, Alex},\n journal = {Nucleic Acids Research},\n month = jan,\n number = {Database issue},\n pages = {D136--40},\n title = {Rfam: updates to the {RNA} families database},\n volume = 37,\n year = 2009\n}\n\n@article{daub2008rna,\n author = {Daub, Jennifer and Gardner, Paul P and Tate, John and Ramsk{\\\"o}ld, Daniel and Manske, Magnus and Scott, William G and Weinberg, Zasha and Griffiths-Jones, Sam and Bateman, Alex},\n journal = {RNA},\n month = dec,\n number = 12,\n pages = {2462--2464},\n title = {The {RNA} {WikiProject}: community annotation of {RNA} families},\n volume = 14,\n year = 2008\n}\n\n@article{griffiths2005rfam,\n author = {Griffiths-Jones, Sam and Moxon, Simon and Marshall, Mhairi and Khanna, Ajay and Eddy, Sean R. and Bateman, Alex},\n doi = {10.1093/nar/gki081},\n eprint = {https://academic.oup.com/nar/article-pdf/33/suppl\\_1/D121/7622063/gki081.pdf},\n issn = {0305-1048},\n journal = {Nucleic Acids Research},\n month = jan,\n number = {suppl_1},\n pages = {D121-D124},\n title = {{Rfam: annotating non-coding RNAs in complete genomes}},\n url = {https://doi.org/10.1093/nar/gki081},\n volume = {33},\n year = {2005}\n}\n\n@article{griffiths2003rfam,\n author = {Griffiths-Jones, Sam and Bateman, Alex and Marshall, Mhairi and Khanna, Ajay and Eddy, Sean R.},\n doi = {10.1093/nar/gkg006},\n eprint = {https://academic.oup.com/nar/article-pdf/31/1/439/7125749/gkg006.pdf},\n issn = {0305-1048},\n journal = {Nucleic Acids Research},\n month = jan,\n number = {1},\n pages = {439-441},\n title = {{Rfam: an RNA family database}},\n url = {https://doi.org/10.1093/nar/gkg006},\n volume = {31},\n year = {2003}\n}\n
"},{"location":"datasets/rivas/","title":"RIVAS","text":"The RIVAS dataset is a curated collection of RNA sequences and their secondary structures, designed for training and evaluating RNA secondary structure prediction methods. The dataset combines sequences from published studies and databases like Rfam, covering diverse RNA families such as tRNA, SRP RNA, and ribozymes. The secondary structure data is obtained from experimentally verified structures and consensus structures from Rfam alignments, ensuring high-quality annotations for model training and evaluation.
"},{"location":"datasets/rivas/#disclaimer","title":"Disclaimer","text":"This is an UNOFFICIAL release of the RIVAS dataset by Elena Rivas, et al.
The team releasing RIVAS did not write this dataset card for this dataset so this dataset card has been written by the MultiMolecule team.
"},{"location":"datasets/rivas/#dataset-description","title":"Dataset Description","text":"The converted dataset consists of the following columns, each providing specific information about the RNA secondary structures, consistent with the bpRNA standard:
id: A unique identifier for each RNA entry. This ID is derived from the original .sta
file name and serves as a reference to the specific RNA structure within the dataset.
sequence: The nucleotide sequence of the RNA molecule, represented using the standard RNA bases:
secondary_structure: The secondary structure of the RNA represented in dot-bracket notation, using up to three types of symbols to indicate base pairing and unpaired regions, as per bpRNA\u2019s standard:
.
): Represent unpaired nucleotides.(
and )
): Represent base pairs in standard stems (page 1).[
and ]
): Represent base pairs in pseudoknots (page 2).{
and }
): Represent base pairs in additional pseudoknots (page 3).This dataset is available in three variants:
This dataset is licensed under the AGPL-3.0 License.
Text OnlySPDX-License-Identifier: AGPL-3.0-or-later\n
"},{"location":"datasets/rivas/#citation","title":"Citation","text":"BibTeX@article{rivas2012a,\n author = {Rivas, Elena and Lang, Raymond and Eddy, Sean R},\n journal = {RNA},\n month = feb,\n number = 2,\n pages = {193--212},\n publisher = {Cold Spring Harbor Laboratory},\n title = {A range of complex probabilistic models for {RNA} secondary structure prediction that includes the nearest-neighbor model and more},\n volume = 18,\n year = 2012\n}\n
"},{"location":"datasets/rnacentral/","title":"RNAcentral","text":"RNAcentral is a free, public resource that offers integrated access to a comprehensive and up-to-date set of non-coding RNA sequences provided by a collaborating group of Expert Databases representing a broad range of organisms and RNA types.
The development of RNAcentral is coordinated by European Bioinformatics Institute and is supported by Wellcome. Initial funding was provided by BBSRC.
"},{"location":"datasets/rnacentral/#disclaimer","title":"Disclaimer","text":"This is an UNOFFICIAL release of the RNAcentral by the RNAcentral Consortium.
The team releasing RNAcentral did not write this dataset card for this dataset so this dataset card has been written by the MultiMolecule team.
"},{"location":"datasets/rnacentral/#dataset-description","title":"Dataset Description","text":"This dataset is available in five additional variants:
In addition to the main RNAcentral dataset, we also provide the following derived datasets:
This dataset is licensed under the AGPL-3.0 License.
Text OnlySPDX-License-Identifier: AGPL-3.0-or-later\n
Tip
The original RNAcentral dataset is licensed under the CC0 1.0 Universal license and is available at RNAcentral.
"},{"location":"datasets/rnacentral/#citation","title":"Citation","text":"BibTeX@article{rnacentral2021,\n author = {{RNAcentral Consortium}},\n doi = {https://doi.org/10.1093/nar/gkaa921},\n journal = {Nucleic Acids Research},\n month = jan,\n number = {D1},\n pages = {D212--D220},\n publisher = {Oxford University Press (OUP)},\n title = {{RNAcentral} 2021: secondary structure integration, improved sequence search and new member databases},\n url = {https://academic.oup.com/nar/article/49/D1/D212/5940500},\n volume = 49,\n year = 2021\n}\n\n@article{sweeney2020exploring,\n author = {Sweeney, Blake A. and Tagmazian, Arina A. and Ribas, Carlos E. and Finn, Robert D. and Bateman, Alex and Petrov, Anton I.},\n doi = {https://doi.org/10.1002/cpbi.104},\n eprint = {https://currentprotocols.onlinelibrary.wiley.com/doi/pdf/10.1002/cpbi.104},\n journal = {Current Protocols in Bioinformatics},\n keywords = {Galaxy, ncRNA, non-coding RNA, RNAcentral, RNA-seq},\n number = {1},\n pages = {e104},\n title = {Exploring Non-Coding RNAs in RNAcentral},\n url = {https://currentprotocols.onlinelibrary.wiley.com/doi/abs/10.1002/cpbi.104},\n volume = 71,\n year = 2020\n}\n\n@article{rnacentral2019,\n author = {{The RNAcentral Consortium}},\n doi = {https://doi.org/10.1093/nar/gky1034},\n journal = {Nucleic Acids Research},\n month = jan,\n number = {D1},\n pages = {D221--D229},\n publisher = {Oxford University Press (OUP)},\n title = {{RNAcentral}: a hub of information for non-coding {RNA} sequences},\n url = {https://academic.oup.com/nar/article/47/D1/D221/5160993},\n volume = 47,\n year = 2019\n}\n\n@article{rnacentral2017,\n author = {{The RNAcentral Consortium} and Petrov, Anton I and Kay, Simon J E and Kalvari, Ioanna and Howe, Kevin L and Gray, Kristian A and Bruford, Elspeth A and Kersey, Paul J and Cochrane, Guy and Finn, Robert D and Bateman, Alex and Kozomara, Ana and Griffiths-Jones, Sam and Frankish, Adam and Zwieb, Christian W and Lau, Britney Y and Williams, Kelly P and Chan, Patricia Pand Lowe, Todd M and Cannone, Jamie J and Gutell, Robin and Machnicka, Magdalena A and Bujnicki, Janusz M and Yoshihama, Maki and Kenmochi, Naoya and Chai, Benli and Cole, James R and Szymanski, Maciej and Karlowski, Wojciech M and Wood, Valerie and Huala, Eva and Berardini, Tanya Z and Zhao, Yi and Chen, Runsheng and Zhu, Weimin and Paraskevopoulou, Maria D and Vlachos, Ioannis S and Hatzigeorgiou, Artemis G and Ma, Lina and Zhang, Zhang and Puetz, Joern and Stadler, Peter F and McDonald, Daniel and Basu, Siddhartha and Fey, Petra and Engel, Stacia R and Cherry, J Michael and Volders, Pieter-Jan and Mestdagh, Pieter and Wower, Jacek and Clark, Michael B and Quek, Xiu Cheng and Dinger, Marcel E},\n doi = {https://doi.org/10.1093/nar/gkw1008},\n journal = {Nucleic Acids Research},\n month = jan,\n number = {D1},\n pages = {D128--D134},\n publisher = {Oxford University Press (OUP)},\n title = {{RNAcentral}: a comprehensive database of non-coding {RNA} sequences},\n url = {https://academic.oup.com/nar/article/45/D1/D128/2333921},\n volume = 45,\n year = 2017\n}\n\n@article{rnacentral2015,\n author = {{RNAcentral Consortium} and Petrov, Anton I and Kay, Simon J E and Gibson, Richard and Kulesha, Eugene and Staines, Dan and Bruford, Elspeth A and Wright, Mathew W and Burge, Sarah and Finn, Robert D and Kersey, Paul J and Cochrane, Guy and Bateman, Alex and Griffiths-Jones, Sam and Harrow, Jennifer and Chan, Patricia P and Lowe, Todd M and Zwieb, Christian W and Wower, Jacek and Williams, Kelly P and Hudson, Corey M and Gutell, Robin and Clark, Michael B and Dinger, Marcel and Quek, Xiu Cheng and Bujnicki, Janusz M and Chua, Nam-Hai and Liu, Jun and Wang, Huan and Skogerb{\\o}, Geir and Zhao, Yi and Chen, Runsheng and Zhu, Weimin and Cole, James R and Chai, Benli and Huang, Hsien-Da and Huang, His-Yuan and Cherry, J Michael and Hatzigeorgiou, Artemis and Pruitt, Kim D},\n doi = {https://doi.org/10.1093/nar/gku991},\n journal = {Nucleic Acids Research},\n month = jan,\n number = {Database issue},\n pages = {D123--D129},\n title = {{RNAcentral}: an international database of {ncRNA} sequences},\n url = {https://academic.oup.com/nar/article/43/D1/D123/2439941},\n volume = 43,\n year = 2015\n}\n\n@article{bateman2011rnacentral,\n author = {Bateman, Alex and Agrawal, Shipra and Birney, Ewan and Bruford, Elspeth A and Bujnicki, Janusz M and Cochrane, Guy and Cole, James R and Dinger, Marcel E and Enright, Anton J and Gardner, Paul P and Gautheret, Daniel and Griffiths-Jones, Sam and Harrow, Jen and Herrero, Javier and Holmes, Ian H and Huang, Hsien-Da and Kelly, Krystyna A and Kersey, Paul and Kozomara, Ana and Lowe, Todd M and Marz, Manja and Moxon, Simon andPruitt, Kim D and Samuelsson, Tore and Stadler, Peter F and Vilella, Albert J and Vogel, Jan-Hinnerk and Williams, Kelly P and Wright, Mathew W and Zwieb, Christian},\n doi = {https://doi.org/10.1261/rna.2750811},\n journal = {RNA},\n month = nov,\n number = 11,\n pages = {1941--1946},\n publisher = {Cold Spring Harbor Laboratory},\n title = {{RNAcentral}: A vision for an international database of {RNA} sequences},\n url = {https://rnajournal.cshlp.org/content/17/11/1941.long},\n volume = 17,\n year = 2011\n}\n
"},{"location":"datasets/rnastralign/","title":"RNAStrAlign","text":"RNAStrAlign is a comprehensive dataset of RNA sequences and their secondary structures.
RNAStrAlign aggregates data from multiple established RNA structure repositories, covering diverse RNA families such as 5S ribosomal RNA, tRNA, and group I introns.
It is considered complementary to the ArchiveII dataset.
"},{"location":"datasets/rnastralign/#disclaimer","title":"Disclaimer","text":"This is an UNOFFICIAL release of the RNAStrAlign by Zhen Tan, et al.
The team releasing RNAStrAlign did not write this dataset card for this dataset so this dataset card has been written by the MultiMolecule team.
"},{"location":"datasets/rnastralign/#dataset-description","title":"Dataset Description","text":"id: A unique identifier for each RNA entry. This ID is derived from the family and the original .sta
file name, and serves as a reference to the specific RNA structure within the dataset.
sequence: The nucleotide sequence of the RNA molecule, represented using the standard RNA bases:
secondary_structure: The secondary structure of the RNA represented in dot-bracket notation, using up to three types of symbols to indicate base pairing and unpaired regions, as per bpRNA\u2019s standard:
.
): Represent unpaired nucleotides.(
and )
): Represent base pairs in standard stems (page 1).family: The RNA family to which the sequence belongs, such as 16S rRNA, 5S rRNA, etc.
subfamily: A more specific subfamily within the family, such as Actinobacteria for 16S rRNA.
Not all families have subfamilies, in which case this field will be None
.
This dataset is available in two additional variants:
This dataset is licensed under the AGPL-3.0 License.
Text OnlySPDX-License-Identifier: AGPL-3.0-or-later\n
"},{"location":"datasets/rnastralign/#citation","title":"Citation","text":"BibTeX@article{ran2017turbofold,\n author = {Tan, Zhen and Fu, Yinghan and Sharma, Gaurav and Mathews, David H},\n journal = {Nucleic Acids Research},\n month = nov,\n number = 20,\n pages = {11570--11581},\n title = {{TurboFold} {II}: {RNA} structural alignment and secondary structure prediction informed by multiple homologs},\n volume = 45,\n year = 2017\n}\n
"},{"location":"datasets/ryos/","title":"RYOS","text":"RYOS is a database of RNA backbone stability in aqueous solution.
RYOS focuses on exploring the stability of mRNA molecules for vaccine applications. This dataset is part of a broader effort to address one of the key challenges of mRNA vaccines: degradation during shipping and storage.
"},{"location":"datasets/ryos/#statement","title":"Statement","text":"Deep learning models for predicting RNA degradation via dual crowdsourcing is published in Nature Machine Intelligence, which is a Closed Access / Author-Fee journal.
Machine learning has been at the forefront of the movement for free and open access to research.
We see no role for closed access or author-fee publication in the future of machine learning research and believe the adoption of these journals as an outlet of record for the machine learning community would be a retrograde step.
The MultiMolecule team is committed to the principles of open access and open science.
We do NOT endorse the publication of manuscripts in Closed Access / Author-Fee journals and encourage the community to support Open Access journals and conferences.
Please consider signing the Statement on Nature Machine Intelligence.
"},{"location":"datasets/ryos/#disclaimer","title":"Disclaimer","text":"This is an UNOFFICIAL release of the RYOS by Hannah K. Wayment-Steele, et al.
The team releasing RYOS did not write this dataset card for this dataset so this dataset card has been written by the MultiMolecule team.
"},{"location":"datasets/ryos/#dataset-description","title":"Dataset Description","text":"id: A unique identifier for each RNA sequence entry.
design: The name given to each RNA design by contributors, used for easy reference.
sequence: The nucleotide sequence of the RNA molecule, represented using the standard RNA bases:
secondary_structure: The secondary structure of the RNA represented in dot-bracket notation, using up to three types of symbols to indicate base pairing and unpaired regions, as per bpRNA\u2019s standard:
.
): Represent unpaired nucleotides.(
and )
): Represent base pairs in standard stems (page 1).[
and ]
): Represent base pairs in pseudoknots (page 2).{
and }
): Represent base pairs in additional pseudoknots (page 3).reactivity: A list of floating-point values that provide an estimate of the likelihood of the RNA backbone being cut at each nucleotide position. These values help determine the stability of the RNA structure under various experimental conditions.
deg_pH10 and deg_Mg_pH10: Arrays of degradation rates observed under two conditions: incubation at pH 10 without and with magnesium, respectively. These values provide insight into how different conditions affect the stability of RNA molecules.
deg_50C and deg_Mg_50C: Arrays of degradation rates after incubation at 50\u00b0C, without and with magnesium. These values capture how RNA sequences respond to elevated temperatures, which is relevant for storage and transportation conditions.
*_error_* Columns: Arrays of floating-point numbers indicating the experimental errors corresponding to the measurements in the reactivity and deg_ columns. These values help quantify the uncertainty in the degradation rates and reactivity measurements.
SN_filter: A filter applied to the dataset based on the signal-to-noise ratio, indicating whether a specific sequence meets the dataset\u2019s quality criteria.
If the SN_filter is True
, the sequence meets the quality criteria; otherwise, it does not.
Note that due to technical limitations, the ground truth measurements are not available for the final bases of each RNA sequence, resulting in a shorter length for the provided labels compared to the full sequence.
"},{"location":"datasets/ryos/#variations","title":"Variations","text":"This dataset is available in two subsets:
This dataset is licensed under the AGPL-3.0 License.
Text OnlySPDX-License-Identifier: AGPL-3.0-or-later\n
"},{"location":"datasets/ryos/#citation","title":"Citation","text":"BibTeX@article{waymentsteele2021deep,\n author = {Wayment-Steele, Hannah K and Kladwang, Wipapat and Watkins, Andrew M and Kim, Do Soon and Tunguz, Bojan and Reade, Walter and Demkin, Maggie and Romano, Jonathan and Wellington-Oguri, Roger and Nicol, John J and Gao, Jiayang and Onodera, Kazuki and Fujikawa, Kazuki and Mao, Hanfei and Vandewiele, Gilles and Tinti, Michele and Steenwinckel, Bram and Ito, Takuya and Noumi, Taiga and He, Shujun and Ishi, Keiichiro and Lee, Youhan and {\\\"O}zt{\\\"u}rk, Fatih and Chiu, Anthony and {\\\"O}zt{\\\"u}rk, Emin and Amer, Karim and Fares, Mohamed and Participants, Eterna and Das, Rhiju},\n journal = {ArXiv},\n month = oct,\n title = {Deep learning models for predicting {RNA} degradation via dual crowdsourcing},\n year = 2021\n}\n
"},{"location":"models/","title":"models","text":"models
provide a collection of pre-trained models.
In the transformers
library, the names of model classes can sometimes be misleading. While these classes support both regression and classification tasks, their names often include xxxForSequenceClassification
, which may imply they are only for classification.
To avoid this ambiguity, MultiMolecule provides a set of model classes with clear, intuitive names that reflect their intended use:
multimolecule.AutoModelForSequencePrediction
: Sequence Predictionmultimolecule.AutoModelForTokenPrediction
: Token Predictionmultimolecule.AutoModelForContactPrediction
: Contact PredictionEach of these models supports both regression and classification tasks, offering flexibility and precision for a wide range of applications.
"},{"location":"models/#contact-prediction","title":"Contact Prediction","text":"Contact prediction assign a label to each pair of token in a sentence. One of the most common contact prediction tasks is protein distance map prediction. Protein distance map prediction attempts to find the distance between all possible amino acid residue pairs of a three-dimensional protein structure
"},{"location":"models/#nucleotide-prediction","title":"Nucleotide Prediction","text":"Similar to Token Classification, but removes the <bos>
token and the <eos>
token if they are defined in the model config.
<bos>
and <eos>
tokens
In tokenizers provided by MultiMolecule, <bos>
token is pointed to <cls>
token, and <sep>
token is pointed to <eos>
token.
multimolecule.AutoModel
s","text":"Pythonfrom transformers import AutoTokenizer\n\nfrom multimolecule import AutoModelForSequencePrediction\n\nmodel = AutoModelForSequencePrediction.from_pretrained(\"multimolecule/rnafm\")\ntokenizer = AutoTokenizer.from_pretrained(\"multimolecule/rnafm\")\n\nsequence = \"UAGCGUAUCAGACUGAUGUUG\"\noutput = model(**tokenizer(sequence, return_tensors=\"pt\"))\n
"},{"location":"models/#direct-access","title":"Direct Access","text":"All models can be directly loaded with the from_pretrained
method.
from multimolecule.models import RnaFmForTokenPrediction, RnaTokenizer\n\nmodel = RnaFmForTokenPrediction.from_pretrained(\"multimolecule/rnafm\")\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rnafm\")\n\nsequence = \"UAGCGUAUCAGACUGAUGUUG\"\noutput = model(**tokenizer(sequence, return_tensors=\"pt\"))\n
"},{"location":"models/#build-with-transformersautomodels","title":"Build with transformers.AutoModel
s","text":"While we use a different naming convention for model classes, the models are still registered to corresponding transformers.AutoModel
s.
from transformers import AutoModelForSequenceClassification, AutoTokenizer\n\nimport multimolecule # noqa: F401\n\nmodel = AutoModelForSequenceClassification.from_pretrained(\"multimolecule/mrnafm\")\ntokenizer = AutoTokenizer.from_pretrained(\"multimolecule/mrnafm\")\n\nsequence = \"UAGCGUAUCAGACUGAUGUUG\"\noutput = model(**tokenizer(sequence, return_tensors=\"pt\"))\n
import multimolecule
before use
Note that you must import multimolecule
before building the model using transformers.AutoModel
. The registration of models is done in the multimolecule
package, and the models are not available in the transformers
package.
The following error will be raised if you do not import multimolecule
before using transformers.AutoModel
:
ValueError: The checkpoint you are trying to load has model type `rnafm` but Transformers does not recognize this architecture. This could be because of an issue with the checkpoint, or because your version of Transformers is out of date.\n
"},{"location":"models/#initialize-a-vanilla-model","title":"Initialize a vanilla model","text":"You can also initialize a vanilla model using the model class.
Pythonfrom multimolecule.models import RnaFmConfig, RnaFmForTokenPrediction, RnaTokenizer\n\nconfig = RnaFmConfig()\nmodel = RnaFmForTokenPrediction(config)\ntokenizer = RnaTokenizer()\n\nsequence = \"UAGCGUAUCAGACUGAUGUUG\"\noutput = model(**tokenizer(sequence, return_tensors=\"pt\"))\n
"},{"location":"models/#available-models","title":"Available Models","text":""},{"location":"models/#deoxyribonucleic-acid-dna","title":"DeoxyriboNucleic Acid (DNA)","text":"Pre-trained model on protein-coding DNA (cDNA) using a masked language modeling (MLM) objective.
"},{"location":"models/calm/#statement","title":"Statement","text":"Codon language embeddings provide strong signals for use in protein engineering is published in Nature Machine Intelligence, which is a Closed Access / Author-Fee journal.
Machine learning has been at the forefront of the movement for free and open access to research.
We see no role for closed access or author-fee publication in the future of machine learning research and believe the adoption of these journals as an outlet of record for the machine learning community would be a retrograde step.
The MultiMolecule team is committed to the principles of open access and open science.
We do NOT endorse the publication of manuscripts in Closed Access / Author-Fee journals and encourage the community to support Open Access journals and conferences.
Please consider signing the Statement on Nature Machine Intelligence.
"},{"location":"models/calm/#disclaimer","title":"Disclaimer","text":"This is an UNOFFICIAL implementation of the Codon language embeddings provide strong signals for use in protein engineering by Carlos Outeiral and Charlotte M. Deane.
The OFFICIAL repository of CaLM is at oxpig/CaLM.
Warning
The MultiMolecule team is unable to confirm that the provided model and checkpoints are producing the same intermediate representations as the original implementation. This is because
The proposed method is published in a Closed Access / Author-Fee journal.
The team releasing CaLM did not write this model card for this model so this model card has been written by the MultiMolecule team.
"},{"location":"models/calm/#model-details","title":"Model Details","text":"CaLM is a bert-style model pre-trained on a large corpus of protein-coding DNA sequences in a self-supervised fashion. This means that the model was trained on the raw nucleotides of DNA sequences only, with an automatic process to generate inputs and labels from those texts. Please refer to the Training Details section for more information on the training process.
"},{"location":"models/calm/#model-specification","title":"Model Specification","text":"Num Layers Hidden Size Num Heads Intermediate Size Num Parameters (M) FLOPs (G) MACs (G) Max Num Tokens 12 768 12 3072 85.75 22.36 11.17 1024"},{"location":"models/calm/#links","title":"Links","text":"The model file depends on the multimolecule
library. You can install it using pip:
pip install multimolecule\n
"},{"location":"models/calm/#direct-use","title":"Direct Use","text":"You can use this model directly with a pipeline for masked language modeling:
Python>>> import multimolecule # you must import multimolecule to register models\n>>> from transformers import pipeline\n>>> unmasker = pipeline(\"fill-mask\", model=\"multimolecule/calm\")\n>>> unmasker(\"agc<mask>cattatggcgaaccttggctgctg\")\n\n[{'score': 0.011160684749484062,\n 'token': 100,\n 'token_str': 'UUN',\n 'sequence': 'AGC UUN CAU UAU GGC GAA CCU UGG CUG CUG'},\n {'score': 0.01067513320595026,\n 'token': 117,\n 'token_str': 'NGC',\n 'sequence': 'AGC NGC CAU UAU GGC GAA CCU UGG CUG CUG'},\n {'score': 0.010549729689955711,\n 'token': 127,\n 'token_str': 'NNC',\n 'sequence': 'AGC NNC CAU UAU GGC GAA CCU UGG CUG CUG'},\n {'score': 0.0103579331189394,\n 'token': 51,\n 'token_str': 'CNA',\n 'sequence': 'AGC CNA CAU UAU GGC GAA CCU UGG CUG CUG'},\n {'score': 0.010322545655071735,\n 'token': 77,\n 'token_str': 'GNC',\n 'sequence': 'AGC GNC CAU UAU GGC GAA CCU UGG CUG CUG'}]\n
"},{"location":"models/calm/#downstream-use","title":"Downstream Use","text":""},{"location":"models/calm/#extract-features","title":"Extract Features","text":"Here is how to use this model to get the features of a given sequence in PyTorch:
Pythonfrom multimolecule import RnaTokenizer, CaLmModel\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/calm\")\nmodel = CaLmModel.from_pretrained(\"multimolecule/calm\")\n\ntext = \"GCCAGTCGCTGACAGCCGCGG\"\ninput = tokenizer(text, return_tensors=\"pt\")\n\noutput = model(**input)\n
"},{"location":"models/calm/#sequence-classification-regression","title":"Sequence Classification / Regression","text":"Note: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for sequence classification or regression.
Here is how to use this model as backbone to fine-tune for a sequence-level task in PyTorch:
Pythonimport torch\nfrom multimolecule import RnaTokenizer, CaLmForSequencePrediction\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/calm\")\nmodel = CaLmForSequencePrediction.from_pretrained(\"multimolecule/calm\")\n\ntext = \"GCCAGTCGCTGACAGCCGCGG\"\ninput = tokenizer(text, return_tensors=\"pt\")\nlabel = torch.tensor([1])\n\noutput = model(**input, labels=label)\n
"},{"location":"models/calm/#token-classification-regression","title":"Token Classification / Regression","text":"Note: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for nucleotide classification or regression.
Here is how to use this model as backbone to fine-tune for a nucleotide-level task in PyTorch:
Pythonimport torch\nfrom multimolecule import RnaTokenizer, CaLmForTokenPrediction\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/calm\")\nmodel = CaLmForTokenPrediction.from_pretrained(\"multimolecule/calm\")\n\ntext = \"GCCAGTCGCTGACAGCCGCGG\"\ninput = tokenizer(text, return_tensors=\"pt\")\nlabel = torch.randint(2, (len(text), ))\n\noutput = model(**input, labels=label)\n
"},{"location":"models/calm/#contact-classification-regression","title":"Contact Classification / Regression","text":"Note: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for contact classification or regression.
Here is how to use this model as backbone to fine-tune for a contact-level task in PyTorch:
Pythonimport torch\nfrom multimolecule import RnaTokenizer, CaLmForContactPrediction\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/calm\")\nmodel = CaLmForContactPrediction.from_pretrained(\"multimolecule/calm\")\n\ntext = \"GCCAGTCGCTGACAGCCGCGG\"\ninput = tokenizer(text, return_tensors=\"pt\")\nlabel = torch.randint(2, (len(text), len(text)))\n\noutput = model(**input, labels=label)\n
"},{"location":"models/calm/#training-details","title":"Training Details","text":"CaLM used Masked Language Modeling (MLM) as the pre-training objective: taking a sequence, the model randomly masks 25% of the tokens in the input then runs the entire masked sentence through the model and has to predict the masked tokens. This is comparable to the Cloze task in language modeling.
"},{"location":"models/calm/#training-data","title":"Training Data","text":"The CaLM model was pre-trained coding sequences of all organisms available on the European Nucleotide Archive (ENA). European Nucleotide Archive provides a comprehensive record of the world\u2019s nucleotide sequencing information, covering raw sequencing data, sequence assembly information and functional annotation.
CaLM collected coding sequences of all organisms from ENA on April 2022, including 114,214,475 sequences. Only high level assembly information (dataclass CON) were used. Sequences matching the following criteria were filtered out:
N
, Y
, R
)ATG
To reduce redundancy, CaLM grouped the entries by organism, and apply CD-HIT (CD-HIT-EST) with a cut-off at 40% sequence identity to the translated protein sequences.
The final dataset contains 9,858,385 cDNA sequences.
Note that the alphabet in the original implementation is RNA instead of DNA, therefore, we use RnaTokenizer
to tokenize the sequences. RnaTokenizer
of multimolecule
will convert \u201cU\u201ds to \u201cT\u201ds for you, you may disable this behaviour by passing replace_T_with_U=False
.
CaLM used masked language modeling (MLM) as the pre-training objective. The masking procedure is similar to the one used in BERT:
<mask>
.The model was trained on 4 NVIDIA Quadro RTX4000 GPUs with 8GiB memories.
BibTeX:
BibTeX@article {outeiral2022coodn,\n author = {Outeiral, Carlos and Deane, Charlotte M.},\n title = {Codon language embeddings provide strong signals for protein engineering},\n elocation-id = {2022.12.15.519894},\n year = {2022},\n doi = {10.1101/2022.12.15.519894},\n publisher = {Cold Spring Harbor Laboratory},\n abstract = {Protein representations from deep language models have yielded state-of-the-art performance across many tasks in computational protein engineering. In recent years, progress has primarily focused on parameter count, with recent models{\\textquoteright} capacities surpassing the size of the very datasets they were trained on. Here, we propose an alternative direction. We show that large language models trained on codons, instead of amino acid sequences, provide high-quality representations that outperform comparable state-of-the-art models across a variety of tasks. In some tasks, like species recognition, prediction of protein and transcript abundance, or melting point estimation, we show that a language model trained on codons outperforms every other published protein language model, including some that contain over 50 times more parameters. These results suggest that, in addition to commonly studied scale and model complexity, the information content of biological data provides an orthogonal direction to improve the power of machine learning in biology.Competing Interest StatementThe authors have declared no competing interest.},\n URL = {https://www.biorxiv.org/content/early/2022/12/19/2022.12.15.519894},\n eprint = {https://www.biorxiv.org/content/early/2022/12/19/2022.12.15.519894.full.pdf},\n journal = {bioRxiv}\n}\n
"},{"location":"models/calm/#contact","title":"Contact","text":"Please use GitHub issues of MultiMolecule for any questions or comments on the model card.
Please contact the authors of the CaLM paper for questions or comments on the paper/model.
"},{"location":"models/calm/#license","title":"License","text":"This model is licensed under the AGPL-3.0 License.
Text OnlySPDX-License-Identifier: AGPL-3.0-or-later\n
"},{"location":"models/calm/#multimolecule.models.calm","title":"multimolecule.models.calm","text":""},{"location":"models/calm/#multimolecule.models.calm.RnaTokenizer","title":"RnaTokenizer","text":" Bases: Tokenizer
Tokenizer for RNA sequences.
Parameters:
Name Type Description DefaultAlphabet | str | List[str] | None
alphabet to use for tokenization.
None
, the standard RNA alphabet will be used.string
, it should correspond to the name of a predefined alphabet. The options includestandard
extended
streamline
nucleobase
None
int
Size of kmer to tokenize.
1
bool
Whether to tokenize into codons.
False
bool
Whether to replace T with U.
True
bool
Whether to convert input to uppercase.
True
Examples:
Python Console Session>>> from multimolecule import RnaTokenizer\n>>> tokenizer = RnaTokenizer()\n>>> tokenizer('<pad><cls><eos><unk><mask><null>ACGUNRYSWKMBDHV.X*-I')[\"input_ids\"]\n[1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 2]\n>>> tokenizer('acgu')[\"input_ids\"]\n[1, 6, 7, 8, 9, 2]\n>>> tokenizer('acgt')[\"input_ids\"]\n[1, 6, 7, 8, 9, 2]\n>>> tokenizer = RnaTokenizer(replace_T_with_U=False)\n>>> tokenizer('acgt')[\"input_ids\"]\n[1, 6, 7, 8, 3, 2]\n>>> tokenizer = RnaTokenizer(nmers=3)\n>>> tokenizer('uagcuuauc')[\"input_ids\"]\n[1, 83, 17, 64, 49, 96, 84, 22, 2]\n>>> tokenizer = RnaTokenizer(codon=True)\n>>> tokenizer('uagcuuauc')[\"input_ids\"]\n[1, 83, 49, 22, 2]\n>>> tokenizer('uagcuuauca')[\"input_ids\"]\nTraceback (most recent call last):\nValueError: length of input sequence must be a multiple of 3 for codon tokenization, but got 10\n
Source code in multimolecule/tokenisers/rna/tokenization_rna.py
Pythonclass RnaTokenizer(Tokenizer):\n \"\"\"\n Tokenizer for RNA sequences.\n\n Args:\n alphabet: alphabet to use for tokenization.\n\n - If is `None`, the standard RNA alphabet will be used.\n - If is a `string`, it should correspond to the name of a predefined alphabet. The options include\n + `standard`\n + `extended`\n + `streamline`\n + `nucleobase`\n - If is an alphabet or a list of characters, that specific alphabet will be used.\n nmers: Size of kmer to tokenize.\n codon: Whether to tokenize into codons.\n replace_T_with_U: Whether to replace T with U.\n do_upper_case: Whether to convert input to uppercase.\n\n Examples:\n >>> from multimolecule import RnaTokenizer\n >>> tokenizer = RnaTokenizer()\n >>> tokenizer('<pad><cls><eos><unk><mask><null>ACGUNRYSWKMBDHV.X*-I')[\"input_ids\"]\n [1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 2]\n >>> tokenizer('acgu')[\"input_ids\"]\n [1, 6, 7, 8, 9, 2]\n >>> tokenizer('acgt')[\"input_ids\"]\n [1, 6, 7, 8, 9, 2]\n >>> tokenizer = RnaTokenizer(replace_T_with_U=False)\n >>> tokenizer('acgt')[\"input_ids\"]\n [1, 6, 7, 8, 3, 2]\n >>> tokenizer = RnaTokenizer(nmers=3)\n >>> tokenizer('uagcuuauc')[\"input_ids\"]\n [1, 83, 17, 64, 49, 96, 84, 22, 2]\n >>> tokenizer = RnaTokenizer(codon=True)\n >>> tokenizer('uagcuuauc')[\"input_ids\"]\n [1, 83, 49, 22, 2]\n >>> tokenizer('uagcuuauca')[\"input_ids\"]\n Traceback (most recent call last):\n ValueError: length of input sequence must be a multiple of 3 for codon tokenization, but got 10\n \"\"\"\n\n model_input_names = [\"input_ids\", \"attention_mask\"]\n\n def __init__(\n self,\n alphabet: Alphabet | str | List[str] | None = None,\n nmers: int = 1,\n codon: bool = False,\n replace_T_with_U: bool = True,\n do_upper_case: bool = True,\n additional_special_tokens: List | Tuple | None = None,\n **kwargs,\n ):\n if codon and (nmers > 1 and nmers != 3):\n raise ValueError(\"Codon and nmers cannot be used together.\")\n if codon:\n nmers = 3 # set to 3 to get correct vocab\n if not isinstance(alphabet, Alphabet):\n alphabet = get_alphabet(alphabet, nmers=nmers)\n super().__init__(\n alphabet=alphabet,\n nmers=nmers,\n codon=codon,\n replace_T_with_U=replace_T_with_U,\n do_upper_case=do_upper_case,\n additional_special_tokens=additional_special_tokens,\n **kwargs,\n )\n self.replace_T_with_U = replace_T_with_U\n self.nmers = nmers\n self.codon = codon\n\n def _tokenize(self, text: str, **kwargs):\n if self.do_upper_case:\n text = text.upper()\n if self.replace_T_with_U:\n text = text.replace(\"T\", \"U\")\n if self.codon:\n if len(text) % 3 != 0:\n raise ValueError(\n f\"length of input sequence must be a multiple of 3 for codon tokenization, but got {len(text)}\"\n )\n return [text[i : i + 3] for i in range(0, len(text), 3)]\n if self.nmers > 1:\n return [text[i : i + self.nmers] for i in range(len(text) - self.nmers + 1)] # noqa: E203\n return list(text)\n
"},{"location":"models/calm/#multimolecule.models.calm.RnaTokenizer(alphabet)","title":"alphabet
","text":""},{"location":"models/calm/#multimolecule.models.calm.RnaTokenizer(nmers)","title":"nmers
","text":""},{"location":"models/calm/#multimolecule.models.calm.RnaTokenizer(codon)","title":"codon
","text":""},{"location":"models/calm/#multimolecule.models.calm.RnaTokenizer(replace_T_with_U)","title":"replace_T_with_U
","text":""},{"location":"models/calm/#multimolecule.models.calm.RnaTokenizer(do_upper_case)","title":"do_upper_case
","text":""},{"location":"models/calm/#multimolecule.models.calm.CaLmConfig","title":"CaLmConfig","text":" Bases: PreTrainedConfig
This is the configuration class to store the configuration of a CaLmModel
. It is used to instantiate a CaLM model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the CaLM oxpig/CaLM architecture.
Configuration objects inherit from PreTrainedConfig
and can be used to control the model outputs. Read the documentation from PreTrainedConfig
for more information.
Parameters:
Name Type Description Defaultint
Vocabulary size of the CaLM model. Defines the number of different tokens that can be represented by the inputs_ids
passed when calling [CaLmModel
].
131
int
Dimensionality of the encoder layers and the pooler layer.
768
int
Number of hidden layers in the Transformer encoder.
12
int
Number of attention heads for each attention layer in the Transformer encoder.
12
int
Dimensionality of the \u201cintermediate\u201d (often named feed-forward) layer in the Transformer encoder.
3072
float
The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
0.1
float
The dropout ratio for the attention probabilities.
0.1
int
The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
1026
float
The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
0.02
float
The epsilon used by the layer normalization layers.
1e-12
str
Type of position embedding. Choose one of \"absolute\"
, \"relative_key\"
, \"relative_key_query\", \"rotary\"
. For positional embeddings use \"absolute\"
. For more information on \"relative_key\"
, please refer to Self-Attention with Relative Position Representations (Shaw et al.). For more information on \"relative_key_query\"
, please refer to Method 4 in Improve Transformer Models with Better Relative Position Embeddings (Huang et al.).
'rotary'
bool
Whether the model is used as a decoder or not. If False
, the model is used as an encoder.
False
bool
Whether or not the model should return the last key/values attentions (not used by all models). Only relevant if config.is_decoder=True
.
True
bool
Whether to apply layer normalization after embeddings but before the main stem of the network.
False
bool
When this is enabled, masked tokens are treated as if they had been dropped out by input dropout.
False
Examples:
Python Console Session>>> from multimolecule import CaLmModel, CaLmConfig\n>>> # Initializing a CaLM multimolecule/calm style configuration\n>>> configuration = CaLmConfig()\n>>> # Initializing a model (with random weights) from the multimolecule/calm style configuration\n>>> model = CaLmModel(configuration)\n>>> # Accessing the model configuration\n>>> configuration = model.config\n
Source code in multimolecule/models/calm/configuration_calm.py
Pythonclass CaLmConfig(PreTrainedConfig):\n r\"\"\"\n This is the configuration class to store the configuration of a [`CaLmModel`][multimolecule.models.CaLmModel]. It\n is used to instantiate a CaLM model according to the specified arguments, defining the model architecture.\n Instantiating a configuration with the defaults will yield a similar configuration to that of the CaLM\n [oxpig/CaLM](https://github.com/oxpig/CaLM) architecture.\n\n Configuration objects inherit from [`PreTrainedConfig`][multimolecule.models.PreTrainedConfig] and can be used to\n control the model outputs. Read the documentation from [`PreTrainedConfig`][multimolecule.models.PreTrainedConfig]\n for more information.\n\n Args:\n vocab_size:\n Vocabulary size of the CaLM model. Defines the number of different tokens that can be represented by the\n `inputs_ids` passed when calling [`CaLmModel`].\n hidden_size:\n Dimensionality of the encoder layers and the pooler layer.\n num_hidden_layers:\n Number of hidden layers in the Transformer encoder.\n num_attention_heads:\n Number of attention heads for each attention layer in the Transformer encoder.\n intermediate_size:\n Dimensionality of the \"intermediate\" (often named feed-forward) layer in the Transformer encoder.\n hidden_dropout:\n The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.\n attention_dropout:\n The dropout ratio for the attention probabilities.\n max_position_embeddings:\n The maximum sequence length that this model might ever be used with. Typically set this to something large\n just in case (e.g., 512 or 1024 or 2048).\n initializer_range:\n The standard deviation of the truncated_normal_initializer for initializing all weight matrices.\n layer_norm_eps:\n The epsilon used by the layer normalization layers.\n position_embedding_type:\n Type of position embedding. Choose one of `\"absolute\"`, `\"relative_key\"`, `\"relative_key_query\", \"rotary\"`.\n For positional embeddings use `\"absolute\"`. For more information on `\"relative_key\"`, please refer to\n [Self-Attention with Relative Position Representations (Shaw et al.)](https://arxiv.org/abs/1803.02155).\n For more information on `\"relative_key_query\"`, please refer to *Method 4* in [Improve Transformer Models\n with Better Relative Position Embeddings (Huang et al.)](https://arxiv.org/abs/2009.13658).\n is_decoder:\n Whether the model is used as a decoder or not. If `False`, the model is used as an encoder.\n use_cache:\n Whether or not the model should return the last key/values attentions (not used by all models). Only\n relevant if `config.is_decoder=True`.\n emb_layer_norm_before:\n Whether to apply layer normalization after embeddings but before the main stem of the network.\n token_dropout:\n When this is enabled, masked tokens are treated as if they had been dropped out by input dropout.\n\n Examples:\n >>> from multimolecule import CaLmModel, CaLmConfig\n >>> # Initializing a CaLM multimolecule/calm style configuration\n >>> configuration = CaLmConfig()\n >>> # Initializing a model (with random weights) from the multimolecule/calm style configuration\n >>> model = CaLmModel(configuration)\n >>> # Accessing the model configuration\n >>> configuration = model.config\n \"\"\"\n\n model_type = \"calm\"\n\n def __init__(\n self,\n vocab_size: int = 131,\n codon: bool = True,\n hidden_size: int = 768,\n num_hidden_layers: int = 12,\n num_attention_heads: int = 12,\n intermediate_size: int = 3072,\n hidden_act: str = \"gelu\",\n hidden_dropout: float = 0.1,\n attention_dropout: float = 0.1,\n max_position_embeddings: int = 1026,\n initializer_range: float = 0.02,\n layer_norm_eps: float = 1e-12,\n position_embedding_type: str = \"rotary\",\n is_decoder: bool = False,\n use_cache: bool = True,\n emb_layer_norm_before: bool = False,\n token_dropout: bool = False,\n head: HeadConfig | None = None,\n lm_head: MaskedLMHeadConfig | None = None,\n **kwargs,\n ):\n super().__init__(**kwargs)\n\n self.vocab_size = vocab_size\n self.codon = codon\n self.hidden_size = hidden_size\n self.num_hidden_layers = num_hidden_layers\n self.num_attention_heads = num_attention_heads\n self.intermediate_size = intermediate_size\n self.hidden_act = hidden_act\n self.hidden_dropout = hidden_dropout\n self.attention_dropout = attention_dropout\n self.max_position_embeddings = max_position_embeddings\n self.initializer_range = initializer_range\n self.layer_norm_eps = layer_norm_eps\n self.position_embedding_type = position_embedding_type\n self.is_decoder = is_decoder\n self.use_cache = use_cache\n self.emb_layer_norm_before = emb_layer_norm_before\n self.token_dropout = token_dropout\n self.head = HeadConfig(**head) if head is not None else None\n self.lm_head = MaskedLMHeadConfig(**lm_head) if lm_head is not None else None\n
"},{"location":"models/calm/#multimolecule.models.calm.CaLmConfig(vocab_size)","title":"vocab_size
","text":""},{"location":"models/calm/#multimolecule.models.calm.CaLmConfig(hidden_size)","title":"hidden_size
","text":""},{"location":"models/calm/#multimolecule.models.calm.CaLmConfig(num_hidden_layers)","title":"num_hidden_layers
","text":""},{"location":"models/calm/#multimolecule.models.calm.CaLmConfig(num_attention_heads)","title":"num_attention_heads
","text":""},{"location":"models/calm/#multimolecule.models.calm.CaLmConfig(intermediate_size)","title":"intermediate_size
","text":""},{"location":"models/calm/#multimolecule.models.calm.CaLmConfig(hidden_dropout)","title":"hidden_dropout
","text":""},{"location":"models/calm/#multimolecule.models.calm.CaLmConfig(attention_dropout)","title":"attention_dropout
","text":""},{"location":"models/calm/#multimolecule.models.calm.CaLmConfig(max_position_embeddings)","title":"max_position_embeddings
","text":""},{"location":"models/calm/#multimolecule.models.calm.CaLmConfig(initializer_range)","title":"initializer_range
","text":""},{"location":"models/calm/#multimolecule.models.calm.CaLmConfig(layer_norm_eps)","title":"layer_norm_eps
","text":""},{"location":"models/calm/#multimolecule.models.calm.CaLmConfig(position_embedding_type)","title":"position_embedding_type
","text":""},{"location":"models/calm/#multimolecule.models.calm.CaLmConfig(is_decoder)","title":"is_decoder
","text":""},{"location":"models/calm/#multimolecule.models.calm.CaLmConfig(use_cache)","title":"use_cache
","text":""},{"location":"models/calm/#multimolecule.models.calm.CaLmConfig(emb_layer_norm_before)","title":"emb_layer_norm_before
","text":""},{"location":"models/calm/#multimolecule.models.calm.CaLmConfig(token_dropout)","title":"token_dropout
","text":""},{"location":"models/calm/#multimolecule.models.calm.CaLmForContactPrediction","title":"CaLmForContactPrediction","text":" Bases: CaLmPreTrainedModel
Examples:
Python Console Session>>> from multimolecule import CaLmConfig, CaLmForContactPrediction, RnaTokenizer\n>>> config = CaLmConfig()\n>>> model = CaLmForContactPrediction(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=torch.randint(2, (1, 5, 5)))\n>>> output[\"logits\"].shape\ntorch.Size([1, 5, 5, 1])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n
Source code in multimolecule/models/calm/modeling_calm.py
Pythonclass CaLmForContactPrediction(CaLmPreTrainedModel):\n \"\"\"\n Examples:\n >>> from multimolecule import CaLmConfig, CaLmForContactPrediction, RnaTokenizer\n >>> config = CaLmConfig()\n >>> model = CaLmForContactPrediction(config)\n >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n >>> output = model(**input, labels=torch.randint(2, (1, 5, 5)))\n >>> output[\"logits\"].shape\n torch.Size([1, 5, 5, 1])\n >>> output[\"loss\"] # doctest:+ELLIPSIS\n tensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n \"\"\"\n\n def __init__(self, config: CaLmConfig):\n super().__init__(config)\n self.calm = CaLmModel(config, add_pooling_layer=True)\n self.contact_head = ContactPredictionHead(config)\n self.head_config = self.contact_head.config\n\n # Initialize weights and apply final processing\n self.post_init()\n\n def forward(\n self,\n input_ids: Tensor | NestedTensor,\n attention_mask: Tensor | None = None,\n position_ids: Tensor | None = None,\n head_mask: Tensor | None = None,\n inputs_embeds: Tensor | NestedTensor | None = None,\n labels: Tensor | None = None,\n output_attentions: bool | None = None,\n output_hidden_states: bool | None = None,\n return_dict: bool | None = None,\n **kwargs,\n ) -> Tuple[Tensor, ...] | ContactPredictorOutput:\n if output_attentions is False:\n warn(\"output_attentions must be True for contact classification and will be ignored.\")\n return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n outputs = self.calm(\n input_ids,\n attention_mask=attention_mask,\n position_ids=position_ids,\n head_mask=head_mask,\n inputs_embeds=inputs_embeds,\n output_attentions=True,\n output_hidden_states=output_hidden_states,\n return_dict=return_dict,\n **kwargs,\n )\n output = self.contact_head(outputs, attention_mask, input_ids, labels)\n logits, loss = output.logits, output.loss\n\n if not return_dict:\n output = (logits,) + outputs[2:]\n return ((loss,) + output) if loss is not None else output\n\n return ContactPredictorOutput(\n loss=loss,\n logits=logits,\n hidden_states=outputs.hidden_states,\n attentions=outputs.attentions,\n )\n
"},{"location":"models/calm/#multimolecule.models.calm.CaLmForMaskedLM","title":"CaLmForMaskedLM","text":" Bases: CaLmPreTrainedModel
Examples:
Python Console Session>>> from multimolecule import CaLmConfig, CaLmForMaskedLM, RnaTokenizer\n>>> config = CaLmConfig()\n>>> model = CaLmForMaskedLM(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=input[\"input_ids\"])\n>>> output[\"logits\"].shape\ntorch.Size([1, 7, 131])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<NllLossBackward0>)\n
Source code in multimolecule/models/calm/modeling_calm.py
Pythonclass CaLmForMaskedLM(CaLmPreTrainedModel):\n \"\"\"\n Examples:\n >>> from multimolecule import CaLmConfig, CaLmForMaskedLM, RnaTokenizer\n >>> config = CaLmConfig()\n >>> model = CaLmForMaskedLM(config)\n >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n >>> output = model(**input, labels=input[\"input_ids\"])\n >>> output[\"logits\"].shape\n torch.Size([1, 7, 131])\n >>> output[\"loss\"] # doctest:+ELLIPSIS\n tensor(..., grad_fn=<NllLossBackward0>)\n \"\"\"\n\n _tied_weights_keys = [\"lm_head.decoder.weight\", \"lm_head.decoder.bias\"]\n\n def __init__(self, config: CaLmConfig):\n super().__init__(config)\n if config.is_decoder:\n logger.warning(\n \"If you want to use `CaLmForMaskedLM` make sure `config.is_decoder=False` for \"\n \"bi-directional self-attention.\"\n )\n self.calm = CaLmModel(config, add_pooling_layer=False)\n self.lm_head = MaskedLMHead(config, self.calm.embeddings.word_embeddings.weight)\n\n # Initialize weights and apply final processing\n self.post_init()\n\n def forward(\n self,\n input_ids: Tensor | NestedTensor,\n attention_mask: Tensor | None = None,\n position_ids: Tensor | None = None,\n head_mask: Tensor | None = None,\n inputs_embeds: Tensor | NestedTensor | None = None,\n encoder_hidden_states: Tensor | None = None,\n encoder_attention_mask: Tensor | None = None,\n labels: Tensor | None = None,\n output_attentions: bool | None = None,\n output_hidden_states: bool | None = None,\n return_dict: bool | None = None,\n **kwargs,\n ) -> Tuple[Tensor, ...] | MaskedLMOutput:\n return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n outputs = self.calm(\n input_ids,\n attention_mask=attention_mask,\n position_ids=position_ids,\n head_mask=head_mask,\n inputs_embeds=inputs_embeds,\n encoder_hidden_states=encoder_hidden_states,\n encoder_attention_mask=encoder_attention_mask,\n output_attentions=output_attentions,\n output_hidden_states=output_hidden_states,\n return_dict=return_dict,\n **kwargs,\n )\n output = self.lm_head(outputs, labels)\n logits, loss = output.logits, output.loss\n\n if not return_dict:\n output = (logits,) + outputs[2:]\n return ((loss,) + output) if loss is not None else output\n\n return MaskedLMOutput(\n loss=loss,\n logits=logits,\n hidden_states=outputs.hidden_states,\n attentions=outputs.attentions,\n )\n
"},{"location":"models/calm/#multimolecule.models.calm.CaLmForSequencePrediction","title":"CaLmForSequencePrediction","text":" Bases: CaLmPreTrainedModel
Examples:
Python Console Session>>> from multimolecule import CaLmConfig, CaLmForSequencePrediction, RnaTokenizer\n>>> config = CaLmConfig()\n>>> model = CaLmForSequencePrediction(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=torch.tensor([[1]]))\n>>> output[\"logits\"].shape\ntorch.Size([1, 1])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n
Source code in multimolecule/models/calm/modeling_calm.py
Pythonclass CaLmForSequencePrediction(CaLmPreTrainedModel):\n \"\"\"\n Examples:\n >>> from multimolecule import CaLmConfig, CaLmForSequencePrediction, RnaTokenizer\n >>> config = CaLmConfig()\n >>> model = CaLmForSequencePrediction(config)\n >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n >>> output = model(**input, labels=torch.tensor([[1]]))\n >>> output[\"logits\"].shape\n torch.Size([1, 1])\n >>> output[\"loss\"] # doctest:+ELLIPSIS\n tensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n \"\"\"\n\n def __init__(self, config: CaLmConfig):\n super().__init__(config)\n self.calm = CaLmModel(config, add_pooling_layer=True)\n self.sequence_head = SequencePredictionHead(config)\n self.head_config = self.sequence_head.config\n\n # Initialize weights and apply final processing\n self.post_init()\n\n def forward(\n self,\n input_ids: Tensor | NestedTensor,\n attention_mask: Tensor | None = None,\n position_ids: Tensor | None = None,\n head_mask: Tensor | None = None,\n inputs_embeds: Tensor | NestedTensor | None = None,\n labels: Tensor | None = None,\n output_attentions: bool | None = None,\n output_hidden_states: bool | None = None,\n return_dict: bool | None = None,\n **kwargs,\n ) -> Tuple[Tensor, ...] | SequencePredictorOutput:\n return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n outputs = self.calm(\n input_ids,\n attention_mask=attention_mask,\n position_ids=position_ids,\n head_mask=head_mask,\n inputs_embeds=inputs_embeds,\n output_attentions=output_attentions,\n output_hidden_states=output_hidden_states,\n return_dict=return_dict,\n **kwargs,\n )\n output = self.sequence_head(outputs, labels)\n logits, loss = output.logits, output.loss\n\n if not return_dict:\n output = (logits,) + outputs[2:]\n return ((loss,) + output) if loss is not None else output\n\n return SequencePredictorOutput(\n loss=loss,\n logits=logits,\n hidden_states=outputs.hidden_states,\n attentions=outputs.attentions,\n )\n
"},{"location":"models/calm/#multimolecule.models.calm.CaLmForTokenPrediction","title":"CaLmForTokenPrediction","text":" Bases: CaLmPreTrainedModel
Examples:
Python Console Session>>> from multimolecule import CaLmConfig, CaLmForTokenPrediction, RnaTokenizer\n>>> config = CaLmConfig()\n>>> model = CaLmForTokenPrediction(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=torch.randint(2, (1, 5)))\n>>> output[\"logits\"].shape\ntorch.Size([1, 5, 1])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n
Source code in multimolecule/models/calm/modeling_calm.py
Pythonclass CaLmForTokenPrediction(CaLmPreTrainedModel):\n \"\"\"\n Examples:\n >>> from multimolecule import CaLmConfig, CaLmForTokenPrediction, RnaTokenizer\n >>> config = CaLmConfig()\n >>> model = CaLmForTokenPrediction(config)\n >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n >>> output = model(**input, labels=torch.randint(2, (1, 5)))\n >>> output[\"logits\"].shape\n torch.Size([1, 5, 1])\n >>> output[\"loss\"] # doctest:+ELLIPSIS\n tensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n \"\"\"\n\n def __init__(self, config: CaLmConfig):\n super().__init__(config)\n self.calm = CaLmModel(config, add_pooling_layer=True)\n self.token_head = TokenPredictionHead(config)\n self.head_config = self.token_head.config\n\n # Initialize weights and apply final processing\n self.post_init()\n\n def forward(\n self,\n input_ids: Tensor | NestedTensor,\n attention_mask: Tensor | None = None,\n position_ids: Tensor | None = None,\n head_mask: Tensor | None = None,\n inputs_embeds: Tensor | NestedTensor | None = None,\n labels: Tensor | None = None,\n output_attentions: bool | None = None,\n output_hidden_states: bool | None = None,\n return_dict: bool | None = None,\n **kwargs,\n ) -> Tuple[Tensor, ...] | TokenPredictorOutput:\n return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n outputs = self.calm(\n input_ids,\n attention_mask=attention_mask,\n position_ids=position_ids,\n head_mask=head_mask,\n inputs_embeds=inputs_embeds,\n output_attentions=output_attentions,\n output_hidden_states=output_hidden_states,\n return_dict=return_dict,\n **kwargs,\n )\n output = self.token_head(outputs, attention_mask, input_ids, labels)\n logits, loss = output.logits, output.loss\n\n if not return_dict:\n output = (logits,) + outputs[2:]\n return ((loss,) + output) if loss is not None else output\n\n return TokenPredictorOutput(\n loss=loss,\n logits=logits,\n hidden_states=outputs.hidden_states,\n attentions=outputs.attentions,\n )\n
"},{"location":"models/calm/#multimolecule.models.calm.CaLmModel","title":"CaLmModel","text":" Bases: CaLmPreTrainedModel
Examples:
Python Console Session>>> from multimolecule import CaLmConfig, CaLmModel, RnaTokenizer\n>>> config = CaLmConfig()\n>>> model = CaLmModel(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input)\n>>> output[\"last_hidden_state\"].shape\ntorch.Size([1, 7, 768])\n>>> output[\"pooler_output\"].shape\ntorch.Size([1, 768])\n
Source code in multimolecule/models/calm/modeling_calm.py
Pythonclass CaLmModel(CaLmPreTrainedModel):\n \"\"\"\n Examples:\n >>> from multimolecule import CaLmConfig, CaLmModel, RnaTokenizer\n >>> config = CaLmConfig()\n >>> model = CaLmModel(config)\n >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n >>> output = model(**input)\n >>> output[\"last_hidden_state\"].shape\n torch.Size([1, 7, 768])\n >>> output[\"pooler_output\"].shape\n torch.Size([1, 768])\n \"\"\"\n\n def __init__(self, config: CaLmConfig, add_pooling_layer: bool = True):\n super().__init__(config)\n self.pad_token_id = config.pad_token_id\n self.embeddings = CaLmEmbeddings(config)\n self.encoder = CaLmEncoder(config)\n self.pooler = CaLmPooler(config) if add_pooling_layer else None\n\n # Initialize weights and apply final processing\n self.post_init()\n\n def get_input_embeddings(self):\n return self.embeddings.word_embeddings\n\n def set_input_embeddings(self, value):\n self.embeddings.word_embeddings = value\n\n def _prune_heads(self, heads_to_prune):\n \"\"\"\n Prunes heads of the model. heads_to_prune: dict of {layer_num: list of heads to prune in this layer} See base\n class PreTrainedModel\n \"\"\"\n for layer, heads in heads_to_prune.items():\n self.encoder.layer[layer].attention.prune_heads(heads)\n\n def forward(\n self,\n input_ids: Tensor | NestedTensor,\n attention_mask: Tensor | None = None,\n position_ids: Tensor | None = None,\n head_mask: Tensor | None = None,\n inputs_embeds: Tensor | NestedTensor | None = None,\n encoder_hidden_states: Tensor | None = None,\n encoder_attention_mask: Tensor | None = None,\n past_key_values: Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None = None,\n use_cache: bool | None = None,\n output_attentions: bool | None = None,\n output_hidden_states: bool | None = None,\n return_dict: bool | None = None,\n **kwargs,\n ) -> Tuple[Tensor, ...] | BaseModelOutputWithPoolingAndCrossAttentions:\n r\"\"\"\n Args:\n encoder_hidden_states:\n Shape: `(batch_size, sequence_length, hidden_size)`\n\n Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if\n the model is configured as a decoder.\n encoder_attention_mask:\n Shape: `(batch_size, sequence_length)`\n\n Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used\n in the cross-attention if the model is configured as a decoder. Mask values selected in `[0, 1]`:\n\n - 1 for tokens that are **not masked**,\n - 0 for tokens that are **masked**.\n past_key_values:\n Tuple of length `config.n_layers` with each tuple having 4 tensors of shape\n `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)\n\n Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up\n decoding.\n\n If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those\n that don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of\n all `decoder_input_ids` of shape `(batch_size, sequence_length)`.\n use_cache:\n If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding\n (see `past_key_values`).\n \"\"\"\n if kwargs:\n warn(\n f\"Additional keyword arguments `{', '.join(kwargs)}` are detected in \"\n f\"`{self.__class__.__name__}.forward`, they will be ignored.\\n\"\n \"This is provided for backward compatibility and may lead to unexpected behavior.\"\n )\n output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions\n output_hidden_states = (\n output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states\n )\n return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n\n if self.config.is_decoder:\n use_cache = use_cache if use_cache is not None else self.config.use_cache\n else:\n use_cache = False\n\n if isinstance(input_ids, NestedTensor):\n input_ids, attention_mask = input_ids.tensor, input_ids.mask\n if input_ids is not None and inputs_embeds is not None:\n raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n if input_ids is not None:\n self.warn_if_padding_and_no_attention_mask(input_ids, attention_mask)\n input_shape = input_ids.size()\n elif inputs_embeds is not None:\n input_shape = inputs_embeds.size()[:-1]\n else:\n raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n batch_size, seq_length = input_shape\n device = input_ids.device if input_ids is not None else inputs_embeds.device # type: ignore[union-attr]\n\n # past_key_values_length\n past_key_values_length = past_key_values[0][0].shape[2] if past_key_values is not None else 0\n\n if attention_mask is None:\n attention_mask = (\n input_ids.ne(self.pad_token_id)\n if self.pad_token_id is not None\n else torch.ones(((batch_size, seq_length + past_key_values_length)), device=device)\n )\n\n # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]\n # ourselves in which case we just need to make it broadcastable to all heads.\n extended_attention_mask: Tensor = self.get_extended_attention_mask(attention_mask, input_shape)\n\n # If a 2D or 3D attention mask is provided for the cross-attention\n # we need to make broadcastable to [batch_size, num_heads, seq_length, seq_length]\n if self.config.is_decoder and encoder_hidden_states is not None:\n encoder_batch_size, encoder_sequence_length, _ = encoder_hidden_states.size()\n encoder_hidden_shape = (encoder_batch_size, encoder_sequence_length)\n if encoder_attention_mask is None:\n encoder_attention_mask = torch.ones(encoder_hidden_shape, device=device)\n encoder_extended_attention_mask = self.invert_attention_mask(encoder_attention_mask)\n else:\n encoder_extended_attention_mask = None\n\n # Prepare head mask if needed\n # 1.0 in head_mask indicate we keep the head\n # attention_probs has shape bsz x n_heads x N x N\n # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]\n # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]\n head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)\n\n embedding_output = self.embeddings(\n input_ids=input_ids,\n position_ids=position_ids,\n attention_mask=attention_mask,\n inputs_embeds=inputs_embeds,\n past_key_values_length=past_key_values_length,\n )\n encoder_outputs = self.encoder(\n embedding_output,\n attention_mask=extended_attention_mask,\n head_mask=head_mask,\n encoder_hidden_states=encoder_hidden_states,\n encoder_attention_mask=encoder_extended_attention_mask,\n past_key_values=past_key_values,\n use_cache=use_cache,\n output_attentions=output_attentions,\n output_hidden_states=output_hidden_states,\n return_dict=return_dict,\n )\n sequence_output = encoder_outputs[0]\n pooled_output = self.pooler(sequence_output) if self.pooler is not None else None\n\n if not return_dict:\n return (sequence_output, pooled_output) + encoder_outputs[1:]\n\n return BaseModelOutputWithPoolingAndCrossAttentions(\n last_hidden_state=sequence_output,\n pooler_output=pooled_output,\n past_key_values=encoder_outputs.past_key_values,\n hidden_states=encoder_outputs.hidden_states,\n attentions=encoder_outputs.attentions,\n cross_attentions=encoder_outputs.cross_attentions,\n )\n
"},{"location":"models/calm/#multimolecule.models.calm.CaLmModel.forward","title":"forward","text":"Pythonforward(input_ids: Tensor | NestedTensor, attention_mask: Tensor | None = None, position_ids: Tensor | None = None, head_mask: Tensor | None = None, inputs_embeds: Tensor | NestedTensor | None = None, encoder_hidden_states: Tensor | None = None, encoder_attention_mask: Tensor | None = None, past_key_values: Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None = None, use_cache: bool | None = None, output_attentions: bool | None = None, output_hidden_states: bool | None = None, return_dict: bool | None = None, **kwargs) -> Tuple[Tensor, ...] | BaseModelOutputWithPoolingAndCrossAttentions\n
Parameters:
Name Type Description DefaultTensor | None
Shape: (batch_size, sequence_length, hidden_size)
Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if the model is configured as a decoder.
None
Tensor | None
Shape: (batch_size, sequence_length)
Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used in the cross-attention if the model is configured as a decoder. Mask values selected in [0, 1]
:
None
Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None
Tuple of length config.n_layers
with each tuple having 4 tensors of shape `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)
Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up decoding.
If past_key_values
are used, the user can optionally input only the last decoder_input_ids
(those that don\u2019t have their past key value states given to this model) of shape (batch_size, 1)
instead of all decoder_input_ids
of shape (batch_size, sequence_length)
.
None
bool | None
If set to True
, past_key_values
key value states are returned and can be used to speed up decoding (see past_key_values
).
None
Source code in multimolecule/models/calm/modeling_calm.py
Pythondef forward(\n self,\n input_ids: Tensor | NestedTensor,\n attention_mask: Tensor | None = None,\n position_ids: Tensor | None = None,\n head_mask: Tensor | None = None,\n inputs_embeds: Tensor | NestedTensor | None = None,\n encoder_hidden_states: Tensor | None = None,\n encoder_attention_mask: Tensor | None = None,\n past_key_values: Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None = None,\n use_cache: bool | None = None,\n output_attentions: bool | None = None,\n output_hidden_states: bool | None = None,\n return_dict: bool | None = None,\n **kwargs,\n) -> Tuple[Tensor, ...] | BaseModelOutputWithPoolingAndCrossAttentions:\n r\"\"\"\n Args:\n encoder_hidden_states:\n Shape: `(batch_size, sequence_length, hidden_size)`\n\n Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if\n the model is configured as a decoder.\n encoder_attention_mask:\n Shape: `(batch_size, sequence_length)`\n\n Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used\n in the cross-attention if the model is configured as a decoder. Mask values selected in `[0, 1]`:\n\n - 1 for tokens that are **not masked**,\n - 0 for tokens that are **masked**.\n past_key_values:\n Tuple of length `config.n_layers` with each tuple having 4 tensors of shape\n `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)\n\n Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up\n decoding.\n\n If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those\n that don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of\n all `decoder_input_ids` of shape `(batch_size, sequence_length)`.\n use_cache:\n If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding\n (see `past_key_values`).\n \"\"\"\n if kwargs:\n warn(\n f\"Additional keyword arguments `{', '.join(kwargs)}` are detected in \"\n f\"`{self.__class__.__name__}.forward`, they will be ignored.\\n\"\n \"This is provided for backward compatibility and may lead to unexpected behavior.\"\n )\n output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions\n output_hidden_states = (\n output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states\n )\n return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n\n if self.config.is_decoder:\n use_cache = use_cache if use_cache is not None else self.config.use_cache\n else:\n use_cache = False\n\n if isinstance(input_ids, NestedTensor):\n input_ids, attention_mask = input_ids.tensor, input_ids.mask\n if input_ids is not None and inputs_embeds is not None:\n raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n if input_ids is not None:\n self.warn_if_padding_and_no_attention_mask(input_ids, attention_mask)\n input_shape = input_ids.size()\n elif inputs_embeds is not None:\n input_shape = inputs_embeds.size()[:-1]\n else:\n raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n batch_size, seq_length = input_shape\n device = input_ids.device if input_ids is not None else inputs_embeds.device # type: ignore[union-attr]\n\n # past_key_values_length\n past_key_values_length = past_key_values[0][0].shape[2] if past_key_values is not None else 0\n\n if attention_mask is None:\n attention_mask = (\n input_ids.ne(self.pad_token_id)\n if self.pad_token_id is not None\n else torch.ones(((batch_size, seq_length + past_key_values_length)), device=device)\n )\n\n # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]\n # ourselves in which case we just need to make it broadcastable to all heads.\n extended_attention_mask: Tensor = self.get_extended_attention_mask(attention_mask, input_shape)\n\n # If a 2D or 3D attention mask is provided for the cross-attention\n # we need to make broadcastable to [batch_size, num_heads, seq_length, seq_length]\n if self.config.is_decoder and encoder_hidden_states is not None:\n encoder_batch_size, encoder_sequence_length, _ = encoder_hidden_states.size()\n encoder_hidden_shape = (encoder_batch_size, encoder_sequence_length)\n if encoder_attention_mask is None:\n encoder_attention_mask = torch.ones(encoder_hidden_shape, device=device)\n encoder_extended_attention_mask = self.invert_attention_mask(encoder_attention_mask)\n else:\n encoder_extended_attention_mask = None\n\n # Prepare head mask if needed\n # 1.0 in head_mask indicate we keep the head\n # attention_probs has shape bsz x n_heads x N x N\n # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]\n # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]\n head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)\n\n embedding_output = self.embeddings(\n input_ids=input_ids,\n position_ids=position_ids,\n attention_mask=attention_mask,\n inputs_embeds=inputs_embeds,\n past_key_values_length=past_key_values_length,\n )\n encoder_outputs = self.encoder(\n embedding_output,\n attention_mask=extended_attention_mask,\n head_mask=head_mask,\n encoder_hidden_states=encoder_hidden_states,\n encoder_attention_mask=encoder_extended_attention_mask,\n past_key_values=past_key_values,\n use_cache=use_cache,\n output_attentions=output_attentions,\n output_hidden_states=output_hidden_states,\n return_dict=return_dict,\n )\n sequence_output = encoder_outputs[0]\n pooled_output = self.pooler(sequence_output) if self.pooler is not None else None\n\n if not return_dict:\n return (sequence_output, pooled_output) + encoder_outputs[1:]\n\n return BaseModelOutputWithPoolingAndCrossAttentions(\n last_hidden_state=sequence_output,\n pooler_output=pooled_output,\n past_key_values=encoder_outputs.past_key_values,\n hidden_states=encoder_outputs.hidden_states,\n attentions=encoder_outputs.attentions,\n cross_attentions=encoder_outputs.cross_attentions,\n )\n
"},{"location":"models/calm/#multimolecule.models.calm.CaLmModel.forward(encoder_hidden_states)","title":"encoder_hidden_states
","text":""},{"location":"models/calm/#multimolecule.models.calm.CaLmModel.forward(encoder_attention_mask)","title":"encoder_attention_mask
","text":""},{"location":"models/calm/#multimolecule.models.calm.CaLmModel.forward(past_key_values)","title":"past_key_values
","text":""},{"location":"models/calm/#multimolecule.models.calm.CaLmModel.forward(use_cache)","title":"use_cache
","text":""},{"location":"models/calm/#multimolecule.models.calm.CaLmPreTrainedModel","title":"CaLmPreTrainedModel","text":" Bases: PreTrainedModel
An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained models.
Source code inmultimolecule/models/calm/modeling_calm.py
Pythonclass CaLmPreTrainedModel(PreTrainedModel):\n \"\"\"\n An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained\n models.\n \"\"\"\n\n config_class = CaLmConfig\n base_model_prefix = \"calm\"\n supports_gradient_checkpointing = True\n _no_split_modules = [\"CaLmLayer\", \"CaLmEmbeddings\"]\n\n # Copied from transformers.models.bert.modeling_bert.BertPreTrainedModel._init_weights\n def _init_weights(self, module: nn.Module):\n \"\"\"Initialize the weights\"\"\"\n if isinstance(module, nn.Linear):\n # Slightly different from the TF version which uses truncated_normal for initialization\n # cf https://github.com/pytorch/pytorch/pull/5617\n module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)\n if module.bias is not None:\n module.bias.data.zero_()\n elif isinstance(module, nn.Embedding):\n module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)\n if module.padding_idx is not None:\n module.weight.data[module.padding_idx].zero_()\n elif isinstance(module, nn.LayerNorm):\n module.bias.data.zero_()\n module.weight.data.fill_(1.0)\n
"},{"location":"models/configuration_utils/","title":"configuration_utils","text":""},{"location":"models/configuration_utils/#multimolecule.models.configuration_utils","title":"multimolecule.models.configuration_utils","text":""},{"location":"models/configuration_utils/#multimolecule.models.configuration_utils.HeadConfig","title":"HeadConfig","text":" Bases: BaseHeadConfig
Configuration class for a prediction head.
Parameters:
Name Type Description DefaultNumber of labels to use in the last layer added to the model, typically for a classification task.
Head should look for Config.num_labels
if is None
.
Problem type for XxxForYyyPrediction
models. Can be one of \"binary\"
, \"regression\"
, \"multiclass\"
or \"multilabel\"
.
Head should look for Config.problem_type
if is None
.
Dimensionality of the encoder layers and the pooler layer.
Head should look for Config.hidden_size
if is None
.
The dropout ratio for the hidden states.
requiredThe transform operation applied to hidden states.
requiredThe activation function of transform applied to hidden states.
requiredWhether to apply bias to the final prediction layer.
requiredThe activation function of the final prediction output.
requiredThe epsilon used by the layer normalization layers.
requiredThe name of the tensor required in model outputs.
If is None
, will use the default output name of the corresponding head.
The type of the head in the model.
This is used by MultiMoleculeModel
to construct heads.
multimolecule/module/heads/config.py
Pythonclass HeadConfig(BaseHeadConfig):\n r\"\"\"\n Configuration class for a prediction head.\n\n Args:\n num_labels:\n Number of labels to use in the last layer added to the model, typically for a classification task.\n\n Head should look for [`Config.num_labels`][multimolecule.PreTrainedConfig] if is `None`.\n problem_type:\n Problem type for `XxxForYyyPrediction` models. Can be one of `\"binary\"`, `\"regression\"`,\n `\"multiclass\"` or `\"multilabel\"`.\n\n Head should look for [`Config.problem_type`][multimolecule.PreTrainedConfig] if is `None`.\n hidden_size:\n Dimensionality of the encoder layers and the pooler layer.\n\n Head should look for [`Config.hidden_size`][multimolecule.PreTrainedConfig] if is `None`.\n dropout:\n The dropout ratio for the hidden states.\n transform:\n The transform operation applied to hidden states.\n transform_act:\n The activation function of transform applied to hidden states.\n bias:\n Whether to apply bias to the final prediction layer.\n act:\n The activation function of the final prediction output.\n layer_norm_eps:\n The epsilon used by the layer normalization layers.\n output_name:\n The name of the tensor required in model outputs.\n\n If is `None`, will use the default output name of the corresponding head.\n type:\n The type of the head in the model.\n\n This is used by [`MultiMoleculeModel`][multimolecule.MultiMoleculeModel] to construct heads.\n \"\"\"\n\n num_labels: Optional[int] = None\n problem_type: Optional[str] = None\n hidden_size: Optional[int] = None\n dropout: float = 0.0\n transform: Optional[str] = None\n transform_act: Optional[str] = \"gelu\"\n bias: bool = True\n act: Optional[str] = None\n layer_norm_eps: float = 1e-12\n output_name: Optional[str] = None\n type: Optional[str] = None\n
"},{"location":"models/configuration_utils/#multimolecule.models.configuration_utils.HeadConfig(num_labels)","title":"num_labels
","text":""},{"location":"models/configuration_utils/#multimolecule.models.configuration_utils.HeadConfig(problem_type)","title":"problem_type
","text":""},{"location":"models/configuration_utils/#multimolecule.models.configuration_utils.HeadConfig(hidden_size)","title":"hidden_size
","text":""},{"location":"models/configuration_utils/#multimolecule.models.configuration_utils.HeadConfig(dropout)","title":"dropout
","text":""},{"location":"models/configuration_utils/#multimolecule.models.configuration_utils.HeadConfig(transform)","title":"transform
","text":""},{"location":"models/configuration_utils/#multimolecule.models.configuration_utils.HeadConfig(transform_act)","title":"transform_act
","text":""},{"location":"models/configuration_utils/#multimolecule.models.configuration_utils.HeadConfig(bias)","title":"bias
","text":""},{"location":"models/configuration_utils/#multimolecule.models.configuration_utils.HeadConfig(act)","title":"act
","text":""},{"location":"models/configuration_utils/#multimolecule.models.configuration_utils.HeadConfig(layer_norm_eps)","title":"layer_norm_eps
","text":""},{"location":"models/configuration_utils/#multimolecule.models.configuration_utils.HeadConfig(output_name)","title":"output_name
","text":""},{"location":"models/configuration_utils/#multimolecule.models.configuration_utils.HeadConfig(type)","title":"type
","text":""},{"location":"models/configuration_utils/#multimolecule.models.configuration_utils.MaskedLMHeadConfig","title":"MaskedLMHeadConfig","text":" Bases: BaseHeadConfig
Configuration class for a Masked Language Modeling head.
Parameters:
Name Type Description DefaultDimensionality of the encoder layers and the pooler layer.
Head should look for Config.hidden_size
if is None
.
The dropout ratio for the hidden states.
requiredThe transform operation applied to hidden states.
requiredThe activation function of transform applied to hidden states.
requiredWhether to apply bias to the final prediction layer.
requiredThe activation function of the final prediction output.
requiredThe epsilon used by the layer normalization layers.
requiredThe name of the tensor required in model outputs.
If is None
, will use the default output name of the corresponding head.
multimolecule/module/heads/config.py
Pythonclass MaskedLMHeadConfig(BaseHeadConfig):\n r\"\"\"\n Configuration class for a Masked Language Modeling head.\n\n Args:\n hidden_size:\n Dimensionality of the encoder layers and the pooler layer.\n\n Head should look for [`Config.hidden_size`][multimolecule.PreTrainedConfig] if is `None`.\n dropout:\n The dropout ratio for the hidden states.\n transform:\n The transform operation applied to hidden states.\n transform_act:\n The activation function of transform applied to hidden states.\n bias:\n Whether to apply bias to the final prediction layer.\n act:\n The activation function of the final prediction output.\n layer_norm_eps:\n The epsilon used by the layer normalization layers.\n output_name:\n The name of the tensor required in model outputs.\n\n If is `None`, will use the default output name of the corresponding head.\n \"\"\"\n\n hidden_size: Optional[int] = None\n dropout: float = 0.0\n transform: Optional[str] = \"nonlinear\"\n transform_act: Optional[str] = \"gelu\"\n bias: bool = True\n act: Optional[str] = None\n layer_norm_eps: float = 1e-12\n output_name: Optional[str] = None\n
"},{"location":"models/configuration_utils/#multimolecule.models.configuration_utils.MaskedLMHeadConfig(hidden_size)","title":"hidden_size
","text":""},{"location":"models/configuration_utils/#multimolecule.models.configuration_utils.MaskedLMHeadConfig(dropout)","title":"dropout
","text":""},{"location":"models/configuration_utils/#multimolecule.models.configuration_utils.MaskedLMHeadConfig(transform)","title":"transform
","text":""},{"location":"models/configuration_utils/#multimolecule.models.configuration_utils.MaskedLMHeadConfig(transform_act)","title":"transform_act
","text":""},{"location":"models/configuration_utils/#multimolecule.models.configuration_utils.MaskedLMHeadConfig(bias)","title":"bias
","text":""},{"location":"models/configuration_utils/#multimolecule.models.configuration_utils.MaskedLMHeadConfig(act)","title":"act
","text":""},{"location":"models/configuration_utils/#multimolecule.models.configuration_utils.MaskedLMHeadConfig(layer_norm_eps)","title":"layer_norm_eps
","text":""},{"location":"models/configuration_utils/#multimolecule.models.configuration_utils.MaskedLMHeadConfig(output_name)","title":"output_name
","text":""},{"location":"models/configuration_utils/#multimolecule.models.configuration_utils.PreTrainedConfig","title":"PreTrainedConfig","text":" Bases: PretrainedConfig
Base class for all model configuration classes.
Source code inmultimolecule/models/configuration_utils.py
Pythonclass PreTrainedConfig(PretrainedConfig):\n r\"\"\"\n Base class for all model configuration classes.\n \"\"\"\n\n head: HeadConfig | None\n num_labels: int = 1\n\n hidden_size: int\n\n pad_token_id: int = 0\n bos_token_id: int = 1\n eos_token_id: int = 2\n unk_token_id: int = 3\n mask_token_id: int = 4\n null_token_id: int = 5\n\n def __init__(\n self,\n pad_token_id: int = 0,\n bos_token_id: int = 1,\n eos_token_id: int = 2,\n unk_token_id: int = 3,\n mask_token_id: int = 4,\n null_token_id: int = 5,\n num_labels: int = 1,\n **kwargs,\n ):\n super().__init__(\n pad_token_id=pad_token_id,\n bos_token_id=bos_token_id,\n eos_token_id=eos_token_id,\n unk_token_id=unk_token_id,\n mask_token_id=mask_token_id,\n null_token_id=null_token_id,\n num_labels=num_labels,\n **kwargs,\n )\n\n def to_dict(self):\n output = super().to_dict()\n for k, v in output.items():\n if hasattr(v, \"to_dict\"):\n output[k] = v.to_dict()\n if is_dataclass(v):\n output[k] = asdict(v)\n return output\n
"},{"location":"models/ernierna/","title":"ERNIE-RNA","text":"Pre-trained model on non-coding RNA (ncRNA) using a masked language modeling (MLM) objective.
"},{"location":"models/ernierna/#disclaimer","title":"Disclaimer","text":"This is an UNOFFICIAL implementation of the ERNIE-RNA: An RNA Language Model with Structure-enhanced Representations by Weijie Yin, Zhaoyu Zhang, Liang He, et al.
The OFFICIAL repository of ERNIE-RNA is at Bruce-ywj/ERNIE-RNA.
Tip
The MultiMolecule team has confirmed that the provided model and checkpoints are producing the same intermediate representations as the original implementation.
The team releasing ERNIE-RNA did not write this model card for this model so this model card has been written by the MultiMolecule team.
"},{"location":"models/ernierna/#model-details","title":"Model Details","text":"ERNIE-RNA is a bert-style model pre-trained on a large corpus of non-coding RNA sequences in a self-supervised fashion. This means that the model was trained on the raw nucleotides of RNA sequences only, with an automatic process to generate inputs and labels from those texts. Please refer to the Training Details section for more information on the training process.
"},{"location":"models/ernierna/#variations","title":"Variations","text":"multimolecule/ernierna
: The ERNIE-RNA model pre-trained on non-coding RNA sequences.multimolecule/ernierna-ss
: The ERNIE-RNA model fine-tuned on RNA secondary structure prediction.The model file depends on the multimolecule
library. You can install it using pip:
pip install multimolecule\n
"},{"location":"models/ernierna/#direct-use","title":"Direct Use","text":"You can use this model directly with a pipeline for masked language modeling:
Python>>> import multimolecule # you must import multimolecule to register models\n>>> from transformers import pipeline\n>>> unmasker = pipeline(\"fill-mask\", model=\"multimolecule/ernierna\")\n>>> unmasker(\"gguc<mask>cucugguuagaccagaucugagccu\")\n\n[{'score': 0.32839149236679077,\n 'token': 6,\n 'token_str': 'A',\n 'sequence': 'G G U C A C U C U G G U U A G A C C A G A U C U G A G C C U'},\n {'score': 0.3044775426387787,\n 'token': 9,\n 'token_str': 'U',\n 'sequence': 'G G U C U C U C U G G U U A G A C C A G A U C U G A G C C U'},\n {'score': 0.09914574027061462,\n 'token': 7,\n 'token_str': 'C',\n 'sequence': 'G G U C C C U C U G G U U A G A C C A G A U C U G A G C C U'},\n {'score': 0.09502048045396805,\n 'token': 24,\n 'token_str': '-',\n 'sequence': 'G G U C - C U C U G G U U A G A C C A G A U C U G A G C C U'},\n {'score': 0.06993662565946579,\n 'token': 21,\n 'token_str': '.',\n 'sequence': 'G G U C. C U C U G G U U A G A C C A G A U C U G A G C C U'}]\n
"},{"location":"models/ernierna/#downstream-use","title":"Downstream Use","text":""},{"location":"models/ernierna/#extract-features","title":"Extract Features","text":"Here is how to use this model to get the features of a given sequence in PyTorch:
Pythonfrom multimolecule import RnaTokenizer, ErnieRnaModel\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/ernierna\")\nmodel = ErnieRnaModel.from_pretrained(\"multimolecule/ernierna\")\n\ntext = \"UAGCUUAUCAGACUGAUGUUGA\"\ninput = tokenizer(text, return_tensors=\"pt\")\n\noutput = model(**input)\n
"},{"location":"models/ernierna/#sequence-classification-regression","title":"Sequence Classification / Regression","text":"Note: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for sequence classification or regression.
Here is how to use this model as backbone to fine-tune for a sequence-level task in PyTorch:
Pythonimport torch\nfrom multimolecule import RnaTokenizer, ErnieRnaForSequencePrediction\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/ernierna\")\nmodel = ErnieRnaForSequencePrediction.from_pretrained(\"multimolecule/ernierna\")\n\ntext = \"UAGCUUAUCAGACUGAUGUUGA\"\ninput = tokenizer(text, return_tensors=\"pt\")\nlabel = torch.tensor([1])\n\noutput = model(**input, labels=label)\n
"},{"location":"models/ernierna/#token-classification-regression","title":"Token Classification / Regression","text":"Note: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for nucleotide classification or regression.
Here is how to use this model as backbone to fine-tune for a nucleotide-level task in PyTorch:
Pythonimport torch\nfrom multimolecule import RnaTokenizer, ErnieRnaForTokenPrediction\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/ernierna\")\nmodel = ErnieRnaForTokenPrediction.from_pretrained(\"multimolecule/ernierna\")\n\ntext = \"UAGCUUAUCAGACUGAUGUUGA\"\ninput = tokenizer(text, return_tensors=\"pt\")\nlabel = torch.randint(2, (len(text), ))\n\noutput = model(**input, labels=label)\n
"},{"location":"models/ernierna/#contact-classification-regression","title":"Contact Classification / Regression","text":"Note: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for contact classification or regression.
Here is how to use this model as backbone to fine-tune for a contact-level task in PyTorch:
Pythonimport torch\nfrom multimolecule import RnaTokenizer, ErnieRnaForContactPrediction\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/ernierna\")\nmodel = ErnieRnaForContactPrediction.from_pretrained(\"multimolecule/ernierna\")\n\ntext = \"UAGCUUAUCAGACUGAUGUUGA\"\ninput = tokenizer(text, return_tensors=\"pt\")\nlabel = torch.randint(2, (len(text), len(text)))\n\noutput = model(**input, labels=label)\n
"},{"location":"models/ernierna/#training-details","title":"Training Details","text":"ERNIE-RNA used Masked Language Modeling (MLM) as the pre-training objective: taking a sequence, the model randomly masks 15% of the tokens in the input then runs the entire masked sentence through the model and has to predict the masked tokens. This is comparable to the Cloze task in language modeling.
"},{"location":"models/ernierna/#training-data","title":"Training Data","text":"The ERNIE-RNA model was pre-trained on RNAcentral. RNAcentral is a free, public resource that offers integrated access to a comprehensive and up-to-date set of non-coding RNA sequences provided by a collaborating group of Expert Databases representing a broad range of organisms and RNA types.
ERNIE-RNA applied CD-HIT (CD-HIT-EST) with a cut-off at 100% sequence identity to remove redundancy from the RNAcentral, resulting 25 million unique sequences. Sequences longer than 1024 nucleotides were subsequently excluded. The final dataset contains 20.4 million non-redundant RNA sequences. ERNIE-RNA preprocessed all tokens by replacing \u201cT\u201ds with \u201cS\u201ds.
Note that RnaTokenizer
will convert \u201cT\u201ds to \u201cU\u201ds for you, you may disable this behaviour by passing replace_T_with_U=False
.
ERNIE-RNA used masked language modeling (MLM) as the pre-training objective. The masking procedure is similar to the one used in BERT:
<mask>
.The model was trained on 24 NVIDIA V100 GPUs with 32GiB memories.
BibTeX:
BibTeX@article {Yin2024.03.17.585376,\n author = {Yin, Weijie and Zhang, Zhaoyu and He, Liang and Jiang, Rui and Zhang, Shuo and Liu, Gan and Zhang, Xuegong and Qin, Tao and Xie, Zhen},\n title = {ERNIE-RNA: An RNA Language Model with Structure-enhanced Representations},\n elocation-id = {2024.03.17.585376},\n year = {2024},\n doi = {10.1101/2024.03.17.585376},\n publisher = {Cold Spring Harbor Laboratory},\n abstract = {With large amounts of unlabeled RNA sequences data produced by high-throughput sequencing technologies, pre-trained RNA language models have been developed to estimate semantic space of RNA molecules, which facilities the understanding of grammar of RNA language. However, existing RNA language models overlook the impact of structure when modeling the RNA semantic space, resulting in incomplete feature extraction and suboptimal performance across various downstream tasks. In this study, we developed a RNA pre-trained language model named ERNIE-RNA (Enhanced Representations with base-pairing restriction for RNA modeling) based on a modified BERT (Bidirectional Encoder Representations from Transformers) by incorporating base-pairing restriction with no MSA (Multiple Sequence Alignment) information. We found that the attention maps from ERNIE-RNA with no fine-tuning are able to capture RNA structure in the zero-shot experiment more precisely than conventional methods such as fine-tuned RNAfold and RNAstructure, suggesting that the ERNIE-RNA can provide comprehensive RNA structural representations. Furthermore, ERNIE-RNA achieved SOTA (state-of-the-art) performance after fine-tuning for various downstream tasks, including RNA structural and functional predictions. In summary, our ERNIE-RNA model provides general features which can be widely and effectively applied in various subsequent research tasks. Our results indicate that introducing key knowledge-based prior information in the BERT framework may be a useful strategy to enhance the performance of other language models.Competing Interest StatementOne patent based on the study was submitted by Z.X. and W.Y., which is entitled as \"A Pre-training Approach for RNA Sequences and Its Applications\"(application number, no 202410262527.5). The remaining authors declare no competing interests.},\n URL = {https://www.biorxiv.org/content/early/2024/03/17/2024.03.17.585376},\n eprint = {https://www.biorxiv.org/content/early/2024/03/17/2024.03.17.585376.full.pdf},\n journal = {bioRxiv}\n}\n
"},{"location":"models/ernierna/#contact","title":"Contact","text":"Please use GitHub issues of MultiMolecule for any questions or comments on the model card.
Please contact the authors of the ERNIE-RNA paper for questions or comments on the paper/model.
"},{"location":"models/ernierna/#license","title":"License","text":"This model is licensed under the AGPL-3.0 License.
Text OnlySPDX-License-Identifier: AGPL-3.0-or-later\n
"},{"location":"models/ernierna/#multimolecule.models.ernierna","title":"multimolecule.models.ernierna","text":""},{"location":"models/ernierna/#multimolecule.models.ernierna.RnaTokenizer","title":"RnaTokenizer","text":" Bases: Tokenizer
Tokenizer for RNA sequences.
Parameters:
Name Type Description DefaultAlphabet | str | List[str] | None
alphabet to use for tokenization.
None
, the standard RNA alphabet will be used.string
, it should correspond to the name of a predefined alphabet. The options includestandard
extended
streamline
nucleobase
None
int
Size of kmer to tokenize.
1
bool
Whether to tokenize into codons.
False
bool
Whether to replace T with U.
True
bool
Whether to convert input to uppercase.
True
Examples:
Python Console Session>>> from multimolecule import RnaTokenizer\n>>> tokenizer = RnaTokenizer()\n>>> tokenizer('<pad><cls><eos><unk><mask><null>ACGUNRYSWKMBDHV.X*-I')[\"input_ids\"]\n[1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 2]\n>>> tokenizer('acgu')[\"input_ids\"]\n[1, 6, 7, 8, 9, 2]\n>>> tokenizer('acgt')[\"input_ids\"]\n[1, 6, 7, 8, 9, 2]\n>>> tokenizer = RnaTokenizer(replace_T_with_U=False)\n>>> tokenizer('acgt')[\"input_ids\"]\n[1, 6, 7, 8, 3, 2]\n>>> tokenizer = RnaTokenizer(nmers=3)\n>>> tokenizer('uagcuuauc')[\"input_ids\"]\n[1, 83, 17, 64, 49, 96, 84, 22, 2]\n>>> tokenizer = RnaTokenizer(codon=True)\n>>> tokenizer('uagcuuauc')[\"input_ids\"]\n[1, 83, 49, 22, 2]\n>>> tokenizer('uagcuuauca')[\"input_ids\"]\nTraceback (most recent call last):\nValueError: length of input sequence must be a multiple of 3 for codon tokenization, but got 10\n
Source code in multimolecule/tokenisers/rna/tokenization_rna.py
Pythonclass RnaTokenizer(Tokenizer):\n \"\"\"\n Tokenizer for RNA sequences.\n\n Args:\n alphabet: alphabet to use for tokenization.\n\n - If is `None`, the standard RNA alphabet will be used.\n - If is a `string`, it should correspond to the name of a predefined alphabet. The options include\n + `standard`\n + `extended`\n + `streamline`\n + `nucleobase`\n - If is an alphabet or a list of characters, that specific alphabet will be used.\n nmers: Size of kmer to tokenize.\n codon: Whether to tokenize into codons.\n replace_T_with_U: Whether to replace T with U.\n do_upper_case: Whether to convert input to uppercase.\n\n Examples:\n >>> from multimolecule import RnaTokenizer\n >>> tokenizer = RnaTokenizer()\n >>> tokenizer('<pad><cls><eos><unk><mask><null>ACGUNRYSWKMBDHV.X*-I')[\"input_ids\"]\n [1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 2]\n >>> tokenizer('acgu')[\"input_ids\"]\n [1, 6, 7, 8, 9, 2]\n >>> tokenizer('acgt')[\"input_ids\"]\n [1, 6, 7, 8, 9, 2]\n >>> tokenizer = RnaTokenizer(replace_T_with_U=False)\n >>> tokenizer('acgt')[\"input_ids\"]\n [1, 6, 7, 8, 3, 2]\n >>> tokenizer = RnaTokenizer(nmers=3)\n >>> tokenizer('uagcuuauc')[\"input_ids\"]\n [1, 83, 17, 64, 49, 96, 84, 22, 2]\n >>> tokenizer = RnaTokenizer(codon=True)\n >>> tokenizer('uagcuuauc')[\"input_ids\"]\n [1, 83, 49, 22, 2]\n >>> tokenizer('uagcuuauca')[\"input_ids\"]\n Traceback (most recent call last):\n ValueError: length of input sequence must be a multiple of 3 for codon tokenization, but got 10\n \"\"\"\n\n model_input_names = [\"input_ids\", \"attention_mask\"]\n\n def __init__(\n self,\n alphabet: Alphabet | str | List[str] | None = None,\n nmers: int = 1,\n codon: bool = False,\n replace_T_with_U: bool = True,\n do_upper_case: bool = True,\n additional_special_tokens: List | Tuple | None = None,\n **kwargs,\n ):\n if codon and (nmers > 1 and nmers != 3):\n raise ValueError(\"Codon and nmers cannot be used together.\")\n if codon:\n nmers = 3 # set to 3 to get correct vocab\n if not isinstance(alphabet, Alphabet):\n alphabet = get_alphabet(alphabet, nmers=nmers)\n super().__init__(\n alphabet=alphabet,\n nmers=nmers,\n codon=codon,\n replace_T_with_U=replace_T_with_U,\n do_upper_case=do_upper_case,\n additional_special_tokens=additional_special_tokens,\n **kwargs,\n )\n self.replace_T_with_U = replace_T_with_U\n self.nmers = nmers\n self.codon = codon\n\n def _tokenize(self, text: str, **kwargs):\n if self.do_upper_case:\n text = text.upper()\n if self.replace_T_with_U:\n text = text.replace(\"T\", \"U\")\n if self.codon:\n if len(text) % 3 != 0:\n raise ValueError(\n f\"length of input sequence must be a multiple of 3 for codon tokenization, but got {len(text)}\"\n )\n return [text[i : i + 3] for i in range(0, len(text), 3)]\n if self.nmers > 1:\n return [text[i : i + self.nmers] for i in range(len(text) - self.nmers + 1)] # noqa: E203\n return list(text)\n
"},{"location":"models/ernierna/#multimolecule.models.ernierna.RnaTokenizer(alphabet)","title":"alphabet
","text":""},{"location":"models/ernierna/#multimolecule.models.ernierna.RnaTokenizer(nmers)","title":"nmers
","text":""},{"location":"models/ernierna/#multimolecule.models.ernierna.RnaTokenizer(codon)","title":"codon
","text":""},{"location":"models/ernierna/#multimolecule.models.ernierna.RnaTokenizer(replace_T_with_U)","title":"replace_T_with_U
","text":""},{"location":"models/ernierna/#multimolecule.models.ernierna.RnaTokenizer(do_upper_case)","title":"do_upper_case
","text":""},{"location":"models/ernierna/#multimolecule.models.ernierna.ErnieRnaConfig","title":"ErnieRnaConfig","text":" Bases: PreTrainedConfig
This is the configuration class to store the configuration of a ErnieRnaModel
. It is used to instantiate a ErnieRna model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the ErnieRna Bruce-ywj/ERNIE-RNA architecture.
Configuration objects inherit from PreTrainedConfig
and can be used to control the model outputs. Read the documentation from PreTrainedConfig
for more information.
Parameters:
Name Type Description Defaultint
Vocabulary size of the ErnieRna model. Defines the number of different tokens that can be represented by the inputs_ids
passed when calling [ErnieRnaModel
].
26
int
Dimensionality of the encoder layers and the pooler layer.
768
int
Number of hidden layers in the Transformer encoder.
12
int
Number of attention heads for each attention layer in the Transformer encoder.
12
int
Dimensionality of the \u201cintermediate\u201d (often named feed-forward) layer in the Transformer encoder.
3072
float
The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
0.1
float
The dropout ratio for the attention probabilities.
0.1
int
The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
1026
float
The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
0.02
float
The epsilon used by the layer normalization layers.
1e-12
Examples:
Python Console Session>>> from multimolecule import ErnieRnaModel, ErnieRnaConfig\n>>> # Initializing a ERNIE-RNA multimolecule/ernierna style configuration\n>>> configuration = ErnieRnaConfig()\n>>> # Initializing a model (with random weights) from the multimolecule/ernierna style configuration\n>>> model = ErnieRnaModel(configuration)\n>>> # Accessing the model configuration\n>>> configuration = model.config\n
Source code in multimolecule/models/ernierna/configuration_ernierna.py
Pythonclass ErnieRnaConfig(PreTrainedConfig):\n r\"\"\"\n This is the configuration class to store the configuration of a\n [`ErnieRnaModel`][multimolecule.models.ErnieRnaModel]. It is used to instantiate a ErnieRna model according to the\n specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a\n similar configuration to that of the ErnieRna [Bruce-ywj/ERNIE-RNA](https://github.com/Bruce-ywj/ERNIE-RNA)\n architecture.\n\n Configuration objects inherit from [`PreTrainedConfig`][multimolecule.models.PreTrainedConfig] and can be used to\n control the model outputs. Read the documentation from [`PreTrainedConfig`][multimolecule.models.PreTrainedConfig]\n for more information.\n\n Args:\n vocab_size:\n Vocabulary size of the ErnieRna model. Defines the number of different tokens that can be represented by\n the `inputs_ids` passed when calling [`ErnieRnaModel`].\n hidden_size:\n Dimensionality of the encoder layers and the pooler layer.\n num_hidden_layers:\n Number of hidden layers in the Transformer encoder.\n num_attention_heads:\n Number of attention heads for each attention layer in the Transformer encoder.\n intermediate_size:\n Dimensionality of the \"intermediate\" (often named feed-forward) layer in the Transformer encoder.\n hidden_dropout:\n The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.\n attention_dropout:\n The dropout ratio for the attention probabilities.\n max_position_embeddings:\n The maximum sequence length that this model might ever be used with. Typically set this to something large\n just in case (e.g., 512 or 1024 or 2048).\n initializer_range:\n The standard deviation of the truncated_normal_initializer for initializing all weight matrices.\n layer_norm_eps:\n The epsilon used by the layer normalization layers.\n\n Examples:\n >>> from multimolecule import ErnieRnaModel, ErnieRnaConfig\n >>> # Initializing a ERNIE-RNA multimolecule/ernierna style configuration\n >>> configuration = ErnieRnaConfig()\n >>> # Initializing a model (with random weights) from the multimolecule/ernierna style configuration\n >>> model = ErnieRnaModel(configuration)\n >>> # Accessing the model configuration\n >>> configuration = model.config\n \"\"\"\n\n model_type = \"ernierna\"\n\n def __init__(\n self,\n vocab_size: int = 26,\n hidden_size: int = 768,\n num_hidden_layers: int = 12,\n num_attention_heads: int = 12,\n intermediate_size: int = 3072,\n hidden_act: str = \"gelu\",\n hidden_dropout: float = 0.1,\n attention_dropout: float = 0.1,\n max_position_embeddings: int = 1026,\n initializer_range: float = 0.02,\n layer_norm_eps: float = 1e-12,\n position_embedding_type: str = \"sinusoidal\",\n pairwise_alpha: float = 0.8,\n is_decoder: bool = False,\n use_cache: bool = True,\n head: HeadConfig | None = None,\n lm_head: MaskedLMHeadConfig | None = None,\n **kwargs,\n ):\n super().__init__(**kwargs)\n\n self.vocab_size = vocab_size\n self.type_vocab_size = 2\n self.hidden_size = hidden_size\n self.num_hidden_layers = num_hidden_layers\n self.num_attention_heads = num_attention_heads\n self.intermediate_size = intermediate_size\n self.hidden_act = hidden_act\n self.hidden_dropout = hidden_dropout\n self.attention_dropout = attention_dropout\n self.max_position_embeddings = max_position_embeddings\n self.initializer_range = initializer_range\n self.layer_norm_eps = layer_norm_eps\n self.position_embedding_type = position_embedding_type\n self.pairwise_alpha = pairwise_alpha\n self.is_decoder = is_decoder\n self.use_cache = use_cache\n self.head = HeadConfig(**head) if head is not None else None\n self.lm_head = MaskedLMHeadConfig(**lm_head) if lm_head is not None else None\n
"},{"location":"models/ernierna/#multimolecule.models.ernierna.ErnieRnaConfig(vocab_size)","title":"vocab_size
","text":""},{"location":"models/ernierna/#multimolecule.models.ernierna.ErnieRnaConfig(hidden_size)","title":"hidden_size
","text":""},{"location":"models/ernierna/#multimolecule.models.ernierna.ErnieRnaConfig(num_hidden_layers)","title":"num_hidden_layers
","text":""},{"location":"models/ernierna/#multimolecule.models.ernierna.ErnieRnaConfig(num_attention_heads)","title":"num_attention_heads
","text":""},{"location":"models/ernierna/#multimolecule.models.ernierna.ErnieRnaConfig(intermediate_size)","title":"intermediate_size
","text":""},{"location":"models/ernierna/#multimolecule.models.ernierna.ErnieRnaConfig(hidden_dropout)","title":"hidden_dropout
","text":""},{"location":"models/ernierna/#multimolecule.models.ernierna.ErnieRnaConfig(attention_dropout)","title":"attention_dropout
","text":""},{"location":"models/ernierna/#multimolecule.models.ernierna.ErnieRnaConfig(max_position_embeddings)","title":"max_position_embeddings
","text":""},{"location":"models/ernierna/#multimolecule.models.ernierna.ErnieRnaConfig(initializer_range)","title":"initializer_range
","text":""},{"location":"models/ernierna/#multimolecule.models.ernierna.ErnieRnaConfig(layer_norm_eps)","title":"layer_norm_eps
","text":""},{"location":"models/ernierna/#multimolecule.models.ernierna.ErnieRnaForContactClassification","title":"ErnieRnaForContactClassification","text":" Bases: ErnieRnaForPreTraining
Examples:
Python Console Session>>> from multimolecule.models import ErnieRnaConfig, ErnieRnaForContactClassification, RnaTokenizer\n>>> config = ErnieRnaConfig()\n>>> model = ErnieRnaForContactClassification(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input)\n
Source code in multimolecule/models/ernierna/modeling_ernierna.py
Pythonclass ErnieRnaForContactClassification(ErnieRnaForPreTraining):\n \"\"\"\n Examples:\n >>> from multimolecule.models import ErnieRnaConfig, ErnieRnaForContactClassification, RnaTokenizer\n >>> config = ErnieRnaConfig()\n >>> model = ErnieRnaForContactClassification(config)\n >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n >>> output = model(**input)\n \"\"\"\n\n def __init__(self, config: ErnieRnaConfig):\n super().__init__(config)\n self.ss_head = ErnieRnaContactClassificationHead(config)\n\n # Initialize weights and apply final processing\n self.post_init()\n\n def forward( # type: ignore[override] # pylint: disable=W0221\n self,\n input_ids: Tensor | NestedTensor,\n attention_mask: Tensor | None = None,\n position_ids: Tensor | None = None,\n head_mask: Tensor | None = None,\n inputs_embeds: Tensor | NestedTensor | None = None,\n labels_lm: Tensor | None = None,\n labels_ss: Tensor | None = None,\n output_attentions: bool | None = None,\n output_attention_biases: bool | None = None,\n output_hidden_states: bool | None = None,\n return_dict: bool | None = None,\n **kwargs,\n ) -> Tuple[Tensor, ...] | ErnieRnaForContactClassificationOutput:\n if output_attentions is False:\n warn(\"output_attentions must be True for contact classification and will be ignored.\")\n outputs = self.ernierna(\n input_ids,\n attention_mask=attention_mask,\n position_ids=position_ids,\n head_mask=head_mask,\n inputs_embeds=inputs_embeds,\n output_attentions=True,\n output_attention_biases=output_attention_biases,\n output_hidden_states=output_hidden_states,\n return_dict=return_dict,\n **kwargs,\n )\n output_lm = self.lm_head(outputs, labels_lm)\n output_ss = self.ss_head(outputs[-1][-1], attention_mask, input_ids, labels_ss)\n logits_lm, loss_lm = output_lm.logits, output_lm.loss\n logits_ss, loss_ss = output_ss.logits, output_ss.loss\n\n loss = None\n if loss_lm is not None and loss_ss is not None:\n loss = loss_lm + loss_ss\n elif loss_lm is not None:\n loss = loss_lm\n elif loss_ss is not None:\n loss = loss_ss\n\n if not return_dict:\n output = outputs[2:]\n output = ((logits_ss, loss_ss) + output) if loss_ss is not None else ((logits_ss,) + output)\n output = ((logits_lm, loss_lm) + output) if loss_lm is not None else ((logits_lm,) + output)\n return ((loss,) + output) if loss is not None else output\n\n return ErnieRnaForContactClassificationOutput(\n loss=loss,\n logits_lm=logits_lm,\n loss_lm=loss_lm,\n logits_ss=logits_ss,\n loss_ss=loss_ss,\n hidden_states=outputs.hidden_states,\n attentions=outputs.attentions,\n attention_biases=outputs.attention_biases,\n )\n
"},{"location":"models/ernierna/#multimolecule.models.ernierna.ErnieRnaForContactPrediction","title":"ErnieRnaForContactPrediction","text":" Bases: ErnieRnaPreTrainedModel
Examples:
Python Console Session>>> from multimolecule import ErnieRnaConfig, ErnieRnaForContactPrediction, RnaTokenizer\n>>> config = ErnieRnaConfig()\n>>> model = ErnieRnaForContactPrediction(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=torch.randint(2, (1, 5, 5)))\n>>> output[\"logits\"].shape\ntorch.Size([1, 5, 5, 1])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n
Source code in multimolecule/models/ernierna/modeling_ernierna.py
Pythonclass ErnieRnaForContactPrediction(ErnieRnaPreTrainedModel):\n \"\"\"\n Examples:\n >>> from multimolecule import ErnieRnaConfig, ErnieRnaForContactPrediction, RnaTokenizer\n >>> config = ErnieRnaConfig()\n >>> model = ErnieRnaForContactPrediction(config)\n >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n >>> output = model(**input, labels=torch.randint(2, (1, 5, 5)))\n >>> output[\"logits\"].shape\n torch.Size([1, 5, 5, 1])\n >>> output[\"loss\"] # doctest:+ELLIPSIS\n tensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n \"\"\"\n\n def __init__(self, config: ErnieRnaConfig):\n super().__init__(config)\n self.ernierna = ErnieRnaModel(config, add_pooling_layer=True)\n self.contact_head = ContactPredictionHead(config)\n self.head_config = self.contact_head.config\n\n # Initialize weights and apply final processing\n self.post_init()\n\n def forward(\n self,\n input_ids: Tensor | NestedTensor,\n attention_mask: Tensor | None = None,\n position_ids: Tensor | None = None,\n head_mask: Tensor | None = None,\n inputs_embeds: Tensor | NestedTensor | None = None,\n labels: Tensor | None = None,\n output_attentions: bool | None = None,\n output_hidden_states: bool | None = None,\n return_dict: bool | None = None,\n **kwargs,\n ) -> Tuple[Tensor, ...] | ErnieRnaContactPredictorOutput:\n if output_attentions is False:\n warn(\"output_attentions must be True for contact classification and will be ignored.\")\n return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n outputs = self.ernierna(\n input_ids,\n attention_mask=attention_mask,\n position_ids=position_ids,\n head_mask=head_mask,\n inputs_embeds=inputs_embeds,\n output_attentions=True,\n output_hidden_states=output_hidden_states,\n return_dict=return_dict,\n **kwargs,\n )\n output = self.contact_head(outputs, attention_mask, input_ids, labels)\n logits, loss = output.logits, output.loss\n\n if not return_dict:\n output = (logits,) + outputs[2:]\n return ((loss,) + output) if loss is not None else output\n\n return ErnieRnaContactPredictorOutput(\n loss=loss,\n logits=logits,\n hidden_states=outputs.hidden_states,\n attentions=outputs.attentions,\n )\n
"},{"location":"models/ernierna/#multimolecule.models.ernierna.ErnieRnaForMaskedLM","title":"ErnieRnaForMaskedLM","text":" Bases: ErnieRnaPreTrainedModel
Examples:
Python Console Session>>> from multimolecule import ErnieRnaConfig, ErnieRnaForMaskedLM, RnaTokenizer\n>>> config = ErnieRnaConfig()\n>>> model = ErnieRnaForMaskedLM(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=input[\"input_ids\"])\n>>> output[\"logits\"].shape\ntorch.Size([1, 7, 26])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<NllLossBackward0>)\n
Source code in multimolecule/models/ernierna/modeling_ernierna.py
Pythonclass ErnieRnaForMaskedLM(ErnieRnaPreTrainedModel):\n \"\"\"\n Examples:\n >>> from multimolecule import ErnieRnaConfig, ErnieRnaForMaskedLM, RnaTokenizer\n >>> config = ErnieRnaConfig()\n >>> model = ErnieRnaForMaskedLM(config)\n >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n >>> output = model(**input, labels=input[\"input_ids\"])\n >>> output[\"logits\"].shape\n torch.Size([1, 7, 26])\n >>> output[\"loss\"] # doctest:+ELLIPSIS\n tensor(..., grad_fn=<NllLossBackward0>)\n \"\"\"\n\n _tied_weights_keys = [\"lm_head.decoder.weight\", \"lm_head.decoder.bias\"]\n\n def __init__(self, config: ErnieRnaConfig):\n super().__init__(config)\n if config.is_decoder:\n logger.warning(\n \"If you want to use `BertForMaskedLM` make sure `config.is_decoder=False` for \"\n \"bi-directional self-attention.\"\n )\n self.ernierna = ErnieRnaModel(config, add_pooling_layer=False)\n self.lm_head = MaskedLMHead(config)\n\n # Initialize weights and apply final processing\n self.post_init()\n\n def get_output_embeddings(self):\n return self.lm_head.decoder\n\n def set_output_embeddings(self, new_embeddings):\n self.lm_head.decoder = new_embeddings\n\n def forward(\n self,\n input_ids: Tensor | NestedTensor,\n attention_mask: Tensor | None = None,\n position_ids: Tensor | None = None,\n head_mask: Tensor | None = None,\n inputs_embeds: Tensor | NestedTensor | None = None,\n encoder_hidden_states: Tensor | None = None,\n encoder_attention_mask: Tensor | None = None,\n labels: Tensor | None = None,\n output_attentions: bool | None = None,\n output_attention_biases: bool | None = None,\n output_hidden_states: bool | None = None,\n return_dict: bool | None = None,\n **kwargs,\n ) -> Tuple[Tensor, ...] | ErnieRnaForMaskedLMOutput:\n return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n outputs = self.ernierna(\n input_ids,\n attention_mask=attention_mask,\n position_ids=position_ids,\n head_mask=head_mask,\n inputs_embeds=inputs_embeds,\n encoder_hidden_states=encoder_hidden_states,\n encoder_attention_mask=encoder_attention_mask,\n output_attentions=output_attentions,\n output_attention_biases=output_attention_biases,\n output_hidden_states=output_hidden_states,\n return_dict=return_dict,\n **kwargs,\n )\n output = self.lm_head(outputs, labels)\n logits, loss = output.logits, output.loss\n\n if not return_dict:\n output = (logits,) + outputs[2:]\n return ((loss,) + output) if loss is not None else output\n\n return ErnieRnaForMaskedLMOutput(\n loss=loss,\n logits=logits,\n hidden_states=outputs.hidden_states,\n attentions=outputs.attentions,\n )\n
"},{"location":"models/ernierna/#multimolecule.models.ernierna.ErnieRnaForSequencePrediction","title":"ErnieRnaForSequencePrediction","text":" Bases: ErnieRnaPreTrainedModel
Examples:
Python Console Session>>> from multimolecule import ErnieRnaConfig, ErnieRnaForSequencePrediction, RnaTokenizer\n>>> config = ErnieRnaConfig()\n>>> model = ErnieRnaForSequencePrediction(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input)\n>>> output[\"logits\"].shape\ntorch.Size([1, 1])\n
Source code in multimolecule/models/ernierna/modeling_ernierna.py
Pythonclass ErnieRnaForSequencePrediction(ErnieRnaPreTrainedModel):\n \"\"\"\n Examples:\n >>> from multimolecule import ErnieRnaConfig, ErnieRnaForSequencePrediction, RnaTokenizer\n >>> config = ErnieRnaConfig()\n >>> model = ErnieRnaForSequencePrediction(config)\n >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n >>> output = model(**input)\n >>> output[\"logits\"].shape\n torch.Size([1, 1])\n \"\"\"\n\n def __init__(self, config: ErnieRnaConfig):\n super().__init__(config)\n self.ernierna = ErnieRnaModel(config)\n self.sequence_head = SequencePredictionHead(config)\n self.head_config = self.sequence_head.config\n\n # Initialize weights and apply final processing\n self.post_init()\n\n def forward(\n self,\n input_ids: Tensor | NestedTensor,\n attention_mask: Tensor | None = None,\n position_ids: Tensor | None = None,\n head_mask: Tensor | None = None,\n inputs_embeds: Tensor | NestedTensor | None = None,\n labels: Tensor | None = None,\n output_attentions: bool | None = None,\n output_attention_biases: bool | None = None,\n output_hidden_states: bool | None = None,\n return_dict: bool | None = None,\n **kwargs,\n ) -> Tuple[Tensor, ...] | ErnieRnaSequencePredictorOutput:\n return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n outputs = self.ernierna(\n input_ids,\n attention_mask=attention_mask,\n position_ids=position_ids,\n head_mask=head_mask,\n inputs_embeds=inputs_embeds,\n output_attentions=output_attentions,\n output_attention_biases=output_attention_biases,\n output_hidden_states=output_hidden_states,\n return_dict=return_dict,\n **kwargs,\n )\n output = self.sequence_head(outputs, labels)\n logits, loss = output.logits, output.loss\n\n if not return_dict:\n output = (logits,) + outputs[2:]\n return ((loss,) + output) if loss is not None else output\n\n return ErnieRnaSequencePredictorOutput(\n loss=loss,\n logits=logits,\n hidden_states=outputs.hidden_states,\n attentions=outputs.attentions,\n )\n
"},{"location":"models/ernierna/#multimolecule.models.ernierna.ErnieRnaForTokenPrediction","title":"ErnieRnaForTokenPrediction","text":" Bases: ErnieRnaPreTrainedModel
Examples:
Python Console Session>>> from multimolecule import ErnieRnaConfig, ErnieRnaForTokenPrediction, RnaTokenizer\n>>> config = ErnieRnaConfig()\n>>> model = ErnieRnaForTokenPrediction(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=torch.randint(2, (1, 5)))\n>>> output[\"logits\"].shape\ntorch.Size([1, 5, 1])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n
Source code in multimolecule/models/ernierna/modeling_ernierna.py
Pythonclass ErnieRnaForTokenPrediction(ErnieRnaPreTrainedModel):\n \"\"\"\n Examples:\n >>> from multimolecule import ErnieRnaConfig, ErnieRnaForTokenPrediction, RnaTokenizer\n >>> config = ErnieRnaConfig()\n >>> model = ErnieRnaForTokenPrediction(config)\n >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n >>> output = model(**input, labels=torch.randint(2, (1, 5)))\n >>> output[\"logits\"].shape\n torch.Size([1, 5, 1])\n >>> output[\"loss\"] # doctest:+ELLIPSIS\n tensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n \"\"\"\n\n def __init__(self, config: ErnieRnaConfig):\n super().__init__(config)\n self.num_labels = config.num_labels\n self.ernierna = ErnieRnaModel(config, add_pooling_layer=True)\n self.token_head = TokenPredictionHead(config)\n self.head_config = self.token_head.config\n\n # Initialize weights and apply final processing\n self.post_init()\n\n def forward(\n self,\n input_ids: Tensor | NestedTensor,\n attention_mask: Tensor | None = None,\n position_ids: Tensor | None = None,\n head_mask: Tensor | None = None,\n inputs_embeds: Tensor | NestedTensor | None = None,\n labels: Tensor | None = None,\n output_attentions: bool | None = None,\n output_attention_biases: bool | None = None,\n output_hidden_states: bool | None = None,\n return_dict: bool | None = None,\n **kwargs,\n ) -> Tuple[Tensor, ...] | ErnieRnaTokenPredictorOutput:\n return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n outputs = self.ernierna(\n input_ids,\n attention_mask=attention_mask,\n position_ids=position_ids,\n head_mask=head_mask,\n inputs_embeds=inputs_embeds,\n output_attentions=output_attentions,\n output_attention_biases=output_attention_biases,\n output_hidden_states=output_hidden_states,\n return_dict=return_dict,\n **kwargs,\n )\n output = self.token_head(outputs, attention_mask, input_ids, labels)\n logits, loss = output.logits, output.loss\n\n if not return_dict:\n output = (logits,) + outputs[2:]\n return ((loss,) + output) if loss is not None else output\n\n return ErnieRnaTokenPredictorOutput(\n loss=loss,\n logits=logits,\n hidden_states=outputs.hidden_states,\n attentions=outputs.attentions,\n )\n
"},{"location":"models/ernierna/#multimolecule.models.ernierna.ErnieRnaModel","title":"ErnieRnaModel","text":" Bases: ErnieRnaPreTrainedModel
Examples:
Python Console Session>>> from multimolecule import ErnieRnaConfig, ErnieRnaModel, RnaTokenizer\n>>> config = ErnieRnaConfig()\n>>> model = ErnieRnaModel(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input)\n>>> output[\"last_hidden_state\"].shape\ntorch.Size([1, 7, 768])\n>>> output[\"pooler_output\"].shape\ntorch.Size([1, 768])\n
Source code in multimolecule/models/ernierna/modeling_ernierna.py
Pythonclass ErnieRnaModel(ErnieRnaPreTrainedModel):\n \"\"\"\n Examples:\n >>> from multimolecule import ErnieRnaConfig, ErnieRnaModel, RnaTokenizer\n >>> config = ErnieRnaConfig()\n >>> model = ErnieRnaModel(config)\n >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n >>> output = model(**input)\n >>> output[\"last_hidden_state\"].shape\n torch.Size([1, 7, 768])\n >>> output[\"pooler_output\"].shape\n torch.Size([1, 768])\n \"\"\"\n\n pairwise_bias_map: Tensor\n\n def __init__(\n self, config: ErnieRnaConfig, add_pooling_layer: bool = True, tokenizer: PreTrainedTokenizer | None = None\n ):\n super().__init__(config)\n if tokenizer is None:\n tokenizer = AutoTokenizer.from_pretrained(\"multimolecule/rna\")\n self.tokenizer = tokenizer\n self.pad_token_id = tokenizer.pad_token_id\n self.vocab_size = len(self.tokenizer)\n if self.vocab_size != config.vocab_size:\n raise ValueError(\n f\"Vocab size in tokenizer ({self.vocab_size}) does not match the one in config ({config.vocab_size})\"\n )\n token_to_ids = self.tokenizer._token_to_id\n tokens = sorted(token_to_ids, key=token_to_ids.get)\n pairwise_bias_dict = get_pairwise_bias_dict(config.pairwise_alpha)\n self.register_buffer(\n \"pairwise_bias_map\",\n torch.tensor([[pairwise_bias_dict.get(f\"{i}{j}\", 0) for i in tokens] for j in tokens]),\n persistent=False,\n )\n self.pairwise_bias_proj = nn.Sequential(\n nn.Linear(1, config.num_attention_heads // 2),\n nn.GELU(),\n nn.Linear(config.num_attention_heads // 2, config.num_attention_heads),\n )\n self.embeddings = ErnieRnaEmbeddings(config)\n self.encoder = ErnieRnaEncoder(config)\n self.pooler = ErnieRnaPooler(config) if add_pooling_layer else None\n\n # Initialize weights and apply final processing\n self.post_init()\n\n def get_input_embeddings(self):\n return self.embeddings.word_embeddings\n\n def set_input_embeddings(self, value):\n self.embeddings.word_embeddings = value\n\n def _prune_heads(self, heads_to_prune):\n \"\"\"\n Prunes heads of the model. heads_to_prune: dict of {layer_num: list of heads to prune in this layer} See base\n class PreTrainedModel\n \"\"\"\n for layer, heads in heads_to_prune.items():\n self.encoder.layer[layer].attention.prune_heads(heads)\n\n def get_pairwise_bias(\n self, input_ids: Tensor | NestedTensor, attention_mask: Tensor | NestedTensor | None = None\n ) -> Tensor | NestedTensor:\n batch_size, seq_len = input_ids.shape\n\n # Broadcasting data indices to compute indices\n data_index_x = input_ids.unsqueeze(2).expand(batch_size, seq_len, seq_len)\n data_index_y = input_ids.unsqueeze(1).expand(batch_size, seq_len, seq_len)\n\n # Get bias from pairwise_bias_map\n return self.pairwise_bias_map[data_index_x, data_index_y]\n\n # Zhiyuan: Is it really necessary to mask the bias?\n # The mask position should have been nan, and the implementation is incorrect anyway\n # if attention_mask is not None:\n # attention_mask = attention_mask.unsqueeze(1).expand(batch_size, seq_len, seq_len)\n # bias = bias * attention_mask\n\n def forward(\n self,\n input_ids: Tensor | NestedTensor,\n attention_mask: Tensor | None = None,\n position_ids: Tensor | None = None,\n head_mask: Tensor | None = None,\n inputs_embeds: Tensor | NestedTensor | None = None,\n encoder_hidden_states: Tensor | None = None,\n encoder_attention_mask: Tensor | None = None,\n past_key_values: Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None = None,\n use_cache: bool | None = None,\n output_attentions: bool | None = None,\n output_attention_biases: bool | None = None,\n output_hidden_states: bool | None = None,\n return_dict: bool | None = None,\n **kwargs,\n ) -> Tuple[Tensor, ...] | ErnieRnaModelOutputWithPoolingAndCrossAttentions:\n r\"\"\"\n Args:\n encoder_hidden_states:\n Shape: `(batch_size, sequence_length, hidden_size)`\n\n Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if\n the model is configured as a decoder.\n encoder_attention_mask:\n Shape: `(batch_size, sequence_length)`\n\n Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used\n in the cross-attention if the model is configured as a decoder. Mask values selected in `[0, 1]`:\n\n - 1 for tokens that are **not masked**,\n - 0 for tokens that are **masked**.\n past_key_values:\n Tuple of length `config.n_layers` with each tuple having 4 tensors of shape\n `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)\n\n Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up\n decoding.\n\n If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those\n that don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of\n all `decoder_input_ids` of shape `(batch_size, sequence_length)`.\n use_cache:\n If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding\n (see `past_key_values`).\n \"\"\"\n if kwargs:\n warn(\n f\"Additional keyword arguments `{', '.join(kwargs)}` are detected in \"\n f\"`{self.__class__.__name__}.forward`, they will be ignored.\\n\"\n \"This is provided for backward compatibility and may lead to unexpected behavior.\"\n )\n output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions\n output_hidden_states = (\n output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states\n )\n return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n\n if self.config.is_decoder:\n use_cache = use_cache if use_cache is not None else self.config.use_cache\n else:\n use_cache = False\n\n pairwise_bias = self.get_pairwise_bias(input_ids, attention_mask)\n attention_bias = self.pairwise_bias_proj(pairwise_bias.unsqueeze(-1)).transpose(1, 3)\n\n if isinstance(input_ids, NestedTensor):\n input_ids, attention_mask = input_ids.tensor, input_ids.mask\n if input_ids is not None and inputs_embeds is not None:\n raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n if input_ids is not None:\n self.warn_if_padding_and_no_attention_mask(input_ids, attention_mask)\n input_shape = input_ids.size()\n elif inputs_embeds is not None:\n input_shape = inputs_embeds.size()[:-1]\n else:\n raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n batch_size, seq_length = input_shape\n device = input_ids.device if input_ids is not None else inputs_embeds.device # type: ignore[union-attr]\n\n # past_key_values_length\n past_key_values_length = past_key_values[0][0].shape[2] if past_key_values is not None else 0\n\n if attention_mask is None:\n attention_mask = (\n input_ids.ne(self.pad_token_id)\n if self.pad_token_id is not None\n else torch.ones(((batch_size, seq_length + past_key_values_length)), device=device)\n )\n\n # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]\n # ourselves in which case we just need to make it broadcastable to all heads.\n extended_attention_mask: Tensor = self.get_extended_attention_mask(attention_mask, input_shape)\n\n # If a 2D or 3D attention mask is provided for the cross-attention\n # we need to make broadcastable to [batch_size, num_heads, seq_length, seq_length]\n if self.config.is_decoder and encoder_hidden_states is not None:\n encoder_batch_size, encoder_sequence_length, _ = encoder_hidden_states.size()\n encoder_hidden_shape = (encoder_batch_size, encoder_sequence_length)\n if encoder_attention_mask is None:\n encoder_attention_mask = torch.ones(encoder_hidden_shape, device=device)\n encoder_extended_attention_mask = self.invert_attention_mask(encoder_attention_mask)\n else:\n encoder_extended_attention_mask = None\n\n # Prepare head mask if needed\n # 1.0 in head_mask indicate we keep the head\n # attention_probs has shape bsz x n_heads x N x N\n # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]\n # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]\n head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)\n\n embedding_output = self.embeddings(\n input_ids=input_ids,\n position_ids=position_ids,\n inputs_embeds=inputs_embeds,\n past_key_values_length=past_key_values_length,\n )\n encoder_outputs = self.encoder(\n embedding_output,\n attention_mask=extended_attention_mask,\n attention_bias=attention_bias,\n head_mask=head_mask,\n encoder_hidden_states=encoder_hidden_states,\n encoder_attention_mask=encoder_extended_attention_mask,\n past_key_values=past_key_values,\n use_cache=use_cache,\n output_attentions=output_attentions,\n output_attention_biases=output_attention_biases,\n output_hidden_states=output_hidden_states,\n return_dict=return_dict,\n )\n sequence_output = encoder_outputs[0]\n pooled_output = self.pooler(sequence_output) if self.pooler is not None else None\n\n if not return_dict:\n return (sequence_output, pooled_output) + encoder_outputs[1:]\n\n return ErnieRnaModelOutputWithPoolingAndCrossAttentions(\n last_hidden_state=sequence_output,\n pooler_output=pooled_output,\n past_key_values=encoder_outputs.past_key_values,\n hidden_states=encoder_outputs.hidden_states,\n attention_biases=encoder_outputs.attention_biases,\n attentions=encoder_outputs.attentions,\n cross_attentions=encoder_outputs.cross_attentions,\n )\n
"},{"location":"models/ernierna/#multimolecule.models.ernierna.ErnieRnaModel.forward","title":"forward","text":"Pythonforward(input_ids: Tensor | NestedTensor, attention_mask: Tensor | None = None, position_ids: Tensor | None = None, head_mask: Tensor | None = None, inputs_embeds: Tensor | NestedTensor | None = None, encoder_hidden_states: Tensor | None = None, encoder_attention_mask: Tensor | None = None, past_key_values: Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None = None, use_cache: bool | None = None, output_attentions: bool | None = None, output_attention_biases: bool | None = None, output_hidden_states: bool | None = None, return_dict: bool | None = None, **kwargs) -> Tuple[Tensor, ...] | ErnieRnaModelOutputWithPoolingAndCrossAttentions\n
Parameters:
Name Type Description DefaultTensor | None
Shape: (batch_size, sequence_length, hidden_size)
Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if the model is configured as a decoder.
None
Tensor | None
Shape: (batch_size, sequence_length)
Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used in the cross-attention if the model is configured as a decoder. Mask values selected in [0, 1]
:
None
Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None
Tuple of length config.n_layers
with each tuple having 4 tensors of shape `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)
Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up decoding.
If past_key_values
are used, the user can optionally input only the last decoder_input_ids
(those that don\u2019t have their past key value states given to this model) of shape (batch_size, 1)
instead of all decoder_input_ids
of shape (batch_size, sequence_length)
.
None
bool | None
If set to True
, past_key_values
key value states are returned and can be used to speed up decoding (see past_key_values
).
None
Source code in multimolecule/models/ernierna/modeling_ernierna.py
Pythondef forward(\n self,\n input_ids: Tensor | NestedTensor,\n attention_mask: Tensor | None = None,\n position_ids: Tensor | None = None,\n head_mask: Tensor | None = None,\n inputs_embeds: Tensor | NestedTensor | None = None,\n encoder_hidden_states: Tensor | None = None,\n encoder_attention_mask: Tensor | None = None,\n past_key_values: Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None = None,\n use_cache: bool | None = None,\n output_attentions: bool | None = None,\n output_attention_biases: bool | None = None,\n output_hidden_states: bool | None = None,\n return_dict: bool | None = None,\n **kwargs,\n) -> Tuple[Tensor, ...] | ErnieRnaModelOutputWithPoolingAndCrossAttentions:\n r\"\"\"\n Args:\n encoder_hidden_states:\n Shape: `(batch_size, sequence_length, hidden_size)`\n\n Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if\n the model is configured as a decoder.\n encoder_attention_mask:\n Shape: `(batch_size, sequence_length)`\n\n Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used\n in the cross-attention if the model is configured as a decoder. Mask values selected in `[0, 1]`:\n\n - 1 for tokens that are **not masked**,\n - 0 for tokens that are **masked**.\n past_key_values:\n Tuple of length `config.n_layers` with each tuple having 4 tensors of shape\n `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)\n\n Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up\n decoding.\n\n If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those\n that don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of\n all `decoder_input_ids` of shape `(batch_size, sequence_length)`.\n use_cache:\n If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding\n (see `past_key_values`).\n \"\"\"\n if kwargs:\n warn(\n f\"Additional keyword arguments `{', '.join(kwargs)}` are detected in \"\n f\"`{self.__class__.__name__}.forward`, they will be ignored.\\n\"\n \"This is provided for backward compatibility and may lead to unexpected behavior.\"\n )\n output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions\n output_hidden_states = (\n output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states\n )\n return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n\n if self.config.is_decoder:\n use_cache = use_cache if use_cache is not None else self.config.use_cache\n else:\n use_cache = False\n\n pairwise_bias = self.get_pairwise_bias(input_ids, attention_mask)\n attention_bias = self.pairwise_bias_proj(pairwise_bias.unsqueeze(-1)).transpose(1, 3)\n\n if isinstance(input_ids, NestedTensor):\n input_ids, attention_mask = input_ids.tensor, input_ids.mask\n if input_ids is not None and inputs_embeds is not None:\n raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n if input_ids is not None:\n self.warn_if_padding_and_no_attention_mask(input_ids, attention_mask)\n input_shape = input_ids.size()\n elif inputs_embeds is not None:\n input_shape = inputs_embeds.size()[:-1]\n else:\n raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n batch_size, seq_length = input_shape\n device = input_ids.device if input_ids is not None else inputs_embeds.device # type: ignore[union-attr]\n\n # past_key_values_length\n past_key_values_length = past_key_values[0][0].shape[2] if past_key_values is not None else 0\n\n if attention_mask is None:\n attention_mask = (\n input_ids.ne(self.pad_token_id)\n if self.pad_token_id is not None\n else torch.ones(((batch_size, seq_length + past_key_values_length)), device=device)\n )\n\n # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]\n # ourselves in which case we just need to make it broadcastable to all heads.\n extended_attention_mask: Tensor = self.get_extended_attention_mask(attention_mask, input_shape)\n\n # If a 2D or 3D attention mask is provided for the cross-attention\n # we need to make broadcastable to [batch_size, num_heads, seq_length, seq_length]\n if self.config.is_decoder and encoder_hidden_states is not None:\n encoder_batch_size, encoder_sequence_length, _ = encoder_hidden_states.size()\n encoder_hidden_shape = (encoder_batch_size, encoder_sequence_length)\n if encoder_attention_mask is None:\n encoder_attention_mask = torch.ones(encoder_hidden_shape, device=device)\n encoder_extended_attention_mask = self.invert_attention_mask(encoder_attention_mask)\n else:\n encoder_extended_attention_mask = None\n\n # Prepare head mask if needed\n # 1.0 in head_mask indicate we keep the head\n # attention_probs has shape bsz x n_heads x N x N\n # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]\n # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]\n head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)\n\n embedding_output = self.embeddings(\n input_ids=input_ids,\n position_ids=position_ids,\n inputs_embeds=inputs_embeds,\n past_key_values_length=past_key_values_length,\n )\n encoder_outputs = self.encoder(\n embedding_output,\n attention_mask=extended_attention_mask,\n attention_bias=attention_bias,\n head_mask=head_mask,\n encoder_hidden_states=encoder_hidden_states,\n encoder_attention_mask=encoder_extended_attention_mask,\n past_key_values=past_key_values,\n use_cache=use_cache,\n output_attentions=output_attentions,\n output_attention_biases=output_attention_biases,\n output_hidden_states=output_hidden_states,\n return_dict=return_dict,\n )\n sequence_output = encoder_outputs[0]\n pooled_output = self.pooler(sequence_output) if self.pooler is not None else None\n\n if not return_dict:\n return (sequence_output, pooled_output) + encoder_outputs[1:]\n\n return ErnieRnaModelOutputWithPoolingAndCrossAttentions(\n last_hidden_state=sequence_output,\n pooler_output=pooled_output,\n past_key_values=encoder_outputs.past_key_values,\n hidden_states=encoder_outputs.hidden_states,\n attention_biases=encoder_outputs.attention_biases,\n attentions=encoder_outputs.attentions,\n cross_attentions=encoder_outputs.cross_attentions,\n )\n
"},{"location":"models/ernierna/#multimolecule.models.ernierna.ErnieRnaModel.forward(encoder_hidden_states)","title":"encoder_hidden_states
","text":""},{"location":"models/ernierna/#multimolecule.models.ernierna.ErnieRnaModel.forward(encoder_attention_mask)","title":"encoder_attention_mask
","text":""},{"location":"models/ernierna/#multimolecule.models.ernierna.ErnieRnaModel.forward(past_key_values)","title":"past_key_values
","text":""},{"location":"models/ernierna/#multimolecule.models.ernierna.ErnieRnaModel.forward(use_cache)","title":"use_cache
","text":""},{"location":"models/ernierna/#multimolecule.models.ernierna.ErnieRnaPreTrainedModel","title":"ErnieRnaPreTrainedModel","text":" Bases: PreTrainedModel
An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained models.
Source code inmultimolecule/models/ernierna/modeling_ernierna.py
Pythonclass ErnieRnaPreTrainedModel(PreTrainedModel):\n \"\"\"\n An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained\n models.\n \"\"\"\n\n config_class = ErnieRnaConfig\n base_model_prefix = \"ernierna\"\n supports_gradient_checkpointing = True\n _no_split_modules = [\"ErnieRnaLayer\", \"ErnieRnaEmbeddings\"]\n\n # Copied from transformers.models.bert.modeling_bert.BertPreTrainedModel._init_weights\n def _init_weights(self, module: nn.Module):\n \"\"\"Initialize the weights\"\"\"\n if isinstance(module, nn.Linear):\n # Slightly different from the TF version which uses truncated_normal for initialization\n # cf https://github.com/pytorch/pytorch/pull/5617\n module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)\n if module.bias is not None:\n module.bias.data.zero_()\n elif isinstance(module, nn.Embedding):\n module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)\n if module.padding_idx is not None:\n module.weight.data[module.padding_idx].zero_()\n elif isinstance(module, nn.LayerNorm):\n module.bias.data.zero_()\n module.weight.data.fill_(1.0)\n
"},{"location":"models/modeling_outputs/","title":"modeling_outputs","text":""},{"location":"models/modeling_outputs/#multimolecule.models.modeling_outputs","title":"multimolecule.models.modeling_outputs","text":""},{"location":"models/modeling_outputs/#multimolecule.models.modeling_outputs.SequencePredictorOutput","title":"SequencePredictorOutput dataclass
","text":" Bases: ModelOutput
Base class for outputs of sentence classification & regression models.
Parameters:
Name Type Description DefaultFloatTensor | None
torch.FloatTensor
of shape (1,)
.
Optional, returned when labels
is provided
None
FloatTensor
torch.FloatTensor
of shape (batch_size, sequence_length, config.num_labels)
Prediction outputs.
None
Tuple[FloatTensor, ...] | None
Tuple of torch.FloatTensor
(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size)
.
Optional, returned when output_hidden_states=True
is passed or when `config.output_hidden_states=True
Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
None
Tuple[FloatTensor, ...] | None
Tuple of torch.FloatTensor
(one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length)
.
Optional, eturned when output_attentions=True
is passed or when config.output_attentions=True
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
None
Source code in multimolecule/models/modeling_outputs.py
Python@dataclass\nclass SequencePredictorOutput(ModelOutput):\n \"\"\"\n Base class for outputs of sentence classification & regression models.\n\n Args:\n loss:\n `torch.FloatTensor` of shape `(1,)`.\n\n Optional, returned when `labels` is provided\n logits:\n `torch.FloatTensor` of shape `(batch_size, sequence_length, config.num_labels)`\n\n Prediction outputs.\n hidden_states:\n Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +\n one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.\n\n Optional, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True\n\n Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.\n attentions:\n Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,\n sequence_length)`.\n\n Optional, eturned when `output_attentions=True` is passed or when `config.output_attentions=True`\n\n Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n heads.\n \"\"\"\n\n loss: torch.FloatTensor | None = None\n logits: torch.FloatTensor = None\n hidden_states: Tuple[torch.FloatTensor, ...] | None = None\n attentions: Tuple[torch.FloatTensor, ...] | None = None\n
"},{"location":"models/modeling_outputs/#multimolecule.models.modeling_outputs.SequencePredictorOutput(loss)","title":"loss
","text":""},{"location":"models/modeling_outputs/#multimolecule.models.modeling_outputs.SequencePredictorOutput(logits)","title":"logits
","text":""},{"location":"models/modeling_outputs/#multimolecule.models.modeling_outputs.SequencePredictorOutput(hidden_states)","title":"hidden_states
","text":""},{"location":"models/modeling_outputs/#multimolecule.models.modeling_outputs.SequencePredictorOutput(attentions)","title":"attentions
","text":""},{"location":"models/modeling_outputs/#multimolecule.models.modeling_outputs.TokenPredictorOutput","title":"TokenPredictorOutput dataclass
","text":" Bases: ModelOutput
Base class for outputs of token classification & regression models.
Parameters:
Name Type Description DefaultFloatTensor | None
torch.FloatTensor
of shape (1,)
.
Optional, returned when labels
is provided
None
FloatTensor
torch.FloatTensor
of shape (batch_size, sequence_length, config.num_labels)
Prediction outputs.
None
Tuple[FloatTensor, ...] | None
Tuple of torch.FloatTensor
(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size)
.
Optional, returned when output_hidden_states=True
is passed or when `config.output_hidden_states=True
Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
None
Tuple[FloatTensor, ...] | None
Tuple of torch.FloatTensor
(one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length)
.
Optional, eturned when output_attentions=True
is passed or when config.output_attentions=True
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
None
Source code in multimolecule/models/modeling_outputs.py
Python@dataclass\nclass TokenPredictorOutput(ModelOutput):\n \"\"\"\n Base class for outputs of token classification & regression models.\n\n Args:\n loss:\n `torch.FloatTensor` of shape `(1,)`.\n\n Optional, returned when `labels` is provided\n logits:\n `torch.FloatTensor` of shape `(batch_size, sequence_length, config.num_labels)`\n\n Prediction outputs.\n hidden_states:\n Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +\n one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.\n\n Optional, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True\n\n Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.\n attentions:\n Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,\n sequence_length)`.\n\n Optional, eturned when `output_attentions=True` is passed or when `config.output_attentions=True`\n\n Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n heads.\n \"\"\"\n\n loss: torch.FloatTensor | None = None\n logits: torch.FloatTensor = None\n hidden_states: Tuple[torch.FloatTensor, ...] | None = None\n attentions: Tuple[torch.FloatTensor, ...] | None = None\n
"},{"location":"models/modeling_outputs/#multimolecule.models.modeling_outputs.TokenPredictorOutput(loss)","title":"loss
","text":""},{"location":"models/modeling_outputs/#multimolecule.models.modeling_outputs.TokenPredictorOutput(logits)","title":"logits
","text":""},{"location":"models/modeling_outputs/#multimolecule.models.modeling_outputs.TokenPredictorOutput(hidden_states)","title":"hidden_states
","text":""},{"location":"models/modeling_outputs/#multimolecule.models.modeling_outputs.TokenPredictorOutput(attentions)","title":"attentions
","text":""},{"location":"models/modeling_outputs/#multimolecule.models.modeling_outputs.ContactPredictorOutput","title":"ContactPredictorOutput dataclass
","text":" Bases: ModelOutput
Base class for outputs of contact classification & regression models.
Parameters:
Name Type Description DefaultFloatTensor | None
torch.FloatTensor
of shape (1,)
.
Optional, returned when labels
is provided
None
FloatTensor
torch.FloatTensor
of shape (batch_size, sequence_length, config.num_labels)
Prediction outputs.
None
Tuple[FloatTensor, ...] | None
Tuple of torch.FloatTensor
(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size)
.
Optional, returned when output_hidden_states=True
is passed or when `config.output_hidden_states=True
Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
None
Tuple[FloatTensor, ...] | None
Tuple of torch.FloatTensor
(one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length)
.
Optional, eturned when output_attentions=True
is passed or when config.output_attentions=True
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
None
Source code in multimolecule/models/modeling_outputs.py
Python@dataclass\nclass ContactPredictorOutput(ModelOutput):\n \"\"\"\n Base class for outputs of contact classification & regression models.\n\n Args:\n loss:\n `torch.FloatTensor` of shape `(1,)`.\n\n Optional, returned when `labels` is provided\n logits:\n `torch.FloatTensor` of shape `(batch_size, sequence_length, config.num_labels)`\n\n Prediction outputs.\n hidden_states:\n Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +\n one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.\n\n Optional, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True\n\n Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.\n attentions:\n Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,\n sequence_length)`.\n\n Optional, eturned when `output_attentions=True` is passed or when `config.output_attentions=True`\n\n Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n heads.\n \"\"\"\n\n loss: torch.FloatTensor | None = None\n logits: torch.FloatTensor = None\n hidden_states: Tuple[torch.FloatTensor, ...] | None = None\n attentions: Tuple[torch.FloatTensor, ...] | None = None\n
"},{"location":"models/modeling_outputs/#multimolecule.models.modeling_outputs.ContactPredictorOutput(loss)","title":"loss
","text":""},{"location":"models/modeling_outputs/#multimolecule.models.modeling_outputs.ContactPredictorOutput(logits)","title":"logits
","text":""},{"location":"models/modeling_outputs/#multimolecule.models.modeling_outputs.ContactPredictorOutput(hidden_states)","title":"hidden_states
","text":""},{"location":"models/modeling_outputs/#multimolecule.models.modeling_outputs.ContactPredictorOutput(attentions)","title":"attentions
","text":""},{"location":"models/rinalmo/","title":"RiNALMo","text":"Pre-trained model on non-coding RNA (ncRNA) using a masked language modeling (MLM) objective.
"},{"location":"models/rinalmo/#disclaimer","title":"Disclaimer","text":"This is an UNOFFICIAL implementation of the RiNALMo: General-Purpose RNA Language Models Can Generalize Well on Structure Prediction Tasks by Rafael Josip Peni\u0107, et al.
The OFFICIAL repository of RiNALMo is at lbcb-sci/RiNALMo.
Tip
The MultiMolecule team has confirmed that the provided model and checkpoints are producing the same intermediate representations as the original implementation.
The team releasing RiNALMo did not write this model card for this model so this model card has been written by the MultiMolecule team.
"},{"location":"models/rinalmo/#model-details","title":"Model Details","text":"RiNALMo is a bert-style model pre-trained on a large corpus of non-coding RNA sequences in a self-supervised fashion. This means that the model was trained on the raw nucleotides of RNA sequences only, with an automatic process to generate inputs and labels from those texts. Please refer to the Training Details section for more information on the training process.
"},{"location":"models/rinalmo/#model-specification","title":"Model Specification","text":"Num Layers Hidden Size Num Heads Intermediate Size Num Parameters (M) FLOPs (G) MACs (G) Max Num Tokens 33 1280 20 5120 650.88 168.92 84.43 1022"},{"location":"models/rinalmo/#links","title":"Links","text":"multimolecule/rinalmo
The model file depends on the multimolecule
library. You can install it using pip:
pip install multimolecule\n
"},{"location":"models/rinalmo/#direct-use","title":"Direct Use","text":"You can use this model directly with a pipeline for masked language modeling:
Python>>> import multimolecule # you must import multimolecule to register models\n>>> from transformers import pipeline\n>>> unmasker = pipeline(\"fill-mask\", model=\"multimolecule/rinalmo\")\n>>> unmasker(\"gguc<mask>cucugguuagaccagaucugagccu\")\n\n[{'score': 0.3932918310165405,\n 'token': 6,\n 'token_str': 'A',\n 'sequence': 'G G U C A C U C U G G U U A G A C C A G A U C U G A G C C U'},\n {'score': 0.2897723913192749,\n 'token': 9,\n 'token_str': 'U',\n 'sequence': 'G G U C U C U C U G G U U A G A C C A G A U C U G A G C C U'},\n {'score': 0.15423105657100677,\n 'token': 22,\n 'token_str': 'X',\n 'sequence': 'G G U C X C U C U G G U U A G A C C A G A U C U G A G C C U'},\n {'score': 0.12160095572471619,\n 'token': 7,\n 'token_str': 'C',\n 'sequence': 'G G U C C C U C U G G U U A G A C C A G A U C U G A G C C U'},\n {'score': 0.0408296100795269,\n 'token': 8,\n 'token_str': 'G',\n 'sequence': 'G G U C G C U C U G G U U A G A C C A G A U C U G A G C C U'}]\n
"},{"location":"models/rinalmo/#downstream-use","title":"Downstream Use","text":""},{"location":"models/rinalmo/#extract-features","title":"Extract Features","text":"Here is how to use this model to get the features of a given sequence in PyTorch:
Pythonfrom multimolecule import RnaTokenizer, RiNALMoModel\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rinalmo\")\nmodel = RiNALMoModel.from_pretrained(\"multimolecule/rinalmo\")\n\ntext = \"UAGCUUAUCAGACUGAUGUUGA\"\ninput = tokenizer(text, return_tensors=\"pt\")\n\noutput = model(**input)\n
"},{"location":"models/rinalmo/#sequence-classification-regression","title":"Sequence Classification / Regression","text":"Note: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for sequence classification or regression.
Here is how to use this model as backbone to fine-tune for a sequence-level task in PyTorch:
Pythonimport torch\nfrom multimolecule import RnaTokenizer, RiNALMoForSequencePrediction\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rinalmo\")\nmodel = RiNALMoForSequencePrediction.from_pretrained(\"multimolecule/rinalmo\")\n\ntext = \"UAGCUUAUCAGACUGAUGUUGA\"\ninput = tokenizer(text, return_tensors=\"pt\")\nlabel = torch.tensor([1])\n\noutput = model(**input, labels=label)\n
"},{"location":"models/rinalmo/#token-classification-regression","title":"Token Classification / Regression","text":"Note: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for nucleotide classification or regression.
Here is how to use this model as backbone to fine-tune for a nucleotide-level task in PyTorch:
Pythonimport torch\nfrom multimolecule import RnaTokenizer, RiNALMoForTokenPrediction\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rinalmo\")\nmodel = RiNALMoForTokenPrediction.from_pretrained(\"multimolecule/rinalmo\")\n\ntext = \"UAGCUUAUCAGACUGAUGUUGA\"\ninput = tokenizer(text, return_tensors=\"pt\")\nlabel = torch.randint(2, (len(text), ))\n\noutput = model(**input, labels=label)\n
"},{"location":"models/rinalmo/#contact-classification-regression","title":"Contact Classification / Regression","text":"Note: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for contact classification or regression.
Here is how to use this model as backbone to fine-tune for a contact-level task in PyTorch:
Pythonimport torch\nfrom multimolecule import RnaTokenizer, RiNALMoForContactPrediction\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rinalmo\")\nmodel = RiNALMoForContactPrediction.from_pretrained(\"multimolecule/rinalmo\")\n\ntext = \"UAGCUUAUCAGACUGAUGUUGA\"\ninput = tokenizer(text, return_tensors=\"pt\")\nlabel = torch.randint(2, (len(text), len(text)))\n\noutput = model(**input, labels=label)\n
"},{"location":"models/rinalmo/#training-details","title":"Training Details","text":"RiNALMo used Masked Language Modeling (MLM) as the pre-training objective: taking a sequence, the model randomly masks 15% of the tokens in the input then runs the entire masked sentence through the model and has to predict the masked tokens. This is comparable to the Cloze task in language modeling.
"},{"location":"models/rinalmo/#training-data","title":"Training Data","text":"The RiNALMo model was pre-trained on a cocktail of databases including RNAcentral, Rfam, Ensembl Genome Browser, and Nucleotide. The training data contains 36 million unique ncRNA sequences.
To ensure sequence diversity in each training batch, RiNALMo clustered the sequences with MMSeqs2 into 17 million clusters and then sampled each sequence in the batch from a different cluster.
RiNALMo preprocessed all tokens by replacing \u201cU\u201ds with \u201cT\u201ds.
Note that during model conversions, \u201cT\u201d is replaced with \u201cU\u201d. RnaTokenizer
will convert \u201cT\u201ds to \u201cU\u201ds for you, you may disable this behaviour by passing replace_T_with_U=False
.
RiNALMo used masked language modeling (MLM) as the pre-training objective. The masking procedure is similar to the one used in BERT:
<mask>
.The model was trained on 7 NVIDIA A100 GPUs with 80GiB memories.
BibTeX:
BibTeX@article{penic2024rinalmo,\n title={RiNALMo: General-Purpose RNA Language Models Can Generalize Well on Structure Prediction Tasks},\n author={Peni\u0107, Rafael Josip and Vla\u0161i\u0107, Tin and Huber, Roland G. and Wan, Yue and \u0160iki\u0107, Mile},\n journal={arXiv preprint arXiv:2403.00043},\n year={2024}\n}\n
"},{"location":"models/rinalmo/#contact","title":"Contact","text":"Please use GitHub issues of MultiMolecule for any questions or comments on the model card.
Please contact the authors of the RiNALMo paper for questions or comments on the paper/model.
"},{"location":"models/rinalmo/#license","title":"License","text":"This model is licensed under the AGPL-3.0 License.
Text OnlySPDX-License-Identifier: AGPL-3.0-or-later\n
"},{"location":"models/rinalmo/#multimolecule.models.rinalmo","title":"multimolecule.models.rinalmo","text":""},{"location":"models/rinalmo/#multimolecule.models.rinalmo.RnaTokenizer","title":"RnaTokenizer","text":" Bases: Tokenizer
Tokenizer for RNA sequences.
Parameters:
Name Type Description DefaultAlphabet | str | List[str] | None
alphabet to use for tokenization.
None
, the standard RNA alphabet will be used.string
, it should correspond to the name of a predefined alphabet. The options includestandard
extended
streamline
nucleobase
None
int
Size of kmer to tokenize.
1
bool
Whether to tokenize into codons.
False
bool
Whether to replace T with U.
True
bool
Whether to convert input to uppercase.
True
Examples:
Python Console Session>>> from multimolecule import RnaTokenizer\n>>> tokenizer = RnaTokenizer()\n>>> tokenizer('<pad><cls><eos><unk><mask><null>ACGUNRYSWKMBDHV.X*-I')[\"input_ids\"]\n[1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 2]\n>>> tokenizer('acgu')[\"input_ids\"]\n[1, 6, 7, 8, 9, 2]\n>>> tokenizer('acgt')[\"input_ids\"]\n[1, 6, 7, 8, 9, 2]\n>>> tokenizer = RnaTokenizer(replace_T_with_U=False)\n>>> tokenizer('acgt')[\"input_ids\"]\n[1, 6, 7, 8, 3, 2]\n>>> tokenizer = RnaTokenizer(nmers=3)\n>>> tokenizer('uagcuuauc')[\"input_ids\"]\n[1, 83, 17, 64, 49, 96, 84, 22, 2]\n>>> tokenizer = RnaTokenizer(codon=True)\n>>> tokenizer('uagcuuauc')[\"input_ids\"]\n[1, 83, 49, 22, 2]\n>>> tokenizer('uagcuuauca')[\"input_ids\"]\nTraceback (most recent call last):\nValueError: length of input sequence must be a multiple of 3 for codon tokenization, but got 10\n
Source code in multimolecule/tokenisers/rna/tokenization_rna.py
Pythonclass RnaTokenizer(Tokenizer):\n \"\"\"\n Tokenizer for RNA sequences.\n\n Args:\n alphabet: alphabet to use for tokenization.\n\n - If is `None`, the standard RNA alphabet will be used.\n - If is a `string`, it should correspond to the name of a predefined alphabet. The options include\n + `standard`\n + `extended`\n + `streamline`\n + `nucleobase`\n - If is an alphabet or a list of characters, that specific alphabet will be used.\n nmers: Size of kmer to tokenize.\n codon: Whether to tokenize into codons.\n replace_T_with_U: Whether to replace T with U.\n do_upper_case: Whether to convert input to uppercase.\n\n Examples:\n >>> from multimolecule import RnaTokenizer\n >>> tokenizer = RnaTokenizer()\n >>> tokenizer('<pad><cls><eos><unk><mask><null>ACGUNRYSWKMBDHV.X*-I')[\"input_ids\"]\n [1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 2]\n >>> tokenizer('acgu')[\"input_ids\"]\n [1, 6, 7, 8, 9, 2]\n >>> tokenizer('acgt')[\"input_ids\"]\n [1, 6, 7, 8, 9, 2]\n >>> tokenizer = RnaTokenizer(replace_T_with_U=False)\n >>> tokenizer('acgt')[\"input_ids\"]\n [1, 6, 7, 8, 3, 2]\n >>> tokenizer = RnaTokenizer(nmers=3)\n >>> tokenizer('uagcuuauc')[\"input_ids\"]\n [1, 83, 17, 64, 49, 96, 84, 22, 2]\n >>> tokenizer = RnaTokenizer(codon=True)\n >>> tokenizer('uagcuuauc')[\"input_ids\"]\n [1, 83, 49, 22, 2]\n >>> tokenizer('uagcuuauca')[\"input_ids\"]\n Traceback (most recent call last):\n ValueError: length of input sequence must be a multiple of 3 for codon tokenization, but got 10\n \"\"\"\n\n model_input_names = [\"input_ids\", \"attention_mask\"]\n\n def __init__(\n self,\n alphabet: Alphabet | str | List[str] | None = None,\n nmers: int = 1,\n codon: bool = False,\n replace_T_with_U: bool = True,\n do_upper_case: bool = True,\n additional_special_tokens: List | Tuple | None = None,\n **kwargs,\n ):\n if codon and (nmers > 1 and nmers != 3):\n raise ValueError(\"Codon and nmers cannot be used together.\")\n if codon:\n nmers = 3 # set to 3 to get correct vocab\n if not isinstance(alphabet, Alphabet):\n alphabet = get_alphabet(alphabet, nmers=nmers)\n super().__init__(\n alphabet=alphabet,\n nmers=nmers,\n codon=codon,\n replace_T_with_U=replace_T_with_U,\n do_upper_case=do_upper_case,\n additional_special_tokens=additional_special_tokens,\n **kwargs,\n )\n self.replace_T_with_U = replace_T_with_U\n self.nmers = nmers\n self.codon = codon\n\n def _tokenize(self, text: str, **kwargs):\n if self.do_upper_case:\n text = text.upper()\n if self.replace_T_with_U:\n text = text.replace(\"T\", \"U\")\n if self.codon:\n if len(text) % 3 != 0:\n raise ValueError(\n f\"length of input sequence must be a multiple of 3 for codon tokenization, but got {len(text)}\"\n )\n return [text[i : i + 3] for i in range(0, len(text), 3)]\n if self.nmers > 1:\n return [text[i : i + self.nmers] for i in range(len(text) - self.nmers + 1)] # noqa: E203\n return list(text)\n
"},{"location":"models/rinalmo/#multimolecule.models.rinalmo.RnaTokenizer(alphabet)","title":"alphabet
","text":""},{"location":"models/rinalmo/#multimolecule.models.rinalmo.RnaTokenizer(nmers)","title":"nmers
","text":""},{"location":"models/rinalmo/#multimolecule.models.rinalmo.RnaTokenizer(codon)","title":"codon
","text":""},{"location":"models/rinalmo/#multimolecule.models.rinalmo.RnaTokenizer(replace_T_with_U)","title":"replace_T_with_U
","text":""},{"location":"models/rinalmo/#multimolecule.models.rinalmo.RnaTokenizer(do_upper_case)","title":"do_upper_case
","text":""},{"location":"models/rinalmo/#multimolecule.models.rinalmo.RiNALMoConfig","title":"RiNALMoConfig","text":" Bases: PreTrainedConfig
This is the configuration class to store the configuration of a RiNALMoModel
. It is used to instantiate a RiNALMo model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the RiNALMo lbcb-sci/RiNALMo architecture.
Configuration objects inherit from PreTrainedConfig
and can be used to control the model outputs. Read the documentation from PreTrainedConfig
for more information.
Parameters:
Name Type Description Defaultint
Vocabulary size of the RiNALMo model. Defines the number of different tokens that can be represented by the inputs_ids
passed when calling [RiNALMoModel
].
26
int
Dimensionality of the encoder layers and the pooler layer.
1280
int
Number of hidden layers in the Transformer encoder.
33
int
Number of attention heads for each attention layer in the Transformer encoder.
20
int
Dimensionality of the \u201cintermediate\u201d (often named feed-forward) layer in the Transformer encoder.
5120
float
The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
0.1
float
The dropout ratio for the attention probabilities.
0.1
int
The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
1024
float
The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
0.02
float
The epsilon used by the layer normalization layers.
1e-12
str
Type of position embedding. Choose one of \"absolute\"
, \"relative_key\"
, \"relative_key_query\", \"rotary\"
. For positional embeddings use \"absolute\"
. For more information on \"relative_key\"
, please refer to Self-Attention with Relative Position Representations (Shaw et al.). For more information on \"relative_key_query\"
, please refer to Method 4 in Improve Transformer Models with Better Relative Position Embeddings (Huang et al.).
'rotary'
bool
Whether the model is used as a decoder or not. If False
, the model is used as an encoder.
False
bool
Whether or not the model should return the last key/values attentions (not used by all models). Only relevant if config.is_decoder=True
.
True
bool
Whether to apply layer normalization after embeddings but before the main stem of the network.
True
bool
When this is enabled, masked tokens are treated as if they had been dropped out by input dropout.
True
Examples:
Python Console Session>>> from multimolecule import RiNALMoModel, RiNALMoConfig\n>>> # Initializing a RiNALMo multimolecule/rinalmo style configuration\n>>> configuration = RiNALMoConfig()\n>>> # Initializing a model (with random weights) from the multimolecule/rinalmo style configuration\n>>> model = RiNALMoModel(configuration)\n>>> # Accessing the model configuration\n>>> configuration = model.config\n
Source code in multimolecule/models/rinalmo/configuration_rinalmo.py
Pythonclass RiNALMoConfig(PreTrainedConfig):\n r\"\"\"\n This is the configuration class to store the configuration of a [`RiNALMoModel`][multimolecule.models.RiNALMoModel].\n It is used to instantiate a RiNALMo model according to the specified arguments, defining the model architecture.\n Instantiating a configuration with the defaults will yield a similar configuration to that of the RiNALMo\n [lbcb-sci/RiNALMo](https://github.com/lbcb-sci/RiNALMo) architecture.\n\n Configuration objects inherit from [`PreTrainedConfig`][multimolecule.models.PreTrainedConfig] and can be used to\n control the model outputs. Read the documentation from [`PreTrainedConfig`][multimolecule.models.PreTrainedConfig]\n for more information.\n\n Args:\n vocab_size:\n Vocabulary size of the RiNALMo model. Defines the number of different tokens that can be represented by the\n `inputs_ids` passed when calling [`RiNALMoModel`].\n hidden_size:\n Dimensionality of the encoder layers and the pooler layer.\n num_hidden_layers:\n Number of hidden layers in the Transformer encoder.\n num_attention_heads:\n Number of attention heads for each attention layer in the Transformer encoder.\n intermediate_size:\n Dimensionality of the \"intermediate\" (often named feed-forward) layer in the Transformer encoder.\n hidden_dropout:\n The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.\n attention_dropout:\n The dropout ratio for the attention probabilities.\n max_position_embeddings:\n The maximum sequence length that this model might ever be used with. Typically set this to something large\n just in case (e.g., 512 or 1024 or 2048).\n initializer_range:\n The standard deviation of the truncated_normal_initializer for initializing all weight matrices.\n layer_norm_eps:\n The epsilon used by the layer normalization layers.\n position_embedding_type:\n Type of position embedding. Choose one of `\"absolute\"`, `\"relative_key\"`, `\"relative_key_query\", \"rotary\"`.\n For positional embeddings use `\"absolute\"`. For more information on `\"relative_key\"`, please refer to\n [Self-Attention with Relative Position Representations (Shaw et al.)](https://arxiv.org/abs/1803.02155).\n For more information on `\"relative_key_query\"`, please refer to *Method 4* in [Improve Transformer Models\n with Better Relative Position Embeddings (Huang et al.)](https://arxiv.org/abs/2009.13658).\n is_decoder:\n Whether the model is used as a decoder or not. If `False`, the model is used as an encoder.\n use_cache:\n Whether or not the model should return the last key/values attentions (not used by all models). Only\n relevant if `config.is_decoder=True`.\n emb_layer_norm_before:\n Whether to apply layer normalization after embeddings but before the main stem of the network.\n token_dropout:\n When this is enabled, masked tokens are treated as if they had been dropped out by input dropout.\n\n Examples:\n >>> from multimolecule import RiNALMoModel, RiNALMoConfig\n >>> # Initializing a RiNALMo multimolecule/rinalmo style configuration\n >>> configuration = RiNALMoConfig()\n >>> # Initializing a model (with random weights) from the multimolecule/rinalmo style configuration\n >>> model = RiNALMoModel(configuration)\n >>> # Accessing the model configuration\n >>> configuration = model.config\n \"\"\"\n\n model_type = \"rinalmo\"\n\n def __init__(\n self,\n vocab_size: int = 26,\n hidden_size: int = 1280,\n num_hidden_layers: int = 33,\n num_attention_heads: int = 20,\n intermediate_size: int = 5120,\n hidden_act: str = \"gelu\",\n hidden_dropout: float = 0.1,\n attention_dropout: float = 0.1,\n max_position_embeddings: int = 1024,\n initializer_range: float = 0.02,\n layer_norm_eps: float = 1e-12,\n position_embedding_type: str = \"rotary\",\n is_decoder: bool = False,\n use_cache: bool = True,\n emb_layer_norm_before: bool = True,\n learnable_beta: bool = True,\n token_dropout: bool = True,\n head: HeadConfig | None = None,\n lm_head: MaskedLMHeadConfig | None = None,\n **kwargs,\n ):\n super().__init__(**kwargs)\n self.vocab_size = vocab_size\n self.hidden_size = hidden_size\n self.num_hidden_layers = num_hidden_layers\n self.num_attention_heads = num_attention_heads\n self.intermediate_size = intermediate_size\n self.hidden_act = hidden_act\n self.hidden_dropout = hidden_dropout\n self.attention_dropout = attention_dropout\n self.max_position_embeddings = max_position_embeddings\n self.initializer_range = initializer_range\n self.layer_norm_eps = layer_norm_eps\n self.position_embedding_type = position_embedding_type\n self.is_decoder = is_decoder\n self.use_cache = use_cache\n self.learnable_beta = learnable_beta\n self.token_dropout = token_dropout\n self.head = HeadConfig(**head) if head is not None else None\n self.lm_head = MaskedLMHeadConfig(**lm_head) if lm_head is not None else None\n self.emb_layer_norm_before = emb_layer_norm_before\n
"},{"location":"models/rinalmo/#multimolecule.models.rinalmo.RiNALMoConfig(vocab_size)","title":"vocab_size
","text":""},{"location":"models/rinalmo/#multimolecule.models.rinalmo.RiNALMoConfig(hidden_size)","title":"hidden_size
","text":""},{"location":"models/rinalmo/#multimolecule.models.rinalmo.RiNALMoConfig(num_hidden_layers)","title":"num_hidden_layers
","text":""},{"location":"models/rinalmo/#multimolecule.models.rinalmo.RiNALMoConfig(num_attention_heads)","title":"num_attention_heads
","text":""},{"location":"models/rinalmo/#multimolecule.models.rinalmo.RiNALMoConfig(intermediate_size)","title":"intermediate_size
","text":""},{"location":"models/rinalmo/#multimolecule.models.rinalmo.RiNALMoConfig(hidden_dropout)","title":"hidden_dropout
","text":""},{"location":"models/rinalmo/#multimolecule.models.rinalmo.RiNALMoConfig(attention_dropout)","title":"attention_dropout
","text":""},{"location":"models/rinalmo/#multimolecule.models.rinalmo.RiNALMoConfig(max_position_embeddings)","title":"max_position_embeddings
","text":""},{"location":"models/rinalmo/#multimolecule.models.rinalmo.RiNALMoConfig(initializer_range)","title":"initializer_range
","text":""},{"location":"models/rinalmo/#multimolecule.models.rinalmo.RiNALMoConfig(layer_norm_eps)","title":"layer_norm_eps
","text":""},{"location":"models/rinalmo/#multimolecule.models.rinalmo.RiNALMoConfig(position_embedding_type)","title":"position_embedding_type
","text":""},{"location":"models/rinalmo/#multimolecule.models.rinalmo.RiNALMoConfig(is_decoder)","title":"is_decoder
","text":""},{"location":"models/rinalmo/#multimolecule.models.rinalmo.RiNALMoConfig(use_cache)","title":"use_cache
","text":""},{"location":"models/rinalmo/#multimolecule.models.rinalmo.RiNALMoConfig(emb_layer_norm_before)","title":"emb_layer_norm_before
","text":""},{"location":"models/rinalmo/#multimolecule.models.rinalmo.RiNALMoConfig(token_dropout)","title":"token_dropout
","text":""},{"location":"models/rinalmo/#multimolecule.models.rinalmo.RiNALMoForContactPrediction","title":"RiNALMoForContactPrediction","text":" Bases: RiNALMoPreTrainedModel
Examples:
Python Console Session>>> from multimolecule import RiNALMoConfig, RiNALMoForContactPrediction, RnaTokenizer\n>>> config = RiNALMoConfig()\n>>> model = RiNALMoForContactPrediction(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=torch.randint(2, (1, 5, 5)))\n>>> output[\"logits\"].shape\ntorch.Size([1, 5, 5, 1])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n
Source code in multimolecule/models/rinalmo/modeling_rinalmo.py
Pythonclass RiNALMoForContactPrediction(RiNALMoPreTrainedModel):\n \"\"\"\n Examples:\n >>> from multimolecule import RiNALMoConfig, RiNALMoForContactPrediction, RnaTokenizer\n >>> config = RiNALMoConfig()\n >>> model = RiNALMoForContactPrediction(config)\n >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n >>> output = model(**input, labels=torch.randint(2, (1, 5, 5)))\n >>> output[\"logits\"].shape\n torch.Size([1, 5, 5, 1])\n >>> output[\"loss\"] # doctest:+ELLIPSIS\n tensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n \"\"\"\n\n def __init__(self, config: RiNALMoConfig):\n super().__init__(config)\n self.rinalmo = RiNALMoModel(config, add_pooling_layer=True)\n self.contact_head = ContactPredictionHead(config)\n self.head_config = self.contact_head.config\n\n # Initialize weights and apply final processing\n self.post_init()\n\n def forward(\n self,\n input_ids: Tensor | NestedTensor,\n attention_mask: Tensor | None = None,\n position_ids: Tensor | None = None,\n head_mask: Tensor | None = None,\n inputs_embeds: Tensor | NestedTensor | None = None,\n labels: Tensor | None = None,\n output_attentions: bool | None = None,\n output_hidden_states: bool | None = None,\n return_dict: bool | None = None,\n **kwargs,\n ) -> Tuple[Tensor, ...] | ContactPredictorOutput:\n if output_attentions is False:\n warn(\"output_attentions must be True for contact classification and will be ignored.\")\n return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n outputs = self.rinalmo(\n input_ids,\n attention_mask=attention_mask,\n position_ids=position_ids,\n head_mask=head_mask,\n inputs_embeds=inputs_embeds,\n output_attentions=True,\n output_hidden_states=output_hidden_states,\n return_dict=return_dict,\n **kwargs,\n )\n output = self.contact_head(outputs, attention_mask, input_ids, labels)\n logits, loss = output.logits, output.loss\n\n if not return_dict:\n output = (logits,) + outputs[2:]\n return ((loss,) + output) if loss is not None else output\n\n return ContactPredictorOutput(\n loss=loss,\n logits=logits,\n hidden_states=outputs.hidden_states,\n attentions=outputs.attentions,\n )\n
"},{"location":"models/rinalmo/#multimolecule.models.rinalmo.RiNALMoForMaskedLM","title":"RiNALMoForMaskedLM","text":" Bases: RiNALMoPreTrainedModel
Examples:
Python Console Session>>> from multimolecule import RiNALMoConfig, RiNALMoForMaskedLM, RnaTokenizer\n>>> config = RiNALMoConfig()\n>>> model = RiNALMoForMaskedLM(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=input[\"input_ids\"])\n>>> output[\"logits\"].shape\ntorch.Size([1, 7, 26])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<NllLossBackward0>)\n
Source code in multimolecule/models/rinalmo/modeling_rinalmo.py
Pythonclass RiNALMoForMaskedLM(RiNALMoPreTrainedModel):\n \"\"\"\n Examples:\n >>> from multimolecule import RiNALMoConfig, RiNALMoForMaskedLM, RnaTokenizer\n >>> config = RiNALMoConfig()\n >>> model = RiNALMoForMaskedLM(config)\n >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n >>> output = model(**input, labels=input[\"input_ids\"])\n >>> output[\"logits\"].shape\n torch.Size([1, 7, 26])\n >>> output[\"loss\"] # doctest:+ELLIPSIS\n tensor(..., grad_fn=<NllLossBackward0>)\n \"\"\"\n\n _tied_weights_keys = [\"lm_head.decoder.weight\", \"lm_head.decoder.bias\"]\n\n def __init__(self, config: RiNALMoConfig):\n super().__init__(config)\n if config.is_decoder:\n logger.warning(\n \"If you want to use `RiNALMoForMaskedLM` make sure `config.is_decoder=False` for \"\n \"bi-directional self-attention.\"\n )\n self.rinalmo = RiNALMoModel(config, add_pooling_layer=False)\n self.lm_head = MaskedLMHead(config)\n\n # Initialize weights and apply final processing\n self.post_init()\n\n def forward(\n self,\n input_ids: Tensor | NestedTensor,\n attention_mask: Tensor | None = None,\n position_ids: Tensor | None = None,\n head_mask: Tensor | None = None,\n inputs_embeds: Tensor | NestedTensor | None = None,\n encoder_hidden_states: Tensor | None = None,\n encoder_attention_mask: Tensor | None = None,\n labels: Tensor | None = None,\n output_attentions: bool | None = None,\n output_hidden_states: bool | None = None,\n return_dict: bool | None = None,\n **kwargs,\n ) -> Tuple[Tensor, ...] | MaskedLMOutput:\n return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n outputs = self.rinalmo(\n input_ids,\n attention_mask=attention_mask,\n position_ids=position_ids,\n head_mask=head_mask,\n inputs_embeds=inputs_embeds,\n encoder_hidden_states=encoder_hidden_states,\n encoder_attention_mask=encoder_attention_mask,\n output_attentions=output_attentions,\n output_hidden_states=output_hidden_states,\n return_dict=return_dict,\n **kwargs,\n )\n output = self.lm_head(outputs, labels)\n logits, loss = output.logits, output.loss\n\n if not return_dict:\n output = (logits,) + outputs[2:]\n return ((loss,) + output) if loss is not None else output\n\n return MaskedLMOutput(\n loss=loss,\n logits=logits,\n hidden_states=outputs.hidden_states,\n attentions=outputs.attentions,\n )\n
"},{"location":"models/rinalmo/#multimolecule.models.rinalmo.RiNALMoForSequencePrediction","title":"RiNALMoForSequencePrediction","text":" Bases: RiNALMoPreTrainedModel
Examples:
Python Console Session>>> from multimolecule import RiNALMoConfig, RiNALMoForSequencePrediction, RnaTokenizer\n>>> config = RiNALMoConfig()\n>>> model = RiNALMoForSequencePrediction(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=torch.tensor([[1]]))\n>>> output[\"logits\"].shape\ntorch.Size([1, 1])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n
Source code in multimolecule/models/rinalmo/modeling_rinalmo.py
Pythonclass RiNALMoForSequencePrediction(RiNALMoPreTrainedModel):\n \"\"\"\n Examples:\n >>> from multimolecule import RiNALMoConfig, RiNALMoForSequencePrediction, RnaTokenizer\n >>> config = RiNALMoConfig()\n >>> model = RiNALMoForSequencePrediction(config)\n >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n >>> output = model(**input, labels=torch.tensor([[1]]))\n >>> output[\"logits\"].shape\n torch.Size([1, 1])\n >>> output[\"loss\"] # doctest:+ELLIPSIS\n tensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n \"\"\"\n\n def __init__(self, config: RiNALMoConfig):\n super().__init__(config)\n self.rinalmo = RiNALMoModel(config, add_pooling_layer=True)\n self.sequence_head = SequencePredictionHead(config)\n self.head_config = self.sequence_head.config\n\n # Initialize weights and apply final processing\n self.post_init()\n\n def forward(\n self,\n input_ids: Tensor | NestedTensor,\n attention_mask: Tensor | None = None,\n position_ids: Tensor | None = None,\n head_mask: Tensor | None = None,\n inputs_embeds: Tensor | NestedTensor | None = None,\n labels: Tensor | None = None,\n output_attentions: bool | None = None,\n output_hidden_states: bool | None = None,\n return_dict: bool | None = None,\n **kwargs,\n ) -> Tuple[Tensor, ...] | SequencePredictorOutput:\n return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n outputs = self.rinalmo(\n input_ids,\n attention_mask=attention_mask,\n position_ids=position_ids,\n head_mask=head_mask,\n inputs_embeds=inputs_embeds,\n output_attentions=output_attentions,\n output_hidden_states=output_hidden_states,\n return_dict=return_dict,\n **kwargs,\n )\n output = self.sequence_head(outputs, labels)\n logits, loss = output.logits, output.loss\n\n if not return_dict:\n output = (logits,) + outputs[2:]\n return ((loss,) + output) if loss is not None else output\n\n return SequencePredictorOutput(\n loss=loss,\n logits=logits,\n hidden_states=outputs.hidden_states,\n attentions=outputs.attentions,\n )\n
"},{"location":"models/rinalmo/#multimolecule.models.rinalmo.RiNALMoForTokenPrediction","title":"RiNALMoForTokenPrediction","text":" Bases: RiNALMoPreTrainedModel
Examples:
Python Console Session>>> from multimolecule import RiNALMoConfig, RiNALMoForTokenPrediction, RnaTokenizer\n>>> config = RiNALMoConfig()\n>>> model = RiNALMoForTokenPrediction(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=torch.randint(2, (1, 5)))\n>>> output[\"logits\"].shape\ntorch.Size([1, 5, 1])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n
Source code in multimolecule/models/rinalmo/modeling_rinalmo.py
Pythonclass RiNALMoForTokenPrediction(RiNALMoPreTrainedModel):\n \"\"\"\n Examples:\n >>> from multimolecule import RiNALMoConfig, RiNALMoForTokenPrediction, RnaTokenizer\n >>> config = RiNALMoConfig()\n >>> model = RiNALMoForTokenPrediction(config)\n >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n >>> output = model(**input, labels=torch.randint(2, (1, 5)))\n >>> output[\"logits\"].shape\n torch.Size([1, 5, 1])\n >>> output[\"loss\"] # doctest:+ELLIPSIS\n tensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n \"\"\"\n\n def __init__(self, config: RiNALMoConfig):\n super().__init__(config)\n self.rinalmo = RiNALMoModel(config, add_pooling_layer=True)\n self.token_head = TokenPredictionHead(config)\n self.head_config = self.token_head.config\n\n # Initialize weights and apply final processing\n self.post_init()\n\n def forward(\n self,\n input_ids: Tensor | NestedTensor,\n attention_mask: Tensor | None = None,\n position_ids: Tensor | None = None,\n head_mask: Tensor | None = None,\n inputs_embeds: Tensor | NestedTensor | None = None,\n labels: Tensor | None = None,\n output_attentions: bool | None = None,\n output_hidden_states: bool | None = None,\n return_dict: bool | None = None,\n **kwargs,\n ) -> Tuple[Tensor, ...] | TokenPredictorOutput:\n return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n outputs = self.rinalmo(\n input_ids,\n attention_mask=attention_mask,\n position_ids=position_ids,\n head_mask=head_mask,\n inputs_embeds=inputs_embeds,\n output_attentions=output_attentions,\n output_hidden_states=output_hidden_states,\n return_dict=return_dict,\n **kwargs,\n )\n output = self.token_head(outputs, attention_mask, input_ids, labels)\n logits, loss = output.logits, output.loss\n\n if not return_dict:\n output = (logits,) + outputs[2:]\n return ((loss,) + output) if loss is not None else output\n\n return TokenPredictorOutput(\n loss=loss,\n logits=logits,\n hidden_states=outputs.hidden_states,\n attentions=outputs.attentions,\n )\n
"},{"location":"models/rinalmo/#multimolecule.models.rinalmo.RiNALMoModel","title":"RiNALMoModel","text":" Bases: RiNALMoPreTrainedModel
Examples:
Python Console Session>>> from multimolecule import RiNALMoConfig, RiNALMoModel, RnaTokenizer\n>>> config = RiNALMoConfig()\n>>> model = RiNALMoModel(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input)\n>>> output[\"last_hidden_state\"].shape\ntorch.Size([1, 7, 1280])\n>>> output[\"pooler_output\"].shape\ntorch.Size([1, 1280])\n
Source code in multimolecule/models/rinalmo/modeling_rinalmo.py
Pythonclass RiNALMoModel(RiNALMoPreTrainedModel):\n \"\"\"\n Examples:\n >>> from multimolecule import RiNALMoConfig, RiNALMoModel, RnaTokenizer\n >>> config = RiNALMoConfig()\n >>> model = RiNALMoModel(config)\n >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n >>> output = model(**input)\n >>> output[\"last_hidden_state\"].shape\n torch.Size([1, 7, 1280])\n >>> output[\"pooler_output\"].shape\n torch.Size([1, 1280])\n \"\"\"\n\n def __init__(self, config: RiNALMoConfig, add_pooling_layer: bool = True):\n super().__init__(config)\n self.pad_token_id = config.pad_token_id\n self.embeddings = RiNALMoEmbeddings(config)\n self.encoder = RiNALMoEncoder(config)\n self.pooler = RiNALMoPooler(config) if add_pooling_layer else None\n\n # Initialize weights and apply final processing\n self.post_init()\n\n def get_input_embeddings(self):\n return self.embeddings.word_embeddings\n\n def set_input_embeddings(self, value):\n self.embeddings.word_embeddings = value\n\n def _prune_heads(self, heads_to_prune):\n \"\"\"\n Prunes heads of the model. heads_to_prune: dict of {layer_num: list of heads to prune in this layer} See base\n class PreTrainedModel\n \"\"\"\n for layer, heads in heads_to_prune.items():\n self.encoder.layer[layer].attention.prune_heads(heads)\n\n def forward(\n self,\n input_ids: Tensor | NestedTensor,\n attention_mask: Tensor | None = None,\n position_ids: Tensor | None = None,\n head_mask: Tensor | None = None,\n inputs_embeds: Tensor | NestedTensor | None = None,\n encoder_hidden_states: Tensor | None = None,\n encoder_attention_mask: Tensor | None = None,\n past_key_values: Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None = None,\n use_cache: bool | None = None,\n output_attentions: bool | None = None,\n output_hidden_states: bool | None = None,\n return_dict: bool | None = None,\n **kwargs,\n ) -> Tuple[Tensor, ...] | BaseModelOutputWithPoolingAndCrossAttentions:\n r\"\"\"\n Args:\n encoder_hidden_states:\n Shape: `(batch_size, sequence_length, hidden_size)`\n\n Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if\n the model is configured as a decoder.\n encoder_attention_mask:\n Shape: `(batch_size, sequence_length)`\n\n Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used\n in the cross-attention if the model is configured as a decoder. Mask values selected in `[0, 1]`:\n\n - 1 for tokens that are **not masked**,\n - 0 for tokens that are **masked**.\n past_key_values:\n Tuple of length `config.n_layers` with each tuple having 4 tensors of shape\n `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)\n\n Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up\n decoding.\n\n If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those\n that don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of\n all `decoder_input_ids` of shape `(batch_size, sequence_length)`.\n use_cache:\n If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding\n (see `past_key_values`).\n \"\"\"\n if kwargs:\n warn(\n f\"Additional keyword arguments `{', '.join(kwargs)}` are detected in \"\n f\"`{self.__class__.__name__}.forward`, they will be ignored.\\n\"\n \"This is provided for backward compatibility and may lead to unexpected behavior.\"\n )\n output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions\n output_hidden_states = (\n output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states\n )\n return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n\n if self.config.is_decoder:\n use_cache = use_cache if use_cache is not None else self.config.use_cache\n else:\n use_cache = False\n\n if isinstance(input_ids, NestedTensor):\n input_ids, attention_mask = input_ids.tensor, input_ids.mask\n if input_ids is not None and inputs_embeds is not None:\n raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n if input_ids is not None:\n self.warn_if_padding_and_no_attention_mask(input_ids, attention_mask)\n input_shape = input_ids.size()\n elif inputs_embeds is not None:\n input_shape = inputs_embeds.size()[:-1]\n else:\n raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n batch_size, seq_length = input_shape\n device = input_ids.device if input_ids is not None else inputs_embeds.device # type: ignore[union-attr]\n\n # past_key_values_length\n past_key_values_length = past_key_values[0][0].shape[2] if past_key_values is not None else 0\n\n if attention_mask is None:\n attention_mask = (\n input_ids.ne(self.pad_token_id)\n if self.pad_token_id is not None\n else torch.ones(((batch_size, seq_length + past_key_values_length)), device=device)\n )\n\n # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]\n # ourselves in which case we just need to make it broadcastable to all heads.\n extended_attention_mask: Tensor = self.get_extended_attention_mask(attention_mask, input_shape)\n\n # If a 2D or 3D attention mask is provided for the cross-attention\n # we need to make broadcastable to [batch_size, num_heads, seq_length, seq_length]\n if self.config.is_decoder and encoder_hidden_states is not None:\n encoder_batch_size, encoder_sequence_length, _ = encoder_hidden_states.size()\n encoder_hidden_shape = (encoder_batch_size, encoder_sequence_length)\n if encoder_attention_mask is None:\n encoder_attention_mask = torch.ones(encoder_hidden_shape, device=device)\n encoder_extended_attention_mask = self.invert_attention_mask(encoder_attention_mask)\n else:\n encoder_extended_attention_mask = None\n\n # Prepare head mask if needed\n # 1.0 in head_mask indicate we keep the head\n # attention_probs has shape bsz x n_heads x N x N\n # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]\n # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]\n head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)\n\n embedding_output = self.embeddings(\n input_ids=input_ids,\n position_ids=position_ids,\n attention_mask=attention_mask,\n inputs_embeds=inputs_embeds,\n past_key_values_length=past_key_values_length,\n )\n encoder_outputs = self.encoder(\n embedding_output,\n attention_mask=extended_attention_mask,\n head_mask=head_mask,\n encoder_hidden_states=encoder_hidden_states,\n encoder_attention_mask=encoder_extended_attention_mask,\n past_key_values=past_key_values,\n use_cache=use_cache,\n output_attentions=output_attentions,\n output_hidden_states=output_hidden_states,\n return_dict=return_dict,\n )\n sequence_output = encoder_outputs[0]\n pooled_output = self.pooler(sequence_output) if self.pooler is not None else None\n\n if not return_dict:\n return (sequence_output, pooled_output) + encoder_outputs[1:]\n\n return BaseModelOutputWithPoolingAndCrossAttentions(\n last_hidden_state=sequence_output,\n pooler_output=pooled_output,\n past_key_values=encoder_outputs.past_key_values,\n hidden_states=encoder_outputs.hidden_states,\n attentions=encoder_outputs.attentions,\n cross_attentions=encoder_outputs.cross_attentions,\n )\n
"},{"location":"models/rinalmo/#multimolecule.models.rinalmo.RiNALMoModel.forward","title":"forward","text":"Pythonforward(input_ids: Tensor | NestedTensor, attention_mask: Tensor | None = None, position_ids: Tensor | None = None, head_mask: Tensor | None = None, inputs_embeds: Tensor | NestedTensor | None = None, encoder_hidden_states: Tensor | None = None, encoder_attention_mask: Tensor | None = None, past_key_values: Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None = None, use_cache: bool | None = None, output_attentions: bool | None = None, output_hidden_states: bool | None = None, return_dict: bool | None = None, **kwargs) -> Tuple[Tensor, ...] | BaseModelOutputWithPoolingAndCrossAttentions\n
Parameters:
Name Type Description DefaultTensor | None
Shape: (batch_size, sequence_length, hidden_size)
Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if the model is configured as a decoder.
None
Tensor | None
Shape: (batch_size, sequence_length)
Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used in the cross-attention if the model is configured as a decoder. Mask values selected in [0, 1]
:
None
Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None
Tuple of length config.n_layers
with each tuple having 4 tensors of shape `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)
Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up decoding.
If past_key_values
are used, the user can optionally input only the last decoder_input_ids
(those that don\u2019t have their past key value states given to this model) of shape (batch_size, 1)
instead of all decoder_input_ids
of shape (batch_size, sequence_length)
.
None
bool | None
If set to True
, past_key_values
key value states are returned and can be used to speed up decoding (see past_key_values
).
None
Source code in multimolecule/models/rinalmo/modeling_rinalmo.py
Pythondef forward(\n self,\n input_ids: Tensor | NestedTensor,\n attention_mask: Tensor | None = None,\n position_ids: Tensor | None = None,\n head_mask: Tensor | None = None,\n inputs_embeds: Tensor | NestedTensor | None = None,\n encoder_hidden_states: Tensor | None = None,\n encoder_attention_mask: Tensor | None = None,\n past_key_values: Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None = None,\n use_cache: bool | None = None,\n output_attentions: bool | None = None,\n output_hidden_states: bool | None = None,\n return_dict: bool | None = None,\n **kwargs,\n) -> Tuple[Tensor, ...] | BaseModelOutputWithPoolingAndCrossAttentions:\n r\"\"\"\n Args:\n encoder_hidden_states:\n Shape: `(batch_size, sequence_length, hidden_size)`\n\n Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if\n the model is configured as a decoder.\n encoder_attention_mask:\n Shape: `(batch_size, sequence_length)`\n\n Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used\n in the cross-attention if the model is configured as a decoder. Mask values selected in `[0, 1]`:\n\n - 1 for tokens that are **not masked**,\n - 0 for tokens that are **masked**.\n past_key_values:\n Tuple of length `config.n_layers` with each tuple having 4 tensors of shape\n `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)\n\n Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up\n decoding.\n\n If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those\n that don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of\n all `decoder_input_ids` of shape `(batch_size, sequence_length)`.\n use_cache:\n If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding\n (see `past_key_values`).\n \"\"\"\n if kwargs:\n warn(\n f\"Additional keyword arguments `{', '.join(kwargs)}` are detected in \"\n f\"`{self.__class__.__name__}.forward`, they will be ignored.\\n\"\n \"This is provided for backward compatibility and may lead to unexpected behavior.\"\n )\n output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions\n output_hidden_states = (\n output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states\n )\n return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n\n if self.config.is_decoder:\n use_cache = use_cache if use_cache is not None else self.config.use_cache\n else:\n use_cache = False\n\n if isinstance(input_ids, NestedTensor):\n input_ids, attention_mask = input_ids.tensor, input_ids.mask\n if input_ids is not None and inputs_embeds is not None:\n raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n if input_ids is not None:\n self.warn_if_padding_and_no_attention_mask(input_ids, attention_mask)\n input_shape = input_ids.size()\n elif inputs_embeds is not None:\n input_shape = inputs_embeds.size()[:-1]\n else:\n raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n batch_size, seq_length = input_shape\n device = input_ids.device if input_ids is not None else inputs_embeds.device # type: ignore[union-attr]\n\n # past_key_values_length\n past_key_values_length = past_key_values[0][0].shape[2] if past_key_values is not None else 0\n\n if attention_mask is None:\n attention_mask = (\n input_ids.ne(self.pad_token_id)\n if self.pad_token_id is not None\n else torch.ones(((batch_size, seq_length + past_key_values_length)), device=device)\n )\n\n # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]\n # ourselves in which case we just need to make it broadcastable to all heads.\n extended_attention_mask: Tensor = self.get_extended_attention_mask(attention_mask, input_shape)\n\n # If a 2D or 3D attention mask is provided for the cross-attention\n # we need to make broadcastable to [batch_size, num_heads, seq_length, seq_length]\n if self.config.is_decoder and encoder_hidden_states is not None:\n encoder_batch_size, encoder_sequence_length, _ = encoder_hidden_states.size()\n encoder_hidden_shape = (encoder_batch_size, encoder_sequence_length)\n if encoder_attention_mask is None:\n encoder_attention_mask = torch.ones(encoder_hidden_shape, device=device)\n encoder_extended_attention_mask = self.invert_attention_mask(encoder_attention_mask)\n else:\n encoder_extended_attention_mask = None\n\n # Prepare head mask if needed\n # 1.0 in head_mask indicate we keep the head\n # attention_probs has shape bsz x n_heads x N x N\n # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]\n # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]\n head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)\n\n embedding_output = self.embeddings(\n input_ids=input_ids,\n position_ids=position_ids,\n attention_mask=attention_mask,\n inputs_embeds=inputs_embeds,\n past_key_values_length=past_key_values_length,\n )\n encoder_outputs = self.encoder(\n embedding_output,\n attention_mask=extended_attention_mask,\n head_mask=head_mask,\n encoder_hidden_states=encoder_hidden_states,\n encoder_attention_mask=encoder_extended_attention_mask,\n past_key_values=past_key_values,\n use_cache=use_cache,\n output_attentions=output_attentions,\n output_hidden_states=output_hidden_states,\n return_dict=return_dict,\n )\n sequence_output = encoder_outputs[0]\n pooled_output = self.pooler(sequence_output) if self.pooler is not None else None\n\n if not return_dict:\n return (sequence_output, pooled_output) + encoder_outputs[1:]\n\n return BaseModelOutputWithPoolingAndCrossAttentions(\n last_hidden_state=sequence_output,\n pooler_output=pooled_output,\n past_key_values=encoder_outputs.past_key_values,\n hidden_states=encoder_outputs.hidden_states,\n attentions=encoder_outputs.attentions,\n cross_attentions=encoder_outputs.cross_attentions,\n )\n
"},{"location":"models/rinalmo/#multimolecule.models.rinalmo.RiNALMoModel.forward(encoder_hidden_states)","title":"encoder_hidden_states
","text":""},{"location":"models/rinalmo/#multimolecule.models.rinalmo.RiNALMoModel.forward(encoder_attention_mask)","title":"encoder_attention_mask
","text":""},{"location":"models/rinalmo/#multimolecule.models.rinalmo.RiNALMoModel.forward(past_key_values)","title":"past_key_values
","text":""},{"location":"models/rinalmo/#multimolecule.models.rinalmo.RiNALMoModel.forward(use_cache)","title":"use_cache
","text":""},{"location":"models/rinalmo/#multimolecule.models.rinalmo.RiNALMoPreTrainedModel","title":"RiNALMoPreTrainedModel","text":" Bases: PreTrainedModel
An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained models.
Source code inmultimolecule/models/rinalmo/modeling_rinalmo.py
Pythonclass RiNALMoPreTrainedModel(PreTrainedModel):\n \"\"\"\n An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained\n models.\n \"\"\"\n\n config_class = RiNALMoConfig\n base_model_prefix = \"rinalmo\"\n supports_gradient_checkpointing = True\n _no_split_modules = [\"RiNALMoLayer\", \"RiNALMoEmbeddings\"]\n\n # Copied from transformers.models.bert.modeling_bert.BertPreTrainedModel._init_weights\n def _init_weights(self, module: nn.Module):\n \"\"\"Initialize the weights\"\"\"\n if isinstance(module, nn.Linear):\n # Slightly different from the TF version which uses truncated_normal for initialization\n # cf https://github.com/pytorch/pytorch/pull/5617\n module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)\n if module.bias is not None:\n module.bias.data.zero_()\n elif isinstance(module, nn.Embedding):\n module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)\n if module.padding_idx is not None:\n module.weight.data[module.padding_idx].zero_()\n elif isinstance(module, nn.LayerNorm):\n module.bias.data.zero_()\n module.weight.data.fill_(1.0)\n
"},{"location":"models/rnabert/","title":"RNABERT","text":"Pre-trained model on non-coding RNA (ncRNA) using masked language modeling (MLM) and structural alignment learning (SAL) objectives.
"},{"location":"models/rnabert/#disclaimer","title":"Disclaimer","text":"This is an UNOFFICIAL implementation of the Informative RNA-base embedding for functional RNA clustering and structural alignment by Manato Akiyama and Yasubumi Sakakibara.
The OFFICIAL repository of RNABERT is at mana438/RNABERT.
Caution
The MultiMolecule team is aware of a potential risk in reproducing the results of RNABERT.
The original implementation of RNABERT does not prepend <cls>
and append <eos>
tokens to the input sequence. This should not affect the performance of the model in most cases, but it can lead to unexpected behavior in some cases.
Please set cls_token=None
and eos_token=None
explicitly in the tokenizer if you want the exact behavior of the original implementation.
Tip
The MultiMolecule team has confirmed that the provided model and checkpoints are producing the same intermediate representations as the original implementation.
The team releasing RNABERT did not write this model card for this model so this model card has been written by the MultiMolecule team.
"},{"location":"models/rnabert/#model-details","title":"Model Details","text":"RNABERT is a bert-style model pre-trained on a large corpus of non-coding RNA sequences in a self-supervised fashion. This means that the model was trained on the raw nucleotides of RNA sequences only, with an automatic process to generate inputs and labels from those texts. Please refer to the Training Details section for more information on the training process.
"},{"location":"models/rnabert/#model-specification","title":"Model Specification","text":"Num Layers Hidden Size Num Heads Intermediate Size Num Parameters (M) FLOPs (G) MACs (G) Max Num Tokens 6 120 12 40 0.48 0.15 0.08 440"},{"location":"models/rnabert/#links","title":"Links","text":"The model file depends on the multimolecule
library. You can install it using pip:
pip install multimolecule\n
"},{"location":"models/rnabert/#direct-use","title":"Direct Use","text":"You can use this model directly with a pipeline for masked language modeling:
Python>>> import multimolecule # you must import multimolecule to register models\n>>> from transformers import pipeline\n>>> unmasker = pipeline(\"fill-mask\", model=\"multimolecule/rnabert\")\n>>> unmasker(\"gguc<mask>cucugguuagaccagaucugagccu\")\n\n[{'score': 0.03852083534002304,\n 'token': 24,\n 'token_str': '-',\n 'sequence': 'G G U C - C U C U G G U U A G A C C A G A U C U G A G C C U'},\n {'score': 0.03851056098937988,\n 'token': 10,\n 'token_str': 'N',\n 'sequence': 'G G U C N C U C U G G U U A G A C C A G A U C U G A G C C U'},\n {'score': 0.03849703073501587,\n 'token': 25,\n 'token_str': 'I',\n 'sequence': 'G G U C I C U C U G G U U A G A C C A G A U C U G A G C C U'},\n {'score': 0.03848597779870033,\n 'token': 3,\n 'token_str': '<unk>',\n 'sequence': 'G G U C C U C U G G U U A G A C C A G A U C U G A G C C U'},\n {'score': 0.038484156131744385,\n 'token': 5,\n 'token_str': '<null>',\n 'sequence': 'G G U C C U C U G G U U A G A C C A G A U C U G A G C C U'}]\n
"},{"location":"models/rnabert/#downstream-use","title":"Downstream Use","text":""},{"location":"models/rnabert/#extract-features","title":"Extract Features","text":"Here is how to use this model to get the features of a given sequence in PyTorch:
Pythonfrom multimolecule import RnaTokenizer, RnaBertModel\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rnabert\")\nmodel = RnaBertModel.from_pretrained(\"multimolecule/rnabert\")\n\ntext = \"UAGCUUAUCAGACUGAUGUUGA\"\ninput = tokenizer(text, return_tensors=\"pt\")\n\noutput = model(**input)\n
"},{"location":"models/rnabert/#sequence-classification-regression","title":"Sequence Classification / Regression","text":"Note: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for sequence classification or regression.
Here is how to use this model as backbone to fine-tune for a sequence-level task in PyTorch:
Pythonimport torch\nfrom multimolecule import RnaTokenizer, RnaBertForSequencePrediction\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rnabert\")\nmodel = RnaBertForSequencePrediction.from_pretrained(\"multimolecule/rnabert\")\n\ntext = \"UAGCUUAUCAGACUGAUGUUGA\"\ninput = tokenizer(text, return_tensors=\"pt\")\nlabel = torch.tensor([1])\n\noutput = model(**input, labels=label)\n
"},{"location":"models/rnabert/#token-classification-regression","title":"Token Classification / Regression","text":"Note: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for nucleotide classification or regression.
Here is how to use this model as backbone to fine-tune for a nucleotide-level task in PyTorch:
Pythonimport torch\nfrom multimolecule import RnaTokenizer, RnaBertForTokenPrediction\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rnabert\")\nmodel = RnaBertForTokenPrediction.from_pretrained(\"multimolecule/rnabert\")\n\ntext = \"UAGCUUAUCAGACUGAUGUUGA\"\ninput = tokenizer(text, return_tensors=\"pt\")\nlabel = torch.randint(2, (len(text), ))\n\noutput = model(**input, labels=label)\n
"},{"location":"models/rnabert/#contact-classification-regression","title":"Contact Classification / Regression","text":"Note: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for contact classification or regression.
Here is how to use this model as backbone to fine-tune for a contact-level task in PyTorch:
Pythonimport torch\nfrom multimolecule import RnaTokenizer, RnaBertForContactPrediction\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rnabert\")\nmodel = RnaBertForContactPrediction.from_pretrained(\"multimolecule/rnabert\")\n\ntext = \"UAGCUUAUCAGACUGAUGUUGA\"\ninput = tokenizer(text, return_tensors=\"pt\")\nlabel = torch.randint(2, (len(text), len(text)))\n\noutput = model(**input, labels=label)\n
"},{"location":"models/rnabert/#training-details","title":"Training Details","text":"RNABERT has two pre-training objectives: masked language modeling (MLM) and structural alignment learning (SAL).
The RNABERT model was pre-trained on RNAcentral. RNAcentral is a free, public resource that offers integrated access to a comprehensive and up-to-date set of non-coding RNA sequences provided by a collaborating group of Expert Databases representing a broad range of organisms and RNA types.
RNABERT used a subset of 76, 237 human ncRNA sequences from RNAcentral for pre-training. RNABERT preprocessed all tokens by replacing \u201cU\u201ds with \u201cT\u201ds.
Note that during model conversions, \u201cT\u201d is replaced with \u201cU\u201d. RnaTokenizer
will convert \u201cT\u201ds to \u201cU\u201ds for you, you may disable this behaviour by passing replace_T_with_U=False
.
RNABERT preprocess the dataset by applying 10 different mask patterns to the 72, 237 human ncRNA sequences. The final dataset contains 722, 370 sequences. The masking procedure is similar to the one used in BERT:
<mask>
.The model was trained on 1 NVIDIA V100 GPU.
"},{"location":"models/rnabert/#citation","title":"Citation","text":"BibTeX:
BibTeX@article{akiyama2022informative,\n author = {Akiyama, Manato and Sakakibara, Yasubumi},\n title = \"{Informative RNA base embedding for RNA structural alignment and clustering by deep representation learning}\",\n journal = {NAR Genomics and Bioinformatics},\n volume = {4},\n number = {1},\n pages = {lqac012},\n year = {2022},\n month = {02},\n abstract = \"{Effective embedding is actively conducted by applying deep learning to biomolecular information. Obtaining better embeddings enhances the quality of downstream analyses, such as DNA sequence motif detection and protein function prediction. In this study, we adopt a pre-training algorithm for the effective embedding of RNA bases to acquire semantically rich representations and apply this algorithm to two fundamental RNA sequence problems: structural alignment and clustering. By using the pre-training algorithm to embed the four bases of RNA in a position-dependent manner using a large number of RNA sequences from various RNA families, a context-sensitive embedding representation is obtained. As a result, not only base information but also secondary structure and context information of RNA sequences are embedded for each base. We call this \u2018informative base embedding\u2019 and use it to achieve accuracies superior to those of existing state-of-the-art methods on RNA structural alignment and RNA family clustering tasks. Furthermore, upon performing RNA sequence alignment by combining this informative base embedding with a simple Needleman\u2013Wunsch alignment algorithm, we succeed in calculating structural alignments with a time complexity of O(n2) instead of the O(n6) time complexity of the naive implementation of Sankoff-style algorithm for input RNA sequence of length n.}\",\n issn = {2631-9268},\n doi = {10.1093/nargab/lqac012},\n url = {https://doi.org/10.1093/nargab/lqac012},\n eprint = {https://academic.oup.com/nargab/article-pdf/4/1/lqac012/42577168/lqac012.pdf},\n}\n
"},{"location":"models/rnabert/#contact","title":"Contact","text":"Please use GitHub issues of MultiMolecule for any questions or comments on the model card.
Please contact the authors of the RNABERT paper for questions or comments on the paper/model.
"},{"location":"models/rnabert/#license","title":"License","text":"This model is licensed under the AGPL-3.0 License.
Text OnlySPDX-License-Identifier: AGPL-3.0-or-later\n
"},{"location":"models/rnabert/#multimolecule.models.rnabert","title":"multimolecule.models.rnabert","text":""},{"location":"models/rnabert/#multimolecule.models.rnabert.RnaTokenizer","title":"RnaTokenizer","text":" Bases: Tokenizer
Tokenizer for RNA sequences.
Parameters:
Name Type Description DefaultAlphabet | str | List[str] | None
alphabet to use for tokenization.
None
, the standard RNA alphabet will be used.string
, it should correspond to the name of a predefined alphabet. The options includestandard
extended
streamline
nucleobase
None
int
Size of kmer to tokenize.
1
bool
Whether to tokenize into codons.
False
bool
Whether to replace T with U.
True
bool
Whether to convert input to uppercase.
True
Examples:
Python Console Session>>> from multimolecule import RnaTokenizer\n>>> tokenizer = RnaTokenizer()\n>>> tokenizer('<pad><cls><eos><unk><mask><null>ACGUNRYSWKMBDHV.X*-I')[\"input_ids\"]\n[1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 2]\n>>> tokenizer('acgu')[\"input_ids\"]\n[1, 6, 7, 8, 9, 2]\n>>> tokenizer('acgt')[\"input_ids\"]\n[1, 6, 7, 8, 9, 2]\n>>> tokenizer = RnaTokenizer(replace_T_with_U=False)\n>>> tokenizer('acgt')[\"input_ids\"]\n[1, 6, 7, 8, 3, 2]\n>>> tokenizer = RnaTokenizer(nmers=3)\n>>> tokenizer('uagcuuauc')[\"input_ids\"]\n[1, 83, 17, 64, 49, 96, 84, 22, 2]\n>>> tokenizer = RnaTokenizer(codon=True)\n>>> tokenizer('uagcuuauc')[\"input_ids\"]\n[1, 83, 49, 22, 2]\n>>> tokenizer('uagcuuauca')[\"input_ids\"]\nTraceback (most recent call last):\nValueError: length of input sequence must be a multiple of 3 for codon tokenization, but got 10\n
Source code in multimolecule/tokenisers/rna/tokenization_rna.py
Pythonclass RnaTokenizer(Tokenizer):\n \"\"\"\n Tokenizer for RNA sequences.\n\n Args:\n alphabet: alphabet to use for tokenization.\n\n - If is `None`, the standard RNA alphabet will be used.\n - If is a `string`, it should correspond to the name of a predefined alphabet. The options include\n + `standard`\n + `extended`\n + `streamline`\n + `nucleobase`\n - If is an alphabet or a list of characters, that specific alphabet will be used.\n nmers: Size of kmer to tokenize.\n codon: Whether to tokenize into codons.\n replace_T_with_U: Whether to replace T with U.\n do_upper_case: Whether to convert input to uppercase.\n\n Examples:\n >>> from multimolecule import RnaTokenizer\n >>> tokenizer = RnaTokenizer()\n >>> tokenizer('<pad><cls><eos><unk><mask><null>ACGUNRYSWKMBDHV.X*-I')[\"input_ids\"]\n [1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 2]\n >>> tokenizer('acgu')[\"input_ids\"]\n [1, 6, 7, 8, 9, 2]\n >>> tokenizer('acgt')[\"input_ids\"]\n [1, 6, 7, 8, 9, 2]\n >>> tokenizer = RnaTokenizer(replace_T_with_U=False)\n >>> tokenizer('acgt')[\"input_ids\"]\n [1, 6, 7, 8, 3, 2]\n >>> tokenizer = RnaTokenizer(nmers=3)\n >>> tokenizer('uagcuuauc')[\"input_ids\"]\n [1, 83, 17, 64, 49, 96, 84, 22, 2]\n >>> tokenizer = RnaTokenizer(codon=True)\n >>> tokenizer('uagcuuauc')[\"input_ids\"]\n [1, 83, 49, 22, 2]\n >>> tokenizer('uagcuuauca')[\"input_ids\"]\n Traceback (most recent call last):\n ValueError: length of input sequence must be a multiple of 3 for codon tokenization, but got 10\n \"\"\"\n\n model_input_names = [\"input_ids\", \"attention_mask\"]\n\n def __init__(\n self,\n alphabet: Alphabet | str | List[str] | None = None,\n nmers: int = 1,\n codon: bool = False,\n replace_T_with_U: bool = True,\n do_upper_case: bool = True,\n additional_special_tokens: List | Tuple | None = None,\n **kwargs,\n ):\n if codon and (nmers > 1 and nmers != 3):\n raise ValueError(\"Codon and nmers cannot be used together.\")\n if codon:\n nmers = 3 # set to 3 to get correct vocab\n if not isinstance(alphabet, Alphabet):\n alphabet = get_alphabet(alphabet, nmers=nmers)\n super().__init__(\n alphabet=alphabet,\n nmers=nmers,\n codon=codon,\n replace_T_with_U=replace_T_with_U,\n do_upper_case=do_upper_case,\n additional_special_tokens=additional_special_tokens,\n **kwargs,\n )\n self.replace_T_with_U = replace_T_with_U\n self.nmers = nmers\n self.codon = codon\n\n def _tokenize(self, text: str, **kwargs):\n if self.do_upper_case:\n text = text.upper()\n if self.replace_T_with_U:\n text = text.replace(\"T\", \"U\")\n if self.codon:\n if len(text) % 3 != 0:\n raise ValueError(\n f\"length of input sequence must be a multiple of 3 for codon tokenization, but got {len(text)}\"\n )\n return [text[i : i + 3] for i in range(0, len(text), 3)]\n if self.nmers > 1:\n return [text[i : i + self.nmers] for i in range(len(text) - self.nmers + 1)] # noqa: E203\n return list(text)\n
"},{"location":"models/rnabert/#multimolecule.models.rnabert.RnaTokenizer(alphabet)","title":"alphabet
","text":""},{"location":"models/rnabert/#multimolecule.models.rnabert.RnaTokenizer(nmers)","title":"nmers
","text":""},{"location":"models/rnabert/#multimolecule.models.rnabert.RnaTokenizer(codon)","title":"codon
","text":""},{"location":"models/rnabert/#multimolecule.models.rnabert.RnaTokenizer(replace_T_with_U)","title":"replace_T_with_U
","text":""},{"location":"models/rnabert/#multimolecule.models.rnabert.RnaTokenizer(do_upper_case)","title":"do_upper_case
","text":""},{"location":"models/rnabert/#multimolecule.models.rnabert.RnaBertConfig","title":"RnaBertConfig","text":" Bases: PreTrainedConfig
This is the configuration class to store the configuration of a RnaBertModel
. It is used to instantiate a RnaBert model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the RnaBert mana438/RNABERT architecture.
Configuration objects inherit from PreTrainedConfig
and can be used to control the model outputs. Read the documentation from PreTrainedConfig
for more information.
Parameters:
Name Type Description Defaultint
Vocabulary size of the RnaBert model. Defines the number of different tokens that can be represented by the inputs_ids
passed when calling [RnaBertModel
].
26
int | None
Dimensionality of the encoder layers and the pooler layer.
None
int
Number of hidden layers in the Transformer encoder.
6
int
Number of attention heads for each attention layer in the Transformer encoder.
12
int
Dimensionality of the \u201cintermediate\u201d (often named feed-forward) layer in the Transformer encoder.
40
float
The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
0.0
float
The dropout ratio for the attention probabilities.
0.0
int
The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
440
float
The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
0.02
float
The epsilon used by the layer normalization layers.
1e-12
Examples:
Python Console Session>>> from multimolecule import RnaBertModel, RnaBertConfig\n>>> # Initializing a RNABERT multimolecule/rnabert style configuration\n>>> configuration = RnaBertConfig()\n>>> # Initializing a model (with random weights) from the multimolecule/rnabert style configuration\n>>> model = RnaBertModel(configuration)\n>>> # Accessing the model configuration\n>>> configuration = model.config\n
Source code in multimolecule/models/rnabert/configuration_rnabert.py
Pythonclass RnaBertConfig(PreTrainedConfig):\n r\"\"\"\n This is the configuration class to store the configuration of a [`RnaBertModel`][multimolecule.models.RnaBertModel].\n It is used to instantiate a RnaBert model according to the specified arguments, defining the model architecture.\n Instantiating a configuration with the defaults will yield a similar configuration to that of the RnaBert\n [mana438/RNABERT](https://github.com/mana438/RNABERT) architecture.\n\n Configuration objects inherit from [`PreTrainedConfig`][multimolecule.models.PreTrainedConfig] and can be used to\n control the model outputs. Read the documentation from [`PreTrainedConfig`][multimolecule.models.PreTrainedConfig]\n for more information.\n\n Args:\n vocab_size:\n Vocabulary size of the RnaBert model. Defines the number of different tokens that can be represented by the\n `inputs_ids` passed when calling [`RnaBertModel`].\n hidden_size:\n Dimensionality of the encoder layers and the pooler layer.\n num_hidden_layers:\n Number of hidden layers in the Transformer encoder.\n num_attention_heads:\n Number of attention heads for each attention layer in the Transformer encoder.\n intermediate_size:\n Dimensionality of the \"intermediate\" (often named feed-forward) layer in the Transformer encoder.\n hidden_dropout:\n The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.\n attention_dropout:\n The dropout ratio for the attention probabilities.\n max_position_embeddings:\n The maximum sequence length that this model might ever be used with. Typically set this to something large\n just in case (e.g., 512 or 1024 or 2048).\n initializer_range:\n The standard deviation of the truncated_normal_initializer for initializing all weight matrices.\n layer_norm_eps:\n The epsilon used by the layer normalization layers.\n\n Examples:\n >>> from multimolecule import RnaBertModel, RnaBertConfig\n >>> # Initializing a RNABERT multimolecule/rnabert style configuration\n >>> configuration = RnaBertConfig()\n >>> # Initializing a model (with random weights) from the multimolecule/rnabert style configuration\n >>> model = RnaBertModel(configuration)\n >>> # Accessing the model configuration\n >>> configuration = model.config\n \"\"\"\n\n model_type = \"rnabert\"\n\n def __init__(\n self,\n vocab_size: int = 26,\n ss_vocab_size: int = 8,\n hidden_size: int | None = None,\n multiple: int | None = None,\n num_hidden_layers: int = 6,\n num_attention_heads: int = 12,\n intermediate_size: int = 40,\n hidden_act: str = \"gelu\",\n hidden_dropout: float = 0.0,\n attention_dropout: float = 0.0,\n max_position_embeddings: int = 440,\n initializer_range: float = 0.02,\n layer_norm_eps: float = 1e-12,\n position_embedding_type: str = \"absolute\",\n is_decoder: bool = False,\n use_cache: bool = True,\n head: HeadConfig | None = None,\n lm_head: MaskedLMHeadConfig | None = None,\n **kwargs,\n ):\n if hidden_size is None:\n hidden_size = num_attention_heads * multiple if multiple is not None else 120\n super().__init__(**kwargs)\n\n self.vocab_size = vocab_size\n self.ss_vocab_size = ss_vocab_size\n self.type_vocab_size = 2\n self.hidden_size = hidden_size\n self.num_hidden_layers = num_hidden_layers\n self.num_attention_heads = num_attention_heads\n self.intermediate_size = intermediate_size\n self.hidden_act = hidden_act\n self.hidden_dropout = hidden_dropout\n self.attention_dropout = attention_dropout\n self.max_position_embeddings = max_position_embeddings\n self.initializer_range = initializer_range\n self.layer_norm_eps = layer_norm_eps\n self.position_embedding_type = position_embedding_type\n self.is_decoder = is_decoder\n self.use_cache = use_cache\n self.head = HeadConfig(**head) if head is not None else None\n self.lm_head = MaskedLMHeadConfig(**lm_head) if lm_head is not None else None\n
"},{"location":"models/rnabert/#multimolecule.models.rnabert.RnaBertConfig(vocab_size)","title":"vocab_size
","text":""},{"location":"models/rnabert/#multimolecule.models.rnabert.RnaBertConfig(hidden_size)","title":"hidden_size
","text":""},{"location":"models/rnabert/#multimolecule.models.rnabert.RnaBertConfig(num_hidden_layers)","title":"num_hidden_layers
","text":""},{"location":"models/rnabert/#multimolecule.models.rnabert.RnaBertConfig(num_attention_heads)","title":"num_attention_heads
","text":""},{"location":"models/rnabert/#multimolecule.models.rnabert.RnaBertConfig(intermediate_size)","title":"intermediate_size
","text":""},{"location":"models/rnabert/#multimolecule.models.rnabert.RnaBertConfig(hidden_dropout)","title":"hidden_dropout
","text":""},{"location":"models/rnabert/#multimolecule.models.rnabert.RnaBertConfig(attention_dropout)","title":"attention_dropout
","text":""},{"location":"models/rnabert/#multimolecule.models.rnabert.RnaBertConfig(max_position_embeddings)","title":"max_position_embeddings
","text":""},{"location":"models/rnabert/#multimolecule.models.rnabert.RnaBertConfig(initializer_range)","title":"initializer_range
","text":""},{"location":"models/rnabert/#multimolecule.models.rnabert.RnaBertConfig(layer_norm_eps)","title":"layer_norm_eps
","text":""},{"location":"models/rnabert/#multimolecule.models.rnabert.RnaBertForContactPrediction","title":"RnaBertForContactPrediction","text":" Bases: RnaBertPreTrainedModel
Examples:
Python Console Session>>> from multimolecule import RnaBertConfig, RnaBertForContactPrediction, RnaTokenizer\n>>> config = RnaBertConfig()\n>>> model = RnaBertForContactPrediction(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=torch.randint(2, (1, 5, 5)))\n>>> output[\"logits\"].shape\ntorch.Size([1, 5, 5, 1])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n
Source code in multimolecule/models/rnabert/modeling_rnabert.py
Pythonclass RnaBertForContactPrediction(RnaBertPreTrainedModel):\n \"\"\"\n Examples:\n >>> from multimolecule import RnaBertConfig, RnaBertForContactPrediction, RnaTokenizer\n >>> config = RnaBertConfig()\n >>> model = RnaBertForContactPrediction(config)\n >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n >>> output = model(**input, labels=torch.randint(2, (1, 5, 5)))\n >>> output[\"logits\"].shape\n torch.Size([1, 5, 5, 1])\n >>> output[\"loss\"] # doctest:+ELLIPSIS\n tensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n \"\"\"\n\n def __init__(self, config: RnaBertConfig):\n super().__init__(config)\n self.rnabert = RnaBertModel(config, add_pooling_layer=True)\n self.contact_head = ContactPredictionHead(config)\n self.head_config = self.contact_head.config\n\n # Initialize weights and apply final processing\n self.post_init()\n\n def forward(\n self,\n input_ids: Tensor | NestedTensor,\n attention_mask: Tensor | None = None,\n position_ids: Tensor | None = None,\n head_mask: Tensor | None = None,\n inputs_embeds: Tensor | NestedTensor | None = None,\n labels: Tensor | None = None,\n output_attentions: bool | None = None,\n output_hidden_states: bool | None = None,\n return_dict: bool | None = None,\n **kwargs,\n ) -> Tuple[Tensor, ...] | ContactPredictorOutput:\n if output_attentions is False:\n warn(\"output_attentions must be True for contact classification and will be ignored.\")\n return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n outputs = self.rnabert(\n input_ids,\n attention_mask=attention_mask,\n position_ids=position_ids,\n head_mask=head_mask,\n inputs_embeds=inputs_embeds,\n output_attentions=True,\n output_hidden_states=output_hidden_states,\n return_dict=return_dict,\n **kwargs,\n )\n output = self.contact_head(outputs, attention_mask, input_ids, labels)\n logits, loss = output.logits, output.loss\n\n if not return_dict:\n output = (logits,) + outputs[2:]\n return ((loss,) + output) if loss is not None else output\n\n return ContactPredictorOutput(\n loss=loss,\n logits=logits,\n hidden_states=outputs.hidden_states,\n attentions=outputs.attentions,\n )\n
"},{"location":"models/rnabert/#multimolecule.models.rnabert.RnaBertForMaskedLM","title":"RnaBertForMaskedLM","text":" Bases: RnaBertPreTrainedModel
Examples:
Python Console Session>>> from multimolecule import RnaBertConfig, RnaBertForMaskedLM, RnaTokenizer\n>>> config = RnaBertConfig()\n>>> model = RnaBertForMaskedLM(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=input[\"input_ids\"])\n>>> output[\"logits\"].shape\ntorch.Size([1, 7, 26])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<NllLossBackward0>)\n
Source code in multimolecule/models/rnabert/modeling_rnabert.py
Pythonclass RnaBertForMaskedLM(RnaBertPreTrainedModel):\n \"\"\"\n Examples:\n >>> from multimolecule import RnaBertConfig, RnaBertForMaskedLM, RnaTokenizer\n >>> config = RnaBertConfig()\n >>> model = RnaBertForMaskedLM(config)\n >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n >>> output = model(**input, labels=input[\"input_ids\"])\n >>> output[\"logits\"].shape\n torch.Size([1, 7, 26])\n >>> output[\"loss\"] # doctest:+ELLIPSIS\n tensor(..., grad_fn=<NllLossBackward0>)\n \"\"\"\n\n _tied_weights_keys = [\"lm_head.decoder.weight\", \"lm_head.decoder.bias\"]\n\n def __init__(self, config: RnaBertConfig):\n super().__init__(config)\n self.rnabert = RnaBertModel(config, add_pooling_layer=False)\n self.lm_head = MaskedLMHead(config)\n\n # Initialize weights and apply final processing\n self.post_init()\n\n def forward(\n self,\n input_ids: Tensor | NestedTensor,\n attention_mask: Tensor | None = None,\n labels: Tensor | None = None,\n output_attentions: bool = False,\n output_hidden_states: bool = False,\n return_dict: bool = True,\n **kwargs,\n ) -> Tuple[Tensor, ...] | MaskedLMOutput:\n return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n outputs = self.rnabert(\n input_ids,\n attention_mask,\n output_attentions=output_attentions,\n output_hidden_states=output_hidden_states,\n return_dict=return_dict,\n **kwargs,\n )\n output = self.lm_head(outputs, labels)\n logits, loss = output.logits, output.loss\n\n if not return_dict:\n output = (logits,) + outputs[2:]\n return ((loss,) + output) if loss is not None else output\n\n return MaskedLMOutput(\n loss=loss,\n logits=logits,\n hidden_states=outputs.hidden_states,\n attentions=outputs.attentions,\n )\n
"},{"location":"models/rnabert/#multimolecule.models.rnabert.RnaBertForPreTraining","title":"RnaBertForPreTraining","text":" Bases: RnaBertPreTrainedModel
Examples:
Python Console Session>>> from multimolecule import RnaBertConfig, RnaBertForPreTraining, RnaTokenizer\n>>> config = RnaBertConfig()\n>>> model = RnaBertForPreTraining(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels_mlm=input[\"input_ids\"])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<AddBackward0>)\n>>> output[\"logits_mlm\"].shape\ntorch.Size([1, 7, 26])\n>>> output[\"logits_ss\"].shape\ntorch.Size([1, 7, 8])\n>>> output[\"logits_sal\"].shape\ntorch.Size([1, 2])\n
Source code in multimolecule/models/rnabert/modeling_rnabert.py
Pythonclass RnaBertForPreTraining(RnaBertPreTrainedModel):\n \"\"\"\n Examples:\n >>> from multimolecule import RnaBertConfig, RnaBertForPreTraining, RnaTokenizer\n >>> config = RnaBertConfig()\n >>> model = RnaBertForPreTraining(config)\n >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n >>> output = model(**input, labels_mlm=input[\"input_ids\"])\n >>> output[\"loss\"] # doctest:+ELLIPSIS\n tensor(..., grad_fn=<AddBackward0>)\n >>> output[\"logits_mlm\"].shape\n torch.Size([1, 7, 26])\n >>> output[\"logits_ss\"].shape\n torch.Size([1, 7, 8])\n >>> output[\"logits_sal\"].shape\n torch.Size([1, 2])\n \"\"\"\n\n _tied_weights_keys = [\n \"lm_head.decoder.weight\",\n \"lm_head.decoder.bias\",\n \"pretrain.predictions.decoder.weight\",\n \"pretrain.predictions.decoder.bias\",\n \"pretrain.predictions_ss.decoder.weight\",\n \"pretrain.predictions_ss.decoder.bias\",\n ]\n\n def __init__(self, config: RnaBertConfig):\n super().__init__(config)\n self.rnabert = RnaBertModel(config, add_pooling_layer=True)\n self.pretrain = RnaBertPreTrainingHeads(config)\n\n # Initialize weights and apply final processing\n self.post_init()\n\n def forward(\n self,\n input_ids: Tensor | NestedTensor,\n attention_mask: Tensor | None = None,\n labels_mlm: Tensor | None = None,\n labels_ss: Tensor | None = None,\n labels_sal: Tensor | None = None,\n output_attentions: bool = False,\n output_hidden_states: bool = False,\n return_dict: bool = True,\n **kwargs,\n ) -> Tuple[Tensor, ...] | RnaBertForPreTrainingOutput:\n return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n outputs = self.rnabert(\n input_ids,\n attention_mask,\n output_attentions=output_attentions,\n output_hidden_states=output_hidden_states,\n return_dict=return_dict,\n **kwargs,\n )\n total_loss, logits_mlm, logits_ss, logits_sal = self.pretrain(\n outputs, labels_mlm=labels_mlm, labels_ss=labels_ss, labels_sal=labels_sal\n )\n\n if not return_dict:\n output = (logits_mlm, logits_ss, logits_sal) + outputs[2:]\n return ((total_loss,) + output) if total_loss is not None else output\n\n return RnaBertForPreTrainingOutput(\n loss=total_loss,\n logits_mlm=logits_mlm,\n logits_ss=logits_ss,\n logits_sal=logits_sal,\n hidden_states=outputs.hidden_states,\n attentions=outputs.attentions,\n )\n
"},{"location":"models/rnabert/#multimolecule.models.rnabert.RnaBertForSequencePrediction","title":"RnaBertForSequencePrediction","text":" Bases: RnaBertPreTrainedModel
Examples:
Python Console Session>>> from multimolecule import RnaBertConfig, RnaBertForSequencePrediction, RnaTokenizer\n>>> config = RnaBertConfig()\n>>> model = RnaBertForSequencePrediction(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=torch.tensor([[1]]))\n>>> output[\"logits\"].shape\ntorch.Size([1, 1])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n
Source code in multimolecule/models/rnabert/modeling_rnabert.py
Pythonclass RnaBertForSequencePrediction(RnaBertPreTrainedModel):\n \"\"\"\n Examples:\n >>> from multimolecule import RnaBertConfig, RnaBertForSequencePrediction, RnaTokenizer\n >>> config = RnaBertConfig()\n >>> model = RnaBertForSequencePrediction(config)\n >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n >>> output = model(**input, labels=torch.tensor([[1]]))\n >>> output[\"logits\"].shape\n torch.Size([1, 1])\n >>> output[\"loss\"] # doctest:+ELLIPSIS\n tensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n \"\"\"\n\n def __init__(self, config: RnaBertConfig):\n super().__init__(config)\n self.rnabert = RnaBertModel(config, add_pooling_layer=True)\n self.sequence_head = SequencePredictionHead(config)\n self.head_config = self.sequence_head.config\n\n # Initialize weights and apply final processing\n self.post_init()\n\n def forward(\n self,\n input_ids: Tensor | NestedTensor,\n attention_mask: Tensor | None = None,\n position_ids: Tensor | None = None,\n head_mask: Tensor | None = None,\n inputs_embeds: Tensor | NestedTensor | None = None,\n labels: Tensor | None = None,\n output_attentions: bool | None = None,\n output_hidden_states: bool | None = None,\n return_dict: bool | None = None,\n **kwargs,\n ) -> Tuple[Tensor, ...] | SequencePredictorOutput:\n return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n outputs = self.rnabert(\n input_ids,\n attention_mask=attention_mask,\n position_ids=position_ids,\n head_mask=head_mask,\n inputs_embeds=inputs_embeds,\n output_attentions=output_attentions,\n output_hidden_states=output_hidden_states,\n return_dict=return_dict,\n **kwargs,\n )\n output = self.sequence_head(outputs, labels)\n logits, loss = output.logits, output.loss\n\n if not return_dict:\n output = (logits,) + outputs[2:]\n return ((loss,) + output) if loss is not None else output\n\n return SequencePredictorOutput(\n loss=loss,\n logits=logits,\n hidden_states=outputs.hidden_states,\n attentions=outputs.attentions,\n )\n
"},{"location":"models/rnabert/#multimolecule.models.rnabert.RnaBertForTokenPrediction","title":"RnaBertForTokenPrediction","text":" Bases: RnaBertPreTrainedModel
Examples:
Python Console Session>>> from multimolecule import RnaBertConfig, RnaBertForTokenPrediction, RnaTokenizer\n>>> config = RnaBertConfig()\n>>> model = RnaBertForTokenPrediction(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=torch.randint(2, (1, 5)))\n>>> output[\"logits\"].shape\ntorch.Size([1, 5, 1])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n
Source code in multimolecule/models/rnabert/modeling_rnabert.py
Pythonclass RnaBertForTokenPrediction(RnaBertPreTrainedModel):\n \"\"\"\n Examples:\n >>> from multimolecule import RnaBertConfig, RnaBertForTokenPrediction, RnaTokenizer\n >>> config = RnaBertConfig()\n >>> model = RnaBertForTokenPrediction(config)\n >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n >>> output = model(**input, labels=torch.randint(2, (1, 5)))\n >>> output[\"logits\"].shape\n torch.Size([1, 5, 1])\n >>> output[\"loss\"] # doctest:+ELLIPSIS\n tensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n \"\"\"\n\n def __init__(self, config: RnaBertConfig):\n super().__init__(config)\n self.rnabert = RnaBertModel(config, add_pooling_layer=True)\n self.token_head = TokenPredictionHead(config)\n self.head_config = self.token_head.config\n\n # Initialize weights and apply final processing\n self.post_init()\n\n def forward(\n self,\n input_ids: Tensor | NestedTensor,\n attention_mask: Tensor | None = None,\n position_ids: Tensor | None = None,\n head_mask: Tensor | None = None,\n inputs_embeds: Tensor | NestedTensor | None = None,\n labels: Tensor | None = None,\n output_attentions: bool | None = None,\n output_hidden_states: bool | None = None,\n return_dict: bool | None = None,\n **kwargs,\n ) -> Tuple[Tensor, ...] | TokenPredictorOutput:\n return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n outputs = self.rnabert(\n input_ids,\n attention_mask=attention_mask,\n position_ids=position_ids,\n head_mask=head_mask,\n inputs_embeds=inputs_embeds,\n output_attentions=output_attentions,\n output_hidden_states=output_hidden_states,\n return_dict=return_dict,\n **kwargs,\n )\n output = self.token_head(outputs, attention_mask, input_ids, labels)\n logits, loss = output.logits, output.loss\n\n if not return_dict:\n output = (logits,) + outputs[2:]\n return ((loss,) + output) if loss is not None else output\n\n return TokenPredictorOutput(\n loss=loss,\n logits=logits,\n hidden_states=outputs.hidden_states,\n attentions=outputs.attentions,\n )\n
"},{"location":"models/rnabert/#multimolecule.models.rnabert.RnaBertModel","title":"RnaBertModel","text":" Bases: RnaBertPreTrainedModel
Examples:
Python Console Session>>> from multimolecule import RnaBertConfig, RnaBertModel, RnaTokenizer\n>>> config = RnaBertConfig()\n>>> model = RnaBertModel(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input)\n>>> output[\"last_hidden_state\"].shape\ntorch.Size([1, 7, 120])\n>>> output[\"pooler_output\"].shape\ntorch.Size([1, 120])\n
Source code in multimolecule/models/rnabert/modeling_rnabert.py
Pythonclass RnaBertModel(RnaBertPreTrainedModel):\n \"\"\"\n Examples:\n >>> from multimolecule import RnaBertConfig, RnaBertModel, RnaTokenizer\n >>> config = RnaBertConfig()\n >>> model = RnaBertModel(config)\n >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n >>> output = model(**input)\n >>> output[\"last_hidden_state\"].shape\n torch.Size([1, 7, 120])\n >>> output[\"pooler_output\"].shape\n torch.Size([1, 120])\n \"\"\"\n\n def __init__(self, config: RnaBertConfig, add_pooling_layer: bool = True):\n super().__init__(config)\n self.pad_token_id = config.pad_token_id\n self.embeddings = RnaBertEmbeddings(config)\n self.encoder = RnaBertEncoder(config)\n self.pooler = RnaBertPooler(config) if add_pooling_layer else None\n\n # Initialize weights and apply final processing\n self.post_init()\n\n def get_input_embeddings(self):\n return self.embeddings.word_embeddings\n\n def set_input_embeddings(self, value):\n self.embeddings.word_embeddings = value\n\n def _prune_heads(self, heads_to_prune):\n \"\"\"\n Prunes heads of the model. heads_to_prune: dict of {layer_num: list of heads to prune in this layer} See base\n class PreTrainedModel\n \"\"\"\n for layer, heads in heads_to_prune.items():\n self.encoder.layer[layer].attention.prune_heads(heads)\n\n def forward(\n self,\n input_ids: Tensor | NestedTensor,\n attention_mask: Tensor | None = None,\n position_ids: Tensor | None = None,\n head_mask: Tensor | None = None,\n inputs_embeds: Tensor | NestedTensor | None = None,\n encoder_hidden_states: Tensor | None = None,\n encoder_attention_mask: Tensor | None = None,\n past_key_values: Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None = None,\n use_cache: bool | None = None,\n output_attentions: bool | None = None,\n output_hidden_states: bool | None = None,\n return_dict: bool | None = None,\n **kwargs,\n ) -> Tuple[Tensor, ...] | BaseModelOutputWithPoolingAndCrossAttentions:\n r\"\"\"\n Args:\n encoder_hidden_states:\n Shape: `(batch_size, sequence_length, hidden_size)`\n\n Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if\n the model is configured as a decoder.\n encoder_attention_mask:\n Shape: `(batch_size, sequence_length)`\n\n Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used\n in the cross-attention if the model is configured as a decoder. Mask values selected in `[0, 1]`:\n\n - 1 for tokens that are **not masked**,\n - 0 for tokens that are **masked**.\n past_key_values:\n Tuple of length `config.n_layers` with each tuple having 4 tensors of shape\n `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)\n\n Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up\n decoding.\n\n If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those\n that don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of\n all `decoder_input_ids` of shape `(batch_size, sequence_length)`.\n use_cache:\n If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding\n (see `past_key_values`).\n \"\"\"\n if kwargs:\n warn(\n f\"Additional keyword arguments `{', '.join(kwargs)}` are detected in \"\n f\"`{self.__class__.__name__}.forward`, they will be ignored.\\n\"\n \"This is provided for backward compatibility and may lead to unexpected behavior.\"\n )\n output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions\n output_hidden_states = (\n output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states\n )\n return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n\n if self.config.is_decoder:\n use_cache = use_cache if use_cache is not None else self.config.use_cache\n else:\n use_cache = False\n\n if isinstance(input_ids, NestedTensor):\n input_ids, attention_mask = input_ids.tensor, input_ids.mask\n if input_ids is not None and inputs_embeds is not None:\n raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n if input_ids is not None:\n self.warn_if_padding_and_no_attention_mask(input_ids, attention_mask)\n input_shape = input_ids.size()\n elif inputs_embeds is not None:\n input_shape = inputs_embeds.size()[:-1]\n else:\n raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n batch_size, seq_length = input_shape\n device = input_ids.device if input_ids is not None else inputs_embeds.device # type: ignore[union-attr]\n\n # past_key_values_length\n past_key_values_length = past_key_values[0][0].shape[2] if past_key_values is not None else 0\n\n if attention_mask is None:\n attention_mask = (\n input_ids.ne(self.pad_token_id)\n if self.pad_token_id is not None\n else torch.ones(((batch_size, seq_length + past_key_values_length)), device=device)\n )\n\n # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]\n # ourselves in which case we just need to make it broadcastable to all heads.\n extended_attention_mask: Tensor = self.get_extended_attention_mask(attention_mask, input_shape)\n\n # If a 2D or 3D attention mask is provided for the cross-attention\n # we need to make broadcastable to [batch_size, num_heads, seq_length, seq_length]\n if self.config.is_decoder and encoder_hidden_states is not None:\n encoder_batch_size, encoder_sequence_length, _ = encoder_hidden_states.size()\n encoder_hidden_shape = (encoder_batch_size, encoder_sequence_length)\n if encoder_attention_mask is None:\n encoder_attention_mask = torch.ones(encoder_hidden_shape, device=device)\n encoder_extended_attention_mask = self.invert_attention_mask(encoder_attention_mask)\n else:\n encoder_extended_attention_mask = None\n\n # Prepare head mask if needed\n # 1.0 in head_mask indicate we keep the head\n # attention_probs has shape bsz x n_heads x N x N\n # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]\n # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]\n head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)\n\n embedding_output = self.embeddings(\n input_ids=input_ids,\n position_ids=position_ids,\n inputs_embeds=inputs_embeds,\n past_key_values_length=past_key_values_length,\n )\n encoder_outputs = self.encoder(\n embedding_output,\n attention_mask=extended_attention_mask,\n head_mask=head_mask,\n encoder_hidden_states=encoder_hidden_states,\n encoder_attention_mask=encoder_extended_attention_mask,\n past_key_values=past_key_values,\n use_cache=use_cache,\n output_attentions=output_attentions,\n output_hidden_states=output_hidden_states,\n return_dict=return_dict,\n )\n sequence_output = encoder_outputs[0]\n pooled_output = self.pooler(sequence_output) if self.pooler is not None else None\n\n if not return_dict:\n return (sequence_output, pooled_output) + encoder_outputs[1:]\n\n return BaseModelOutputWithPoolingAndCrossAttentions(\n last_hidden_state=sequence_output,\n pooler_output=pooled_output,\n past_key_values=encoder_outputs.past_key_values,\n hidden_states=encoder_outputs.hidden_states,\n attentions=encoder_outputs.attentions,\n cross_attentions=encoder_outputs.cross_attentions,\n )\n
"},{"location":"models/rnabert/#multimolecule.models.rnabert.RnaBertModel.forward","title":"forward","text":"Pythonforward(input_ids: Tensor | NestedTensor, attention_mask: Tensor | None = None, position_ids: Tensor | None = None, head_mask: Tensor | None = None, inputs_embeds: Tensor | NestedTensor | None = None, encoder_hidden_states: Tensor | None = None, encoder_attention_mask: Tensor | None = None, past_key_values: Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None = None, use_cache: bool | None = None, output_attentions: bool | None = None, output_hidden_states: bool | None = None, return_dict: bool | None = None, **kwargs) -> Tuple[Tensor, ...] | BaseModelOutputWithPoolingAndCrossAttentions\n
Parameters:
Name Type Description DefaultTensor | None
Shape: (batch_size, sequence_length, hidden_size)
Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if the model is configured as a decoder.
None
Tensor | None
Shape: (batch_size, sequence_length)
Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used in the cross-attention if the model is configured as a decoder. Mask values selected in [0, 1]
:
None
Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None
Tuple of length config.n_layers
with each tuple having 4 tensors of shape `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)
Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up decoding.
If past_key_values
are used, the user can optionally input only the last decoder_input_ids
(those that don\u2019t have their past key value states given to this model) of shape (batch_size, 1)
instead of all decoder_input_ids
of shape (batch_size, sequence_length)
.
None
bool | None
If set to True
, past_key_values
key value states are returned and can be used to speed up decoding (see past_key_values
).
None
Source code in multimolecule/models/rnabert/modeling_rnabert.py
Pythondef forward(\n self,\n input_ids: Tensor | NestedTensor,\n attention_mask: Tensor | None = None,\n position_ids: Tensor | None = None,\n head_mask: Tensor | None = None,\n inputs_embeds: Tensor | NestedTensor | None = None,\n encoder_hidden_states: Tensor | None = None,\n encoder_attention_mask: Tensor | None = None,\n past_key_values: Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None = None,\n use_cache: bool | None = None,\n output_attentions: bool | None = None,\n output_hidden_states: bool | None = None,\n return_dict: bool | None = None,\n **kwargs,\n) -> Tuple[Tensor, ...] | BaseModelOutputWithPoolingAndCrossAttentions:\n r\"\"\"\n Args:\n encoder_hidden_states:\n Shape: `(batch_size, sequence_length, hidden_size)`\n\n Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if\n the model is configured as a decoder.\n encoder_attention_mask:\n Shape: `(batch_size, sequence_length)`\n\n Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used\n in the cross-attention if the model is configured as a decoder. Mask values selected in `[0, 1]`:\n\n - 1 for tokens that are **not masked**,\n - 0 for tokens that are **masked**.\n past_key_values:\n Tuple of length `config.n_layers` with each tuple having 4 tensors of shape\n `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)\n\n Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up\n decoding.\n\n If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those\n that don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of\n all `decoder_input_ids` of shape `(batch_size, sequence_length)`.\n use_cache:\n If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding\n (see `past_key_values`).\n \"\"\"\n if kwargs:\n warn(\n f\"Additional keyword arguments `{', '.join(kwargs)}` are detected in \"\n f\"`{self.__class__.__name__}.forward`, they will be ignored.\\n\"\n \"This is provided for backward compatibility and may lead to unexpected behavior.\"\n )\n output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions\n output_hidden_states = (\n output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states\n )\n return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n\n if self.config.is_decoder:\n use_cache = use_cache if use_cache is not None else self.config.use_cache\n else:\n use_cache = False\n\n if isinstance(input_ids, NestedTensor):\n input_ids, attention_mask = input_ids.tensor, input_ids.mask\n if input_ids is not None and inputs_embeds is not None:\n raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n if input_ids is not None:\n self.warn_if_padding_and_no_attention_mask(input_ids, attention_mask)\n input_shape = input_ids.size()\n elif inputs_embeds is not None:\n input_shape = inputs_embeds.size()[:-1]\n else:\n raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n batch_size, seq_length = input_shape\n device = input_ids.device if input_ids is not None else inputs_embeds.device # type: ignore[union-attr]\n\n # past_key_values_length\n past_key_values_length = past_key_values[0][0].shape[2] if past_key_values is not None else 0\n\n if attention_mask is None:\n attention_mask = (\n input_ids.ne(self.pad_token_id)\n if self.pad_token_id is not None\n else torch.ones(((batch_size, seq_length + past_key_values_length)), device=device)\n )\n\n # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]\n # ourselves in which case we just need to make it broadcastable to all heads.\n extended_attention_mask: Tensor = self.get_extended_attention_mask(attention_mask, input_shape)\n\n # If a 2D or 3D attention mask is provided for the cross-attention\n # we need to make broadcastable to [batch_size, num_heads, seq_length, seq_length]\n if self.config.is_decoder and encoder_hidden_states is not None:\n encoder_batch_size, encoder_sequence_length, _ = encoder_hidden_states.size()\n encoder_hidden_shape = (encoder_batch_size, encoder_sequence_length)\n if encoder_attention_mask is None:\n encoder_attention_mask = torch.ones(encoder_hidden_shape, device=device)\n encoder_extended_attention_mask = self.invert_attention_mask(encoder_attention_mask)\n else:\n encoder_extended_attention_mask = None\n\n # Prepare head mask if needed\n # 1.0 in head_mask indicate we keep the head\n # attention_probs has shape bsz x n_heads x N x N\n # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]\n # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]\n head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)\n\n embedding_output = self.embeddings(\n input_ids=input_ids,\n position_ids=position_ids,\n inputs_embeds=inputs_embeds,\n past_key_values_length=past_key_values_length,\n )\n encoder_outputs = self.encoder(\n embedding_output,\n attention_mask=extended_attention_mask,\n head_mask=head_mask,\n encoder_hidden_states=encoder_hidden_states,\n encoder_attention_mask=encoder_extended_attention_mask,\n past_key_values=past_key_values,\n use_cache=use_cache,\n output_attentions=output_attentions,\n output_hidden_states=output_hidden_states,\n return_dict=return_dict,\n )\n sequence_output = encoder_outputs[0]\n pooled_output = self.pooler(sequence_output) if self.pooler is not None else None\n\n if not return_dict:\n return (sequence_output, pooled_output) + encoder_outputs[1:]\n\n return BaseModelOutputWithPoolingAndCrossAttentions(\n last_hidden_state=sequence_output,\n pooler_output=pooled_output,\n past_key_values=encoder_outputs.past_key_values,\n hidden_states=encoder_outputs.hidden_states,\n attentions=encoder_outputs.attentions,\n cross_attentions=encoder_outputs.cross_attentions,\n )\n
"},{"location":"models/rnabert/#multimolecule.models.rnabert.RnaBertModel.forward(encoder_hidden_states)","title":"encoder_hidden_states
","text":""},{"location":"models/rnabert/#multimolecule.models.rnabert.RnaBertModel.forward(encoder_attention_mask)","title":"encoder_attention_mask
","text":""},{"location":"models/rnabert/#multimolecule.models.rnabert.RnaBertModel.forward(past_key_values)","title":"past_key_values
","text":""},{"location":"models/rnabert/#multimolecule.models.rnabert.RnaBertModel.forward(use_cache)","title":"use_cache
","text":""},{"location":"models/rnabert/#multimolecule.models.rnabert.RnaBertPreTrainedModel","title":"RnaBertPreTrainedModel","text":" Bases: PreTrainedModel
An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained models.
Source code inmultimolecule/models/rnabert/modeling_rnabert.py
Pythonclass RnaBertPreTrainedModel(PreTrainedModel):\n \"\"\"\n An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained\n models.\n \"\"\"\n\n config_class = RnaBertConfig\n base_model_prefix = \"rnabert\"\n supports_gradient_checkpointing = True\n _no_split_modules = [\"RnaBertLayer\", \"RnaBertEmbeddings\"]\n\n # Copied from transformers.models.bert.modeling_bert.BertPreTrainedModel._init_weights\n def _init_weights(self, module: nn.Module):\n \"\"\"Initialize the weights\"\"\"\n if isinstance(module, nn.Linear):\n # Slightly different from the TF version which uses truncated_normal for initialization\n # cf https://github.com/pytorch/pytorch/pull/5617\n module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)\n if module.bias is not None:\n module.bias.data.zero_()\n elif isinstance(module, nn.Embedding):\n module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)\n if module.padding_idx is not None:\n module.weight.data[module.padding_idx].zero_()\n elif isinstance(module, nn.LayerNorm):\n module.bias.data.zero_()\n module.weight.data.fill_(1.0)\n
"},{"location":"models/rnaernie/","title":"RNAErnie","text":"Pre-trained model on non-coding RNA (ncRNA) using a multi-stage masked language modeling (MLM) objective.
"},{"location":"models/rnaernie/#statement","title":"Statement","text":"Multi-purpose RNA language modelling with motif-aware pretraining and type-guided fine-tuning is published in Nature Machine Intelligence, which is a Closed Access / Author-Fee journal.
Machine learning has been at the forefront of the movement for free and open access to research.
We see no role for closed access or author-fee publication in the future of machine learning research and believe the adoption of these journals as an outlet of record for the machine learning community would be a retrograde step.
The MultiMolecule team is committed to the principles of open access and open science.
We do NOT endorse the publication of manuscripts in Closed Access / Author-Fee journals and encourage the community to support Open Access journals and conferences.
Please consider signing the Statement on Nature Machine Intelligence.
"},{"location":"models/rnaernie/#disclaimer","title":"Disclaimer","text":"This is an UNOFFICIAL implementation of the RNAErnie: An RNA Language Model with Structure-enhanced Representations by Ning Wang, Jiang Bian, Haoyi Xiong, et al.
The OFFICIAL repository of RNAErnie is at CatIIIIIIII/RNAErnie.
Warning
The MultiMolecule team is unable to confirm that the provided model and checkpoints are producing the same intermediate representations as the original implementation. This is because
The proposed method is published in a Closed Access / Author-Fee journal.
The team releasing RNAErnie did not write this model card for this model so this model card has been written by the MultiMolecule team.
"},{"location":"models/rnaernie/#model-details","title":"Model Details","text":"RNAErnie is a bert-style model pre-trained on a large corpus of non-coding RNA sequences in a self-supervised fashion. This means that the model was trained on the raw nucleotides of RNA sequences only, with an automatic process to generate inputs and labels from those texts. Please refer to the Training Details section for more information on the training process.
Note that during the conversion process, additional tokens such as [IND]
and ncRNA class symbols are removed.
multimolecule/rnaernie
The model file depends on the multimolecule
library. You can install it using pip:
pip install multimolecule\n
"},{"location":"models/rnaernie/#direct-use","title":"Direct Use","text":"You can use this model directly with a pipeline for masked language modeling:
Python>>> import multimolecule # you must import multimolecule to register models\n>>> from transformers import pipeline\n>>> unmasker = pipeline(\"fill-mask\", model=\"multimolecule/rnaernie\")\n>>> unmasker(\"gguc<mask>cucugguuagaccagaucugagccu\")\n\n[{'score': 0.09252794831991196,\n 'token': 8,\n 'token_str': 'G',\n 'sequence': 'G G U C G C U C U G G U U A G A C C A G A U C U G A G C C U'},\n {'score': 0.09062391519546509,\n 'token': 11,\n 'token_str': 'R',\n 'sequence': 'G G U C R C U C U G G U U A G A C C A G A U C U G A G C C U'},\n {'score': 0.08875908702611923,\n 'token': 6,\n 'token_str': 'A',\n 'sequence': 'G G U C A C U C U G G U U A G A C C A G A U C U G A G C C U'},\n {'score': 0.07809742540121078,\n 'token': 20,\n 'token_str': 'V',\n 'sequence': 'G G U C V C U C U G G U U A G A C C A G A U C U G A G C C U'},\n {'score': 0.07325706630945206,\n 'token': 13,\n 'token_str': 'S',\n 'sequence': 'G G U C S C U C U G G U U A G A C C A G A U C U G A G C C U'}]\n
"},{"location":"models/rnaernie/#downstream-use","title":"Downstream Use","text":""},{"location":"models/rnaernie/#extract-features","title":"Extract Features","text":"Here is how to use this model to get the features of a given sequence in PyTorch:
Pythonfrom multimolecule import RnaTokenizer, RnaErnieModel\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rnaernie\")\nmodel = RnaErnieModel.from_pretrained(\"multimolecule/rnaernie\")\n\ntext = \"UAGCUUAUCAGACUGAUGUUGA\"\ninput = tokenizer(text, return_tensors=\"pt\")\n\noutput = model(**input)\n
"},{"location":"models/rnaernie/#sequence-classification-regression","title":"Sequence Classification / Regression","text":"Note: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for sequence classification or regression.
Here is how to use this model as backbone to fine-tune for a sequence-level task in PyTorch:
Pythonimport torch\nfrom multimolecule import RnaTokenizer, RnaErnieForSequencePrediction\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rnaernie\")\nmodel = RnaErnieForSequencePrediction.from_pretrained(\"multimolecule/rnaernie\")\n\ntext = \"UAGCUUAUCAGACUGAUGUUGA\"\ninput = tokenizer(text, return_tensors=\"pt\")\nlabel = torch.tensor([1])\n\noutput = model(**input, labels=label)\n
"},{"location":"models/rnaernie/#token-classification-regression","title":"Token Classification / Regression","text":"Note: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for nucleotide classification or regression.
Here is how to use this model as backbone to fine-tune for a nucleotide-level task in PyTorch:
Pythonimport torch\nfrom multimolecule import RnaTokenizer, RnaErnieForTokenPrediction\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rnaernie\")\nmodel = RnaErnieForTokenPrediction.from_pretrained(\"multimolecule/rnaernie\")\n\ntext = \"UAGCUUAUCAGACUGAUGUUGA\"\ninput = tokenizer(text, return_tensors=\"pt\")\nlabel = torch.randint(2, (len(text), ))\n\noutput = model(**input, labels=label)\n
"},{"location":"models/rnaernie/#contact-classification-regression","title":"Contact Classification / Regression","text":"Note: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for contact classification or regression.
Here is how to use this model as backbone to fine-tune for a contact-level task in PyTorch:
Pythonimport torch\nfrom multimolecule import RnaTokenizer, RnaErnieForContactPrediction\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rnaernie\")\nmodel = RnaErnieForContactPrediction.from_pretrained(\"multimolecule/rnaernie\")\n\ntext = \"UAGCUUAUCAGACUGAUGUUGA\"\ninput = tokenizer(text, return_tensors=\"pt\")\nlabel = torch.randint(2, (len(text), len(text)))\n\noutput = model(**input, labels=label)\n
"},{"location":"models/rnaernie/#training-details","title":"Training Details","text":"RNAErnie used Masked Language Modeling (MLM) as the pre-training objective: taking a sequence, the model randomly masks 15% of the tokens in the input then runs the entire masked sentence through the model and has to predict the masked tokens. This is comparable to the Cloze task in language modeling.
"},{"location":"models/rnaernie/#training-data","title":"Training Data","text":"The RNAErnie model was pre-trained on RNAcentral. RNAcentral is a free, public resource that offers integrated access to a comprehensive and up-to-date set of non-coding RNA sequences provided by a collaborating group of Expert Databases representing a broad range of organisms and RNA types.
RNAErnie used a subset of RNAcentral for pre-training. The subset contains 23 million sequences. RNAErnie preprocessed all tokens by replacing \u201cT\u201ds with \u201cS\u201ds.
Note that RnaTokenizer
will convert \u201cT\u201ds to \u201cU\u201ds for you, you may disable this behaviour by passing replace_T_with_U=False
.
RNAErnie used masked language modeling (MLM) as the pre-training objective. The masking procedure is similar to the one used in BERT:
<mask>
.RNAErnie uses a special 3-stage training pipeline to pre-train the model, each with a different masking strategy:
Base-level Masking: The masking applies to each nucleotide in the sequence. Subsequence-level Masking: The masking applies to subsequences of 4-8bp in the sequence. Motif-level Masking: The model is trained on motif datasets.
The model was trained on 4 NVIDIA V100 GPUs with 32GiB memories.
Citation information is not available for papers published in Closed Access / Author-Fee journals.
"},{"location":"models/rnaernie/#contact","title":"Contact","text":"Please use GitHub issues of MultiMolecule for any questions or comments on the model card.
Please contact the authors of the RNAErnie paper for questions or comments on the paper/model.
"},{"location":"models/rnaernie/#license","title":"License","text":"This model is licensed under the AGPL-3.0 License.
Text OnlySPDX-License-Identifier: AGPL-3.0-or-later\n
"},{"location":"models/rnaernie/#multimolecule.models.rnaernie","title":"multimolecule.models.rnaernie","text":""},{"location":"models/rnaernie/#multimolecule.models.rnaernie.RnaTokenizer","title":"RnaTokenizer","text":" Bases: Tokenizer
Tokenizer for RNA sequences.
Parameters:
Name Type Description DefaultAlphabet | str | List[str] | None
alphabet to use for tokenization.
None
, the standard RNA alphabet will be used.string
, it should correspond to the name of a predefined alphabet. The options includestandard
extended
streamline
nucleobase
None
int
Size of kmer to tokenize.
1
bool
Whether to tokenize into codons.
False
bool
Whether to replace T with U.
True
bool
Whether to convert input to uppercase.
True
Examples:
Python Console Session>>> from multimolecule import RnaTokenizer\n>>> tokenizer = RnaTokenizer()\n>>> tokenizer('<pad><cls><eos><unk><mask><null>ACGUNRYSWKMBDHV.X*-I')[\"input_ids\"]\n[1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 2]\n>>> tokenizer('acgu')[\"input_ids\"]\n[1, 6, 7, 8, 9, 2]\n>>> tokenizer('acgt')[\"input_ids\"]\n[1, 6, 7, 8, 9, 2]\n>>> tokenizer = RnaTokenizer(replace_T_with_U=False)\n>>> tokenizer('acgt')[\"input_ids\"]\n[1, 6, 7, 8, 3, 2]\n>>> tokenizer = RnaTokenizer(nmers=3)\n>>> tokenizer('uagcuuauc')[\"input_ids\"]\n[1, 83, 17, 64, 49, 96, 84, 22, 2]\n>>> tokenizer = RnaTokenizer(codon=True)\n>>> tokenizer('uagcuuauc')[\"input_ids\"]\n[1, 83, 49, 22, 2]\n>>> tokenizer('uagcuuauca')[\"input_ids\"]\nTraceback (most recent call last):\nValueError: length of input sequence must be a multiple of 3 for codon tokenization, but got 10\n
Source code in multimolecule/tokenisers/rna/tokenization_rna.py
Pythonclass RnaTokenizer(Tokenizer):\n \"\"\"\n Tokenizer for RNA sequences.\n\n Args:\n alphabet: alphabet to use for tokenization.\n\n - If is `None`, the standard RNA alphabet will be used.\n - If is a `string`, it should correspond to the name of a predefined alphabet. The options include\n + `standard`\n + `extended`\n + `streamline`\n + `nucleobase`\n - If is an alphabet or a list of characters, that specific alphabet will be used.\n nmers: Size of kmer to tokenize.\n codon: Whether to tokenize into codons.\n replace_T_with_U: Whether to replace T with U.\n do_upper_case: Whether to convert input to uppercase.\n\n Examples:\n >>> from multimolecule import RnaTokenizer\n >>> tokenizer = RnaTokenizer()\n >>> tokenizer('<pad><cls><eos><unk><mask><null>ACGUNRYSWKMBDHV.X*-I')[\"input_ids\"]\n [1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 2]\n >>> tokenizer('acgu')[\"input_ids\"]\n [1, 6, 7, 8, 9, 2]\n >>> tokenizer('acgt')[\"input_ids\"]\n [1, 6, 7, 8, 9, 2]\n >>> tokenizer = RnaTokenizer(replace_T_with_U=False)\n >>> tokenizer('acgt')[\"input_ids\"]\n [1, 6, 7, 8, 3, 2]\n >>> tokenizer = RnaTokenizer(nmers=3)\n >>> tokenizer('uagcuuauc')[\"input_ids\"]\n [1, 83, 17, 64, 49, 96, 84, 22, 2]\n >>> tokenizer = RnaTokenizer(codon=True)\n >>> tokenizer('uagcuuauc')[\"input_ids\"]\n [1, 83, 49, 22, 2]\n >>> tokenizer('uagcuuauca')[\"input_ids\"]\n Traceback (most recent call last):\n ValueError: length of input sequence must be a multiple of 3 for codon tokenization, but got 10\n \"\"\"\n\n model_input_names = [\"input_ids\", \"attention_mask\"]\n\n def __init__(\n self,\n alphabet: Alphabet | str | List[str] | None = None,\n nmers: int = 1,\n codon: bool = False,\n replace_T_with_U: bool = True,\n do_upper_case: bool = True,\n additional_special_tokens: List | Tuple | None = None,\n **kwargs,\n ):\n if codon and (nmers > 1 and nmers != 3):\n raise ValueError(\"Codon and nmers cannot be used together.\")\n if codon:\n nmers = 3 # set to 3 to get correct vocab\n if not isinstance(alphabet, Alphabet):\n alphabet = get_alphabet(alphabet, nmers=nmers)\n super().__init__(\n alphabet=alphabet,\n nmers=nmers,\n codon=codon,\n replace_T_with_U=replace_T_with_U,\n do_upper_case=do_upper_case,\n additional_special_tokens=additional_special_tokens,\n **kwargs,\n )\n self.replace_T_with_U = replace_T_with_U\n self.nmers = nmers\n self.codon = codon\n\n def _tokenize(self, text: str, **kwargs):\n if self.do_upper_case:\n text = text.upper()\n if self.replace_T_with_U:\n text = text.replace(\"T\", \"U\")\n if self.codon:\n if len(text) % 3 != 0:\n raise ValueError(\n f\"length of input sequence must be a multiple of 3 for codon tokenization, but got {len(text)}\"\n )\n return [text[i : i + 3] for i in range(0, len(text), 3)]\n if self.nmers > 1:\n return [text[i : i + self.nmers] for i in range(len(text) - self.nmers + 1)] # noqa: E203\n return list(text)\n
"},{"location":"models/rnaernie/#multimolecule.models.rnaernie.RnaTokenizer(alphabet)","title":"alphabet
","text":""},{"location":"models/rnaernie/#multimolecule.models.rnaernie.RnaTokenizer(nmers)","title":"nmers
","text":""},{"location":"models/rnaernie/#multimolecule.models.rnaernie.RnaTokenizer(codon)","title":"codon
","text":""},{"location":"models/rnaernie/#multimolecule.models.rnaernie.RnaTokenizer(replace_T_with_U)","title":"replace_T_with_U
","text":""},{"location":"models/rnaernie/#multimolecule.models.rnaernie.RnaTokenizer(do_upper_case)","title":"do_upper_case
","text":""},{"location":"models/rnaernie/#multimolecule.models.rnaernie.RnaErnieConfig","title":"RnaErnieConfig","text":" Bases: PreTrainedConfig
This is the configuration class to store the configuration of a RnaErnieModel
. It is used to instantiate a RnaErnie model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the RnaErnie Bruce-ywj/rnaernie architecture.
Configuration objects inherit from PreTrainedConfig
and can be used to control the model outputs. Read the documentation from PreTrainedConfig
for more information.
Parameters:
Name Type Description Defaultint
Vocabulary size of the RnaErnie model. Defines the number of different tokens that can be represented by the inputs_ids
passed when calling [RnaErnieModel
].
26
int
Dimensionality of the encoder layers and the pooler layer.
768
int
Number of hidden layers in the Transformer encoder.
12
int
Number of attention heads for each attention layer in the Transformer encoder.
12
int
Dimensionality of the \u201cintermediate\u201d (often named feed-forward) layer in the Transformer encoder.
3072
float
The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
0.1
float
The dropout ratio for the attention probabilities.
0.1
int
The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
513
float
The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
0.02
float
The epsilon used by the layer normalization layers.
1e-12
Examples:
Python Console Session>>> from multimolecule import RnaErnieModel, RnaErnieConfig\n>>> # Initializing a rnaernie multimolecule/rnaernie style configuration\n>>> configuration = RnaErnieConfig()\n>>> # Initializing a model (with random weights) from the multimolecule/rnaernie style configuration\n>>> model = RnaErnieModel(configuration)\n>>> # Accessing the model configuration\n>>> configuration = model.config\n
Source code in multimolecule/models/rnaernie/configuration_rnaernie.py
Pythonclass RnaErnieConfig(PreTrainedConfig):\n r\"\"\"\n This is the configuration class to store the configuration of a\n [`RnaErnieModel`][multimolecule.models.RnaErnieModel]. It is used to instantiate a RnaErnie model according to the\n specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a\n similar configuration to that of the RnaErnie [Bruce-ywj/rnaernie](https://github.com/Bruce-ywj/rnaernie)\n architecture.\n\n Configuration objects inherit from [`PreTrainedConfig`][multimolecule.models.PreTrainedConfig] and can be used to\n control the model outputs. Read the documentation from [`PreTrainedConfig`][multimolecule.models.PreTrainedConfig]\n for more information.\n\n Args:\n vocab_size:\n Vocabulary size of the RnaErnie model. Defines the number of different tokens that can be represented by\n the `inputs_ids` passed when calling [`RnaErnieModel`].\n hidden_size:\n Dimensionality of the encoder layers and the pooler layer.\n num_hidden_layers:\n Number of hidden layers in the Transformer encoder.\n num_attention_heads:\n Number of attention heads for each attention layer in the Transformer encoder.\n intermediate_size:\n Dimensionality of the \"intermediate\" (often named feed-forward) layer in the Transformer encoder.\n hidden_dropout:\n The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.\n attention_dropout:\n The dropout ratio for the attention probabilities.\n max_position_embeddings:\n The maximum sequence length that this model might ever be used with. Typically set this to something large\n just in case (e.g., 512 or 1024 or 2048).\n initializer_range:\n The standard deviation of the truncated_normal_initializer for initializing all weight matrices.\n layer_norm_eps:\n The epsilon used by the layer normalization layers.\n\n Examples:\n >>> from multimolecule import RnaErnieModel, RnaErnieConfig\n >>> # Initializing a rnaernie multimolecule/rnaernie style configuration\n >>> configuration = RnaErnieConfig()\n >>> # Initializing a model (with random weights) from the multimolecule/rnaernie style configuration\n >>> model = RnaErnieModel(configuration)\n >>> # Accessing the model configuration\n >>> configuration = model.config\n \"\"\"\n\n model_type = \"rnaernie\"\n\n def __init__(\n self,\n vocab_size: int = 26,\n hidden_size: int = 768,\n num_hidden_layers: int = 12,\n num_attention_heads: int = 12,\n intermediate_size: int = 3072,\n hidden_act: str = \"relu\",\n hidden_dropout: float = 0.1,\n attention_dropout: float = 0.1,\n max_position_embeddings: int = 513,\n initializer_range: float = 0.02,\n layer_norm_eps: float = 1e-12,\n position_embedding_type: str = \"absolute\",\n is_decoder: bool = False,\n use_cache: bool = True,\n head: HeadConfig | None = None,\n lm_head: MaskedLMHeadConfig | None = None,\n **kwargs,\n ):\n super().__init__(**kwargs)\n\n self.vocab_size = vocab_size\n self.type_vocab_size = 2\n self.hidden_size = hidden_size\n self.num_hidden_layers = num_hidden_layers\n self.num_attention_heads = num_attention_heads\n self.intermediate_size = intermediate_size\n self.hidden_act = hidden_act\n self.hidden_dropout = hidden_dropout\n self.attention_dropout = attention_dropout\n self.max_position_embeddings = max_position_embeddings\n self.initializer_range = initializer_range\n self.layer_norm_eps = layer_norm_eps\n self.position_embedding_type = position_embedding_type\n self.is_decoder = is_decoder\n self.use_cache = use_cache\n self.head = HeadConfig(**head) if head is not None else None\n self.lm_head = MaskedLMHeadConfig(**lm_head) if lm_head is not None else None\n
"},{"location":"models/rnaernie/#multimolecule.models.rnaernie.RnaErnieConfig(vocab_size)","title":"vocab_size
","text":""},{"location":"models/rnaernie/#multimolecule.models.rnaernie.RnaErnieConfig(hidden_size)","title":"hidden_size
","text":""},{"location":"models/rnaernie/#multimolecule.models.rnaernie.RnaErnieConfig(num_hidden_layers)","title":"num_hidden_layers
","text":""},{"location":"models/rnaernie/#multimolecule.models.rnaernie.RnaErnieConfig(num_attention_heads)","title":"num_attention_heads
","text":""},{"location":"models/rnaernie/#multimolecule.models.rnaernie.RnaErnieConfig(intermediate_size)","title":"intermediate_size
","text":""},{"location":"models/rnaernie/#multimolecule.models.rnaernie.RnaErnieConfig(hidden_dropout)","title":"hidden_dropout
","text":""},{"location":"models/rnaernie/#multimolecule.models.rnaernie.RnaErnieConfig(attention_dropout)","title":"attention_dropout
","text":""},{"location":"models/rnaernie/#multimolecule.models.rnaernie.RnaErnieConfig(max_position_embeddings)","title":"max_position_embeddings
","text":""},{"location":"models/rnaernie/#multimolecule.models.rnaernie.RnaErnieConfig(initializer_range)","title":"initializer_range
","text":""},{"location":"models/rnaernie/#multimolecule.models.rnaernie.RnaErnieConfig(layer_norm_eps)","title":"layer_norm_eps
","text":""},{"location":"models/rnaernie/#multimolecule.models.rnaernie.RnaErnieForContactPrediction","title":"RnaErnieForContactPrediction","text":" Bases: RnaErniePreTrainedModel
Examples:
Python Console Session>>> from multimolecule import RnaErnieConfig, RnaErnieForContactPrediction, RnaTokenizer\n>>> config = RnaErnieConfig()\n>>> model = RnaErnieForContactPrediction(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=torch.randint(2, (1, 5, 5)))\n>>> output[\"logits\"].shape\ntorch.Size([1, 5, 5, 1])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n
Source code in multimolecule/models/rnaernie/modeling_rnaernie.py
Pythonclass RnaErnieForContactPrediction(RnaErniePreTrainedModel):\n \"\"\"\n Examples:\n >>> from multimolecule import RnaErnieConfig, RnaErnieForContactPrediction, RnaTokenizer\n >>> config = RnaErnieConfig()\n >>> model = RnaErnieForContactPrediction(config)\n >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n >>> output = model(**input, labels=torch.randint(2, (1, 5, 5)))\n >>> output[\"logits\"].shape\n torch.Size([1, 5, 5, 1])\n >>> output[\"loss\"] # doctest:+ELLIPSIS\n tensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n \"\"\"\n\n def __init__(self, config: RnaErnieConfig):\n super().__init__(config)\n self.rnaernie = RnaErnieModel(config, add_pooling_layer=True)\n self.contact_head = ContactPredictionHead(config)\n self.head_config = self.contact_head.config\n\n # Initialize weights and apply final processing\n self.post_init()\n\n def forward(\n self,\n input_ids: Tensor | NestedTensor,\n attention_mask: Tensor | None = None,\n position_ids: Tensor | None = None,\n head_mask: Tensor | None = None,\n inputs_embeds: Tensor | NestedTensor | None = None,\n labels: Tensor | None = None,\n output_attentions: bool | None = None,\n output_hidden_states: bool | None = None,\n return_dict: bool | None = None,\n **kwargs,\n ) -> Tuple[Tensor, ...] | ContactPredictorOutput:\n if output_attentions is False:\n warn(\"output_attentions must be True for contact classification and will be ignored.\")\n return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n outputs = self.rnaernie(\n input_ids,\n attention_mask=attention_mask,\n position_ids=position_ids,\n head_mask=head_mask,\n inputs_embeds=inputs_embeds,\n output_attentions=True,\n output_hidden_states=output_hidden_states,\n return_dict=return_dict,\n **kwargs,\n )\n output = self.contact_head(outputs, attention_mask, input_ids, labels)\n logits, loss = output.logits, output.loss\n\n if not return_dict:\n output = (logits,) + outputs[2:]\n return ((loss,) + output) if loss is not None else output\n\n return ContactPredictorOutput(\n loss=loss,\n logits=logits,\n hidden_states=outputs.hidden_states,\n attentions=outputs.attentions,\n )\n
"},{"location":"models/rnaernie/#multimolecule.models.rnaernie.RnaErnieForMaskedLM","title":"RnaErnieForMaskedLM","text":" Bases: RnaErniePreTrainedModel
Examples:
Python Console Session>>> from multimolecule import RnaErnieConfig, RnaErnieForMaskedLM, RnaTokenizer\n>>> config = RnaErnieConfig()\n>>> model = RnaErnieForMaskedLM(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=input[\"input_ids\"])\n>>> output[\"logits\"].shape\ntorch.Size([1, 7, 26])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<NllLossBackward0>)\n
Source code in multimolecule/models/rnaernie/modeling_rnaernie.py
Pythonclass RnaErnieForMaskedLM(RnaErniePreTrainedModel):\n \"\"\"\n Examples:\n >>> from multimolecule import RnaErnieConfig, RnaErnieForMaskedLM, RnaTokenizer\n >>> config = RnaErnieConfig()\n >>> model = RnaErnieForMaskedLM(config)\n >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n >>> output = model(**input, labels=input[\"input_ids\"])\n >>> output[\"logits\"].shape\n torch.Size([1, 7, 26])\n >>> output[\"loss\"] # doctest:+ELLIPSIS\n tensor(..., grad_fn=<NllLossBackward0>)\n \"\"\"\n\n _tied_weights_keys = [\"lm_head.decoder.bias\", \"lm_head.decoder.weight\"]\n\n def __init__(self, config: RnaErnieConfig):\n super().__init__(config)\n if config.is_decoder:\n logger.warning(\n \"If you want to use `RnaErnieForMaskedLM` make sure `config.is_decoder=False` for \"\n \"bi-directional self-attention.\"\n )\n self.rnaernie = RnaErnieModel(config, add_pooling_layer=False)\n self.lm_head = MaskedLMHead(config)\n\n # Initialize weights and apply final processing\n self.post_init()\n\n def get_output_embeddings(self):\n return self.lm_head.decoder\n\n def set_output_embeddings(self, new_embeddings):\n self.lm_head.decoder = new_embeddings\n\n def forward(\n self,\n input_ids: Tensor | NestedTensor,\n attention_mask: Tensor | None = None,\n position_ids: Tensor | None = None,\n head_mask: Tensor | None = None,\n inputs_embeds: Tensor | NestedTensor | None = None,\n encoder_hidden_states: Tensor | None = None,\n encoder_attention_mask: Tensor | None = None,\n labels: Tensor | None = None,\n output_attentions: bool | None = None,\n output_hidden_states: bool | None = None,\n return_dict: bool | None = None,\n **kwargs,\n ) -> Tuple[Tensor, ...] | MaskedLMOutput:\n return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n outputs = self.rnaernie(\n input_ids,\n attention_mask=attention_mask,\n position_ids=position_ids,\n head_mask=head_mask,\n inputs_embeds=inputs_embeds,\n encoder_hidden_states=encoder_hidden_states,\n encoder_attention_mask=encoder_attention_mask,\n output_attentions=output_attentions,\n output_hidden_states=output_hidden_states,\n return_dict=return_dict,\n **kwargs,\n )\n output = self.lm_head(outputs, labels)\n logits, loss = output.logits, output.loss\n\n if not return_dict:\n output = (logits,) + outputs[2:]\n return ((loss,) + output) if loss is not None else output\n\n return MaskedLMOutput(\n loss=loss,\n logits=logits,\n hidden_states=outputs.hidden_states,\n attentions=outputs.attentions,\n )\n
"},{"location":"models/rnaernie/#multimolecule.models.rnaernie.RnaErnieForSequencePrediction","title":"RnaErnieForSequencePrediction","text":" Bases: RnaErniePreTrainedModel
Examples:
Python Console Session>>> from multimolecule import RnaErnieConfig, RnaErnieForSequencePrediction, RnaTokenizer\n>>> config = RnaErnieConfig()\n>>> model = RnaErnieForSequencePrediction(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=torch.tensor([[1]]))\n>>> output[\"logits\"].shape\ntorch.Size([1, 1])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n
Source code in multimolecule/models/rnaernie/modeling_rnaernie.py
Pythonclass RnaErnieForSequencePrediction(RnaErniePreTrainedModel):\n \"\"\"\n Examples:\n >>> from multimolecule import RnaErnieConfig, RnaErnieForSequencePrediction, RnaTokenizer\n >>> config = RnaErnieConfig()\n >>> model = RnaErnieForSequencePrediction(config)\n >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n >>> output = model(**input, labels=torch.tensor([[1]]))\n >>> output[\"logits\"].shape\n torch.Size([1, 1])\n >>> output[\"loss\"] # doctest:+ELLIPSIS\n tensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n \"\"\"\n\n def __init__(self, config):\n super().__init__(config)\n self.rnaernie = RnaErnieModel(config)\n self.sequence_head = SequencePredictionHead(config)\n self.head_config = self.sequence_head.config\n\n # Initialize weights and apply final processing\n self.post_init()\n\n def forward(\n self,\n input_ids: Tensor | NestedTensor,\n attention_mask: Tensor | None = None,\n position_ids: Tensor | None = None,\n head_mask: Tensor | None = None,\n inputs_embeds: Tensor | NestedTensor | None = None,\n labels: Tensor | None = None,\n output_attentions: bool | None = None,\n output_hidden_states: bool | None = None,\n return_dict: bool | None = None,\n **kwargs,\n ) -> Tuple[Tensor, ...] | SequencePredictorOutput:\n return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n outputs = self.rnaernie(\n input_ids,\n attention_mask=attention_mask,\n position_ids=position_ids,\n head_mask=head_mask,\n inputs_embeds=inputs_embeds,\n output_attentions=output_attentions,\n output_hidden_states=output_hidden_states,\n return_dict=return_dict,\n **kwargs,\n )\n output = self.sequence_head(outputs, labels)\n logits, loss = output.logits, output.loss\n\n if not return_dict:\n output = (logits,) + outputs[2:]\n return ((loss,) + output) if loss is not None else output\n\n return SequencePredictorOutput(\n loss=loss,\n logits=logits,\n hidden_states=outputs.hidden_states,\n attentions=outputs.attentions,\n )\n
"},{"location":"models/rnaernie/#multimolecule.models.rnaernie.RnaErnieForTokenPrediction","title":"RnaErnieForTokenPrediction","text":" Bases: RnaErniePreTrainedModel
Examples:
Python Console Session>>> from multimolecule import RnaErnieConfig, RnaErnieForTokenPrediction, RnaTokenizer\n>>> config = RnaErnieConfig()\n>>> model = RnaErnieForTokenPrediction(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=torch.randint(2, (1, 5)))\n>>> output[\"logits\"].shape\ntorch.Size([1, 5, 1])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n
Source code in multimolecule/models/rnaernie/modeling_rnaernie.py
Pythonclass RnaErnieForTokenPrediction(RnaErniePreTrainedModel):\n \"\"\"\n Examples:\n >>> from multimolecule import RnaErnieConfig, RnaErnieForTokenPrediction, RnaTokenizer\n >>> config = RnaErnieConfig()\n >>> model = RnaErnieForTokenPrediction(config)\n >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n >>> output = model(**input, labels=torch.randint(2, (1, 5)))\n >>> output[\"logits\"].shape\n torch.Size([1, 5, 1])\n >>> output[\"loss\"] # doctest:+ELLIPSIS\n tensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n \"\"\"\n\n def __init__(self, config: RnaErnieConfig):\n super().__init__(config)\n self.rnaernie = RnaErnieModel(config, add_pooling_layer=True)\n self.token_head = TokenPredictionHead(config)\n self.head_config = self.token_head.config\n\n # Initialize weights and apply final processing\n self.post_init()\n\n def forward(\n self,\n input_ids: Tensor | NestedTensor,\n attention_mask: Tensor | None = None,\n position_ids: Tensor | None = None,\n head_mask: Tensor | None = None,\n inputs_embeds: Tensor | NestedTensor | None = None,\n labels: Tensor | None = None,\n output_attentions: bool | None = None,\n output_hidden_states: bool | None = None,\n return_dict: bool | None = None,\n **kwargs,\n ) -> Tuple[Tensor, ...] | TokenPredictorOutput:\n return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n outputs = self.rnaernie(\n input_ids,\n attention_mask=attention_mask,\n position_ids=position_ids,\n head_mask=head_mask,\n inputs_embeds=inputs_embeds,\n output_attentions=output_attentions,\n output_hidden_states=output_hidden_states,\n return_dict=return_dict,\n **kwargs,\n )\n output = self.token_head(outputs, attention_mask, input_ids, labels)\n logits, loss = output.logits, output.loss\n\n if not return_dict:\n output = (logits,) + outputs[2:]\n return ((loss,) + output) if loss is not None else output\n\n return TokenPredictorOutput(\n loss=loss,\n logits=logits,\n hidden_states=outputs.hidden_states,\n attentions=outputs.attentions,\n )\n
"},{"location":"models/rnaernie/#multimolecule.models.rnaernie.RnaErnieModel","title":"RnaErnieModel","text":" Bases: RnaErniePreTrainedModel
Examples:
Python Console Session>>> from multimolecule import RnaErnieConfig, RnaErnieModel, RnaTokenizer\n>>> config = RnaErnieConfig()\n>>> model = RnaErnieModel(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input)\n>>> output[\"last_hidden_state\"].shape\ntorch.Size([1, 7, 768])\n>>> output[\"pooler_output\"].shape\ntorch.Size([1, 768])\n
Source code in multimolecule/models/rnaernie/modeling_rnaernie.py
Pythonclass RnaErnieModel(RnaErniePreTrainedModel):\n \"\"\"\n Examples:\n >>> from multimolecule import RnaErnieConfig, RnaErnieModel, RnaTokenizer\n >>> config = RnaErnieConfig()\n >>> model = RnaErnieModel(config)\n >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n >>> output = model(**input)\n >>> output[\"last_hidden_state\"].shape\n torch.Size([1, 7, 768])\n >>> output[\"pooler_output\"].shape\n torch.Size([1, 768])\n \"\"\"\n\n def __init__(self, config: RnaErnieConfig, add_pooling_layer: bool = True):\n super().__init__(config)\n self.pad_token_id = config.pad_token_id\n\n self.embeddings = RnaErnieEmbeddings(config)\n self.encoder = RnaErnieEncoder(config)\n\n self.pooler = RnaErniePooler(config) if add_pooling_layer else None\n\n # Initialize weights and apply final processing\n self.post_init()\n\n def get_input_embeddings(self):\n return self.embeddings.word_embeddings\n\n def set_input_embeddings(self, value):\n self.embeddings.word_embeddings = value\n\n def _prune_heads(self, heads_to_prune):\n \"\"\"\n Prunes heads of the model. heads_to_prune: dict of {layer_num: list of heads to prune in this layer} See base\n class PreTrainedModel\n \"\"\"\n for layer, heads in heads_to_prune.items():\n self.encoder.layer[layer].attention.prune_heads(heads)\n\n def forward(\n self,\n input_ids: Tensor | NestedTensor,\n attention_mask: Tensor | None = None,\n position_ids: Tensor | None = None,\n head_mask: Tensor | None = None,\n inputs_embeds: Tensor | NestedTensor | None = None,\n encoder_hidden_states: Tensor | None = None,\n encoder_attention_mask: Tensor | None = None,\n past_key_values: Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None = None,\n use_cache: bool | None = None,\n output_attentions: bool | None = None,\n output_hidden_states: bool | None = None,\n return_dict: bool | None = None,\n **kwargs,\n ) -> Tuple[Tensor, ...] | BaseModelOutputWithPoolingAndCrossAttentions:\n r\"\"\"\n Args:\n encoder_hidden_states:\n Shape: `(batch_size, sequence_length, hidden_size)`\n\n Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if\n the model is configured as a decoder.\n encoder_attention_mask:\n Shape: `(batch_size, sequence_length)`\n\n Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used\n in the cross-attention if the model is configured as a decoder. Mask values selected in `[0, 1]`:\n\n - 1 for tokens that are **not masked**,\n - 0 for tokens that are **masked**.\n past_key_values:\n Tuple of length `config.n_layers` with each tuple having 4 tensors of shape\n `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)\n\n Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up\n decoding.\n\n If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those\n that don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of\n all `decoder_input_ids` of shape `(batch_size, sequence_length)`.\n use_cache:\n If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding\n (see `past_key_values`).\n \"\"\"\n if kwargs:\n warn(\n f\"Additional keyword arguments `{', '.join(kwargs)}` are detected in \"\n f\"`{self.__class__.__name__}.forward`, they will be ignored.\\n\"\n \"This is provided for backward compatibility and may lead to unexpected behavior.\"\n )\n output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions\n output_hidden_states = (\n output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states\n )\n return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n\n if self.config.is_decoder:\n use_cache = use_cache if use_cache is not None else self.config.use_cache\n else:\n use_cache = False\n\n if isinstance(input_ids, NestedTensor):\n input_ids, attention_mask = input_ids.tensor, input_ids.mask\n if input_ids is not None and inputs_embeds is not None:\n raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n if input_ids is not None:\n self.warn_if_padding_and_no_attention_mask(input_ids, attention_mask)\n input_shape = input_ids.size()\n elif inputs_embeds is not None:\n input_shape = inputs_embeds.size()[:-1]\n else:\n raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n batch_size, seq_length = input_shape\n device = input_ids.device if input_ids is not None else inputs_embeds.device # type: ignore[union-attr]\n\n # past_key_values_length\n past_key_values_length = past_key_values[0][0].shape[2] if past_key_values is not None else 0\n\n if attention_mask is None:\n attention_mask = (\n input_ids.ne(self.pad_token_id)\n if self.pad_token_id is not None\n else torch.ones(((batch_size, seq_length + past_key_values_length)), device=device)\n )\n\n # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]\n # ourselves in which case we just need to make it broadcastable to all heads.\n extended_attention_mask: Tensor = self.get_extended_attention_mask(attention_mask, input_shape)\n\n # If a 2D or 3D attention mask is provided for the cross-attention\n # we need to make broadcastable to [batch_size, num_heads, seq_length, seq_length]\n if self.config.is_decoder and encoder_hidden_states is not None:\n encoder_batch_size, encoder_sequence_length, _ = encoder_hidden_states.size()\n encoder_hidden_shape = (encoder_batch_size, encoder_sequence_length)\n if encoder_attention_mask is None:\n encoder_attention_mask = torch.ones(encoder_hidden_shape, device=device)\n encoder_extended_attention_mask = self.invert_attention_mask(encoder_attention_mask)\n else:\n encoder_extended_attention_mask = None\n\n # Prepare head mask if needed\n # 1.0 in head_mask indicate we keep the head\n # attention_probs has shape bsz x n_heads x N x N\n # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]\n # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]\n head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)\n\n embedding_output = self.embeddings(\n input_ids=input_ids,\n position_ids=position_ids,\n inputs_embeds=inputs_embeds,\n past_key_values_length=past_key_values_length,\n )\n encoder_outputs = self.encoder(\n embedding_output,\n attention_mask=extended_attention_mask,\n head_mask=head_mask,\n encoder_hidden_states=encoder_hidden_states,\n encoder_attention_mask=encoder_extended_attention_mask,\n past_key_values=past_key_values,\n use_cache=use_cache,\n output_attentions=output_attentions,\n output_hidden_states=output_hidden_states,\n return_dict=return_dict,\n )\n sequence_output = encoder_outputs[0]\n pooled_output = self.pooler(sequence_output) if self.pooler is not None else None\n\n if not return_dict:\n return (sequence_output, pooled_output) + encoder_outputs[1:]\n\n return BaseModelOutputWithPoolingAndCrossAttentions(\n last_hidden_state=sequence_output,\n pooler_output=pooled_output,\n past_key_values=encoder_outputs.past_key_values,\n hidden_states=encoder_outputs.hidden_states,\n attentions=encoder_outputs.attentions,\n cross_attentions=encoder_outputs.cross_attentions,\n )\n
"},{"location":"models/rnaernie/#multimolecule.models.rnaernie.RnaErnieModel.forward","title":"forward","text":"Pythonforward(input_ids: Tensor | NestedTensor, attention_mask: Tensor | None = None, position_ids: Tensor | None = None, head_mask: Tensor | None = None, inputs_embeds: Tensor | NestedTensor | None = None, encoder_hidden_states: Tensor | None = None, encoder_attention_mask: Tensor | None = None, past_key_values: Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None = None, use_cache: bool | None = None, output_attentions: bool | None = None, output_hidden_states: bool | None = None, return_dict: bool | None = None, **kwargs) -> Tuple[Tensor, ...] | BaseModelOutputWithPoolingAndCrossAttentions\n
Parameters:
Name Type Description DefaultTensor | None
Shape: (batch_size, sequence_length, hidden_size)
Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if the model is configured as a decoder.
None
Tensor | None
Shape: (batch_size, sequence_length)
Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used in the cross-attention if the model is configured as a decoder. Mask values selected in [0, 1]
:
None
Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None
Tuple of length config.n_layers
with each tuple having 4 tensors of shape `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)
Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up decoding.
If past_key_values
are used, the user can optionally input only the last decoder_input_ids
(those that don\u2019t have their past key value states given to this model) of shape (batch_size, 1)
instead of all decoder_input_ids
of shape (batch_size, sequence_length)
.
None
bool | None
If set to True
, past_key_values
key value states are returned and can be used to speed up decoding (see past_key_values
).
None
Source code in multimolecule/models/rnaernie/modeling_rnaernie.py
Pythondef forward(\n self,\n input_ids: Tensor | NestedTensor,\n attention_mask: Tensor | None = None,\n position_ids: Tensor | None = None,\n head_mask: Tensor | None = None,\n inputs_embeds: Tensor | NestedTensor | None = None,\n encoder_hidden_states: Tensor | None = None,\n encoder_attention_mask: Tensor | None = None,\n past_key_values: Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None = None,\n use_cache: bool | None = None,\n output_attentions: bool | None = None,\n output_hidden_states: bool | None = None,\n return_dict: bool | None = None,\n **kwargs,\n) -> Tuple[Tensor, ...] | BaseModelOutputWithPoolingAndCrossAttentions:\n r\"\"\"\n Args:\n encoder_hidden_states:\n Shape: `(batch_size, sequence_length, hidden_size)`\n\n Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if\n the model is configured as a decoder.\n encoder_attention_mask:\n Shape: `(batch_size, sequence_length)`\n\n Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used\n in the cross-attention if the model is configured as a decoder. Mask values selected in `[0, 1]`:\n\n - 1 for tokens that are **not masked**,\n - 0 for tokens that are **masked**.\n past_key_values:\n Tuple of length `config.n_layers` with each tuple having 4 tensors of shape\n `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)\n\n Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up\n decoding.\n\n If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those\n that don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of\n all `decoder_input_ids` of shape `(batch_size, sequence_length)`.\n use_cache:\n If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding\n (see `past_key_values`).\n \"\"\"\n if kwargs:\n warn(\n f\"Additional keyword arguments `{', '.join(kwargs)}` are detected in \"\n f\"`{self.__class__.__name__}.forward`, they will be ignored.\\n\"\n \"This is provided for backward compatibility and may lead to unexpected behavior.\"\n )\n output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions\n output_hidden_states = (\n output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states\n )\n return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n\n if self.config.is_decoder:\n use_cache = use_cache if use_cache is not None else self.config.use_cache\n else:\n use_cache = False\n\n if isinstance(input_ids, NestedTensor):\n input_ids, attention_mask = input_ids.tensor, input_ids.mask\n if input_ids is not None and inputs_embeds is not None:\n raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n if input_ids is not None:\n self.warn_if_padding_and_no_attention_mask(input_ids, attention_mask)\n input_shape = input_ids.size()\n elif inputs_embeds is not None:\n input_shape = inputs_embeds.size()[:-1]\n else:\n raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n batch_size, seq_length = input_shape\n device = input_ids.device if input_ids is not None else inputs_embeds.device # type: ignore[union-attr]\n\n # past_key_values_length\n past_key_values_length = past_key_values[0][0].shape[2] if past_key_values is not None else 0\n\n if attention_mask is None:\n attention_mask = (\n input_ids.ne(self.pad_token_id)\n if self.pad_token_id is not None\n else torch.ones(((batch_size, seq_length + past_key_values_length)), device=device)\n )\n\n # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]\n # ourselves in which case we just need to make it broadcastable to all heads.\n extended_attention_mask: Tensor = self.get_extended_attention_mask(attention_mask, input_shape)\n\n # If a 2D or 3D attention mask is provided for the cross-attention\n # we need to make broadcastable to [batch_size, num_heads, seq_length, seq_length]\n if self.config.is_decoder and encoder_hidden_states is not None:\n encoder_batch_size, encoder_sequence_length, _ = encoder_hidden_states.size()\n encoder_hidden_shape = (encoder_batch_size, encoder_sequence_length)\n if encoder_attention_mask is None:\n encoder_attention_mask = torch.ones(encoder_hidden_shape, device=device)\n encoder_extended_attention_mask = self.invert_attention_mask(encoder_attention_mask)\n else:\n encoder_extended_attention_mask = None\n\n # Prepare head mask if needed\n # 1.0 in head_mask indicate we keep the head\n # attention_probs has shape bsz x n_heads x N x N\n # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]\n # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]\n head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)\n\n embedding_output = self.embeddings(\n input_ids=input_ids,\n position_ids=position_ids,\n inputs_embeds=inputs_embeds,\n past_key_values_length=past_key_values_length,\n )\n encoder_outputs = self.encoder(\n embedding_output,\n attention_mask=extended_attention_mask,\n head_mask=head_mask,\n encoder_hidden_states=encoder_hidden_states,\n encoder_attention_mask=encoder_extended_attention_mask,\n past_key_values=past_key_values,\n use_cache=use_cache,\n output_attentions=output_attentions,\n output_hidden_states=output_hidden_states,\n return_dict=return_dict,\n )\n sequence_output = encoder_outputs[0]\n pooled_output = self.pooler(sequence_output) if self.pooler is not None else None\n\n if not return_dict:\n return (sequence_output, pooled_output) + encoder_outputs[1:]\n\n return BaseModelOutputWithPoolingAndCrossAttentions(\n last_hidden_state=sequence_output,\n pooler_output=pooled_output,\n past_key_values=encoder_outputs.past_key_values,\n hidden_states=encoder_outputs.hidden_states,\n attentions=encoder_outputs.attentions,\n cross_attentions=encoder_outputs.cross_attentions,\n )\n
"},{"location":"models/rnaernie/#multimolecule.models.rnaernie.RnaErnieModel.forward(encoder_hidden_states)","title":"encoder_hidden_states
","text":""},{"location":"models/rnaernie/#multimolecule.models.rnaernie.RnaErnieModel.forward(encoder_attention_mask)","title":"encoder_attention_mask
","text":""},{"location":"models/rnaernie/#multimolecule.models.rnaernie.RnaErnieModel.forward(past_key_values)","title":"past_key_values
","text":""},{"location":"models/rnaernie/#multimolecule.models.rnaernie.RnaErnieModel.forward(use_cache)","title":"use_cache
","text":""},{"location":"models/rnaernie/#multimolecule.models.rnaernie.RnaErniePreTrainedModel","title":"RnaErniePreTrainedModel","text":" Bases: PreTrainedModel
An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained models.
Source code inmultimolecule/models/rnaernie/modeling_rnaernie.py
Pythonclass RnaErniePreTrainedModel(PreTrainedModel):\n \"\"\"\n An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained\n models.\n \"\"\"\n\n config_class = RnaErnieConfig\n base_model_prefix = \"rnaernie\"\n supports_gradient_checkpointing = True\n _no_split_modules = [\"RnaErnieLayer\", \"RnaErnieEmbeddings\"]\n\n # Copied from transformers.models.bert.modeling_bert.BertPreTrainedModel._init_weights\n def _init_weights(self, module: nn.Module):\n \"\"\"Initialize the weights\"\"\"\n if isinstance(module, nn.Linear):\n # Slightly different from the TF version which uses truncated_normal for initialization\n # cf https://github.com/pytorch/pytorch/pull/5617\n module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)\n if module.bias is not None:\n module.bias.data.zero_()\n elif isinstance(module, nn.Embedding):\n module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)\n if module.padding_idx is not None:\n module.weight.data[module.padding_idx].zero_()\n elif isinstance(module, nn.LayerNorm):\n module.bias.data.zero_()\n module.weight.data.fill_(1.0)\n\n def _set_gradient_checkpointing(self, module, value=False):\n if isinstance(module, RnaErnieEncoder):\n module.gradient_checkpointing = value\n
"},{"location":"models/rnafm/","title":"RNA-FM","text":"Pre-trained model on non-coding RNA (ncRNA) using a masked language modeling (MLM) objective.
"},{"location":"models/rnafm/#disclaimer","title":"Disclaimer","text":"This is an UNOFFICIAL implementation of the Interpretable RNA Foundation Model from Unannotated Data for Highly Accurate RNA Structure and Function Predictions by Jiayang Chen, Zhihang Hue, Siqi Sun, et al.
The OFFICIAL repository of RNA-FM is at ml4bio/RNA-FM.
Tip
The MultiMolecule team has confirmed that the provided model and checkpoints are producing the same intermediate representations as the original implementation.
The team releasing RNA-FM did not write this model card for this model so this model card has been written by the MultiMolecule team.
"},{"location":"models/rnafm/#model-details","title":"Model Details","text":"RNA-FM is a bert-style model pre-trained on a large corpus of non-coding RNA sequences in a self-supervised fashion. This means that the model was trained on the raw nucleotides of RNA sequences only, with an automatic process to generate inputs and labels from those texts. Please refer to the Training Details section for more information on the training process.
"},{"location":"models/rnafm/#variations","title":"Variations","text":"multimolecule/rnafm
: The RNA-FM model pre-trained on non-coding RNA sequences.multimolecule/mrnafm
: The RNA-FM model pre-trained on mRNA coding sequences.The model file depends on the multimolecule
library. You can install it using pip:
pip install multimolecule\n
"},{"location":"models/rnafm/#direct-use","title":"Direct Use","text":"You can use this model directly with a pipeline for masked language modeling:
Python>>> import multimolecule # you must import multimolecule to register models\n>>> from transformers import pipeline\n>>> unmasker = pipeline(\"fill-mask\", model=\"multimolecule/rnafm\")\n>>> unmasker(\"gguc<mask>cucugguuagaccagaucugagccu\")\n\n[{'score': 0.2752501964569092,\n 'token': 21,\n 'token_str': '.',\n 'sequence': 'G G U C. C U C U G G U U A G A C C A G A U C U G A G C C U'},\n {'score': 0.22108642756938934,\n 'token': 23,\n 'token_str': '*',\n 'sequence': 'G G U C * C U C U G G U U A G A C C A G A U C U G A G C C U'},\n {'score': 0.18201279640197754,\n 'token': 25,\n 'token_str': 'I',\n 'sequence': 'G G U C I C U C U G G U U A G A C C A G A U C U G A G C C U'},\n {'score': 0.10875876247882843,\n 'token': 9,\n 'token_str': 'U',\n 'sequence': 'G G U C U C U C U G G U U A G A C C A G A U C U G A G C C U'},\n {'score': 0.08898332715034485,\n 'token': 6,\n 'token_str': 'A',\n 'sequence': 'G G U C A C U C U G G U U A G A C C A G A U C U G A G C C U'}]\n
"},{"location":"models/rnafm/#downstream-use","title":"Downstream Use","text":""},{"location":"models/rnafm/#extract-features","title":"Extract Features","text":"Here is how to use this model to get the features of a given sequence in PyTorch:
Pythonfrom multimolecule import RnaTokenizer, RnaFmModel\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rnafm\")\nmodel = RnaFmModel.from_pretrained(\"multimolecule/rnafm\")\n\ntext = \"UAGCUUAUCAGACUGAUGUUGA\"\ninput = tokenizer(text, return_tensors=\"pt\")\n\noutput = model(**input)\n
"},{"location":"models/rnafm/#sequence-classification-regression","title":"Sequence Classification / Regression","text":"Note: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for sequence classification or regression.
Here is how to use this model as backbone to fine-tune for a sequence-level task in PyTorch:
Pythonimport torch\nfrom multimolecule import RnaTokenizer, RnaFmForSequencePrediction\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rnafm\")\nmodel = RnaFmForSequencePrediction.from_pretrained(\"multimolecule/rnafm\")\n\ntext = \"UAGCUUAUCAGACUGAUGUUGA\"\ninput = tokenizer(text, return_tensors=\"pt\")\nlabel = torch.tensor([1])\n\noutput = model(**input, labels=label)\n
"},{"location":"models/rnafm/#token-classification-regression","title":"Token Classification / Regression","text":"Note: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for nucleotide classification or regression.
Here is how to use this model as backbone to fine-tune for a nucleotide-level task in PyTorch:
Pythonimport torch\nfrom multimolecule import RnaTokenizer, RnaFmForTokenPrediction\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rnafm\")\nmodel = RnaFmForTokenPrediction.from_pretrained(\"multimolecule/rnafm\")\n\ntext = \"UAGCUUAUCAGACUGAUGUUGA\"\ninput = tokenizer(text, return_tensors=\"pt\")\nlabel = torch.randint(2, (len(text), ))\n\noutput = model(**input, labels=label)\n
"},{"location":"models/rnafm/#contact-classification-regression","title":"Contact Classification / Regression","text":"Note: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for contact classification or regression.
Here is how to use this model as backbone to fine-tune for a contact-level task in PyTorch:
Pythonimport torch\nfrom multimolecule import RnaTokenizer, RnaFmForContactPrediction\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rnafm\")\nmodel = RnaFmForContactPrediction.from_pretrained(\"multimolecule/rnafm\")\n\ntext = \"UAGCUUAUCAGACUGAUGUUGA\"\ninput = tokenizer(text, return_tensors=\"pt\")\nlabel = torch.randint(2, (len(text), len(text)))\n\noutput = model(**input, labels=label)\n
"},{"location":"models/rnafm/#training-details","title":"Training Details","text":"RNA-FM used Masked Language Modeling (MLM) as the pre-training objective: taking a sequence, the model randomly masks 15% of the tokens in the input then runs the entire masked sentence through the model and has to predict the masked tokens. This is comparable to the Cloze task in language modeling.
"},{"location":"models/rnafm/#training-data","title":"Training Data","text":"The RNA-FM model was pre-trained on RNAcentral. RNAcentral is a free, public resource that offers integrated access to a comprehensive and up-to-date set of non-coding RNA sequences provided by a collaborating group of Expert Databases representing a broad range of organisms and RNA types.
RNA-FM applied CD-HIT (CD-HIT-EST) with a cut-off at 100% sequence identity to remove redundancy from the RNAcentral. The final dataset contains 23.7 million non-redundant RNA sequences.
RNA-FM preprocessed all tokens by replacing \u201cU\u201ds with \u201cT\u201ds.
Note that during model conversions, \u201cT\u201d is replaced with \u201cU\u201d. RnaTokenizer
will convert \u201cT\u201ds to \u201cU\u201ds for you, you may disable this behaviour by passing replace_T_with_U=False
.
RNA-FM used masked language modeling (MLM) as the pre-training objective. The masking procedure is similar to the one used in BERT:
<mask>
.The model was trained on 8 NVIDIA A100 GPUs with 80GiB memories.
BibTeX:
BibTeX@article{chen2022interpretable,\n title={Interpretable rna foundation model from unannotated data for highly accurate rna structure and function predictions},\n author={Chen, Jiayang and Hu, Zhihang and Sun, Siqi and Tan, Qingxiong and Wang, Yixuan and Yu, Qinze and Zong, Licheng and Hong, Liang and Xiao, Jin and King, Irwin and others},\n journal={arXiv preprint arXiv:2204.00300},\n year={2022}\n}\n
"},{"location":"models/rnafm/#contact","title":"Contact","text":"Please use GitHub issues of MultiMolecule for any questions or comments on the model card.
Please contact the authors of the RNA-FM paper for questions or comments on the paper/model.
"},{"location":"models/rnafm/#license","title":"License","text":"This model is licensed under the AGPL-3.0 License.
Text OnlySPDX-License-Identifier: AGPL-3.0-or-later\n
"},{"location":"models/rnafm/#multimolecule.models.rnafm","title":"multimolecule.models.rnafm","text":""},{"location":"models/rnafm/#multimolecule.models.rnafm.RnaTokenizer","title":"RnaTokenizer","text":" Bases: Tokenizer
Tokenizer for RNA sequences.
Parameters:
Name Type Description DefaultAlphabet | str | List[str] | None
alphabet to use for tokenization.
None
, the standard RNA alphabet will be used.string
, it should correspond to the name of a predefined alphabet. The options includestandard
extended
streamline
nucleobase
None
int
Size of kmer to tokenize.
1
bool
Whether to tokenize into codons.
False
bool
Whether to replace T with U.
True
bool
Whether to convert input to uppercase.
True
Examples:
Python Console Session>>> from multimolecule import RnaTokenizer\n>>> tokenizer = RnaTokenizer()\n>>> tokenizer('<pad><cls><eos><unk><mask><null>ACGUNRYSWKMBDHV.X*-I')[\"input_ids\"]\n[1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 2]\n>>> tokenizer('acgu')[\"input_ids\"]\n[1, 6, 7, 8, 9, 2]\n>>> tokenizer('acgt')[\"input_ids\"]\n[1, 6, 7, 8, 9, 2]\n>>> tokenizer = RnaTokenizer(replace_T_with_U=False)\n>>> tokenizer('acgt')[\"input_ids\"]\n[1, 6, 7, 8, 3, 2]\n>>> tokenizer = RnaTokenizer(nmers=3)\n>>> tokenizer('uagcuuauc')[\"input_ids\"]\n[1, 83, 17, 64, 49, 96, 84, 22, 2]\n>>> tokenizer = RnaTokenizer(codon=True)\n>>> tokenizer('uagcuuauc')[\"input_ids\"]\n[1, 83, 49, 22, 2]\n>>> tokenizer('uagcuuauca')[\"input_ids\"]\nTraceback (most recent call last):\nValueError: length of input sequence must be a multiple of 3 for codon tokenization, but got 10\n
Source code in multimolecule/tokenisers/rna/tokenization_rna.py
Pythonclass RnaTokenizer(Tokenizer):\n \"\"\"\n Tokenizer for RNA sequences.\n\n Args:\n alphabet: alphabet to use for tokenization.\n\n - If is `None`, the standard RNA alphabet will be used.\n - If is a `string`, it should correspond to the name of a predefined alphabet. The options include\n + `standard`\n + `extended`\n + `streamline`\n + `nucleobase`\n - If is an alphabet or a list of characters, that specific alphabet will be used.\n nmers: Size of kmer to tokenize.\n codon: Whether to tokenize into codons.\n replace_T_with_U: Whether to replace T with U.\n do_upper_case: Whether to convert input to uppercase.\n\n Examples:\n >>> from multimolecule import RnaTokenizer\n >>> tokenizer = RnaTokenizer()\n >>> tokenizer('<pad><cls><eos><unk><mask><null>ACGUNRYSWKMBDHV.X*-I')[\"input_ids\"]\n [1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 2]\n >>> tokenizer('acgu')[\"input_ids\"]\n [1, 6, 7, 8, 9, 2]\n >>> tokenizer('acgt')[\"input_ids\"]\n [1, 6, 7, 8, 9, 2]\n >>> tokenizer = RnaTokenizer(replace_T_with_U=False)\n >>> tokenizer('acgt')[\"input_ids\"]\n [1, 6, 7, 8, 3, 2]\n >>> tokenizer = RnaTokenizer(nmers=3)\n >>> tokenizer('uagcuuauc')[\"input_ids\"]\n [1, 83, 17, 64, 49, 96, 84, 22, 2]\n >>> tokenizer = RnaTokenizer(codon=True)\n >>> tokenizer('uagcuuauc')[\"input_ids\"]\n [1, 83, 49, 22, 2]\n >>> tokenizer('uagcuuauca')[\"input_ids\"]\n Traceback (most recent call last):\n ValueError: length of input sequence must be a multiple of 3 for codon tokenization, but got 10\n \"\"\"\n\n model_input_names = [\"input_ids\", \"attention_mask\"]\n\n def __init__(\n self,\n alphabet: Alphabet | str | List[str] | None = None,\n nmers: int = 1,\n codon: bool = False,\n replace_T_with_U: bool = True,\n do_upper_case: bool = True,\n additional_special_tokens: List | Tuple | None = None,\n **kwargs,\n ):\n if codon and (nmers > 1 and nmers != 3):\n raise ValueError(\"Codon and nmers cannot be used together.\")\n if codon:\n nmers = 3 # set to 3 to get correct vocab\n if not isinstance(alphabet, Alphabet):\n alphabet = get_alphabet(alphabet, nmers=nmers)\n super().__init__(\n alphabet=alphabet,\n nmers=nmers,\n codon=codon,\n replace_T_with_U=replace_T_with_U,\n do_upper_case=do_upper_case,\n additional_special_tokens=additional_special_tokens,\n **kwargs,\n )\n self.replace_T_with_U = replace_T_with_U\n self.nmers = nmers\n self.codon = codon\n\n def _tokenize(self, text: str, **kwargs):\n if self.do_upper_case:\n text = text.upper()\n if self.replace_T_with_U:\n text = text.replace(\"T\", \"U\")\n if self.codon:\n if len(text) % 3 != 0:\n raise ValueError(\n f\"length of input sequence must be a multiple of 3 for codon tokenization, but got {len(text)}\"\n )\n return [text[i : i + 3] for i in range(0, len(text), 3)]\n if self.nmers > 1:\n return [text[i : i + self.nmers] for i in range(len(text) - self.nmers + 1)] # noqa: E203\n return list(text)\n
"},{"location":"models/rnafm/#multimolecule.models.rnafm.RnaTokenizer(alphabet)","title":"alphabet
","text":""},{"location":"models/rnafm/#multimolecule.models.rnafm.RnaTokenizer(nmers)","title":"nmers
","text":""},{"location":"models/rnafm/#multimolecule.models.rnafm.RnaTokenizer(codon)","title":"codon
","text":""},{"location":"models/rnafm/#multimolecule.models.rnafm.RnaTokenizer(replace_T_with_U)","title":"replace_T_with_U
","text":""},{"location":"models/rnafm/#multimolecule.models.rnafm.RnaTokenizer(do_upper_case)","title":"do_upper_case
","text":""},{"location":"models/rnafm/#multimolecule.models.rnafm.RnaFmConfig","title":"RnaFmConfig","text":" Bases: PreTrainedConfig
This is the configuration class to store the configuration of a RnaFmModel
. It is used to instantiate a RNA-FM model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the RNA-FM ml4bio/RNA-FM architecture.
Configuration objects inherit from PreTrainedConfig
and can be used to control the model outputs. Read the documentation from PreTrainedConfig
for more information.
Parameters:
Name Type Description Defaultint | None
Vocabulary size of the RNA-FM model. Defines the number of different tokens that can be represented by the inputs_ids
passed when calling [RnaFmModel
]. Defaults to 25 if codon=False
else 131.
None
bool
Whether to use codon tokenization.
False
int
Dimensionality of the encoder layers and the pooler layer.
640
int
Number of hidden layers in the Transformer encoder.
12
int
Number of attention heads for each attention layer in the Transformer encoder.
20
int
Dimensionality of the \u201cintermediate\u201d (often named feed-forward) layer in the Transformer encoder.
5120
float
The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
0.1
float
The dropout ratio for the attention probabilities.
0.1
int
The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
1026
float
The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
0.02
float
The epsilon used by the layer normalization layers.
1e-12
str
Type of position embedding. Choose one of \"absolute\"
, \"relative_key\"
, \"relative_key_query\", \"rotary\"
. For positional embeddings use \"absolute\"
. For more information on \"relative_key\"
, please refer to Self-Attention with Relative Position Representations (Shaw et al.). For more information on \"relative_key_query\"
, please refer to Method 4 in Improve Transformer Models with Better Relative Position Embeddings (Huang et al.).
'absolute'
bool
Whether the model is used as a decoder or not. If False
, the model is used as an encoder.
False
bool
Whether or not the model should return the last key/values attentions (not used by all models). Only relevant if config.is_decoder=True
.
True
bool
Whether to apply layer normalization after embeddings but before the main stem of the network.
True
bool
When this is enabled, masked tokens are treated as if they had been dropped out by input dropout.
False
Examples:
Python Console Session>>> from multimolecule import RnaFmModel, RnaFmConfig\n>>> # Initializing a RNA-FM multimolecule/rnafm style configuration\n>>> configuration = RnaFmConfig()\n>>> # Initializing a model (with random weights) from the multimolecule/rnafm style configuration\n>>> model = RnaFmModel(configuration)\n>>> # Accessing the model configuration\n>>> configuration = model.config\n
Source code in multimolecule/models/rnafm/configuration_rnafm.py
Pythonclass RnaFmConfig(PreTrainedConfig):\n r\"\"\"\n This is the configuration class to store the configuration of a [`RnaFmModel`][multimolecule.models.RnaFmModel].\n It is used to instantiate a RNA-FM model according to the specified arguments, defining the model architecture.\n Instantiating a configuration with the defaults will yield a similar configuration to that of the RNA-FM\n [ml4bio/RNA-FM](https://github.com/ml4bio/RNA-FM) architecture.\n\n Configuration objects inherit from [`PreTrainedConfig`][multimolecule.models.PreTrainedConfig] and can be used to\n control the model outputs. Read the documentation from [`PreTrainedConfig`][multimolecule.models.PreTrainedConfig]\n for more information.\n\n Args:\n vocab_size:\n Vocabulary size of the RNA-FM model. Defines the number of different tokens that can be represented by the\n `inputs_ids` passed when calling [`RnaFmModel`].\n Defaults to 25 if `codon=False` else 131.\n codon:\n Whether to use codon tokenization.\n hidden_size:\n Dimensionality of the encoder layers and the pooler layer.\n num_hidden_layers:\n Number of hidden layers in the Transformer encoder.\n num_attention_heads:\n Number of attention heads for each attention layer in the Transformer encoder.\n intermediate_size:\n Dimensionality of the \"intermediate\" (often named feed-forward) layer in the Transformer encoder.\n hidden_dropout:\n The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.\n attention_dropout:\n The dropout ratio for the attention probabilities.\n max_position_embeddings:\n The maximum sequence length that this model might ever be used with. Typically set this to something large\n just in case (e.g., 512 or 1024 or 2048).\n initializer_range:\n The standard deviation of the truncated_normal_initializer for initializing all weight matrices.\n layer_norm_eps:\n The epsilon used by the layer normalization layers.\n position_embedding_type:\n Type of position embedding. Choose one of `\"absolute\"`, `\"relative_key\"`, `\"relative_key_query\", \"rotary\"`.\n For positional embeddings use `\"absolute\"`. For more information on `\"relative_key\"`, please refer to\n [Self-Attention with Relative Position Representations (Shaw et al.)](https://arxiv.org/abs/1803.02155).\n For more information on `\"relative_key_query\"`, please refer to *Method 4* in [Improve Transformer Models\n with Better Relative Position Embeddings (Huang et al.)](https://arxiv.org/abs/2009.13658).\n is_decoder:\n Whether the model is used as a decoder or not. If `False`, the model is used as an encoder.\n use_cache:\n Whether or not the model should return the last key/values attentions (not used by all models). Only\n relevant if `config.is_decoder=True`.\n emb_layer_norm_before:\n Whether to apply layer normalization after embeddings but before the main stem of the network.\n token_dropout:\n When this is enabled, masked tokens are treated as if they had been dropped out by input dropout.\n\n Examples:\n >>> from multimolecule import RnaFmModel, RnaFmConfig\n >>> # Initializing a RNA-FM multimolecule/rnafm style configuration\n >>> configuration = RnaFmConfig()\n >>> # Initializing a model (with random weights) from the multimolecule/rnafm style configuration\n >>> model = RnaFmModel(configuration)\n >>> # Accessing the model configuration\n >>> configuration = model.config\n \"\"\"\n\n model_type = \"rnafm\"\n\n def __init__(\n self,\n vocab_size: int | None = None,\n codon: bool = False,\n hidden_size: int = 640,\n num_hidden_layers: int = 12,\n num_attention_heads: int = 20,\n intermediate_size: int = 5120,\n hidden_act: str = \"gelu\",\n hidden_dropout: float = 0.1,\n attention_dropout: float = 0.1,\n max_position_embeddings: int = 1026,\n initializer_range: float = 0.02,\n layer_norm_eps: float = 1e-12,\n position_embedding_type: str = \"absolute\",\n is_decoder: bool = False,\n use_cache: bool = True,\n emb_layer_norm_before: bool = True,\n token_dropout: bool = False,\n head: HeadConfig | None = None,\n lm_head: MaskedLMHeadConfig | None = None,\n **kwargs,\n ):\n super().__init__(**kwargs)\n if vocab_size is None:\n vocab_size = 131 if codon else 26\n self.vocab_size = vocab_size\n self.codon = codon\n self.hidden_size = hidden_size\n self.num_hidden_layers = num_hidden_layers\n self.num_attention_heads = num_attention_heads\n self.intermediate_size = intermediate_size\n self.hidden_act = hidden_act\n self.hidden_dropout = hidden_dropout\n self.attention_dropout = attention_dropout\n self.max_position_embeddings = max_position_embeddings\n self.initializer_range = initializer_range\n self.layer_norm_eps = layer_norm_eps\n self.position_embedding_type = position_embedding_type\n self.is_decoder = is_decoder\n self.use_cache = use_cache\n self.emb_layer_norm_before = emb_layer_norm_before\n self.token_dropout = token_dropout\n self.head = HeadConfig(**head) if head is not None else None\n self.lm_head = MaskedLMHeadConfig(**lm_head) if lm_head is not None else None\n
"},{"location":"models/rnafm/#multimolecule.models.rnafm.RnaFmConfig(vocab_size)","title":"vocab_size
","text":""},{"location":"models/rnafm/#multimolecule.models.rnafm.RnaFmConfig(codon)","title":"codon
","text":""},{"location":"models/rnafm/#multimolecule.models.rnafm.RnaFmConfig(hidden_size)","title":"hidden_size
","text":""},{"location":"models/rnafm/#multimolecule.models.rnafm.RnaFmConfig(num_hidden_layers)","title":"num_hidden_layers
","text":""},{"location":"models/rnafm/#multimolecule.models.rnafm.RnaFmConfig(num_attention_heads)","title":"num_attention_heads
","text":""},{"location":"models/rnafm/#multimolecule.models.rnafm.RnaFmConfig(intermediate_size)","title":"intermediate_size
","text":""},{"location":"models/rnafm/#multimolecule.models.rnafm.RnaFmConfig(hidden_dropout)","title":"hidden_dropout
","text":""},{"location":"models/rnafm/#multimolecule.models.rnafm.RnaFmConfig(attention_dropout)","title":"attention_dropout
","text":""},{"location":"models/rnafm/#multimolecule.models.rnafm.RnaFmConfig(max_position_embeddings)","title":"max_position_embeddings
","text":""},{"location":"models/rnafm/#multimolecule.models.rnafm.RnaFmConfig(initializer_range)","title":"initializer_range
","text":""},{"location":"models/rnafm/#multimolecule.models.rnafm.RnaFmConfig(layer_norm_eps)","title":"layer_norm_eps
","text":""},{"location":"models/rnafm/#multimolecule.models.rnafm.RnaFmConfig(position_embedding_type)","title":"position_embedding_type
","text":""},{"location":"models/rnafm/#multimolecule.models.rnafm.RnaFmConfig(is_decoder)","title":"is_decoder
","text":""},{"location":"models/rnafm/#multimolecule.models.rnafm.RnaFmConfig(use_cache)","title":"use_cache
","text":""},{"location":"models/rnafm/#multimolecule.models.rnafm.RnaFmConfig(emb_layer_norm_before)","title":"emb_layer_norm_before
","text":""},{"location":"models/rnafm/#multimolecule.models.rnafm.RnaFmConfig(token_dropout)","title":"token_dropout
","text":""},{"location":"models/rnafm/#multimolecule.models.rnafm.RnaFmForContactPrediction","title":"RnaFmForContactPrediction","text":" Bases: RnaFmPreTrainedModel
Examples:
Python Console Session>>> from multimolecule import RnaFmConfig, RnaFmForContactPrediction, RnaTokenizer\n>>> config = RnaFmConfig()\n>>> model = RnaFmForContactPrediction(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=torch.randint(2, (1, 5, 5)))\n>>> output[\"logits\"].shape\ntorch.Size([1, 5, 5, 1])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n
Source code in multimolecule/models/rnafm/modeling_rnafm.py
Pythonclass RnaFmForContactPrediction(RnaFmPreTrainedModel):\n \"\"\"\n Examples:\n >>> from multimolecule import RnaFmConfig, RnaFmForContactPrediction, RnaTokenizer\n >>> config = RnaFmConfig()\n >>> model = RnaFmForContactPrediction(config)\n >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n >>> output = model(**input, labels=torch.randint(2, (1, 5, 5)))\n >>> output[\"logits\"].shape\n torch.Size([1, 5, 5, 1])\n >>> output[\"loss\"] # doctest:+ELLIPSIS\n tensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n \"\"\"\n\n def __init__(self, config: RnaFmConfig):\n super().__init__(config)\n self.rnafm = RnaFmModel(config, add_pooling_layer=True)\n self.contact_head = ContactPredictionHead(config)\n self.head_config = self.contact_head.config\n\n # Initialize weights and apply final processing\n self.post_init()\n\n def forward(\n self,\n input_ids: Tensor | NestedTensor,\n attention_mask: Tensor | None = None,\n position_ids: Tensor | None = None,\n head_mask: Tensor | None = None,\n inputs_embeds: Tensor | NestedTensor | None = None,\n labels: Tensor | None = None,\n output_attentions: bool | None = None,\n output_hidden_states: bool | None = None,\n return_dict: bool | None = None,\n **kwargs,\n ) -> Tuple[Tensor, ...] | ContactPredictorOutput:\n if output_attentions is False:\n warn(\"output_attentions must be True for contact classification and will be ignored.\")\n return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n outputs = self.rnafm(\n input_ids,\n attention_mask=attention_mask,\n position_ids=position_ids,\n head_mask=head_mask,\n inputs_embeds=inputs_embeds,\n output_attentions=True,\n output_hidden_states=output_hidden_states,\n return_dict=return_dict,\n **kwargs,\n )\n output = self.contact_head(outputs, attention_mask, input_ids, labels)\n logits, loss = output.logits, output.loss\n\n if not return_dict:\n output = (logits,) + outputs[2:]\n return ((loss,) + output) if loss is not None else output\n\n return ContactPredictorOutput(\n loss=loss,\n logits=logits,\n hidden_states=outputs.hidden_states,\n attentions=outputs.attentions,\n )\n
"},{"location":"models/rnafm/#multimolecule.models.rnafm.RnaFmForMaskedLM","title":"RnaFmForMaskedLM","text":" Bases: RnaFmPreTrainedModel
Examples:
Python Console Session>>> from multimolecule import RnaFmConfig, RnaFmForMaskedLM, RnaTokenizer\n>>> config = RnaFmConfig()\n>>> model = RnaFmForMaskedLM(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=input[\"input_ids\"])\n>>> output[\"logits\"].shape\ntorch.Size([1, 7, 26])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<NllLossBackward0>)\n
Source code in multimolecule/models/rnafm/modeling_rnafm.py
Pythonclass RnaFmForMaskedLM(RnaFmPreTrainedModel):\n \"\"\"\n Examples:\n >>> from multimolecule import RnaFmConfig, RnaFmForMaskedLM, RnaTokenizer\n >>> config = RnaFmConfig()\n >>> model = RnaFmForMaskedLM(config)\n >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n >>> output = model(**input, labels=input[\"input_ids\"])\n >>> output[\"logits\"].shape\n torch.Size([1, 7, 26])\n >>> output[\"loss\"] # doctest:+ELLIPSIS\n tensor(..., grad_fn=<NllLossBackward0>)\n \"\"\"\n\n _tied_weights_keys = [\"lm_head.decoder.weight\", \"lm_head.decoder.bias\"]\n\n def __init__(self, config: RnaFmConfig):\n super().__init__(config)\n if config.is_decoder:\n logger.warning(\n \"If you want to use `RnaFmForMaskedLM` make sure `config.is_decoder=False` for \"\n \"bi-directional self-attention.\"\n )\n self.rnafm = RnaFmModel(config, add_pooling_layer=False)\n self.lm_head = MaskedLMHead(config)\n\n # Initialize weights and apply final processing\n self.post_init()\n\n def forward(\n self,\n input_ids: Tensor | NestedTensor,\n attention_mask: Tensor | None = None,\n position_ids: Tensor | None = None,\n head_mask: Tensor | None = None,\n inputs_embeds: Tensor | NestedTensor | None = None,\n encoder_hidden_states: Tensor | None = None,\n encoder_attention_mask: Tensor | None = None,\n labels: Tensor | None = None,\n output_attentions: bool | None = None,\n output_hidden_states: bool | None = None,\n return_dict: bool | None = None,\n **kwargs,\n ) -> Tuple[Tensor, ...] | MaskedLMOutput:\n return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n outputs = self.rnafm(\n input_ids,\n attention_mask=attention_mask,\n position_ids=position_ids,\n head_mask=head_mask,\n inputs_embeds=inputs_embeds,\n encoder_hidden_states=encoder_hidden_states,\n encoder_attention_mask=encoder_attention_mask,\n output_attentions=output_attentions,\n output_hidden_states=output_hidden_states,\n return_dict=return_dict,\n **kwargs,\n )\n output = self.lm_head(outputs, labels)\n logits, loss = output.logits, output.loss\n\n if not return_dict:\n output = (logits,) + outputs[2:]\n return ((loss,) + output) if loss is not None else output\n\n return MaskedLMOutput(\n loss=loss,\n logits=logits,\n hidden_states=outputs.hidden_states,\n attentions=outputs.attentions,\n )\n
"},{"location":"models/rnafm/#multimolecule.models.rnafm.RnaFmForPreTraining","title":"RnaFmForPreTraining","text":" Bases: RnaFmPreTrainedModel
Examples:
Python Console Session>>> from multimolecule import RnaFmConfig, RnaFmForPreTraining, RnaTokenizer\n>>> config = RnaFmConfig()\n>>> model = RnaFmForPreTraining(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels_mlm=input[\"input_ids\"])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<AddBackward0>)\n>>> output[\"logits\"].shape\ntorch.Size([1, 7, 26])\n>>> output[\"contact_map\"].shape\ntorch.Size([1, 5, 5, 1])\n
Source code in multimolecule/models/rnafm/modeling_rnafm.py
Pythonclass RnaFmForPreTraining(RnaFmPreTrainedModel):\n \"\"\"\n Examples:\n >>> from multimolecule import RnaFmConfig, RnaFmForPreTraining, RnaTokenizer\n >>> config = RnaFmConfig()\n >>> model = RnaFmForPreTraining(config)\n >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n >>> output = model(**input, labels_mlm=input[\"input_ids\"])\n >>> output[\"loss\"] # doctest:+ELLIPSIS\n tensor(..., grad_fn=<AddBackward0>)\n >>> output[\"logits\"].shape\n torch.Size([1, 7, 26])\n >>> output[\"contact_map\"].shape\n torch.Size([1, 5, 5, 1])\n \"\"\"\n\n _tied_weights_keys = [\n \"lm_head.decoder.weight\",\n \"lm_head.decoder.bias\",\n \"pretrain.predictions.decoder.weight\",\n \"pretrain.predictions.decoder.bias\",\n \"pretrain.predictions_ss.decoder.weight\",\n \"pretrain.predictions_ss.decoder.bias\",\n ]\n\n def __init__(self, config: RnaFmConfig):\n super().__init__(config)\n if config.is_decoder:\n logger.warning(\n \"If you want to use `RnaFmForPreTraining` make sure `config.is_decoder=False` for \"\n \"bi-directional self-attention.\"\n )\n self.rnafm = RnaFmModel(config, add_pooling_layer=False)\n self.pretrain = RnaFmPreTrainingHeads(config)\n\n # Initialize weights and apply final processing\n self.post_init()\n\n def get_output_embeddings(self):\n return self.pretrain.predictions.decoder\n\n def set_output_embeddings(self, embeddings):\n self.pretrain.predictions.decoder = embeddings\n\n def forward(\n self,\n input_ids: Tensor | NestedTensor,\n attention_mask: Tensor | None = None,\n position_ids: Tensor | None = None,\n head_mask: Tensor | None = None,\n inputs_embeds: Tensor | NestedTensor | None = None,\n encoder_hidden_states: Tensor | None = None,\n encoder_attention_mask: Tensor | None = None,\n labels_mlm: Tensor | None = None,\n labels_contact: Tensor | None = None,\n output_attentions: bool | None = None,\n output_hidden_states: bool | None = None,\n return_dict: bool | None = None,\n **kwargs,\n ) -> Tuple[Tensor, ...] | RnaFmForPreTrainingOutput:\n if output_attentions is False:\n warn(\"output_attentions must be True for contact classification and will be ignored.\")\n return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n outputs = self.rnafm(\n input_ids,\n attention_mask=attention_mask,\n position_ids=position_ids,\n head_mask=head_mask,\n inputs_embeds=inputs_embeds,\n encoder_hidden_states=encoder_hidden_states,\n encoder_attention_mask=encoder_attention_mask,\n output_attentions=True,\n output_hidden_states=output_hidden_states,\n return_dict=return_dict,\n **kwargs,\n )\n total_loss, logits, contact_map = self.pretrain(\n outputs, attention_mask, input_ids, labels_mlm=labels_mlm, labels_contact=labels_contact\n )\n\n if not return_dict:\n output = (logits,) + outputs[2:]\n return ((total_loss,) + output) if total_loss is not None else output\n\n return RnaFmForPreTrainingOutput(\n loss=total_loss,\n logits=logits,\n contact_map=contact_map,\n hidden_states=outputs.hidden_states,\n attentions=outputs.attentions,\n )\n
"},{"location":"models/rnafm/#multimolecule.models.rnafm.RnaFmForSequencePrediction","title":"RnaFmForSequencePrediction","text":" Bases: RnaFmPreTrainedModel
Examples:
Python Console Session>>> from multimolecule import RnaFmConfig, RnaFmForSequencePrediction, RnaTokenizer\n>>> config = RnaFmConfig()\n>>> model = RnaFmForSequencePrediction(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=torch.tensor([[1]]))\n>>> output[\"logits\"].shape\ntorch.Size([1, 1])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n
Source code in multimolecule/models/rnafm/modeling_rnafm.py
Pythonclass RnaFmForSequencePrediction(RnaFmPreTrainedModel):\n \"\"\"\n Examples:\n >>> from multimolecule import RnaFmConfig, RnaFmForSequencePrediction, RnaTokenizer\n >>> config = RnaFmConfig()\n >>> model = RnaFmForSequencePrediction(config)\n >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n >>> output = model(**input, labels=torch.tensor([[1]]))\n >>> output[\"logits\"].shape\n torch.Size([1, 1])\n >>> output[\"loss\"] # doctest:+ELLIPSIS\n tensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n \"\"\"\n\n def __init__(self, config: RnaFmConfig):\n super().__init__(config)\n self.rnafm = RnaFmModel(config, add_pooling_layer=True)\n self.sequence_head = SequencePredictionHead(config)\n self.head_config = self.sequence_head.config\n\n # Initialize weights and apply final processing\n self.post_init()\n\n def forward(\n self,\n input_ids: Tensor | NestedTensor,\n attention_mask: Tensor | None = None,\n position_ids: Tensor | None = None,\n head_mask: Tensor | None = None,\n inputs_embeds: Tensor | NestedTensor | None = None,\n labels: Tensor | None = None,\n output_attentions: bool | None = None,\n output_hidden_states: bool | None = None,\n return_dict: bool | None = None,\n **kwargs,\n ) -> Tuple[Tensor, ...] | SequencePredictorOutput:\n return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n outputs = self.rnafm(\n input_ids,\n attention_mask=attention_mask,\n position_ids=position_ids,\n head_mask=head_mask,\n inputs_embeds=inputs_embeds,\n output_attentions=output_attentions,\n output_hidden_states=output_hidden_states,\n return_dict=return_dict,\n **kwargs,\n )\n output = self.sequence_head(outputs, labels)\n logits, loss = output.logits, output.loss\n\n if not return_dict:\n output = (logits,) + outputs[2:]\n return ((loss,) + output) if loss is not None else output\n\n return SequencePredictorOutput(\n loss=loss,\n logits=logits,\n hidden_states=outputs.hidden_states,\n attentions=outputs.attentions,\n )\n
"},{"location":"models/rnafm/#multimolecule.models.rnafm.RnaFmForTokenPrediction","title":"RnaFmForTokenPrediction","text":" Bases: RnaFmPreTrainedModel
Examples:
Python Console Session>>> from multimolecule import RnaFmConfig, RnaFmForTokenPrediction, RnaTokenizer\n>>> config = RnaFmConfig()\n>>> model = RnaFmForTokenPrediction(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=torch.randint(2, (1, 5)))\n>>> output[\"logits\"].shape\ntorch.Size([1, 5, 1])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n
Source code in multimolecule/models/rnafm/modeling_rnafm.py
Pythonclass RnaFmForTokenPrediction(RnaFmPreTrainedModel):\n \"\"\"\n Examples:\n >>> from multimolecule import RnaFmConfig, RnaFmForTokenPrediction, RnaTokenizer\n >>> config = RnaFmConfig()\n >>> model = RnaFmForTokenPrediction(config)\n >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n >>> output = model(**input, labels=torch.randint(2, (1, 5)))\n >>> output[\"logits\"].shape\n torch.Size([1, 5, 1])\n >>> output[\"loss\"] # doctest:+ELLIPSIS\n tensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n \"\"\"\n\n def __init__(self, config: RnaFmConfig):\n super().__init__(config)\n self.rnafm = RnaFmModel(config, add_pooling_layer=True)\n self.token_head = TokenPredictionHead(config)\n self.head_config = self.token_head.config\n\n # Initialize weights and apply final processing\n self.post_init()\n\n def forward(\n self,\n input_ids: Tensor | NestedTensor,\n attention_mask: Tensor | None = None,\n position_ids: Tensor | None = None,\n head_mask: Tensor | None = None,\n inputs_embeds: Tensor | NestedTensor | None = None,\n labels: Tensor | None = None,\n output_attentions: bool | None = None,\n output_hidden_states: bool | None = None,\n return_dict: bool | None = None,\n **kwargs,\n ) -> Tuple[Tensor, ...] | TokenPredictorOutput:\n return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n outputs = self.rnafm(\n input_ids,\n attention_mask=attention_mask,\n position_ids=position_ids,\n head_mask=head_mask,\n inputs_embeds=inputs_embeds,\n output_attentions=output_attentions,\n output_hidden_states=output_hidden_states,\n return_dict=return_dict,\n **kwargs,\n )\n output = self.token_head(outputs, attention_mask, input_ids, labels)\n logits, loss = output.logits, output.loss\n\n if not return_dict:\n output = (logits,) + outputs[2:]\n return ((loss,) + output) if loss is not None else output\n\n return TokenPredictorOutput(\n loss=loss,\n logits=logits,\n hidden_states=outputs.hidden_states,\n attentions=outputs.attentions,\n )\n
"},{"location":"models/rnafm/#multimolecule.models.rnafm.RnaFmModel","title":"RnaFmModel","text":" Bases: RnaFmPreTrainedModel
Examples:
Python Console Session>>> from multimolecule import RnaFmConfig, RnaFmModel, RnaTokenizer\n>>> config = RnaFmConfig()\n>>> model = RnaFmModel(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input)\n>>> output[\"last_hidden_state\"].shape\ntorch.Size([1, 7, 640])\n>>> output[\"pooler_output\"].shape\ntorch.Size([1, 640])\n
Source code in multimolecule/models/rnafm/modeling_rnafm.py
Pythonclass RnaFmModel(RnaFmPreTrainedModel):\n \"\"\"\n Examples:\n >>> from multimolecule import RnaFmConfig, RnaFmModel, RnaTokenizer\n >>> config = RnaFmConfig()\n >>> model = RnaFmModel(config)\n >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n >>> output = model(**input)\n >>> output[\"last_hidden_state\"].shape\n torch.Size([1, 7, 640])\n >>> output[\"pooler_output\"].shape\n torch.Size([1, 640])\n \"\"\"\n\n def __init__(self, config: RnaFmConfig, add_pooling_layer: bool = True):\n super().__init__(config)\n self.pad_token_id = config.pad_token_id\n self.embeddings = RnaFmEmbeddings(config)\n self.encoder = RnaFmEncoder(config)\n self.pooler = RnaFmPooler(config) if add_pooling_layer else None\n\n # Initialize weights and apply final processing\n self.post_init()\n\n def get_input_embeddings(self):\n return self.embeddings.word_embeddings\n\n def set_input_embeddings(self, value):\n self.embeddings.word_embeddings = value\n\n def _prune_heads(self, heads_to_prune):\n \"\"\"\n Prunes heads of the model. heads_to_prune: dict of {layer_num: list of heads to prune in this layer} See base\n class PreTrainedModel\n \"\"\"\n for layer, heads in heads_to_prune.items():\n self.encoder.layer[layer].attention.prune_heads(heads)\n\n def forward(\n self,\n input_ids: Tensor | NestedTensor,\n attention_mask: Tensor | None = None,\n position_ids: Tensor | None = None,\n head_mask: Tensor | None = None,\n inputs_embeds: Tensor | NestedTensor | None = None,\n encoder_hidden_states: Tensor | None = None,\n encoder_attention_mask: Tensor | None = None,\n past_key_values: Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None = None,\n use_cache: bool | None = None,\n output_attentions: bool | None = None,\n output_hidden_states: bool | None = None,\n return_dict: bool | None = None,\n **kwargs,\n ) -> Tuple[Tensor, ...] | BaseModelOutputWithPoolingAndCrossAttentions:\n r\"\"\"\n Args:\n encoder_hidden_states:\n Shape: `(batch_size, sequence_length, hidden_size)`\n\n Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if\n the model is configured as a decoder.\n encoder_attention_mask:\n Shape: `(batch_size, sequence_length)`\n\n Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used\n in the cross-attention if the model is configured as a decoder. Mask values selected in `[0, 1]`:\n\n - 1 for tokens that are **not masked**,\n - 0 for tokens that are **masked**.\n past_key_values:\n Tuple of length `config.n_layers` with each tuple having 4 tensors of shape\n `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)\n\n Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up\n decoding.\n\n If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those\n that don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of\n all `decoder_input_ids` of shape `(batch_size, sequence_length)`.\n use_cache:\n If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding\n (see `past_key_values`).\n \"\"\"\n if kwargs:\n warn(\n f\"Additional keyword arguments `{', '.join(kwargs)}` are detected in \"\n f\"`{self.__class__.__name__}.forward`, they will be ignored.\\n\"\n \"This is provided for backward compatibility and may lead to unexpected behavior.\"\n )\n output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions\n output_hidden_states = (\n output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states\n )\n return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n\n if self.config.is_decoder:\n use_cache = use_cache if use_cache is not None else self.config.use_cache\n else:\n use_cache = False\n\n if isinstance(input_ids, NestedTensor):\n input_ids, attention_mask = input_ids.tensor, input_ids.mask\n if input_ids is not None and inputs_embeds is not None:\n raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n if input_ids is not None:\n self.warn_if_padding_and_no_attention_mask(input_ids, attention_mask)\n input_shape = input_ids.size()\n elif inputs_embeds is not None:\n input_shape = inputs_embeds.size()[:-1]\n else:\n raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n batch_size, seq_length = input_shape\n device = input_ids.device if input_ids is not None else inputs_embeds.device # type: ignore[union-attr]\n\n # past_key_values_length\n past_key_values_length = past_key_values[0][0].shape[2] if past_key_values is not None else 0\n\n if attention_mask is None:\n attention_mask = (\n input_ids.ne(self.pad_token_id)\n if self.pad_token_id is not None\n else torch.ones(((batch_size, seq_length + past_key_values_length)), device=device)\n )\n\n # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]\n # ourselves in which case we just need to make it broadcastable to all heads.\n extended_attention_mask: Tensor = self.get_extended_attention_mask(attention_mask, input_shape)\n\n # If a 2D or 3D attention mask is provided for the cross-attention\n # we need to make broadcastable to [batch_size, num_heads, seq_length, seq_length]\n if self.config.is_decoder and encoder_hidden_states is not None:\n encoder_batch_size, encoder_sequence_length, _ = encoder_hidden_states.size()\n encoder_hidden_shape = (encoder_batch_size, encoder_sequence_length)\n if encoder_attention_mask is None:\n encoder_attention_mask = torch.ones(encoder_hidden_shape, device=device)\n encoder_extended_attention_mask = self.invert_attention_mask(encoder_attention_mask)\n else:\n encoder_extended_attention_mask = None\n\n # Prepare head mask if needed\n # 1.0 in head_mask indicate we keep the head\n # attention_probs has shape bsz x n_heads x N x N\n # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]\n # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]\n head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)\n\n embedding_output = self.embeddings(\n input_ids=input_ids,\n position_ids=position_ids,\n attention_mask=attention_mask,\n inputs_embeds=inputs_embeds,\n past_key_values_length=past_key_values_length,\n )\n encoder_outputs = self.encoder(\n embedding_output,\n attention_mask=extended_attention_mask,\n head_mask=head_mask,\n encoder_hidden_states=encoder_hidden_states,\n encoder_attention_mask=encoder_extended_attention_mask,\n past_key_values=past_key_values,\n use_cache=use_cache,\n output_attentions=output_attentions,\n output_hidden_states=output_hidden_states,\n return_dict=return_dict,\n )\n sequence_output = encoder_outputs[0]\n pooled_output = self.pooler(sequence_output) if self.pooler is not None else None\n\n if not return_dict:\n return (sequence_output, pooled_output) + encoder_outputs[1:]\n\n return BaseModelOutputWithPoolingAndCrossAttentions(\n last_hidden_state=sequence_output,\n pooler_output=pooled_output,\n past_key_values=encoder_outputs.past_key_values,\n hidden_states=encoder_outputs.hidden_states,\n attentions=encoder_outputs.attentions,\n cross_attentions=encoder_outputs.cross_attentions,\n )\n
"},{"location":"models/rnafm/#multimolecule.models.rnafm.RnaFmModel.forward","title":"forward","text":"Pythonforward(input_ids: Tensor | NestedTensor, attention_mask: Tensor | None = None, position_ids: Tensor | None = None, head_mask: Tensor | None = None, inputs_embeds: Tensor | NestedTensor | None = None, encoder_hidden_states: Tensor | None = None, encoder_attention_mask: Tensor | None = None, past_key_values: Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None = None, use_cache: bool | None = None, output_attentions: bool | None = None, output_hidden_states: bool | None = None, return_dict: bool | None = None, **kwargs) -> Tuple[Tensor, ...] | BaseModelOutputWithPoolingAndCrossAttentions\n
Parameters:
Name Type Description DefaultTensor | None
Shape: (batch_size, sequence_length, hidden_size)
Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if the model is configured as a decoder.
None
Tensor | None
Shape: (batch_size, sequence_length)
Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used in the cross-attention if the model is configured as a decoder. Mask values selected in [0, 1]
:
None
Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None
Tuple of length config.n_layers
with each tuple having 4 tensors of shape `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)
Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up decoding.
If past_key_values
are used, the user can optionally input only the last decoder_input_ids
(those that don\u2019t have their past key value states given to this model) of shape (batch_size, 1)
instead of all decoder_input_ids
of shape (batch_size, sequence_length)
.
None
bool | None
If set to True
, past_key_values
key value states are returned and can be used to speed up decoding (see past_key_values
).
None
Source code in multimolecule/models/rnafm/modeling_rnafm.py
Pythondef forward(\n self,\n input_ids: Tensor | NestedTensor,\n attention_mask: Tensor | None = None,\n position_ids: Tensor | None = None,\n head_mask: Tensor | None = None,\n inputs_embeds: Tensor | NestedTensor | None = None,\n encoder_hidden_states: Tensor | None = None,\n encoder_attention_mask: Tensor | None = None,\n past_key_values: Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None = None,\n use_cache: bool | None = None,\n output_attentions: bool | None = None,\n output_hidden_states: bool | None = None,\n return_dict: bool | None = None,\n **kwargs,\n) -> Tuple[Tensor, ...] | BaseModelOutputWithPoolingAndCrossAttentions:\n r\"\"\"\n Args:\n encoder_hidden_states:\n Shape: `(batch_size, sequence_length, hidden_size)`\n\n Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if\n the model is configured as a decoder.\n encoder_attention_mask:\n Shape: `(batch_size, sequence_length)`\n\n Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used\n in the cross-attention if the model is configured as a decoder. Mask values selected in `[0, 1]`:\n\n - 1 for tokens that are **not masked**,\n - 0 for tokens that are **masked**.\n past_key_values:\n Tuple of length `config.n_layers` with each tuple having 4 tensors of shape\n `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)\n\n Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up\n decoding.\n\n If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those\n that don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of\n all `decoder_input_ids` of shape `(batch_size, sequence_length)`.\n use_cache:\n If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding\n (see `past_key_values`).\n \"\"\"\n if kwargs:\n warn(\n f\"Additional keyword arguments `{', '.join(kwargs)}` are detected in \"\n f\"`{self.__class__.__name__}.forward`, they will be ignored.\\n\"\n \"This is provided for backward compatibility and may lead to unexpected behavior.\"\n )\n output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions\n output_hidden_states = (\n output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states\n )\n return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n\n if self.config.is_decoder:\n use_cache = use_cache if use_cache is not None else self.config.use_cache\n else:\n use_cache = False\n\n if isinstance(input_ids, NestedTensor):\n input_ids, attention_mask = input_ids.tensor, input_ids.mask\n if input_ids is not None and inputs_embeds is not None:\n raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n if input_ids is not None:\n self.warn_if_padding_and_no_attention_mask(input_ids, attention_mask)\n input_shape = input_ids.size()\n elif inputs_embeds is not None:\n input_shape = inputs_embeds.size()[:-1]\n else:\n raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n batch_size, seq_length = input_shape\n device = input_ids.device if input_ids is not None else inputs_embeds.device # type: ignore[union-attr]\n\n # past_key_values_length\n past_key_values_length = past_key_values[0][0].shape[2] if past_key_values is not None else 0\n\n if attention_mask is None:\n attention_mask = (\n input_ids.ne(self.pad_token_id)\n if self.pad_token_id is not None\n else torch.ones(((batch_size, seq_length + past_key_values_length)), device=device)\n )\n\n # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]\n # ourselves in which case we just need to make it broadcastable to all heads.\n extended_attention_mask: Tensor = self.get_extended_attention_mask(attention_mask, input_shape)\n\n # If a 2D or 3D attention mask is provided for the cross-attention\n # we need to make broadcastable to [batch_size, num_heads, seq_length, seq_length]\n if self.config.is_decoder and encoder_hidden_states is not None:\n encoder_batch_size, encoder_sequence_length, _ = encoder_hidden_states.size()\n encoder_hidden_shape = (encoder_batch_size, encoder_sequence_length)\n if encoder_attention_mask is None:\n encoder_attention_mask = torch.ones(encoder_hidden_shape, device=device)\n encoder_extended_attention_mask = self.invert_attention_mask(encoder_attention_mask)\n else:\n encoder_extended_attention_mask = None\n\n # Prepare head mask if needed\n # 1.0 in head_mask indicate we keep the head\n # attention_probs has shape bsz x n_heads x N x N\n # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]\n # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]\n head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)\n\n embedding_output = self.embeddings(\n input_ids=input_ids,\n position_ids=position_ids,\n attention_mask=attention_mask,\n inputs_embeds=inputs_embeds,\n past_key_values_length=past_key_values_length,\n )\n encoder_outputs = self.encoder(\n embedding_output,\n attention_mask=extended_attention_mask,\n head_mask=head_mask,\n encoder_hidden_states=encoder_hidden_states,\n encoder_attention_mask=encoder_extended_attention_mask,\n past_key_values=past_key_values,\n use_cache=use_cache,\n output_attentions=output_attentions,\n output_hidden_states=output_hidden_states,\n return_dict=return_dict,\n )\n sequence_output = encoder_outputs[0]\n pooled_output = self.pooler(sequence_output) if self.pooler is not None else None\n\n if not return_dict:\n return (sequence_output, pooled_output) + encoder_outputs[1:]\n\n return BaseModelOutputWithPoolingAndCrossAttentions(\n last_hidden_state=sequence_output,\n pooler_output=pooled_output,\n past_key_values=encoder_outputs.past_key_values,\n hidden_states=encoder_outputs.hidden_states,\n attentions=encoder_outputs.attentions,\n cross_attentions=encoder_outputs.cross_attentions,\n )\n
"},{"location":"models/rnafm/#multimolecule.models.rnafm.RnaFmModel.forward(encoder_hidden_states)","title":"encoder_hidden_states
","text":""},{"location":"models/rnafm/#multimolecule.models.rnafm.RnaFmModel.forward(encoder_attention_mask)","title":"encoder_attention_mask
","text":""},{"location":"models/rnafm/#multimolecule.models.rnafm.RnaFmModel.forward(past_key_values)","title":"past_key_values
","text":""},{"location":"models/rnafm/#multimolecule.models.rnafm.RnaFmModel.forward(use_cache)","title":"use_cache
","text":""},{"location":"models/rnafm/#multimolecule.models.rnafm.RnaFmPreTrainedModel","title":"RnaFmPreTrainedModel","text":" Bases: PreTrainedModel
An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained models.
Source code inmultimolecule/models/rnafm/modeling_rnafm.py
Pythonclass RnaFmPreTrainedModel(PreTrainedModel):\n \"\"\"\n An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained\n models.\n \"\"\"\n\n config_class = RnaFmConfig\n base_model_prefix = \"rnafm\"\n supports_gradient_checkpointing = True\n _no_split_modules = [\"RnaFmLayer\", \"RnaFmEmbeddings\"]\n\n # Copied from transformers.models.bert.modeling_bert.BertPreTrainedModel._init_weights\n def _init_weights(self, module: nn.Module):\n \"\"\"Initialize the weights\"\"\"\n if isinstance(module, nn.Linear):\n # Slightly different from the TF version which uses truncated_normal for initialization\n # cf https://github.com/pytorch/pytorch/pull/5617\n module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)\n if module.bias is not None:\n module.bias.data.zero_()\n elif isinstance(module, nn.Embedding):\n module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)\n if module.padding_idx is not None:\n module.weight.data[module.padding_idx].zero_()\n elif isinstance(module, nn.LayerNorm):\n module.bias.data.zero_()\n module.weight.data.fill_(1.0)\n
"},{"location":"models/rnamsm/","title":"RNA-MSM","text":"Pre-trained model on non-coding RNA (ncRNA) with multi (homologous) sequence alignment using a masked language modeling (MLM) objective.
"},{"location":"models/rnamsm/#disclaimer","title":"Disclaimer","text":"This is an UNOFFICIAL implementation of the Multiple sequence alignment-based RNA language model and its application to structural inference by Yikun Zhang, Mei Lang, Jiuhong Jiang, Zhiqiang Gao, et al.
The OFFICIAL repository of RNA-MSM is at yikunpku/RNA-MSM.
Caution
The MultiMolecule team is aware of a potential risk in reproducing the results of RNA-MSM.
The original implementation of RNA-MSM used a custom tokenizer that does not append <eos>
token to the end of the input sequence in consistent to MSA Transformer. This should not affect the performance of the model in most cases, but it can lead to unexpected behavior in some cases.
Please set eos_token=None
explicitly in the tokenizer if you want the exact behavior of the original implementation.
See more at issue #10
Tip
The MultiMolecule team has confirmed that the provided model and checkpoints are producing the same intermediate representations as the original implementation.
The team releasing RNA-MSM did not write this model card for this model so this model card has been written by the MultiMolecule team.
"},{"location":"models/rnamsm/#model-details","title":"Model Details","text":"RNA-MSM is a bert-style model pre-trained on a large corpus of non-coding RNA sequences in a self-supervised fashion. This means that the model was trained on the raw nucleotides of RNA sequences only, with an automatic process to generate inputs and labels from those texts. Please refer to the Training Details section for more information on the training process.
"},{"location":"models/rnamsm/#model-specification","title":"Model Specification","text":"Num Layers Hidden Size Num Heads Intermediate Size Num Parameters (M) FLOPs (G) MACs (G) Max Num Tokens 10 768 12 3072 95.92 21.66 10.57 1024"},{"location":"models/rnamsm/#links","title":"Links","text":"The model file depends on the multimolecule
library. You can install it using pip:
pip install multimolecule\n
"},{"location":"models/rnamsm/#direct-use","title":"Direct Use","text":"You can use this model directly with a pipeline for masked language modeling:
Python>>> import multimolecule # you must import multimolecule to register models\n>>> from transformers import pipeline\n>>> unmasker = pipeline(\"fill-mask\", model=\"multimolecule/rnamsm\")\n>>> unmasker(\"gguc<mask>cucugguuagaccagaucugagccu\")\n\n[{'score': 0.25111356377601624,\n 'token': 9,\n 'token_str': 'U',\n 'sequence': 'G G U C U C U C U G G U U A G A C C A G A U C U G A G C C U'},\n {'score': 0.1200353354215622,\n 'token': 14,\n 'token_str': 'W',\n 'sequence': 'G G U C W C U C U G G U U A G A C C A G A U C U G A G C C U'},\n {'score': 0.10132723301649094,\n 'token': 15,\n 'token_str': 'K',\n 'sequence': 'G G U C K C U C U G G U U A G A C C A G A U C U G A G C C U'},\n {'score': 0.08383019268512726,\n 'token': 18,\n 'token_str': 'D',\n 'sequence': 'G G U C D C U C U G G U U A G A C C A G A U C U G A G C C U'},\n {'score': 0.05737845227122307,\n 'token': 6,\n 'token_str': 'A',\n 'sequence': 'G G U C A C U C U G G U U A G A C C A G A U C U G A G C C U'}]\n
"},{"location":"models/rnamsm/#downstream-use","title":"Downstream Use","text":""},{"location":"models/rnamsm/#extract-features","title":"Extract Features","text":"Here is how to use this model to get the features of a given sequence in PyTorch:
Pythonfrom multimolecule import RnaTokenizer, RnaMsmModel\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rnamsm\")\nmodel = RnaMsmModel.from_pretrained(\"multimolecule/rnamsm\")\n\ntext = \"UAGCUUAUCAGACUGAUGUUGA\"\ninput = tokenizer(text, return_tensors=\"pt\")\n\noutput = model(**input)\n
"},{"location":"models/rnamsm/#sequence-classification-regression","title":"Sequence Classification / Regression","text":"Note: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for sequence classification or regression.
Here is how to use this model as backbone to fine-tune for a sequence-level task in PyTorch:
Pythonimport torch\nfrom multimolecule import RnaTokenizer, RnaMsmForSequencePrediction\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rnamsm\")\nmodel = RnaMsmForSequencePrediction.from_pretrained(\"multimolecule/rnamsm\")\n\ntext = \"UAGCUUAUCAGACUGAUGUUGA\"\ninput = tokenizer(text, return_tensors=\"pt\")\nlabel = torch.tensor([1])\n\noutput = model(**input, labels=label)\n
"},{"location":"models/rnamsm/#token-classification-regression","title":"Token Classification / Regression","text":"Note: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for nucleotide classification or regression.
Here is how to use this model as backbone to fine-tune for a nucleotide-level task in PyTorch:
Pythonimport torch\nfrom multimolecule import RnaTokenizer, RnaMsmForTokenPrediction\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rnamsm\")\nmodel = RnaMsmForNucleotidPrediction.from_pretrained(\"multimolecule/rnamsm\")\n\ntext = \"UAGCUUAUCAGACUGAUGUUGA\"\ninput = tokenizer(text, return_tensors=\"pt\")\nlabel = torch.randint(2, (len(text), ))\n\noutput = model(**input, labels=label)\n
"},{"location":"models/rnamsm/#contact-classification-regression","title":"Contact Classification / Regression","text":"Note: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for contact classification or regression.
Here is how to use this model as backbone to fine-tune for a contact-level task in PyTorch:
Pythonimport torch\nfrom multimolecule import RnaTokenizer, RnaMsmForContactPrediction\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rnamsm\")\nmodel = RnaMsmForContactPrediction.from_pretrained(\"multimolecule/rnamsm\")\n\ntext = \"UAGCUUAUCAGACUGAUGUUGA\"\ninput = tokenizer(text, return_tensors=\"pt\")\nlabel = torch.randint(2, (len(text), len(text)))\n\noutput = model(**input, labels=label)\n
"},{"location":"models/rnamsm/#training-details","title":"Training Details","text":"RNA-MSM used Masked Language Modeling (MLM) as the pre-training objective: taking a sequence, the model randomly masks 15% of the tokens in the input then runs the entire masked sentence through the model and has to predict the masked tokens. This is comparable to the Cloze task in language modeling.
"},{"location":"models/rnamsm/#training-data","title":"Training Data","text":"The RNA-MSM model was pre-trained on Rfam. The Rfam database is a collection of RNA sequence families of structural RNAs including non-coding RNA genes as well as cis-regulatory elements. RNA-MSM used Rfam 14.7 which contains 4,069 RNA families.
To avoid potential overfitting in structural inference, RNA-MSM excluded families with experimentally determined structures, such as ribosomal RNAs, transfer RNAs, and small nuclear RNAs. The final dataset contains 3,932 RNA families. The median value for the number of MSA sequences for these families by RNAcmap3 is 2,184.
To increase the number of homologous sequences, RNA-MSM used an automatic pipeline, RNAcmap3, for homolog search and sequence alignment. RNAcmap3 is a pipeline that combines the BLAST-N, INFERNAL, Easel, RNAfold and evolutionary coupling tools to generate homologous sequences.
RNA-MSM preprocessed all tokens by replacing \u201cT\u201ds with \u201cU\u201ds and substituting \u201cR\u201d, \u201cY\u201d, \u201cK\u201d, \u201cM\u201d, \u201cS\u201d, \u201cW\u201d, \u201cB\u201d, \u201cD\u201d, \u201cH\u201d, \u201cV\u201d, \u201cN\u201d with \u201cX\u201d.
Note that RnaTokenizer
will convert \u201cT\u201ds to \u201cU\u201ds for you, you may disable this behaviour by passing replace_T_with_U=False
. RnaTokenizer
does not perform other substitutions.
RNA-MSM used masked language modeling (MLM) as the pre-training objective. The masking procedure is similar to the one used in BERT:
<mask>
.The model was trained on 8 NVIDIA V100 GPUs with 32GiB memories.
BibTeX:
BibTeX@article{zhang2023multiple,\n author = {Zhang, Yikun and Lang, Mei and Jiang, Jiuhong and Gao, Zhiqiang and Xu, Fan and Litfin, Thomas and Chen, Ke and Singh, Jaswinder and Huang, Xiansong and Song, Guoli and Tian, Yonghong and Zhan, Jian and Chen, Jie and Zhou, Yaoqi},\n title = \"{Multiple sequence alignment-based RNA language model and its application to structural inference}\",\n journal = {Nucleic Acids Research},\n volume = {52},\n number = {1},\n pages = {e3-e3},\n year = {2023},\n month = {11},\n abstract = \"{Compared with proteins, DNA and RNA are more difficult languages to interpret because four-letter coded DNA/RNA sequences have less information content than 20-letter coded protein sequences. While BERT (Bidirectional Encoder Representations from Transformers)-like language models have been developed for RNA, they are ineffective at capturing the evolutionary information from homologous sequences because\u00a0unlike proteins, RNA sequences are less conserved. Here, we have developed an unsupervised multiple sequence alignment-based RNA language model (RNA-MSM) by utilizing homologous sequences from an automatic pipeline, RNAcmap, as it can provide significantly more homologous sequences than manually annotated Rfam. We demonstrate that the resulting unsupervised, two-dimensional attention maps and one-dimensional embeddings from RNA-MSM contain structural information. In fact, they can be directly mapped with high accuracy to 2D base pairing probabilities and 1D solvent accessibilities, respectively. Further fine-tuning led to significantly improved performance on these two downstream tasks compared with existing state-of-the-art techniques including SPOT-RNA2 and RNAsnap2. By comparison, RNA-FM, a BERT-based RNA language model, performs worse than one-hot encoding with its embedding in base pair and solvent-accessible surface area prediction. We anticipate that the pre-trained RNA-MSM model can be fine-tuned on many other tasks related to RNA structure and function.}\",\n issn = {0305-1048},\n doi = {10.1093/nar/gkad1031},\n url = {https://doi.org/10.1093/nar/gkad1031},\n eprint = {https://academic.oup.com/nar/article-pdf/52/1/e3/55443207/gkad1031.pdf},\n}\n
"},{"location":"models/rnamsm/#contact","title":"Contact","text":"Please use GitHub issues of MultiMolecule for any questions or comments on the model card.
Please contact the authors of the RNA-MSM paper for questions or comments on the paper/model.
"},{"location":"models/rnamsm/#license","title":"License","text":"This model is licensed under the AGPL-3.0 License.
Text OnlySPDX-License-Identifier: AGPL-3.0-or-later\n
"},{"location":"models/rnamsm/#multimolecule.models.rnamsm","title":"multimolecule.models.rnamsm","text":""},{"location":"models/rnamsm/#multimolecule.models.rnamsm.RnaTokenizer","title":"RnaTokenizer","text":" Bases: Tokenizer
Tokenizer for RNA sequences.
Parameters:
Name Type Description DefaultAlphabet | str | List[str] | None
alphabet to use for tokenization.
None
, the standard RNA alphabet will be used.string
, it should correspond to the name of a predefined alphabet. The options includestandard
extended
streamline
nucleobase
None
int
Size of kmer to tokenize.
1
bool
Whether to tokenize into codons.
False
bool
Whether to replace T with U.
True
bool
Whether to convert input to uppercase.
True
Examples:
Python Console Session>>> from multimolecule import RnaTokenizer\n>>> tokenizer = RnaTokenizer()\n>>> tokenizer('<pad><cls><eos><unk><mask><null>ACGUNRYSWKMBDHV.X*-I')[\"input_ids\"]\n[1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 2]\n>>> tokenizer('acgu')[\"input_ids\"]\n[1, 6, 7, 8, 9, 2]\n>>> tokenizer('acgt')[\"input_ids\"]\n[1, 6, 7, 8, 9, 2]\n>>> tokenizer = RnaTokenizer(replace_T_with_U=False)\n>>> tokenizer('acgt')[\"input_ids\"]\n[1, 6, 7, 8, 3, 2]\n>>> tokenizer = RnaTokenizer(nmers=3)\n>>> tokenizer('uagcuuauc')[\"input_ids\"]\n[1, 83, 17, 64, 49, 96, 84, 22, 2]\n>>> tokenizer = RnaTokenizer(codon=True)\n>>> tokenizer('uagcuuauc')[\"input_ids\"]\n[1, 83, 49, 22, 2]\n>>> tokenizer('uagcuuauca')[\"input_ids\"]\nTraceback (most recent call last):\nValueError: length of input sequence must be a multiple of 3 for codon tokenization, but got 10\n
Source code in multimolecule/tokenisers/rna/tokenization_rna.py
Pythonclass RnaTokenizer(Tokenizer):\n \"\"\"\n Tokenizer for RNA sequences.\n\n Args:\n alphabet: alphabet to use for tokenization.\n\n - If is `None`, the standard RNA alphabet will be used.\n - If is a `string`, it should correspond to the name of a predefined alphabet. The options include\n + `standard`\n + `extended`\n + `streamline`\n + `nucleobase`\n - If is an alphabet or a list of characters, that specific alphabet will be used.\n nmers: Size of kmer to tokenize.\n codon: Whether to tokenize into codons.\n replace_T_with_U: Whether to replace T with U.\n do_upper_case: Whether to convert input to uppercase.\n\n Examples:\n >>> from multimolecule import RnaTokenizer\n >>> tokenizer = RnaTokenizer()\n >>> tokenizer('<pad><cls><eos><unk><mask><null>ACGUNRYSWKMBDHV.X*-I')[\"input_ids\"]\n [1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 2]\n >>> tokenizer('acgu')[\"input_ids\"]\n [1, 6, 7, 8, 9, 2]\n >>> tokenizer('acgt')[\"input_ids\"]\n [1, 6, 7, 8, 9, 2]\n >>> tokenizer = RnaTokenizer(replace_T_with_U=False)\n >>> tokenizer('acgt')[\"input_ids\"]\n [1, 6, 7, 8, 3, 2]\n >>> tokenizer = RnaTokenizer(nmers=3)\n >>> tokenizer('uagcuuauc')[\"input_ids\"]\n [1, 83, 17, 64, 49, 96, 84, 22, 2]\n >>> tokenizer = RnaTokenizer(codon=True)\n >>> tokenizer('uagcuuauc')[\"input_ids\"]\n [1, 83, 49, 22, 2]\n >>> tokenizer('uagcuuauca')[\"input_ids\"]\n Traceback (most recent call last):\n ValueError: length of input sequence must be a multiple of 3 for codon tokenization, but got 10\n \"\"\"\n\n model_input_names = [\"input_ids\", \"attention_mask\"]\n\n def __init__(\n self,\n alphabet: Alphabet | str | List[str] | None = None,\n nmers: int = 1,\n codon: bool = False,\n replace_T_with_U: bool = True,\n do_upper_case: bool = True,\n additional_special_tokens: List | Tuple | None = None,\n **kwargs,\n ):\n if codon and (nmers > 1 and nmers != 3):\n raise ValueError(\"Codon and nmers cannot be used together.\")\n if codon:\n nmers = 3 # set to 3 to get correct vocab\n if not isinstance(alphabet, Alphabet):\n alphabet = get_alphabet(alphabet, nmers=nmers)\n super().__init__(\n alphabet=alphabet,\n nmers=nmers,\n codon=codon,\n replace_T_with_U=replace_T_with_U,\n do_upper_case=do_upper_case,\n additional_special_tokens=additional_special_tokens,\n **kwargs,\n )\n self.replace_T_with_U = replace_T_with_U\n self.nmers = nmers\n self.codon = codon\n\n def _tokenize(self, text: str, **kwargs):\n if self.do_upper_case:\n text = text.upper()\n if self.replace_T_with_U:\n text = text.replace(\"T\", \"U\")\n if self.codon:\n if len(text) % 3 != 0:\n raise ValueError(\n f\"length of input sequence must be a multiple of 3 for codon tokenization, but got {len(text)}\"\n )\n return [text[i : i + 3] for i in range(0, len(text), 3)]\n if self.nmers > 1:\n return [text[i : i + self.nmers] for i in range(len(text) - self.nmers + 1)] # noqa: E203\n return list(text)\n
"},{"location":"models/rnamsm/#multimolecule.models.rnamsm.RnaTokenizer(alphabet)","title":"alphabet
","text":""},{"location":"models/rnamsm/#multimolecule.models.rnamsm.RnaTokenizer(nmers)","title":"nmers
","text":""},{"location":"models/rnamsm/#multimolecule.models.rnamsm.RnaTokenizer(codon)","title":"codon
","text":""},{"location":"models/rnamsm/#multimolecule.models.rnamsm.RnaTokenizer(replace_T_with_U)","title":"replace_T_with_U
","text":""},{"location":"models/rnamsm/#multimolecule.models.rnamsm.RnaTokenizer(do_upper_case)","title":"do_upper_case
","text":""},{"location":"models/rnamsm/#multimolecule.models.rnamsm.RnaMsmConfig","title":"RnaMsmConfig","text":" Bases: PreTrainedConfig
This is the configuration class to store the configuration of a RnaMsmModel
. It is used to instantiate a RnaMsm model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the RnaMsm yikunpku/RNA-MSM architecture.
Configuration objects inherit from PreTrainedConfig
and can be used to control the model outputs. Read the documentation from PreTrainedConfig
for more information.
Parameters:
Name Type Description Defaultint
Vocabulary size of the RnaMsm model. Defines the number of different tokens that can be represented by the inputs_ids
passed when calling [RnaMsmModel
].
26
int
Dimensionality of the encoder layers and the pooler layer.
768
int
Number of hidden layers in the Transformer encoder.
10
int
Number of attention heads for each attention layer in the Transformer encoder.
12
int
Dimensionality of the \u201cintermediate\u201d (often named feed-forward) layer in the Transformer encoder.
3072
float
The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
0.1
float
The dropout ratio for the attention probabilities.
0.1
int
The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
1024
float
The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
0.02
float
The epsilon used by the layer normalization layers.
1e-12
Examples:
Python Console Session>>> from multimolecule import RnaMsmModel, RnaMsmConfig\n>>> # Initializing a RNA-MSM multimolecule/rnamsm style configuration\n>>> configuration = RnaMsmConfig()\n>>> # Initializing a model (with random weights) from the multimolecule/rnamsm style configuration\n>>> model = RnaMsmModel(configuration)\n>>> # Accessing the model configuration\n>>> configuration = model.config\n
Source code in multimolecule/models/rnamsm/configuration_rnamsm.py
Pythonclass RnaMsmConfig(PreTrainedConfig):\n r\"\"\"\n This is the configuration class to store the configuration of a [`RnaMsmModel`][multimolecule.models.RnaMsmModel].\n It is used to instantiate a RnaMsm model according to the specified arguments, defining the model architecture.\n Instantiating a configuration with the defaults will yield a similar configuration to that of the RnaMsm\n [yikunpku/RNA-MSM](https://github.com/yikunpku/RNA-MSM) architecture.\n\n Configuration objects inherit from [`PreTrainedConfig`][multimolecule.models.PreTrainedConfig] and can be used to\n control the model outputs. Read the documentation from [`PreTrainedConfig`][multimolecule.models.PreTrainedConfig]\n for more information.\n\n Args:\n vocab_size:\n Vocabulary size of the RnaMsm model. Defines the number of different tokens that can be represented by the\n `inputs_ids` passed when calling [`RnaMsmModel`].\n hidden_size:\n Dimensionality of the encoder layers and the pooler layer.\n num_hidden_layers:\n Number of hidden layers in the Transformer encoder.\n num_attention_heads:\n Number of attention heads for each attention layer in the Transformer encoder.\n intermediate_size:\n Dimensionality of the \"intermediate\" (often named feed-forward) layer in the Transformer encoder.\n hidden_dropout:\n The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.\n attention_dropout:\n The dropout ratio for the attention probabilities.\n max_position_embeddings:\n The maximum sequence length that this model might ever be used with. Typically set this to something large\n just in case (e.g., 512 or 1024 or 2048).\n initializer_range:\n The standard deviation of the truncated_normal_initializer for initializing all weight matrices.\n layer_norm_eps:\n The epsilon used by the layer normalization layers.\n\n Examples:\n >>> from multimolecule import RnaMsmModel, RnaMsmConfig\n >>> # Initializing a RNA-MSM multimolecule/rnamsm style configuration\n >>> configuration = RnaMsmConfig()\n >>> # Initializing a model (with random weights) from the multimolecule/rnamsm style configuration\n >>> model = RnaMsmModel(configuration)\n >>> # Accessing the model configuration\n >>> configuration = model.config\n \"\"\"\n\n model_type = \"rnamsm\"\n\n def __init__(\n self,\n vocab_size: int = 26,\n hidden_size: int = 768,\n num_hidden_layers: int = 10,\n num_attention_heads: int = 12,\n intermediate_size: int = 3072,\n hidden_act: str = \"gelu\",\n hidden_dropout: float = 0.1,\n attention_dropout: float = 0.1,\n max_position_embeddings: int = 1024,\n initializer_range: float = 0.02,\n layer_norm_eps: float = 1e-12,\n position_embedding_type: str = \"absolute\",\n is_decoder: bool = False,\n use_cache: bool = True,\n max_tokens_per_msa: int = 2**14,\n layer_type: str = \"standard\",\n attention_type: str = \"standard\",\n embed_positions_msa: bool = True,\n attention_bias: bool = True,\n head: HeadConfig | None = None,\n lm_head: MaskedLMHeadConfig | None = None,\n **kwargs,\n ):\n super().__init__(**kwargs)\n\n self.vocab_size = vocab_size\n self.hidden_size = hidden_size\n self.num_hidden_layers = num_hidden_layers\n self.num_attention_heads = num_attention_heads\n self.intermediate_size = intermediate_size\n self.hidden_act = hidden_act\n self.hidden_dropout = hidden_dropout\n self.attention_dropout = attention_dropout\n self.max_position_embeddings = max_position_embeddings\n self.initializer_range = initializer_range\n self.layer_norm_eps = layer_norm_eps\n self.position_embedding_type = position_embedding_type\n self.is_decoder = is_decoder\n self.use_cache = use_cache\n self.max_tokens_per_msa = max_tokens_per_msa\n self.layer_type = layer_type\n self.attention_type = attention_type\n self.embed_positions_msa = embed_positions_msa\n self.attention_bias = attention_bias\n self.head = HeadConfig(**head) if head is not None else None\n self.lm_head = MaskedLMHeadConfig(**lm_head) if lm_head is not None else None\n
"},{"location":"models/rnamsm/#multimolecule.models.rnamsm.RnaMsmConfig(vocab_size)","title":"vocab_size
","text":""},{"location":"models/rnamsm/#multimolecule.models.rnamsm.RnaMsmConfig(hidden_size)","title":"hidden_size
","text":""},{"location":"models/rnamsm/#multimolecule.models.rnamsm.RnaMsmConfig(num_hidden_layers)","title":"num_hidden_layers
","text":""},{"location":"models/rnamsm/#multimolecule.models.rnamsm.RnaMsmConfig(num_attention_heads)","title":"num_attention_heads
","text":""},{"location":"models/rnamsm/#multimolecule.models.rnamsm.RnaMsmConfig(intermediate_size)","title":"intermediate_size
","text":""},{"location":"models/rnamsm/#multimolecule.models.rnamsm.RnaMsmConfig(hidden_dropout)","title":"hidden_dropout
","text":""},{"location":"models/rnamsm/#multimolecule.models.rnamsm.RnaMsmConfig(attention_dropout)","title":"attention_dropout
","text":""},{"location":"models/rnamsm/#multimolecule.models.rnamsm.RnaMsmConfig(max_position_embeddings)","title":"max_position_embeddings
","text":""},{"location":"models/rnamsm/#multimolecule.models.rnamsm.RnaMsmConfig(initializer_range)","title":"initializer_range
","text":""},{"location":"models/rnamsm/#multimolecule.models.rnamsm.RnaMsmConfig(layer_norm_eps)","title":"layer_norm_eps
","text":""},{"location":"models/rnamsm/#multimolecule.models.rnamsm.RnaMsmForContactPrediction","title":"RnaMsmForContactPrediction","text":" Bases: RnaMsmPreTrainedModel
Examples:
Python Console Session>>> from multimolecule import RnaMsmConfig, RnaMsmForContactPrediction, RnaTokenizer\n>>> config = RnaMsmConfig()\n>>> model = RnaMsmForContactPrediction(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=torch.randint(2, (1, 5, 5)))\n>>> output[\"logits\"].shape\ntorch.Size([1, 5, 5, 1])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n
Source code in multimolecule/models/rnamsm/modeling_rnamsm.py
Pythonclass RnaMsmForContactPrediction(RnaMsmPreTrainedModel):\n \"\"\"\n Examples:\n >>> from multimolecule import RnaMsmConfig, RnaMsmForContactPrediction, RnaTokenizer\n >>> config = RnaMsmConfig()\n >>> model = RnaMsmForContactPrediction(config)\n >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n >>> output = model(**input, labels=torch.randint(2, (1, 5, 5)))\n >>> output[\"logits\"].shape\n torch.Size([1, 5, 5, 1])\n >>> output[\"loss\"] # doctest:+ELLIPSIS\n tensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n \"\"\"\n\n def __init__(self, config: RnaMsmConfig):\n super().__init__(config)\n self.rnamsm = RnaMsmModel(config, add_pooling_layer=True)\n head_config = HeadConfig(output_name=\"row_attentions\")\n self.contact_head = ContactPredictionHead(config, head_config)\n self.head_config = self.contact_head.config\n\n # Initialize weights and apply final processing\n self.post_init()\n\n def forward(\n self,\n input_ids: Tensor | NestedTensor,\n attention_mask: Tensor | None = None,\n position_ids: Tensor | None = None,\n head_mask: Tensor | None = None,\n inputs_embeds: Tensor | NestedTensor | None = None,\n labels: Tensor | None = None,\n output_attentions: bool | None = None,\n output_hidden_states: bool | None = None,\n return_dict: bool | None = None,\n **kwargs,\n ) -> Tuple[Tensor, ...] | RnaMsmContactPredictorOutput:\n if output_attentions is False:\n warn(\"output_attentions must be True for contact classification and will be ignored.\")\n return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n outputs = self.rnamsm(\n input_ids,\n attention_mask=attention_mask,\n position_ids=position_ids,\n head_mask=head_mask,\n inputs_embeds=inputs_embeds,\n output_attentions=True,\n output_hidden_states=output_hidden_states,\n return_dict=return_dict,\n **kwargs,\n )\n output = self.contact_head(outputs, attention_mask, input_ids, labels)\n logits, loss = output.logits, output.loss\n\n if not return_dict:\n output = (logits,) + outputs[2:]\n return ((loss,) + output) if loss is not None else output\n\n return RnaMsmContactPredictorOutput(\n loss=loss,\n logits=logits,\n hidden_states=outputs.hidden_states,\n col_attentions=outputs.col_attentions,\n row_attentions=outputs.row_attentions,\n )\n
"},{"location":"models/rnamsm/#multimolecule.models.rnamsm.RnaMsmForMaskedLM","title":"RnaMsmForMaskedLM","text":" Bases: RnaMsmPreTrainedModel
Examples:
Python Console Session>>> from multimolecule import RnaMsmConfig, RnaMsmForMaskedLM, RnaTokenizer\n>>> config = RnaMsmConfig()\n>>> model = RnaMsmForMaskedLM(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=input[\"input_ids\"])\n>>> output[\"logits\"].shape\ntorch.Size([1, 7, 26])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<NllLossBackward0>)\n
Source code in multimolecule/models/rnamsm/modeling_rnamsm.py
Pythonclass RnaMsmForMaskedLM(RnaMsmPreTrainedModel):\n \"\"\"\n Examples:\n >>> from multimolecule import RnaMsmConfig, RnaMsmForMaskedLM, RnaTokenizer\n >>> config = RnaMsmConfig()\n >>> model = RnaMsmForMaskedLM(config)\n >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n >>> output = model(**input, labels=input[\"input_ids\"])\n >>> output[\"logits\"].shape\n torch.Size([1, 7, 26])\n >>> output[\"loss\"] # doctest:+ELLIPSIS\n tensor(..., grad_fn=<NllLossBackward0>)\n \"\"\"\n\n _tied_weights_keys = [\"lm_head.decoder.weight\", \"lm_head.decoder.bias\"]\n\n def __init__(self, config: RnaMsmConfig):\n super().__init__(config)\n self.rnamsm = RnaMsmModel(config, add_pooling_layer=False)\n self.lm_head = MaskedLMHead(config, weight=self.rnamsm.embeddings.word_embeddings.weight)\n\n # Initialize weights and apply final processing\n self.post_init()\n\n def forward(\n self,\n input_ids: Tensor | NestedTensor,\n attention_mask: Tensor | None = None,\n position_ids: Tensor | None = None,\n inputs_embeds: Tensor | NestedTensor | None = None,\n labels: Tensor | None = None,\n output_attentions: bool | None = None,\n output_hidden_states: bool | None = None,\n return_dict: bool | None = None,\n **kwargs,\n ) -> Tuple[Tensor, ...] | RnaMsmForMaskedLMOutput:\n return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n outputs = self.rnamsm(\n input_ids,\n attention_mask=attention_mask,\n position_ids=position_ids,\n inputs_embeds=inputs_embeds,\n output_attentions=output_attentions,\n output_hidden_states=output_hidden_states,\n return_dict=return_dict,\n **kwargs,\n )\n output = self.lm_head(outputs, labels)\n logits, loss = output.logits, output.loss\n\n if not return_dict:\n output = (logits,) + outputs[2:]\n return ((loss,) + output) if loss is not None else output\n\n return RnaMsmForMaskedLMOutput(\n loss=loss,\n logits=logits,\n hidden_states=outputs.hidden_states,\n col_attentions=outputs.col_attentions,\n row_attentions=outputs.row_attentions,\n )\n
"},{"location":"models/rnamsm/#multimolecule.models.rnamsm.RnaMsmForPreTraining","title":"RnaMsmForPreTraining","text":" Bases: RnaMsmPreTrainedModel
Examples:
Python Console Session>>> from multimolecule import RnaMsmConfig, RnaMsmForPreTraining, RnaTokenizer\n>>> config = RnaMsmConfig()\n>>> model = RnaMsmForPreTraining(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels_mlm=input[\"input_ids\"])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<AddBackward0>)\n>>> output[\"logits\"].shape\ntorch.Size([1, 7, 26])\n>>> output[\"contact_map\"].shape\ntorch.Size([1, 5, 5, 1])\n
Source code in multimolecule/models/rnamsm/modeling_rnamsm.py
Pythonclass RnaMsmForPreTraining(RnaMsmPreTrainedModel):\n \"\"\"\n Examples:\n >>> from multimolecule import RnaMsmConfig, RnaMsmForPreTraining, RnaTokenizer\n >>> config = RnaMsmConfig()\n >>> model = RnaMsmForPreTraining(config)\n >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n >>> output = model(**input, labels_mlm=input[\"input_ids\"])\n >>> output[\"loss\"] # doctest:+ELLIPSIS\n tensor(..., grad_fn=<AddBackward0>)\n >>> output[\"logits\"].shape\n torch.Size([1, 7, 26])\n >>> output[\"contact_map\"].shape\n torch.Size([1, 5, 5, 1])\n \"\"\"\n\n _tied_weights_keys = [\n \"lm_head.decoder.weight\",\n \"lm_head.decoder.bias\",\n \"pretrain.predictions.decoder.weight\",\n \"pretrain.predictions.decoder.bias\",\n \"pretrain.predictions_ss.decoder.weight\",\n \"pretrain.predictions_ss.decoder.bias\",\n ]\n\n def __init__(self, config: RnaMsmConfig):\n super().__init__(config)\n self.rnamsm = RnaMsmModel(config, add_pooling_layer=False)\n self.pretrain = RnaMsmPreTrainingHeads(config, weight=self.rnamsm.embeddings.word_embeddings.weight)\n\n # Initialize weights and apply final processing\n self.post_init()\n\n def forward(\n self,\n input_ids: Tensor | NestedTensor,\n attention_mask: Tensor | None = None,\n position_ids: Tensor | None = None,\n inputs_embeds: Tensor | NestedTensor | None = None,\n labels_mlm: Tensor | None = None,\n labels_contact: Tensor | None = None,\n output_attentions: bool | None = None,\n output_hidden_states: bool | None = None,\n return_dict: bool | None = None,\n **kwargs,\n ) -> Tuple[Tensor, ...] | RnaMsmForPreTrainingOutput:\n if output_attentions is False:\n warn(\"output_attentions must be True for contact classification and will be ignored.\")\n return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n outputs = self.rnamsm(\n input_ids,\n attention_mask=attention_mask,\n position_ids=position_ids,\n inputs_embeds=inputs_embeds,\n output_attentions=True,\n output_hidden_states=output_hidden_states,\n return_dict=return_dict,\n **kwargs,\n )\n total_loss, logits, contact_map = self.pretrain(\n outputs, attention_mask, input_ids, labels_mlm=labels_mlm, labels_contact=labels_contact\n )\n\n if not return_dict:\n output = (logits, contact_map) + outputs[2:]\n return ((total_loss,) + output) if total_loss is not None else output\n\n return RnaMsmForPreTrainingOutput(\n loss=total_loss,\n logits=logits,\n contact_map=contact_map,\n hidden_states=outputs.hidden_states,\n col_attentions=outputs.col_attentions,\n row_attentions=outputs.row_attentions,\n )\n
"},{"location":"models/rnamsm/#multimolecule.models.rnamsm.RnaMsmForSequencePrediction","title":"RnaMsmForSequencePrediction","text":" Bases: RnaMsmPreTrainedModel
Examples:
Python Console Session>>> from multimolecule import RnaMsmConfig, RnaMsmForSequencePrediction, RnaTokenizer\n>>> config = RnaMsmConfig()\n>>> model = RnaMsmForSequencePrediction(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=torch.tensor([[1]]))\n>>> output[\"logits\"].shape\ntorch.Size([1, 1])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n
Source code in multimolecule/models/rnamsm/modeling_rnamsm.py
Pythonclass RnaMsmForSequencePrediction(RnaMsmPreTrainedModel):\n \"\"\"\n Examples:\n >>> from multimolecule import RnaMsmConfig, RnaMsmForSequencePrediction, RnaTokenizer\n >>> config = RnaMsmConfig()\n >>> model = RnaMsmForSequencePrediction(config)\n >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n >>> output = model(**input, labels=torch.tensor([[1]]))\n >>> output[\"logits\"].shape\n torch.Size([1, 1])\n >>> output[\"loss\"] # doctest:+ELLIPSIS\n tensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n \"\"\"\n\n def __init__(self, config: RnaMsmConfig):\n super().__init__(config)\n self.rnamsm = RnaMsmModel(config, add_pooling_layer=True)\n self.sequence_head = SequencePredictionHead(config)\n self.head_config = self.sequence_head.config\n\n # Initialize weights and apply final processing\n self.post_init()\n\n def forward(\n self,\n input_ids: Tensor | NestedTensor,\n attention_mask: Tensor | None = None,\n position_ids: Tensor | None = None,\n inputs_embeds: Tensor | NestedTensor | None = None,\n labels: Tensor | None = None,\n output_attentions: bool = False,\n output_hidden_states: bool = False,\n return_dict: bool = True,\n **kwargs,\n ) -> Tuple[Tensor, ...] | RnaMsmSequencePredictorOutput:\n return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n outputs = self.rnamsm(\n input_ids,\n attention_mask=attention_mask,\n position_ids=position_ids,\n inputs_embeds=inputs_embeds,\n output_attentions=output_attentions,\n output_hidden_states=output_hidden_states,\n return_dict=return_dict,\n **kwargs,\n )\n output = self.sequence_head(outputs, labels)\n logits, loss = output.logits, output.loss\n\n if not return_dict:\n output = (logits,) + outputs[2:]\n return ((loss,) + output) if loss is not None else output\n\n return RnaMsmSequencePredictorOutput(\n loss=loss,\n logits=logits,\n hidden_states=outputs.hidden_states,\n col_attentions=outputs.col_attentions,\n row_attentions=outputs.row_attentions,\n )\n
"},{"location":"models/rnamsm/#multimolecule.models.rnamsm.RnaMsmForTokenPrediction","title":"RnaMsmForTokenPrediction","text":" Bases: RnaMsmPreTrainedModel
Examples:
Python Console Session>>> from multimolecule import RnaMsmConfig, RnaMsmForTokenPrediction, RnaTokenizer\n>>> config = RnaMsmConfig()\n>>> model = RnaMsmForTokenPrediction(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=torch.randint(2, (1, 5)))\n>>> output[\"logits\"].shape\ntorch.Size([1, 5, 1])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n
Source code in multimolecule/models/rnamsm/modeling_rnamsm.py
Pythonclass RnaMsmForTokenPrediction(RnaMsmPreTrainedModel):\n \"\"\"\n Examples:\n >>> from multimolecule import RnaMsmConfig, RnaMsmForTokenPrediction, RnaTokenizer\n >>> config = RnaMsmConfig()\n >>> model = RnaMsmForTokenPrediction(config)\n >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n >>> output = model(**input, labels=torch.randint(2, (1, 5)))\n >>> output[\"logits\"].shape\n torch.Size([1, 5, 1])\n >>> output[\"loss\"] # doctest:+ELLIPSIS\n tensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n \"\"\"\n\n def __init__(self, config: RnaMsmConfig):\n super().__init__(config)\n self.rnamsm = RnaMsmModel(config, add_pooling_layer=True)\n self.token_head = TokenPredictionHead(config)\n self.head_config = self.token_head.config\n\n # Initialize weights and apply final processing\n self.post_init()\n\n def forward(\n self,\n input_ids: Tensor | NestedTensor,\n attention_mask: Tensor | None = None,\n position_ids: Tensor | None = None,\n inputs_embeds: Tensor | NestedTensor | None = None,\n labels: Tensor | None = None,\n output_attentions: bool = False,\n output_hidden_states: bool = False,\n return_dict: bool = True,\n **kwargs,\n ) -> Tuple[Tensor, ...] | RnaMsmTokenPredictorOutput:\n return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n outputs = self.rnamsm(\n input_ids,\n attention_mask=attention_mask,\n position_ids=position_ids,\n inputs_embeds=inputs_embeds,\n output_attentions=output_attentions,\n output_hidden_states=output_hidden_states,\n return_dict=return_dict,\n **kwargs,\n )\n output = self.token_head(outputs, attention_mask, input_ids, labels)\n logits, loss = output.logits, output.loss\n\n if not return_dict:\n output = (logits,) + outputs[2:]\n return ((loss,) + output) if loss is not None else output\n\n return RnaMsmTokenPredictorOutput(\n loss=loss,\n logits=logits,\n hidden_states=outputs.hidden_states,\n col_attentions=outputs.col_attentions,\n row_attentions=outputs.row_attentions,\n )\n
"},{"location":"models/rnamsm/#multimolecule.models.rnamsm.RnaMsmModel","title":"RnaMsmModel","text":" Bases: RnaMsmPreTrainedModel
Examples:
Python Console Session>>> from multimolecule import RnaMsmConfig, RnaMsmModel, RnaTokenizer\n>>> config = RnaMsmConfig()\n>>> model = RnaMsmModel(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input)\n>>> output[\"last_hidden_state\"].shape\ntorch.Size([1, 7, 768])\n>>> output[\"pooler_output\"].shape\ntorch.Size([1, 768])\n
Source code in multimolecule/models/rnamsm/modeling_rnamsm.py
Pythonclass RnaMsmModel(RnaMsmPreTrainedModel):\n \"\"\"\n Examples:\n >>> from multimolecule import RnaMsmConfig, RnaMsmModel, RnaTokenizer\n >>> config = RnaMsmConfig()\n >>> model = RnaMsmModel(config)\n >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n >>> output = model(**input)\n >>> output[\"last_hidden_state\"].shape\n torch.Size([1, 7, 768])\n >>> output[\"pooler_output\"].shape\n torch.Size([1, 768])\n \"\"\"\n\n def __init__(self, config: RnaMsmConfig, add_pooling_layer: bool = True):\n super().__init__(config)\n self.pad_token_id = config.pad_token_id\n self.embeddings = RnaMsmEmbeddings(config)\n self.encoder = RnaMsmEncoder(config)\n self.pooler = RnaMsmPooler(config) if add_pooling_layer else None\n\n # Initialize weights and apply final processing\n self.post_init()\n\n def forward(\n self,\n input_ids: Tensor | NestedTensor,\n attention_mask: Tensor | None = None,\n position_ids: Tensor | None = None,\n inputs_embeds: Tensor | NestedTensor | None = None,\n output_attentions: bool | None = None,\n output_hidden_states: bool | None = None,\n return_dict: bool | None = None,\n **kwargs,\n ) -> Tuple[Tensor, ...] | RnaMsmModelOutputWithPooling:\n if kwargs:\n warn(\n f\"Additional keyword arguments `{', '.join(kwargs)}` are detected in \"\n f\"`{self.__class__.__name__}.forward`, they will be ignored.\\n\"\n \"This is provided for backward compatibility and may lead to unexpected behavior.\"\n )\n output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions\n output_hidden_states = (\n output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states\n )\n return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n\n if isinstance(input_ids, NestedTensor):\n input_ids, attention_mask = input_ids.tensor, input_ids.mask\n if input_ids is not None and inputs_embeds is not None:\n raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n if input_ids is not None:\n self.warn_if_padding_and_no_attention_mask(input_ids, attention_mask)\n elif inputs_embeds is None:\n raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n if attention_mask is None:\n attention_mask = (\n input_ids.ne(self.pad_token_id) if self.pad_token_id is not None else torch.ones_like(input_ids)\n )\n\n unsqueeze_input = input_ids.ndim == 2\n if unsqueeze_input:\n input_ids = input_ids.unsqueeze(1)\n if attention_mask.ndim == 2:\n attention_mask = attention_mask.unsqueeze(1)\n\n embedding_output = self.embeddings(\n input_ids=input_ids,\n position_ids=position_ids,\n attention_mask=attention_mask,\n inputs_embeds=inputs_embeds,\n )\n encoder_outputs = self.encoder(\n embedding_output,\n attention_mask=attention_mask,\n output_attentions=output_attentions,\n output_hidden_states=output_hidden_states,\n return_dict=return_dict,\n )\n sequence_output = encoder_outputs[0]\n if unsqueeze_input:\n sequence_output = sequence_output.squeeze(1)\n pooled_output = self.pooler(sequence_output) if self.pooler is not None else None\n\n if not return_dict:\n return (sequence_output, pooled_output) + encoder_outputs[1:]\n\n return RnaMsmModelOutputWithPooling(\n last_hidden_state=sequence_output,\n pooler_output=pooled_output,\n hidden_states=encoder_outputs.hidden_states,\n col_attentions=encoder_outputs.col_attentions,\n row_attentions=encoder_outputs.row_attentions,\n )\n
"},{"location":"models/rnamsm/#multimolecule.models.rnamsm.RnaMsmPreTrainedModel","title":"RnaMsmPreTrainedModel","text":" Bases: PreTrainedModel
An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained models.
Source code inmultimolecule/models/rnamsm/modeling_rnamsm.py
Pythonclass RnaMsmPreTrainedModel(PreTrainedModel):\n \"\"\"\n An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained\n models.\n \"\"\"\n\n config_class = RnaMsmConfig\n base_model_prefix = \"rnamsm\"\n supports_gradient_checkpointing = True\n _no_split_modules = [\"RnaMsmLayer\", \"RnaMsmAxialLayer\", \"RnaMsmPkmLayer\", \"RnaMsmEmbeddings\"]\n\n # Copied from transformers.models.bert.modeling_bert.BertPreTrainedModel._init_weights\n def _init_weights(self, module: nn.Module):\n \"\"\"Initialize the weights\"\"\"\n if isinstance(module, nn.Linear):\n # Slightly different from the TF version which uses truncated_normal for initialization\n # cf https://github.com/pytorch/pytorch/pull/5617\n module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)\n if module.bias is not None:\n module.bias.data.zero_()\n elif isinstance(module, nn.Embedding):\n module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)\n if module.padding_idx is not None:\n module.weight.data[module.padding_idx].zero_()\n elif isinstance(module, nn.LayerNorm) and module.elementwise_affine:\n module.bias.data.zero_()\n module.weight.data.fill_(1.0)\n
"},{"location":"models/splicebert/","title":"SpliceBERT","text":"Pre-trained model on messenger RNA precursor (pre-mRNA) using a masked language modeling (MLM) objective.
"},{"location":"models/splicebert/#disclaimer","title":"Disclaimer","text":"This is an UNOFFICIAL implementation of the Self-supervised learning on millions of pre-mRNA sequences improves sequence-based RNA splicing prediction by Ken Chen, et al.
The OFFICIAL repository of SpliceBERT is at chenkenbio/SpliceBERT.
Tip
The MultiMolecule team has confirmed that the provided model and checkpoints are producing the same intermediate representations as the original implementation.
The team releasing SpliceBERT did not write this model card for this model so this model card has been written by the MultiMolecule team.
"},{"location":"models/splicebert/#model-details","title":"Model Details","text":"SpliceBERT is a bert-style model pre-trained on a large corpus of messenger RNA precursor sequences in a self-supervised fashion. This means that the model was trained on the raw nucleotides of RNA sequences only, with an automatic process to generate inputs and labels from those texts. Please refer to the Training Details section for more information on the training process.
"},{"location":"models/splicebert/#variations","title":"Variations","text":"multimolecule/splicebert
: The SpliceBERT model.multimolecule/splicebert.510
: The intermediate SpliceBERT model.multimolecule/splicebert-human.510
: The intermediate SpliceBERT model pre-trained on human data only.The model file depends on the multimolecule
library. You can install it using pip:
pip install multimolecule\n
"},{"location":"models/splicebert/#direct-use","title":"Direct Use","text":"You can use this model directly with a pipeline for masked language modeling:
Python>>> import multimolecule # you must import multimolecule to register models\n>>> from transformers import pipeline\n>>> unmasker = pipeline(\"fill-mask\", model=\"multimolecule/splicebert\")\n>>> unmasker(\"gguc<mask>cucugguuagaccagaucugagccu\")\n\n[{'score': 0.340412974357605,\n 'token': 9,\n 'token_str': 'U',\n 'sequence': 'G G U C U C U C U G G U U A G A C C A G A U C U G A G C C U'},\n {'score': 0.13882005214691162,\n 'token': 12,\n 'token_str': 'Y',\n 'sequence': 'G G U C Y C U C U G G U U A G A C C A G A U C U G A G C C U'},\n {'score': 0.056610625237226486,\n 'token': 7,\n 'token_str': 'C',\n 'sequence': 'G G U C C C U C U G G U U A G A C C A G A U C U G A G C C U'},\n {'score': 0.05455885827541351,\n 'token': 19,\n 'token_str': 'H',\n 'sequence': 'G G U C H C U C U G G U U A G A C C A G A U C U G A G C C U'},\n {'score': 0.05356108024716377,\n 'token': 14,\n 'token_str': 'W',\n 'sequence': 'G G U C W C U C U G G U U A G A C C A G A U C U G A G C C U'}]\n
"},{"location":"models/splicebert/#downstream-use","title":"Downstream Use","text":""},{"location":"models/splicebert/#extract-features","title":"Extract Features","text":"Here is how to use this model to get the features of a given sequence in PyTorch:
Pythonfrom multimolecule import RnaTokenizer, SpliceBertModel\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/splicebert\")\nmodel = SpliceBertModel.from_pretrained(\"multimolecule/splicebert\")\n\ntext = \"UAGCUUAUCAGACUGAUGUUGA\"\ninput = tokenizer(text, return_tensors=\"pt\")\n\noutput = model(**input)\n
"},{"location":"models/splicebert/#sequence-classification-regression","title":"Sequence Classification / Regression","text":"Note: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for sequence classification or regression.
Here is how to use this model as backbone to fine-tune for a sequence-level task in PyTorch:
Pythonimport torch\nfrom multimolecule import RnaTokenizer, SpliceBertForSequencePrediction\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/splicebert\")\nmodel = SpliceBertForSequencePrediction.from_pretrained(\"multimolecule/splicebert\")\n\ntext = \"UAGCUUAUCAGACUGAUGUUGA\"\ninput = tokenizer(text, return_tensors=\"pt\")\nlabel = torch.tensor([1])\n\noutput = model(**input, labels=label)\n
"},{"location":"models/splicebert/#token-classification-regression","title":"Token Classification / Regression","text":"Note: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for nucleotide classification or regression.
Here is how to use this model as backbone to fine-tune for a nucleotide-level task in PyTorch:
Pythonimport torch\nfrom multimolecule import RnaTokenizer, SpliceBertForTokenPrediction\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/splicebert\")\nmodel = SpliceBertForTokenPrediction.from_pretrained(\"multimolecule/splicebert\")\n\ntext = \"UAGCUUAUCAGACUGAUGUUGA\"\ninput = tokenizer(text, return_tensors=\"pt\")\nlabel = torch.randint(2, (len(text), ))\n\noutput = model(**input, labels=label)\n
"},{"location":"models/splicebert/#contact-classification-regression","title":"Contact Classification / Regression","text":"Note: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for contact classification or regression.
Here is how to use this model as backbone to fine-tune for a contact-level task in PyTorch:
Pythonimport torch\nfrom multimolecule import RnaTokenizer, SpliceBertForContactPrediction\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/splicebert\")\nmodel = SpliceBertForContactPrediction.from_pretrained(\"multimolecule/splicebert\")\n\ntext = \"UAGCUUAUCAGACUGAUGUUGA\"\ninput = tokenizer(text, return_tensors=\"pt\")\nlabel = torch.randint(2, (len(text), len(text)))\n\noutput = model(**input, labels=label)\n
"},{"location":"models/splicebert/#training-details","title":"Training Details","text":"SpliceBERT used Masked Language Modeling (MLM) as the pre-training objective: taking a sequence, the model randomly masks 15% of the tokens in the input then runs the entire masked sentence through the model and has to predict the masked tokens. This is comparable to the Cloze task in language modeling.
"},{"location":"models/splicebert/#training-data","title":"Training Data","text":"The SpliceBERT model was pre-trained on messenger RNA precursor sequences from UCSC Genome Browser. UCSC Genome Browser provides visualization, analysis, and download of comprehensive vertebrate genome data with aligned annotation tracks (known genes, predicted genes, ESTs, mRNAs, CpG islands, etc.).
SpliceBERT collected reference genomes and gene annotations from the UCSC Genome Browser for 72 vertebrate species. It applied bedtools getfasta to extract pre-mRNA sequences from the reference genomes based on the gene annotations. The pre-mRNA sequences are then used to pre-train SpliceBERT. The pre-training data contains 2 million pre-mRNA sequences with a total length of 65 billion nucleotides.
Note RnaTokenizer
will convert \u201cT\u201ds to \u201cU\u201ds for you, you may disable this behaviour by passing replace_T_with_U=False
.
SpliceBERT used masked language modeling (MLM) as the pre-training objective. The masking procedure is similar to the one used in BERT:
<mask>
.The model was trained on 8 NVIDIA V100 GPUs.
SpliceBERT trained model in a two-stage training process:
The intermediate model after the first stage is available as multimolecule/splicebert.510
.
SpliceBERT also pre-trained a model on human data only to validate the contribution of multi-species pre-training. The intermediate model after the first stage is available as multimolecule/splicebert-human.510
.
BibTeX:
BibTeX@article {chen2023self,\n author = {Chen, Ken and Zhou, Yue and Ding, Maolin and Wang, Yu and Ren, Zhixiang and Yang, Yuedong},\n title = {Self-supervised learning on millions of pre-mRNA sequences improves sequence-based RNA splicing prediction},\n elocation-id = {2023.01.31.526427},\n year = {2023},\n doi = {10.1101/2023.01.31.526427},\n publisher = {Cold Spring Harbor Laboratory},\n abstract = {RNA splicing is an important post-transcriptional process of gene expression in eukaryotic cells. Predicting RNA splicing from primary sequences can facilitate the interpretation of genomic variants. In this study, we developed a novel self-supervised pre-trained language model, SpliceBERT, to improve sequence-based RNA splicing prediction. Pre-training on pre-mRNA sequences from vertebrates enables SpliceBERT to capture evolutionary conservation information and characterize the unique property of splice sites. SpliceBERT also improves zero-shot prediction of variant effects on splicing by considering sequence context information, and achieves superior performance for predicting branchpoint in the human genome and splice sites across species. Our study highlighted the importance of pre-training genomic language models on a diverse range of species and suggested that pre-trained language models were promising for deciphering the sequence logic of RNA splicing.Competing Interest StatementThe authors have declared no competing interest.},\n URL = {https://www.biorxiv.org/content/early/2023/05/09/2023.01.31.526427},\n eprint = {https://www.biorxiv.org/content/early/2023/05/09/2023.01.31.526427.full.pdf},\n journal = {bioRxiv}\n}\n
"},{"location":"models/splicebert/#contact","title":"Contact","text":"Please use GitHub issues of MultiMolecule for any questions or comments on the model card.
Please contact the authors of the SpliceBERT paper for questions or comments on the paper/model.
"},{"location":"models/splicebert/#license","title":"License","text":"This model is licensed under the AGPL-3.0 License.
Text OnlySPDX-License-Identifier: AGPL-3.0-or-later\n
"},{"location":"models/splicebert/#multimolecule.models.splicebert","title":"multimolecule.models.splicebert","text":""},{"location":"models/splicebert/#multimolecule.models.splicebert.RnaTokenizer","title":"RnaTokenizer","text":" Bases: Tokenizer
Tokenizer for RNA sequences.
Parameters:
Name Type Description DefaultAlphabet | str | List[str] | None
alphabet to use for tokenization.
None
, the standard RNA alphabet will be used.string
, it should correspond to the name of a predefined alphabet. The options includestandard
extended
streamline
nucleobase
None
int
Size of kmer to tokenize.
1
bool
Whether to tokenize into codons.
False
bool
Whether to replace T with U.
True
bool
Whether to convert input to uppercase.
True
Examples:
Python Console Session>>> from multimolecule import RnaTokenizer\n>>> tokenizer = RnaTokenizer()\n>>> tokenizer('<pad><cls><eos><unk><mask><null>ACGUNRYSWKMBDHV.X*-I')[\"input_ids\"]\n[1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 2]\n>>> tokenizer('acgu')[\"input_ids\"]\n[1, 6, 7, 8, 9, 2]\n>>> tokenizer('acgt')[\"input_ids\"]\n[1, 6, 7, 8, 9, 2]\n>>> tokenizer = RnaTokenizer(replace_T_with_U=False)\n>>> tokenizer('acgt')[\"input_ids\"]\n[1, 6, 7, 8, 3, 2]\n>>> tokenizer = RnaTokenizer(nmers=3)\n>>> tokenizer('uagcuuauc')[\"input_ids\"]\n[1, 83, 17, 64, 49, 96, 84, 22, 2]\n>>> tokenizer = RnaTokenizer(codon=True)\n>>> tokenizer('uagcuuauc')[\"input_ids\"]\n[1, 83, 49, 22, 2]\n>>> tokenizer('uagcuuauca')[\"input_ids\"]\nTraceback (most recent call last):\nValueError: length of input sequence must be a multiple of 3 for codon tokenization, but got 10\n
Source code in multimolecule/tokenisers/rna/tokenization_rna.py
Pythonclass RnaTokenizer(Tokenizer):\n \"\"\"\n Tokenizer for RNA sequences.\n\n Args:\n alphabet: alphabet to use for tokenization.\n\n - If is `None`, the standard RNA alphabet will be used.\n - If is a `string`, it should correspond to the name of a predefined alphabet. The options include\n + `standard`\n + `extended`\n + `streamline`\n + `nucleobase`\n - If is an alphabet or a list of characters, that specific alphabet will be used.\n nmers: Size of kmer to tokenize.\n codon: Whether to tokenize into codons.\n replace_T_with_U: Whether to replace T with U.\n do_upper_case: Whether to convert input to uppercase.\n\n Examples:\n >>> from multimolecule import RnaTokenizer\n >>> tokenizer = RnaTokenizer()\n >>> tokenizer('<pad><cls><eos><unk><mask><null>ACGUNRYSWKMBDHV.X*-I')[\"input_ids\"]\n [1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 2]\n >>> tokenizer('acgu')[\"input_ids\"]\n [1, 6, 7, 8, 9, 2]\n >>> tokenizer('acgt')[\"input_ids\"]\n [1, 6, 7, 8, 9, 2]\n >>> tokenizer = RnaTokenizer(replace_T_with_U=False)\n >>> tokenizer('acgt')[\"input_ids\"]\n [1, 6, 7, 8, 3, 2]\n >>> tokenizer = RnaTokenizer(nmers=3)\n >>> tokenizer('uagcuuauc')[\"input_ids\"]\n [1, 83, 17, 64, 49, 96, 84, 22, 2]\n >>> tokenizer = RnaTokenizer(codon=True)\n >>> tokenizer('uagcuuauc')[\"input_ids\"]\n [1, 83, 49, 22, 2]\n >>> tokenizer('uagcuuauca')[\"input_ids\"]\n Traceback (most recent call last):\n ValueError: length of input sequence must be a multiple of 3 for codon tokenization, but got 10\n \"\"\"\n\n model_input_names = [\"input_ids\", \"attention_mask\"]\n\n def __init__(\n self,\n alphabet: Alphabet | str | List[str] | None = None,\n nmers: int = 1,\n codon: bool = False,\n replace_T_with_U: bool = True,\n do_upper_case: bool = True,\n additional_special_tokens: List | Tuple | None = None,\n **kwargs,\n ):\n if codon and (nmers > 1 and nmers != 3):\n raise ValueError(\"Codon and nmers cannot be used together.\")\n if codon:\n nmers = 3 # set to 3 to get correct vocab\n if not isinstance(alphabet, Alphabet):\n alphabet = get_alphabet(alphabet, nmers=nmers)\n super().__init__(\n alphabet=alphabet,\n nmers=nmers,\n codon=codon,\n replace_T_with_U=replace_T_with_U,\n do_upper_case=do_upper_case,\n additional_special_tokens=additional_special_tokens,\n **kwargs,\n )\n self.replace_T_with_U = replace_T_with_U\n self.nmers = nmers\n self.codon = codon\n\n def _tokenize(self, text: str, **kwargs):\n if self.do_upper_case:\n text = text.upper()\n if self.replace_T_with_U:\n text = text.replace(\"T\", \"U\")\n if self.codon:\n if len(text) % 3 != 0:\n raise ValueError(\n f\"length of input sequence must be a multiple of 3 for codon tokenization, but got {len(text)}\"\n )\n return [text[i : i + 3] for i in range(0, len(text), 3)]\n if self.nmers > 1:\n return [text[i : i + self.nmers] for i in range(len(text) - self.nmers + 1)] # noqa: E203\n return list(text)\n
"},{"location":"models/splicebert/#multimolecule.models.splicebert.RnaTokenizer(alphabet)","title":"alphabet
","text":""},{"location":"models/splicebert/#multimolecule.models.splicebert.RnaTokenizer(nmers)","title":"nmers
","text":""},{"location":"models/splicebert/#multimolecule.models.splicebert.RnaTokenizer(codon)","title":"codon
","text":""},{"location":"models/splicebert/#multimolecule.models.splicebert.RnaTokenizer(replace_T_with_U)","title":"replace_T_with_U
","text":""},{"location":"models/splicebert/#multimolecule.models.splicebert.RnaTokenizer(do_upper_case)","title":"do_upper_case
","text":""},{"location":"models/splicebert/#multimolecule.models.splicebert.SpliceBertConfig","title":"SpliceBertConfig","text":" Bases: PreTrainedConfig
This is the configuration class to store the configuration of a SpliceBertModel
. It is used to instantiate a SpliceBert model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the SpliceBert biomed-AI/SpliceBERT architecture.
Configuration objects inherit from PreTrainedConfig
and can be used to control the model outputs. Read the documentation from PreTrainedConfig
for more information.
Parameters:
Name Type Description Defaultint
Vocabulary size of the SpliceBert model. Defines the number of different tokens that can be represented by the inputs_ids
passed when calling [SpliceBertModel
].
26
int
Dimensionality of the encoder layers and the pooler layer.
512
int
Number of hidden layers in the Transformer encoder.
6
int
Number of attention heads for each attention layer in the Transformer encoder.
16
int
Dimensionality of the \u201cintermediate\u201d (often named feed-forward) layer in the Transformer encoder.
2048
float
The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
0.1
float
The dropout ratio for the attention probabilities.
0.1
int
The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
1026
float
The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
0.02
float
The epsilon used by the layer normalization layers.
1e-12
Examples:
Python Console Session>>> from multimolecule import SpliceBertModel, SpliceBertConfig\n>>> # Initializing a SpliceBERT multimolecule/splicebert style configuration\n>>> configuration = SpliceBertConfig()\n>>> # Initializing a model (with random weights) from the multimolecule/splicebert style configuration\n>>> model = SpliceBertModel(configuration)\n>>> # Accessing the model configuration\n>>> configuration = model.config\n
Source code in multimolecule/models/splicebert/configuration_splicebert.py
Pythonclass SpliceBertConfig(PreTrainedConfig):\n r\"\"\"\n This is the configuration class to store the configuration of a\n [`SpliceBertModel`][multimolecule.models.SpliceBertModel]. It is used to instantiate a SpliceBert model according\n to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will\n yield a similar configuration to that of the SpliceBert\n [biomed-AI/SpliceBERT](https://github.com/biomed-AI/SpliceBERT) architecture.\n\n Configuration objects inherit from [`PreTrainedConfig`][multimolecule.models.PreTrainedConfig] and can be used to\n control the model outputs. Read the documentation from [`PreTrainedConfig`][multimolecule.models.PreTrainedConfig]\n for more information.\n\n Args:\n vocab_size:\n Vocabulary size of the SpliceBert model. Defines the number of different tokens that can be represented by\n the `inputs_ids` passed when calling [`SpliceBertModel`].\n hidden_size:\n Dimensionality of the encoder layers and the pooler layer.\n num_hidden_layers:\n Number of hidden layers in the Transformer encoder.\n num_attention_heads:\n Number of attention heads for each attention layer in the Transformer encoder.\n intermediate_size:\n Dimensionality of the \"intermediate\" (often named feed-forward) layer in the Transformer encoder.\n hidden_dropout:\n The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.\n attention_dropout:\n The dropout ratio for the attention probabilities.\n max_position_embeddings:\n The maximum sequence length that this model might ever be used with. Typically set this to something large\n just in case (e.g., 512 or 1024 or 2048).\n initializer_range:\n The standard deviation of the truncated_normal_initializer for initializing all weight matrices.\n layer_norm_eps:\n The epsilon used by the layer normalization layers.\n\n Examples:\n >>> from multimolecule import SpliceBertModel, SpliceBertConfig\n >>> # Initializing a SpliceBERT multimolecule/splicebert style configuration\n >>> configuration = SpliceBertConfig()\n >>> # Initializing a model (with random weights) from the multimolecule/splicebert style configuration\n >>> model = SpliceBertModel(configuration)\n >>> # Accessing the model configuration\n >>> configuration = model.config\n \"\"\"\n\n model_type = \"splicebert\"\n\n def __init__(\n self,\n vocab_size: int = 26,\n hidden_size: int = 512,\n num_hidden_layers: int = 6,\n num_attention_heads: int = 16,\n intermediate_size: int = 2048,\n hidden_act: str = \"gelu\",\n hidden_dropout: float = 0.1,\n attention_dropout: float = 0.1,\n max_position_embeddings: int = 1026,\n initializer_range: float = 0.02,\n layer_norm_eps: float = 1e-12,\n position_embedding_type: str = \"absolute\",\n is_decoder: bool = False,\n use_cache: bool = True,\n head: HeadConfig | None = None,\n lm_head: MaskedLMHeadConfig | None = None,\n **kwargs,\n ):\n super().__init__(**kwargs)\n\n self.vocab_size = vocab_size\n self.type_vocab_size = 2\n self.hidden_size = hidden_size\n self.num_hidden_layers = num_hidden_layers\n self.num_attention_heads = num_attention_heads\n self.intermediate_size = intermediate_size\n self.hidden_act = hidden_act\n self.hidden_dropout = hidden_dropout\n self.attention_dropout = attention_dropout\n self.max_position_embeddings = max_position_embeddings\n self.initializer_range = initializer_range\n self.layer_norm_eps = layer_norm_eps\n self.position_embedding_type = position_embedding_type\n self.is_decoder = is_decoder\n self.use_cache = use_cache\n self.head = HeadConfig(**head) if head is not None else None\n self.lm_head = MaskedLMHeadConfig(**lm_head) if lm_head is not None else None\n
"},{"location":"models/splicebert/#multimolecule.models.splicebert.SpliceBertConfig(vocab_size)","title":"vocab_size
","text":""},{"location":"models/splicebert/#multimolecule.models.splicebert.SpliceBertConfig(hidden_size)","title":"hidden_size
","text":""},{"location":"models/splicebert/#multimolecule.models.splicebert.SpliceBertConfig(num_hidden_layers)","title":"num_hidden_layers
","text":""},{"location":"models/splicebert/#multimolecule.models.splicebert.SpliceBertConfig(num_attention_heads)","title":"num_attention_heads
","text":""},{"location":"models/splicebert/#multimolecule.models.splicebert.SpliceBertConfig(intermediate_size)","title":"intermediate_size
","text":""},{"location":"models/splicebert/#multimolecule.models.splicebert.SpliceBertConfig(hidden_dropout)","title":"hidden_dropout
","text":""},{"location":"models/splicebert/#multimolecule.models.splicebert.SpliceBertConfig(attention_dropout)","title":"attention_dropout
","text":""},{"location":"models/splicebert/#multimolecule.models.splicebert.SpliceBertConfig(max_position_embeddings)","title":"max_position_embeddings
","text":""},{"location":"models/splicebert/#multimolecule.models.splicebert.SpliceBertConfig(initializer_range)","title":"initializer_range
","text":""},{"location":"models/splicebert/#multimolecule.models.splicebert.SpliceBertConfig(layer_norm_eps)","title":"layer_norm_eps
","text":""},{"location":"models/splicebert/#multimolecule.models.splicebert.SpliceBertForContactPrediction","title":"SpliceBertForContactPrediction","text":" Bases: SpliceBertPreTrainedModel
Examples:
Python Console Session>>> from multimolecule import SpliceBertConfig, SpliceBertForContactPrediction, RnaTokenizer\n>>> config = SpliceBertConfig()\n>>> model = SpliceBertForContactPrediction(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=torch.randint(2, (1, 5, 5)))\n>>> output[\"logits\"].shape\ntorch.Size([1, 5, 5, 1])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n
Source code in multimolecule/models/splicebert/modeling_splicebert.py
Pythonclass SpliceBertForContactPrediction(SpliceBertPreTrainedModel):\n \"\"\"\n Examples:\n >>> from multimolecule import SpliceBertConfig, SpliceBertForContactPrediction, RnaTokenizer\n >>> config = SpliceBertConfig()\n >>> model = SpliceBertForContactPrediction(config)\n >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n >>> output = model(**input, labels=torch.randint(2, (1, 5, 5)))\n >>> output[\"logits\"].shape\n torch.Size([1, 5, 5, 1])\n >>> output[\"loss\"] # doctest:+ELLIPSIS\n tensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n \"\"\"\n\n def __init__(self, config: SpliceBertConfig):\n super().__init__(config)\n self.splicebert = SpliceBertModel(config, add_pooling_layer=True)\n self.contact_head = ContactPredictionHead(config)\n self.head_config = self.contact_head.config\n\n # Initialize weights and apply final processing\n self.post_init()\n\n def forward(\n self,\n input_ids: Tensor | NestedTensor,\n attention_mask: Tensor | None = None,\n position_ids: Tensor | None = None,\n head_mask: Tensor | None = None,\n inputs_embeds: Tensor | NestedTensor | None = None,\n labels: Tensor | None = None,\n output_attentions: bool | None = None,\n output_hidden_states: bool | None = None,\n return_dict: bool | None = None,\n **kwargs,\n ) -> Tuple[Tensor, ...] | ContactPredictorOutput:\n if output_attentions is False:\n warn(\"output_attentions must be True for contact classification and will be ignored.\")\n return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n outputs = self.splicebert(\n input_ids,\n attention_mask=attention_mask,\n position_ids=position_ids,\n head_mask=head_mask,\n inputs_embeds=inputs_embeds,\n output_attentions=True,\n output_hidden_states=output_hidden_states,\n return_dict=return_dict,\n **kwargs,\n )\n output = self.contact_head(outputs, attention_mask, input_ids, labels)\n logits, loss = output.logits, output.loss\n\n if not return_dict:\n output = (logits,) + outputs[2:]\n return ((loss,) + output) if loss is not None else output\n\n return ContactPredictorOutput(\n loss=loss,\n logits=logits,\n hidden_states=outputs.hidden_states,\n attentions=outputs.attentions,\n )\n
"},{"location":"models/splicebert/#multimolecule.models.splicebert.SpliceBertForMaskedLM","title":"SpliceBertForMaskedLM","text":" Bases: SpliceBertPreTrainedModel
Examples:
Python Console Session>>> from multimolecule import SpliceBertConfig, SpliceBertForMaskedLM, RnaTokenizer\n>>> config = SpliceBertConfig()\n>>> model = SpliceBertForMaskedLM(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=input[\"input_ids\"])\n>>> output[\"logits\"].shape\ntorch.Size([1, 7, 26])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<NllLossBackward0>)\n
Source code in multimolecule/models/splicebert/modeling_splicebert.py
Pythonclass SpliceBertForMaskedLM(SpliceBertPreTrainedModel):\n \"\"\"\n Examples:\n >>> from multimolecule import SpliceBertConfig, SpliceBertForMaskedLM, RnaTokenizer\n >>> config = SpliceBertConfig()\n >>> model = SpliceBertForMaskedLM(config)\n >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n >>> output = model(**input, labels=input[\"input_ids\"])\n >>> output[\"logits\"].shape\n torch.Size([1, 7, 26])\n >>> output[\"loss\"] # doctest:+ELLIPSIS\n tensor(..., grad_fn=<NllLossBackward0>)\n \"\"\"\n\n _tied_weights_keys = [\"lm_head.decoder.bias\", \"lm_head.decoder.weight\"]\n\n def __init__(self, config: SpliceBertConfig):\n super().__init__(config)\n if config.is_decoder:\n logger.warning(\n \"If you want to use `SpliceBertForMaskedLM` make sure `config.is_decoder=False` for \"\n \"bi-directional self-attention.\"\n )\n self.splicebert = SpliceBertModel(config, add_pooling_layer=False)\n self.lm_head = MaskedLMHead(config)\n\n # Initialize weights and apply final processing\n self.post_init()\n\n def get_output_embeddings(self):\n return self.lm_head.decoder\n\n def set_output_embeddings(self, new_embeddings):\n self.lm_head.decoder = new_embeddings\n\n def forward(\n self,\n input_ids: Tensor | NestedTensor,\n attention_mask: Tensor | None = None,\n position_ids: Tensor | None = None,\n head_mask: Tensor | None = None,\n inputs_embeds: Tensor | NestedTensor | None = None,\n encoder_hidden_states: Tensor | None = None,\n encoder_attention_mask: Tensor | None = None,\n labels: Tensor | None = None,\n output_attentions: bool | None = None,\n output_hidden_states: bool | None = None,\n return_dict: bool | None = None,\n **kwargs,\n ) -> Tuple[Tensor, ...] | MaskedLMOutput:\n return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n outputs = self.splicebert(\n input_ids,\n attention_mask=attention_mask,\n position_ids=position_ids,\n head_mask=head_mask,\n inputs_embeds=inputs_embeds,\n encoder_hidden_states=encoder_hidden_states,\n encoder_attention_mask=encoder_attention_mask,\n output_attentions=output_attentions,\n output_hidden_states=output_hidden_states,\n return_dict=return_dict,\n **kwargs,\n )\n output = self.lm_head(outputs, labels)\n logits, loss = output.logits, output.loss\n\n if not return_dict:\n output = (logits,) + outputs[2:]\n return ((loss,) + output) if loss is not None else output\n\n return MaskedLMOutput(\n loss=loss,\n logits=logits,\n hidden_states=outputs.hidden_states,\n attentions=outputs.attentions,\n )\n
"},{"location":"models/splicebert/#multimolecule.models.splicebert.SpliceBertForSequencePrediction","title":"SpliceBertForSequencePrediction","text":" Bases: SpliceBertPreTrainedModel
Examples:
Python Console Session>>> from multimolecule import SpliceBertConfig, SpliceBertForSequencePrediction, RnaTokenizer\n>>> config = SpliceBertConfig()\n>>> model = SpliceBertForSequencePrediction(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=torch.tensor([[1]]))\n>>> output[\"logits\"].shape\ntorch.Size([1, 1])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n
Source code in multimolecule/models/splicebert/modeling_splicebert.py
Pythonclass SpliceBertForSequencePrediction(SpliceBertPreTrainedModel):\n \"\"\"\n Examples:\n >>> from multimolecule import SpliceBertConfig, SpliceBertForSequencePrediction, RnaTokenizer\n >>> config = SpliceBertConfig()\n >>> model = SpliceBertForSequencePrediction(config)\n >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n >>> output = model(**input, labels=torch.tensor([[1]]))\n >>> output[\"logits\"].shape\n torch.Size([1, 1])\n >>> output[\"loss\"] # doctest:+ELLIPSIS\n tensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n \"\"\"\n\n def __init__(self, config: SpliceBertConfig):\n super().__init__(config)\n self.splicebert = SpliceBertModel(config, add_pooling_layer=True)\n self.sequence_head = SequencePredictionHead(config)\n self.head_config = self.sequence_head.config\n\n # Initialize weights and apply final processing\n self.post_init()\n\n def forward(\n self,\n input_ids: Tensor | NestedTensor,\n attention_mask: Tensor | None = None,\n position_ids: Tensor | None = None,\n head_mask: Tensor | None = None,\n inputs_embeds: Tensor | NestedTensor | None = None,\n labels: Tensor | None = None,\n output_attentions: bool | None = None,\n output_hidden_states: bool | None = None,\n return_dict: bool | None = None,\n **kwargs,\n ) -> Tuple[Tensor, ...] | SequencePredictorOutput:\n return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n outputs = self.splicebert(\n input_ids,\n attention_mask=attention_mask,\n position_ids=position_ids,\n head_mask=head_mask,\n inputs_embeds=inputs_embeds,\n output_attentions=output_attentions,\n output_hidden_states=output_hidden_states,\n return_dict=return_dict,\n **kwargs,\n )\n output = self.sequence_head(outputs, labels)\n logits, loss = output.logits, output.loss\n\n if not return_dict:\n output = (logits,) + outputs[2:]\n return ((loss,) + output) if loss is not None else output\n\n return SequencePredictorOutput(\n loss=loss,\n logits=logits,\n hidden_states=outputs.hidden_states,\n attentions=outputs.attentions,\n )\n
"},{"location":"models/splicebert/#multimolecule.models.splicebert.SpliceBertForTokenPrediction","title":"SpliceBertForTokenPrediction","text":" Bases: SpliceBertPreTrainedModel
Examples:
Python Console Session>>> from multimolecule import SpliceBertConfig, SpliceBertForTokenPrediction, RnaTokenizer\n>>> config = SpliceBertConfig()\n>>> model = SpliceBertForTokenPrediction(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=torch.randint(2, (1, 5)))\n>>> output[\"logits\"].shape\ntorch.Size([1, 5, 1])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n
Source code in multimolecule/models/splicebert/modeling_splicebert.py
Pythonclass SpliceBertForTokenPrediction(SpliceBertPreTrainedModel):\n \"\"\"\n Examples:\n >>> from multimolecule import SpliceBertConfig, SpliceBertForTokenPrediction, RnaTokenizer\n >>> config = SpliceBertConfig()\n >>> model = SpliceBertForTokenPrediction(config)\n >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n >>> output = model(**input, labels=torch.randint(2, (1, 5)))\n >>> output[\"logits\"].shape\n torch.Size([1, 5, 1])\n >>> output[\"loss\"] # doctest:+ELLIPSIS\n tensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n \"\"\"\n\n def __init__(self, config: SpliceBertConfig):\n super().__init__(config)\n self.splicebert = SpliceBertModel(config, add_pooling_layer=True)\n self.token_head = TokenPredictionHead(config)\n self.head_config = self.token_head.config\n\n # Initialize weights and apply final processing\n self.post_init()\n\n def forward(\n self,\n input_ids: Tensor | NestedTensor,\n attention_mask: Tensor | None = None,\n position_ids: Tensor | None = None,\n head_mask: Tensor | None = None,\n inputs_embeds: Tensor | NestedTensor | None = None,\n labels: Tensor | None = None,\n output_attentions: bool | None = None,\n output_hidden_states: bool | None = None,\n return_dict: bool | None = None,\n **kwargs,\n ) -> Tuple[Tensor, ...] | TokenPredictorOutput:\n return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n outputs = self.splicebert(\n input_ids,\n attention_mask=attention_mask,\n position_ids=position_ids,\n head_mask=head_mask,\n inputs_embeds=inputs_embeds,\n output_attentions=output_attentions,\n output_hidden_states=output_hidden_states,\n return_dict=return_dict,\n **kwargs,\n )\n output = self.token_head(outputs, attention_mask, input_ids, labels)\n logits, loss = output.logits, output.loss\n\n if not return_dict:\n output = (logits,) + outputs[2:]\n return ((loss,) + output) if loss is not None else output\n\n return TokenPredictorOutput(\n loss=loss,\n logits=logits,\n hidden_states=outputs.hidden_states,\n attentions=outputs.attentions,\n )\n
"},{"location":"models/splicebert/#multimolecule.models.splicebert.SpliceBertModel","title":"SpliceBertModel","text":" Bases: SpliceBertPreTrainedModel
Examples:
Python Console Session>>> from multimolecule import SpliceBertConfig, SpliceBertModel, RnaTokenizer\n>>> config = SpliceBertConfig()\n>>> model = SpliceBertModel(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input)\n>>> output[\"last_hidden_state\"].shape\ntorch.Size([1, 7, 512])\n>>> output[\"pooler_output\"].shape\ntorch.Size([1, 512])\n
Source code in multimolecule/models/splicebert/modeling_splicebert.py
Pythonclass SpliceBertModel(SpliceBertPreTrainedModel):\n \"\"\"\n Examples:\n >>> from multimolecule import SpliceBertConfig, SpliceBertModel, RnaTokenizer\n >>> config = SpliceBertConfig()\n >>> model = SpliceBertModel(config)\n >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n >>> output = model(**input)\n >>> output[\"last_hidden_state\"].shape\n torch.Size([1, 7, 512])\n >>> output[\"pooler_output\"].shape\n torch.Size([1, 512])\n \"\"\"\n\n def __init__(self, config: SpliceBertConfig, add_pooling_layer: bool = True):\n super().__init__(config)\n self.pad_token_id = config.pad_token_id\n self.embeddings = SpliceBertEmbeddings(config)\n self.encoder = SpliceBertEncoder(config)\n self.pooler = SpliceBertPooler(config) if add_pooling_layer else None\n\n # Initialize weights and apply final processing\n self.post_init()\n\n def get_input_embeddings(self):\n return self.embeddings.word_embeddings\n\n def set_input_embeddings(self, value):\n self.embeddings.word_embeddings = value\n\n def _prune_heads(self, heads_to_prune):\n \"\"\"\n Prunes heads of the model. heads_to_prune: dict of {layer_num: list of heads to prune in this layer} See base\n class PreTrainedModel\n \"\"\"\n for layer, heads in heads_to_prune.items():\n self.encoder.layer[layer].attention.prune_heads(heads)\n\n def forward(\n self,\n input_ids: Tensor | NestedTensor,\n attention_mask: Tensor | None = None,\n position_ids: Tensor | None = None,\n head_mask: Tensor | None = None,\n inputs_embeds: Tensor | NestedTensor | None = None,\n encoder_hidden_states: Tensor | None = None,\n encoder_attention_mask: Tensor | None = None,\n past_key_values: Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None = None,\n use_cache: bool | None = None,\n output_attentions: bool | None = None,\n output_hidden_states: bool | None = None,\n return_dict: bool | None = None,\n **kwargs,\n ) -> Tuple[Tensor, ...] | BaseModelOutputWithPoolingAndCrossAttentions:\n r\"\"\"\n Args:\n encoder_hidden_states:\n Shape: `(batch_size, sequence_length, hidden_size)`\n\n Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if\n the model is configured as a decoder.\n encoder_attention_mask:\n Shape: `(batch_size, sequence_length)`\n\n Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used\n in the cross-attention if the model is configured as a decoder. Mask values selected in `[0, 1]`:\n\n - 1 for tokens that are **not masked**,\n - 0 for tokens that are **masked**.\n past_key_values:\n Tuple of length `config.n_layers` with each tuple having 4 tensors of shape\n `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)\n\n Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up\n decoding.\n\n If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those\n that don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of\n all `decoder_input_ids` of shape `(batch_size, sequence_length)`.\n use_cache:\n If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding\n (see `past_key_values`).\n \"\"\"\n if kwargs:\n warn(\n f\"Additional keyword arguments `{', '.join(kwargs)}` are detected in \"\n f\"`{self.__class__.__name__}.forward`, they will be ignored.\\n\"\n \"This is provided for backward compatibility and may lead to unexpected behavior.\"\n )\n output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions\n output_hidden_states = (\n output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states\n )\n return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n\n if self.config.is_decoder:\n use_cache = use_cache if use_cache is not None else self.config.use_cache\n else:\n use_cache = False\n\n if isinstance(input_ids, NestedTensor):\n input_ids, attention_mask = input_ids.tensor, input_ids.mask\n if input_ids is not None and inputs_embeds is not None:\n raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n if input_ids is not None:\n self.warn_if_padding_and_no_attention_mask(input_ids, attention_mask)\n input_shape = input_ids.size()\n elif inputs_embeds is not None:\n input_shape = inputs_embeds.size()[:-1]\n else:\n raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n batch_size, seq_length = input_shape\n device = input_ids.device if input_ids is not None else inputs_embeds.device # type: ignore[union-attr]\n\n # past_key_values_length\n past_key_values_length = past_key_values[0][0].shape[2] if past_key_values is not None else 0\n\n if attention_mask is None:\n attention_mask = (\n input_ids.ne(self.pad_token_id)\n if self.pad_token_id is not None\n else torch.ones(((batch_size, seq_length + past_key_values_length)), device=device)\n )\n\n # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]\n # ourselves in which case we just need to make it broadcastable to all heads.\n extended_attention_mask: Tensor = self.get_extended_attention_mask(attention_mask, input_shape)\n\n # If a 2D or 3D attention mask is provided for the cross-attention\n # we need to make broadcastable to [batch_size, num_heads, seq_length, seq_length]\n if self.config.is_decoder and encoder_hidden_states is not None:\n encoder_batch_size, encoder_sequence_length, _ = encoder_hidden_states.size()\n encoder_hidden_shape = (encoder_batch_size, encoder_sequence_length)\n if encoder_attention_mask is None:\n encoder_attention_mask = torch.ones(encoder_hidden_shape, device=device)\n encoder_extended_attention_mask = self.invert_attention_mask(encoder_attention_mask)\n else:\n encoder_extended_attention_mask = None\n\n # Prepare head mask if needed\n # 1.0 in head_mask indicate we keep the head\n # attention_probs has shape bsz x n_heads x N x N\n # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]\n # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]\n head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)\n\n embedding_output = self.embeddings(\n input_ids=input_ids,\n position_ids=position_ids,\n inputs_embeds=inputs_embeds,\n past_key_values_length=past_key_values_length,\n )\n encoder_outputs = self.encoder(\n embedding_output,\n attention_mask=extended_attention_mask,\n head_mask=head_mask,\n encoder_hidden_states=encoder_hidden_states,\n encoder_attention_mask=encoder_extended_attention_mask,\n past_key_values=past_key_values,\n use_cache=use_cache,\n output_attentions=output_attentions,\n output_hidden_states=output_hidden_states,\n return_dict=return_dict,\n )\n sequence_output = encoder_outputs[0]\n pooled_output = self.pooler(sequence_output) if self.pooler is not None else None\n\n if not return_dict:\n return (sequence_output, pooled_output) + encoder_outputs[1:]\n\n return BaseModelOutputWithPoolingAndCrossAttentions(\n last_hidden_state=sequence_output,\n pooler_output=pooled_output,\n past_key_values=encoder_outputs.past_key_values,\n hidden_states=encoder_outputs.hidden_states,\n attentions=encoder_outputs.attentions,\n cross_attentions=encoder_outputs.cross_attentions,\n )\n
"},{"location":"models/splicebert/#multimolecule.models.splicebert.SpliceBertModel.forward","title":"forward","text":"Pythonforward(input_ids: Tensor | NestedTensor, attention_mask: Tensor | None = None, position_ids: Tensor | None = None, head_mask: Tensor | None = None, inputs_embeds: Tensor | NestedTensor | None = None, encoder_hidden_states: Tensor | None = None, encoder_attention_mask: Tensor | None = None, past_key_values: Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None = None, use_cache: bool | None = None, output_attentions: bool | None = None, output_hidden_states: bool | None = None, return_dict: bool | None = None, **kwargs) -> Tuple[Tensor, ...] | BaseModelOutputWithPoolingAndCrossAttentions\n
Parameters:
Name Type Description DefaultTensor | None
Shape: (batch_size, sequence_length, hidden_size)
Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if the model is configured as a decoder.
None
Tensor | None
Shape: (batch_size, sequence_length)
Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used in the cross-attention if the model is configured as a decoder. Mask values selected in [0, 1]
:
None
Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None
Tuple of length config.n_layers
with each tuple having 4 tensors of shape `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)
Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up decoding.
If past_key_values
are used, the user can optionally input only the last decoder_input_ids
(those that don\u2019t have their past key value states given to this model) of shape (batch_size, 1)
instead of all decoder_input_ids
of shape (batch_size, sequence_length)
.
None
bool | None
If set to True
, past_key_values
key value states are returned and can be used to speed up decoding (see past_key_values
).
None
Source code in multimolecule/models/splicebert/modeling_splicebert.py
Pythondef forward(\n self,\n input_ids: Tensor | NestedTensor,\n attention_mask: Tensor | None = None,\n position_ids: Tensor | None = None,\n head_mask: Tensor | None = None,\n inputs_embeds: Tensor | NestedTensor | None = None,\n encoder_hidden_states: Tensor | None = None,\n encoder_attention_mask: Tensor | None = None,\n past_key_values: Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None = None,\n use_cache: bool | None = None,\n output_attentions: bool | None = None,\n output_hidden_states: bool | None = None,\n return_dict: bool | None = None,\n **kwargs,\n) -> Tuple[Tensor, ...] | BaseModelOutputWithPoolingAndCrossAttentions:\n r\"\"\"\n Args:\n encoder_hidden_states:\n Shape: `(batch_size, sequence_length, hidden_size)`\n\n Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if\n the model is configured as a decoder.\n encoder_attention_mask:\n Shape: `(batch_size, sequence_length)`\n\n Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used\n in the cross-attention if the model is configured as a decoder. Mask values selected in `[0, 1]`:\n\n - 1 for tokens that are **not masked**,\n - 0 for tokens that are **masked**.\n past_key_values:\n Tuple of length `config.n_layers` with each tuple having 4 tensors of shape\n `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)\n\n Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up\n decoding.\n\n If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those\n that don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of\n all `decoder_input_ids` of shape `(batch_size, sequence_length)`.\n use_cache:\n If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding\n (see `past_key_values`).\n \"\"\"\n if kwargs:\n warn(\n f\"Additional keyword arguments `{', '.join(kwargs)}` are detected in \"\n f\"`{self.__class__.__name__}.forward`, they will be ignored.\\n\"\n \"This is provided for backward compatibility and may lead to unexpected behavior.\"\n )\n output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions\n output_hidden_states = (\n output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states\n )\n return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n\n if self.config.is_decoder:\n use_cache = use_cache if use_cache is not None else self.config.use_cache\n else:\n use_cache = False\n\n if isinstance(input_ids, NestedTensor):\n input_ids, attention_mask = input_ids.tensor, input_ids.mask\n if input_ids is not None and inputs_embeds is not None:\n raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n if input_ids is not None:\n self.warn_if_padding_and_no_attention_mask(input_ids, attention_mask)\n input_shape = input_ids.size()\n elif inputs_embeds is not None:\n input_shape = inputs_embeds.size()[:-1]\n else:\n raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n batch_size, seq_length = input_shape\n device = input_ids.device if input_ids is not None else inputs_embeds.device # type: ignore[union-attr]\n\n # past_key_values_length\n past_key_values_length = past_key_values[0][0].shape[2] if past_key_values is not None else 0\n\n if attention_mask is None:\n attention_mask = (\n input_ids.ne(self.pad_token_id)\n if self.pad_token_id is not None\n else torch.ones(((batch_size, seq_length + past_key_values_length)), device=device)\n )\n\n # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]\n # ourselves in which case we just need to make it broadcastable to all heads.\n extended_attention_mask: Tensor = self.get_extended_attention_mask(attention_mask, input_shape)\n\n # If a 2D or 3D attention mask is provided for the cross-attention\n # we need to make broadcastable to [batch_size, num_heads, seq_length, seq_length]\n if self.config.is_decoder and encoder_hidden_states is not None:\n encoder_batch_size, encoder_sequence_length, _ = encoder_hidden_states.size()\n encoder_hidden_shape = (encoder_batch_size, encoder_sequence_length)\n if encoder_attention_mask is None:\n encoder_attention_mask = torch.ones(encoder_hidden_shape, device=device)\n encoder_extended_attention_mask = self.invert_attention_mask(encoder_attention_mask)\n else:\n encoder_extended_attention_mask = None\n\n # Prepare head mask if needed\n # 1.0 in head_mask indicate we keep the head\n # attention_probs has shape bsz x n_heads x N x N\n # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]\n # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]\n head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)\n\n embedding_output = self.embeddings(\n input_ids=input_ids,\n position_ids=position_ids,\n inputs_embeds=inputs_embeds,\n past_key_values_length=past_key_values_length,\n )\n encoder_outputs = self.encoder(\n embedding_output,\n attention_mask=extended_attention_mask,\n head_mask=head_mask,\n encoder_hidden_states=encoder_hidden_states,\n encoder_attention_mask=encoder_extended_attention_mask,\n past_key_values=past_key_values,\n use_cache=use_cache,\n output_attentions=output_attentions,\n output_hidden_states=output_hidden_states,\n return_dict=return_dict,\n )\n sequence_output = encoder_outputs[0]\n pooled_output = self.pooler(sequence_output) if self.pooler is not None else None\n\n if not return_dict:\n return (sequence_output, pooled_output) + encoder_outputs[1:]\n\n return BaseModelOutputWithPoolingAndCrossAttentions(\n last_hidden_state=sequence_output,\n pooler_output=pooled_output,\n past_key_values=encoder_outputs.past_key_values,\n hidden_states=encoder_outputs.hidden_states,\n attentions=encoder_outputs.attentions,\n cross_attentions=encoder_outputs.cross_attentions,\n )\n
"},{"location":"models/splicebert/#multimolecule.models.splicebert.SpliceBertModel.forward(encoder_hidden_states)","title":"encoder_hidden_states
","text":""},{"location":"models/splicebert/#multimolecule.models.splicebert.SpliceBertModel.forward(encoder_attention_mask)","title":"encoder_attention_mask
","text":""},{"location":"models/splicebert/#multimolecule.models.splicebert.SpliceBertModel.forward(past_key_values)","title":"past_key_values
","text":""},{"location":"models/splicebert/#multimolecule.models.splicebert.SpliceBertModel.forward(use_cache)","title":"use_cache
","text":""},{"location":"models/splicebert/#multimolecule.models.splicebert.SpliceBertPreTrainedModel","title":"SpliceBertPreTrainedModel","text":" Bases: PreTrainedModel
An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained models.
Source code inmultimolecule/models/splicebert/modeling_splicebert.py
Pythonclass SpliceBertPreTrainedModel(PreTrainedModel):\n \"\"\"\n An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained\n models.\n \"\"\"\n\n config_class = SpliceBertConfig\n base_model_prefix = \"splicebert\"\n supports_gradient_checkpointing = True\n _no_split_modules = [\"SpliceBertLayer\", \"SpliceBertEmbeddings\"]\n\n # Copied from transformers.models.bert.modeling_bert.BertPreTrainedModel._init_weights\n def _init_weights(self, module: nn.Module):\n \"\"\"Initialize the weights\"\"\"\n if isinstance(module, nn.Linear):\n # Slightly different from the TF version which uses truncated_normal for initialization\n # cf https://github.com/pytorch/pytorch/pull/5617\n module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)\n if module.bias is not None:\n module.bias.data.zero_()\n elif isinstance(module, nn.Embedding):\n module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)\n if module.padding_idx is not None:\n module.weight.data[module.padding_idx].zero_()\n elif isinstance(module, nn.LayerNorm):\n module.bias.data.zero_()\n module.weight.data.fill_(1.0)\n\n def _set_gradient_checkpointing(self, module, value=False):\n if isinstance(module, SpliceBertEncoder):\n module.gradient_checkpointing = value\n
"},{"location":"models/utrbert/","title":"3UTRBERT","text":"Pre-trained model on 3\u2019 untranslated region (3\u2019UTR) using a masked language modeling (MLM) objective.
"},{"location":"models/utrbert/#disclaimer","title":"Disclaimer","text":"This is an UNOFFICIAL implementation of the Deciphering 3\u2019 UTR mediated gene regulation using interpretable deep representation learning by Yuning Yang, Gen Li, et al.
The OFFICIAL repository of 3UTRBERT is at yangyn533/3UTRBERT.
Tip
The MultiMolecule team has confirmed that the provided model and checkpoints are producing the same intermediate representations as the original implementation.
The team releasing 3UTRBERT did not write this model card for this model so this model card has been written by the MultiMolecule team.
"},{"location":"models/utrbert/#model-details","title":"Model Details","text":"3UTRBERT is a bert-style model pre-trained on a large corpus of 3\u2019 untranslated regions (3\u2019UTRs) in a self-supervised fashion. This means that the model was trained on the raw nucleotides of RNA sequences only, with an automatic process to generate inputs and labels from those texts. Please refer to the Training Details section for more information on the training process.
"},{"location":"models/utrbert/#variations","title":"Variations","text":"multimolecule/utrbert-3mer
: The 3UTRBERT model pre-trained on 3-mer data.multimolecule/utrbert-4mer
: The 3UTRBERT model pre-trained on 4-mer data.multimolecule/utrbert-5mer
: The 3UTRBERT model pre-trained on 5-mer data.multimolecule/utrbert-6mer
: The 3UTRBERT model pre-trained on 6-mer data.The model file depends on the multimolecule
library. You can install it using pip:
pip install multimolecule\n
"},{"location":"models/utrbert/#direct-use","title":"Direct Use","text":"Note: Default transformers pipeline does not support K-mer tokenization.
You can use this model directly with a pipeline for masked language modeling:
Python>>> import multimolecule # you must import multimolecule to register models\n>>> from transformers import pipeline\n>>> unmasker = pipeline(\"fill-mask\", model=\"multimolecule/utrbert-3mer\")\n>>> unmasker(\"gguc<mask><mask><mask>cugguuagaccagaucugagccu\")[1]\n\n[{'score': 0.40745577216148376,\n 'token': 47,\n 'token_str': 'CUC',\n 'sequence': '<cls> GGU GUC <mask> CUC <mask> CUG UGG GGU GUU UUA UAG AGA GAC ACC CCA CAG AGA GAU AUC UCU CUG UGA GAG AGC GCC CCU <eos>'},\n {'score': 0.40001827478408813,\n 'token': 32,\n 'token_str': 'CAC',\n 'sequence': '<cls> GGU GUC <mask> CAC <mask> CUG UGG GGU GUU UUA UAG AGA GAC ACC CCA CAG AGA GAU AUC UCU CUG UGA GAG AGC GCC CCU <eos>'},\n {'score': 0.14566268026828766,\n 'token': 37,\n 'token_str': 'CCC',\n 'sequence': '<cls> GGU GUC <mask> CCC <mask> CUG UGG GGU GUU UUA UAG AGA GAC ACC CCA CAG AGA GAU AUC UCU CUG UGA GAG AGC GCC CCU <eos>'},\n {'score': 0.04422207176685333,\n 'token': 42,\n 'token_str': 'CGC',\n 'sequence': '<cls> GGU GUC <mask> CGC <mask> CUG UGG GGU GUU UUA UAG AGA GAC ACC CCA CAG AGA GAU AUC UCU CUG UGA GAG AGC GCC CCU <eos>'},\n {'score': 0.0008025980787351727,\n 'token': 34,\n 'token_str': 'CAU',\n 'sequence': '<cls> GGU GUC <mask> CAU <mask> CUG UGG GGU GUU UUA UAG AGA GAC ACC CCA CAG AGA GAU AUC UCU CUG UGA GAG AGC GCC CCU <eos>'}]\n
"},{"location":"models/utrbert/#downstream-use","title":"Downstream Use","text":""},{"location":"models/utrbert/#extract-features","title":"Extract Features","text":"Here is how to use this model to get the features of a given sequence in PyTorch:
Pythonfrom multimolecule import RnaTokenizer, UtrBertModel\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/utrbert-3mer\")\nmodel = UtrBertModel.from_pretrained(\"multimolecule/utrbert-3mer\")\n\ntext = \"UAGCUUAUCAGACUGAUGUUGA\"\ninput = tokenizer(text, return_tensors=\"pt\")\n\noutput = model(**input)\n
"},{"location":"models/utrbert/#sequence-classification-regression","title":"Sequence Classification / Regression","text":"Note: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for sequence classification or regression.
Here is how to use this model as backbone to fine-tune for a sequence-level task in PyTorch:
Pythonimport torch\nfrom multimolecule import RnaTokenizer, UtrBertForSequencePrediction\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/utrbert-3mer\")\nmodel = UtrBertForSequencePrediction.from_pretrained(\"multimolecule/utrbert-3mer\")\n\ntext = \"UAGCUUAUCAGACUGAUGUUGA\"\ninput = tokenizer(text, return_tensors=\"pt\")\nlabel = torch.tensor([1])\n\noutput = model(**input, labels=label)\n
"},{"location":"models/utrbert/#token-classification-regression","title":"Token Classification / Regression","text":"Note: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for nucleotide classification or regression.
Here is how to use this model as backbone to fine-tune for a nucleotide-level task in PyTorch:
Pythonimport torch\nfrom multimolecule import RnaTokenizer, UtrBertForTokenPrediction\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/utrbert-3mer\")\nmodel = UtrBertForTokenPrediction.from_pretrained(\"multimolecule/utrbert-3mer\")\n\ntext = \"UAGCUUAUCAGACUGAUGUUGA\"\ninput = tokenizer(text, return_tensors=\"pt\")\nlabel = torch.randint(2, (len(text), ))\n\noutput = model(**input, labels=label)\n
"},{"location":"models/utrbert/#contact-classification-regression","title":"Contact Classification / Regression","text":"Note: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for contact classification or regression.
Here is how to use this model as backbone to fine-tune for a contact-level task in PyTorch:
Pythonimport torch\nfrom multimolecule import RnaTokenizer, UtrBertForContactPrediction\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/utrbert-3mer\")\nmodel = UtrBertForContactPrediction.from_pretrained(\"multimolecule/utrbert-3mer\")\n\ntext = \"UAGCUUAUCAGACUGAUGUUGA\"\ninput = tokenizer(text, return_tensors=\"pt\")\nlabel = torch.randint(2, (len(text), len(text)))\n\noutput = model(**input, labels=label)\n
"},{"location":"models/utrbert/#training-details","title":"Training Details","text":"3UTRBERT used Masked Language Modeling (MLM) as the pre-training objective: taking a sequence, the model randomly masks 15% of the tokens in the input then runs the entire masked sentence through the model and has to predict the masked tokens. This is comparable to the Cloze task in language modeling.
"},{"location":"models/utrbert/#training-data","title":"Training Data","text":"The 3UTRBERT model was pre-trained on human mRNA transcript sequences from GENCODE. GENCODE aims to identify all gene features in the human genome using a combination of computational analysis, manual annotation, and experimental validation. The GENCODE release 40 used by this work contains 61,544 genes, and 246,624 transcripts.
3UTRBERT collected the human mRNA transcript sequences from GENCODE, including 108,573 unique mRNA transcripts. Only the longest transcript of each gene was used in the pre-training process. 3UTRBERT only used the 3\u2019 untranslated regions (3\u2019UTRs) of the mRNA transcripts for pre-training to avoid codon constrains in the CDS region, and to reduce increased complexity of the entire mRNA transcripts. The average length of the 3\u2019UTRs was 1,227 nucleotides, while the median length was 631 nucleotides. Each 3\u2019UTR sequence was cut to non-overlapping patches of 510 nucleotides. The remaining sequences were padded to the same length.
Note RnaTokenizer
will convert \u201cT\u201ds to \u201cU\u201ds for you, you may disable this behaviour by passing replace_T_with_U=False
.
3UTRBERT used masked language modeling (MLM) as the pre-training objective. The masking procedure is similar to the one used in BERT:
<mask>
.Since 3UTRBERT used k-mer tokenizer, it masks the entire k-mer instead of individual nucleotides to avoid information leakage.
For example, if the k-mer is 3, the sequence \"UAGCGUAU\"
will be tokenized as [\"UAG\", \"AGC\", \"GCG\", \"CGU\", \"GUA\", \"UAU\"]
. If the nucleotide \"C\"
is masked, the adjacent tokens will also be masked, resulting [\"UAG\", \"<mask>\", \"<mask>\", \"<mask>\", \"GUA\", \"UAU\"]
.
The model was trained on 4 NVIDIA Quadro RTX 6000 GPUs with 24GiB memories.
BibTeX:
BibTeX@article {yang2023deciphering,\n author = {Yang, Yuning and Li, Gen and Pang, Kuan and Cao, Wuxinhao and Li, Xiangtao and Zhang, Zhaolei},\n title = {Deciphering 3{\\textquoteright} UTR mediated gene regulation using interpretable deep representation learning},\n elocation-id = {2023.09.08.556883},\n year = {2023},\n doi = {10.1101/2023.09.08.556883},\n publisher = {Cold Spring Harbor Laboratory},\n abstract = {The 3{\\textquoteright}untranslated regions (3{\\textquoteright}UTRs) of messenger RNAs contain many important cis-regulatory elements that are under functional and evolutionary constraints. We hypothesize that these constraints are similar to grammars and syntaxes in human languages and can be modeled by advanced natural language models such as Transformers, which has been very effective in modeling protein sequence and structures. Here we describe 3UTRBERT, which implements an attention-based language model, i.e., Bidirectional Encoder Representations from Transformers (BERT). 3UTRBERT was pre-trained on aggregated 3{\\textquoteright}UTR sequences of human mRNAs in a task-agnostic manner; the pre-trained model was then fine-tuned for specific downstream tasks such as predicting RBP binding sites, m6A RNA modification sites, and predicting RNA sub-cellular localizations. Benchmark results showed that 3UTRBERT generally outperformed other contemporary methods in each of these tasks. We also showed that the self-attention mechanism within 3UTRBERT allows direct visualization of the semantic relationship between sequence elements.Competing Interest StatementThe authors have declared no competing interest.},\n URL = {https://www.biorxiv.org/content/early/2023/09/12/2023.09.08.556883},\n eprint = {https://www.biorxiv.org/content/early/2023/09/12/2023.09.08.556883.full.pdf},\n journal = {bioRxiv}\n}\n
"},{"location":"models/utrbert/#contact","title":"Contact","text":"Please use GitHub issues of MultiMolecule for any questions or comments on the model card.
Please contact the authors of the 3UTRBERT paper for questions or comments on the paper/model.
"},{"location":"models/utrbert/#license","title":"License","text":"This model is licensed under the AGPL-3.0 License.
Text OnlySPDX-License-Identifier: AGPL-3.0-or-later\n
"},{"location":"models/utrbert/#multimolecule.models.utrbert","title":"multimolecule.models.utrbert","text":""},{"location":"models/utrbert/#multimolecule.models.utrbert.RnaTokenizer","title":"RnaTokenizer","text":" Bases: Tokenizer
Tokenizer for RNA sequences.
Parameters:
Name Type Description DefaultAlphabet | str | List[str] | None
alphabet to use for tokenization.
None
, the standard RNA alphabet will be used.string
, it should correspond to the name of a predefined alphabet. The options includestandard
extended
streamline
nucleobase
None
int
Size of kmer to tokenize.
1
bool
Whether to tokenize into codons.
False
bool
Whether to replace T with U.
True
bool
Whether to convert input to uppercase.
True
Examples:
Python Console Session>>> from multimolecule import RnaTokenizer\n>>> tokenizer = RnaTokenizer()\n>>> tokenizer('<pad><cls><eos><unk><mask><null>ACGUNRYSWKMBDHV.X*-I')[\"input_ids\"]\n[1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 2]\n>>> tokenizer('acgu')[\"input_ids\"]\n[1, 6, 7, 8, 9, 2]\n>>> tokenizer('acgt')[\"input_ids\"]\n[1, 6, 7, 8, 9, 2]\n>>> tokenizer = RnaTokenizer(replace_T_with_U=False)\n>>> tokenizer('acgt')[\"input_ids\"]\n[1, 6, 7, 8, 3, 2]\n>>> tokenizer = RnaTokenizer(nmers=3)\n>>> tokenizer('uagcuuauc')[\"input_ids\"]\n[1, 83, 17, 64, 49, 96, 84, 22, 2]\n>>> tokenizer = RnaTokenizer(codon=True)\n>>> tokenizer('uagcuuauc')[\"input_ids\"]\n[1, 83, 49, 22, 2]\n>>> tokenizer('uagcuuauca')[\"input_ids\"]\nTraceback (most recent call last):\nValueError: length of input sequence must be a multiple of 3 for codon tokenization, but got 10\n
Source code in multimolecule/tokenisers/rna/tokenization_rna.py
Pythonclass RnaTokenizer(Tokenizer):\n \"\"\"\n Tokenizer for RNA sequences.\n\n Args:\n alphabet: alphabet to use for tokenization.\n\n - If is `None`, the standard RNA alphabet will be used.\n - If is a `string`, it should correspond to the name of a predefined alphabet. The options include\n + `standard`\n + `extended`\n + `streamline`\n + `nucleobase`\n - If is an alphabet or a list of characters, that specific alphabet will be used.\n nmers: Size of kmer to tokenize.\n codon: Whether to tokenize into codons.\n replace_T_with_U: Whether to replace T with U.\n do_upper_case: Whether to convert input to uppercase.\n\n Examples:\n >>> from multimolecule import RnaTokenizer\n >>> tokenizer = RnaTokenizer()\n >>> tokenizer('<pad><cls><eos><unk><mask><null>ACGUNRYSWKMBDHV.X*-I')[\"input_ids\"]\n [1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 2]\n >>> tokenizer('acgu')[\"input_ids\"]\n [1, 6, 7, 8, 9, 2]\n >>> tokenizer('acgt')[\"input_ids\"]\n [1, 6, 7, 8, 9, 2]\n >>> tokenizer = RnaTokenizer(replace_T_with_U=False)\n >>> tokenizer('acgt')[\"input_ids\"]\n [1, 6, 7, 8, 3, 2]\n >>> tokenizer = RnaTokenizer(nmers=3)\n >>> tokenizer('uagcuuauc')[\"input_ids\"]\n [1, 83, 17, 64, 49, 96, 84, 22, 2]\n >>> tokenizer = RnaTokenizer(codon=True)\n >>> tokenizer('uagcuuauc')[\"input_ids\"]\n [1, 83, 49, 22, 2]\n >>> tokenizer('uagcuuauca')[\"input_ids\"]\n Traceback (most recent call last):\n ValueError: length of input sequence must be a multiple of 3 for codon tokenization, but got 10\n \"\"\"\n\n model_input_names = [\"input_ids\", \"attention_mask\"]\n\n def __init__(\n self,\n alphabet: Alphabet | str | List[str] | None = None,\n nmers: int = 1,\n codon: bool = False,\n replace_T_with_U: bool = True,\n do_upper_case: bool = True,\n additional_special_tokens: List | Tuple | None = None,\n **kwargs,\n ):\n if codon and (nmers > 1 and nmers != 3):\n raise ValueError(\"Codon and nmers cannot be used together.\")\n if codon:\n nmers = 3 # set to 3 to get correct vocab\n if not isinstance(alphabet, Alphabet):\n alphabet = get_alphabet(alphabet, nmers=nmers)\n super().__init__(\n alphabet=alphabet,\n nmers=nmers,\n codon=codon,\n replace_T_with_U=replace_T_with_U,\n do_upper_case=do_upper_case,\n additional_special_tokens=additional_special_tokens,\n **kwargs,\n )\n self.replace_T_with_U = replace_T_with_U\n self.nmers = nmers\n self.codon = codon\n\n def _tokenize(self, text: str, **kwargs):\n if self.do_upper_case:\n text = text.upper()\n if self.replace_T_with_U:\n text = text.replace(\"T\", \"U\")\n if self.codon:\n if len(text) % 3 != 0:\n raise ValueError(\n f\"length of input sequence must be a multiple of 3 for codon tokenization, but got {len(text)}\"\n )\n return [text[i : i + 3] for i in range(0, len(text), 3)]\n if self.nmers > 1:\n return [text[i : i + self.nmers] for i in range(len(text) - self.nmers + 1)] # noqa: E203\n return list(text)\n
"},{"location":"models/utrbert/#multimolecule.models.utrbert.RnaTokenizer(alphabet)","title":"alphabet
","text":""},{"location":"models/utrbert/#multimolecule.models.utrbert.RnaTokenizer(nmers)","title":"nmers
","text":""},{"location":"models/utrbert/#multimolecule.models.utrbert.RnaTokenizer(codon)","title":"codon
","text":""},{"location":"models/utrbert/#multimolecule.models.utrbert.RnaTokenizer(replace_T_with_U)","title":"replace_T_with_U
","text":""},{"location":"models/utrbert/#multimolecule.models.utrbert.RnaTokenizer(do_upper_case)","title":"do_upper_case
","text":""},{"location":"models/utrbert/#multimolecule.models.utrbert.UtrBertConfig","title":"UtrBertConfig","text":" Bases: PreTrainedConfig
This is the configuration class to store the configuration of a UtrBertModel
. It is used to instantiate a 3UTRBERT model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the 3UTRBERT yangyn533/3UTRBERT architecture.
Configuration objects inherit from PreTrainedConfig
and can be used to control the model outputs. Read the documentation from PreTrainedConfig
for more information.
Parameters:
Name Type Description Defaultint | None
Vocabulary size of the UTRBERT model. Defines the number of different tokens that can be represented by the inputs_ids
passed when calling [BertModel
].
None
int | None
kmer size of the UTRBERT model. Defines the vocabulary size of the model.
None
int
Dimensionality of the encoder layers and the pooler layer.
768
int
Number of hidden layers in the Transformer encoder.
12
int
Number of attention heads for each attention layer in the Transformer encoder.
12
int
Dimensionality of the \u201cintermediate\u201d (often named feed-forward) layer in the Transformer encoder.
3072
str
The non-linear activation function (function or string) in the encoder and pooler. If string, \"gelu\"
, \"relu\"
, \"silu\"
and \"gelu_new\"
are supported.
'gelu'
float
The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
0.1
float
The dropout ratio for the attention probabilities.
0.1
int
The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
512
float
The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
0.02
float
The epsilon used by the layer normalization layers.
1e-12
str
Type of position embedding. Choose one of \"absolute\"
, \"relative_key\"
, \"relative_key_query\"
. For positional embeddings use \"absolute\"
. For more information on \"relative_key\"
, please refer to Self-Attention with Relative Position Representations (Shaw et al.). For more information on \"relative_key_query\"
, please refer to Method 4 in Improve Transformer Models with Better Relative Position Embeddings (Huang et al.).
'absolute'
bool
Whether the model is used as a decoder or not. If False
, the model is used as an encoder.
False
bool
Whether or not the model should return the last key/values attentions (not used by all models). Only relevant if config.is_decoder=True
.
True
Examples:
Python Console Session>>> from multimolecule import UtrBertConfig, UtrBertModel\n>>> # Initializing a UtrBERT multimolecule/utrbert style configuration\n>>> configuration = UtrBertConfig(vocab_size=26, nmers=1)\n>>> # Initializing a model (with random weights) from the multimolecule/utrbert style configuration\n>>> model = UtrBertModel(configuration)\n>>> # Accessing the model configuration\n>>> configuration = model.config\n
Source code in multimolecule/models/utrbert/configuration_utrbert.py
Pythonclass UtrBertConfig(PreTrainedConfig):\n r\"\"\"\n This is the configuration class to store the configuration of a [`UtrBertModel`][multimolecule.models.UtrBertModel].\n It is used to instantiate a 3UTRBERT model according to the specified arguments, defining the model architecture.\n Instantiating a configuration with the defaults will yield a similar configuration to that of the 3UTRBERT\n [yangyn533/3UTRBERT](https://github.com/yangyn533/3UTRBERT) architecture.\n\n Configuration objects inherit from [`PreTrainedConfig`][multimolecule.models.PreTrainedConfig] and can be used to\n control the model outputs. Read the documentation from [`PreTrainedConfig`][multimolecule.models.PreTrainedConfig]\n for more information.\n\n Args:\n vocab_size:\n Vocabulary size of the UTRBERT model. Defines the number of different tokens that can be represented by the\n `inputs_ids` passed when calling [`BertModel`].\n nmers:\n kmer size of the UTRBERT model. Defines the vocabulary size of the model.\n hidden_size:\n Dimensionality of the encoder layers and the pooler layer.\n num_hidden_layers:\n Number of hidden layers in the Transformer encoder.\n num_attention_heads:\n Number of attention heads for each attention layer in the Transformer encoder.\n intermediate_size:\n Dimensionality of the \"intermediate\" (often named feed-forward) layer in the Transformer encoder.\n hidden_act:\n The non-linear activation function (function or string) in the encoder and pooler. If string, `\"gelu\"`,\n `\"relu\"`, `\"silu\"` and `\"gelu_new\"` are supported.\n hidden_dropout:\n The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.\n attention_dropout:\n The dropout ratio for the attention probabilities.\n max_position_embeddings:\n The maximum sequence length that this model might ever be used with. Typically set this to something large\n just in case (e.g., 512 or 1024 or 2048).\n initializer_range:\n The standard deviation of the truncated_normal_initializer for initializing all weight matrices.\n layer_norm_eps:\n The epsilon used by the layer normalization layers.\n position_embedding_type:\n Type of position embedding. Choose one of `\"absolute\"`, `\"relative_key\"`, `\"relative_key_query\"`. For\n positional embeddings use `\"absolute\"`. For more information on `\"relative_key\"`, please refer to\n [Self-Attention with Relative Position Representations (Shaw et al.)](https://arxiv.org/abs/1803.02155).\n For more information on `\"relative_key_query\"`, please refer to *Method 4* in [Improve Transformer Models\n with Better Relative Position Embeddings (Huang et al.)](https://arxiv.org/abs/2009.13658).\n is_decoder:\n Whether the model is used as a decoder or not. If `False`, the model is used as an encoder.\n use_cache:\n Whether or not the model should return the last key/values attentions (not used by all models). Only\n relevant if `config.is_decoder=True`.\n\n Examples:\n >>> from multimolecule import UtrBertConfig, UtrBertModel\n >>> # Initializing a UtrBERT multimolecule/utrbert style configuration\n >>> configuration = UtrBertConfig(vocab_size=26, nmers=1)\n >>> # Initializing a model (with random weights) from the multimolecule/utrbert style configuration\n >>> model = UtrBertModel(configuration)\n >>> # Accessing the model configuration\n >>> configuration = model.config\n \"\"\"\n\n model_type = \"utrbert\"\n\n def __init__(\n self,\n vocab_size: int | None = None,\n nmers: int | None = None,\n hidden_size: int = 768,\n num_hidden_layers: int = 12,\n num_attention_heads: int = 12,\n intermediate_size: int = 3072,\n hidden_act: str = \"gelu\",\n hidden_dropout: float = 0.1,\n attention_dropout: float = 0.1,\n max_position_embeddings: int = 512,\n initializer_range: float = 0.02,\n layer_norm_eps: float = 1e-12,\n position_embedding_type: str = \"absolute\",\n is_decoder: bool = False,\n use_cache: bool = True,\n head: HeadConfig | None = None,\n lm_head: MaskedLMHeadConfig | None = None,\n **kwargs,\n ):\n super().__init__(**kwargs)\n\n self.vocab_size = vocab_size\n self.type_vocab_size = 2\n self.nmers = nmers\n self.hidden_size = hidden_size\n self.num_hidden_layers = num_hidden_layers\n self.num_attention_heads = num_attention_heads\n self.hidden_act = hidden_act\n self.intermediate_size = intermediate_size\n self.hidden_dropout = hidden_dropout\n self.attention_dropout = attention_dropout\n self.max_position_embeddings = max_position_embeddings\n self.initializer_range = initializer_range\n self.layer_norm_eps = layer_norm_eps\n self.position_embedding_type = position_embedding_type\n self.is_decoder = is_decoder\n self.use_cache = use_cache\n self.head = HeadConfig(**head) if head is not None else None\n self.lm_head = MaskedLMHeadConfig(**lm_head) if lm_head is not None else None\n
"},{"location":"models/utrbert/#multimolecule.models.utrbert.UtrBertConfig(vocab_size)","title":"vocab_size
","text":""},{"location":"models/utrbert/#multimolecule.models.utrbert.UtrBertConfig(nmers)","title":"nmers
","text":""},{"location":"models/utrbert/#multimolecule.models.utrbert.UtrBertConfig(hidden_size)","title":"hidden_size
","text":""},{"location":"models/utrbert/#multimolecule.models.utrbert.UtrBertConfig(num_hidden_layers)","title":"num_hidden_layers
","text":""},{"location":"models/utrbert/#multimolecule.models.utrbert.UtrBertConfig(num_attention_heads)","title":"num_attention_heads
","text":""},{"location":"models/utrbert/#multimolecule.models.utrbert.UtrBertConfig(intermediate_size)","title":"intermediate_size
","text":""},{"location":"models/utrbert/#multimolecule.models.utrbert.UtrBertConfig(hidden_act)","title":"hidden_act
","text":""},{"location":"models/utrbert/#multimolecule.models.utrbert.UtrBertConfig(hidden_dropout)","title":"hidden_dropout
","text":""},{"location":"models/utrbert/#multimolecule.models.utrbert.UtrBertConfig(attention_dropout)","title":"attention_dropout
","text":""},{"location":"models/utrbert/#multimolecule.models.utrbert.UtrBertConfig(max_position_embeddings)","title":"max_position_embeddings
","text":""},{"location":"models/utrbert/#multimolecule.models.utrbert.UtrBertConfig(initializer_range)","title":"initializer_range
","text":""},{"location":"models/utrbert/#multimolecule.models.utrbert.UtrBertConfig(layer_norm_eps)","title":"layer_norm_eps
","text":""},{"location":"models/utrbert/#multimolecule.models.utrbert.UtrBertConfig(position_embedding_type)","title":"position_embedding_type
","text":""},{"location":"models/utrbert/#multimolecule.models.utrbert.UtrBertConfig(is_decoder)","title":"is_decoder
","text":""},{"location":"models/utrbert/#multimolecule.models.utrbert.UtrBertConfig(use_cache)","title":"use_cache
","text":""},{"location":"models/utrbert/#multimolecule.models.utrbert.UtrBertForContactPrediction","title":"UtrBertForContactPrediction","text":" Bases: UtrBertPreTrainedModel
Examples:
Python Console Session>>> from multimolecule import UtrBertConfig, UtrBertForContactPrediction, RnaTokenizer\n>>> tokenizer = RnaTokenizer(nmers=1)\n>>> config = UtrBertConfig(vocab_size=tokenizer.vocab_size)\n>>> model = UtrBertForContactPrediction(config)\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=torch.randint(2, (1, 5, 5)))\n>>> output[\"logits\"].shape\ntorch.Size([1, 5, 5, 1])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n
Source code in multimolecule/models/utrbert/modeling_utrbert.py
Pythonclass UtrBertForContactPrediction(UtrBertPreTrainedModel):\n \"\"\"\n Examples:\n >>> from multimolecule import UtrBertConfig, UtrBertForContactPrediction, RnaTokenizer\n >>> tokenizer = RnaTokenizer(nmers=1)\n >>> config = UtrBertConfig(vocab_size=tokenizer.vocab_size)\n >>> model = UtrBertForContactPrediction(config)\n >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n >>> output = model(**input, labels=torch.randint(2, (1, 5, 5)))\n >>> output[\"logits\"].shape\n torch.Size([1, 5, 5, 1])\n >>> output[\"loss\"] # doctest:+ELLIPSIS\n tensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n \"\"\"\n\n def __init__(self, config: UtrBertConfig):\n super().__init__(config)\n self.utrbert = UtrBertModel(config, add_pooling_layer=True)\n self.contact_head = ContactPredictionHead(config)\n self.head_config = self.contact_head.config\n\n # Initialize weights and apply final processing\n self.post_init()\n\n def forward(\n self,\n input_ids: Tensor | NestedTensor,\n attention_mask: Tensor | None = None,\n position_ids: Tensor | None = None,\n head_mask: Tensor | None = None,\n inputs_embeds: Tensor | NestedTensor | None = None,\n labels: Tensor | None = None,\n output_attentions: bool | None = None,\n output_hidden_states: bool | None = None,\n return_dict: bool | None = None,\n **kwargs,\n ) -> Tuple[Tensor, ...] | ContactPredictorOutput:\n if output_attentions is False:\n warn(\"output_attentions must be True for contact classification and will be ignored.\")\n return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n outputs = self.utrbert(\n input_ids,\n attention_mask=attention_mask,\n position_ids=position_ids,\n head_mask=head_mask,\n inputs_embeds=inputs_embeds,\n output_attentions=True,\n output_hidden_states=output_hidden_states,\n return_dict=return_dict,\n **kwargs,\n )\n output = self.contact_head(outputs, attention_mask, input_ids, labels)\n logits, loss = output.logits, output.loss\n\n if not return_dict:\n output = (logits,) + outputs[2:]\n return ((loss,) + output) if loss is not None else output\n\n return ContactPredictorOutput(\n loss=loss,\n logits=logits,\n hidden_states=outputs.hidden_states,\n attentions=outputs.attentions,\n )\n
"},{"location":"models/utrbert/#multimolecule.models.utrbert.UtrBertForMaskedLM","title":"UtrBertForMaskedLM","text":" Bases: UtrBertPreTrainedModel
Examples:
Python Console Session>>> from multimolecule import UtrBertConfig, UtrBertForMaskedLM, RnaTokenizer\n>>> tokenizer = RnaTokenizer(nmers=2)\n>>> config = UtrBertConfig(vocab_size=tokenizer.vocab_size)\n>>> model = UtrBertForMaskedLM(config)\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=input[\"input_ids\"])\n>>> output[\"logits\"].shape\ntorch.Size([1, 6, 31])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<NllLossBackward0>)\n
Source code in multimolecule/models/utrbert/modeling_utrbert.py
Pythonclass UtrBertForMaskedLM(UtrBertPreTrainedModel):\n \"\"\"\n Examples:\n >>> from multimolecule import UtrBertConfig, UtrBertForMaskedLM, RnaTokenizer\n >>> tokenizer = RnaTokenizer(nmers=2)\n >>> config = UtrBertConfig(vocab_size=tokenizer.vocab_size)\n >>> model = UtrBertForMaskedLM(config)\n >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n >>> output = model(**input, labels=input[\"input_ids\"])\n >>> output[\"logits\"].shape\n torch.Size([1, 6, 31])\n >>> output[\"loss\"] # doctest:+ELLIPSIS\n tensor(..., grad_fn=<NllLossBackward0>)\n \"\"\"\n\n _tied_weights_keys = [\"lm_head.decoder.weight\", \"lm_head.decoder.bias\"]\n\n def __init__(self, config: UtrBertConfig):\n super().__init__(config)\n if config.is_decoder:\n logger.warning(\n \"If you want to use `BertForMaskedLM` make sure `config.is_decoder=False` for \"\n \"bi-directional self-attention.\"\n )\n self.utrbert = UtrBertModel(config, add_pooling_layer=False)\n self.lm_head = MaskedLMHead(config)\n\n # Initialize weights and apply final processing\n self.post_init()\n\n def get_output_embeddings(self):\n return self.lm_head.decoder\n\n def set_output_embeddings(self, new_embeddings):\n self.lm_head.decoder = new_embeddings\n\n def forward(\n self,\n input_ids: Tensor | NestedTensor,\n attention_mask: Tensor | None = None,\n position_ids: Tensor | None = None,\n head_mask: Tensor | None = None,\n inputs_embeds: Tensor | NestedTensor | None = None,\n encoder_hidden_states: Tensor | None = None,\n encoder_attention_mask: Tensor | None = None,\n labels: Tensor | None = None,\n output_attentions: bool | None = None,\n output_hidden_states: bool | None = None,\n return_dict: bool | None = None,\n **kwargs,\n ) -> Tuple[Tensor, ...] | MaskedLMOutput:\n return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n outputs = self.utrbert(\n input_ids,\n attention_mask=attention_mask,\n position_ids=position_ids,\n head_mask=head_mask,\n inputs_embeds=inputs_embeds,\n encoder_hidden_states=encoder_hidden_states,\n encoder_attention_mask=encoder_attention_mask,\n output_attentions=output_attentions,\n output_hidden_states=output_hidden_states,\n return_dict=return_dict,\n **kwargs,\n )\n output = self.lm_head(outputs, labels)\n logits, loss = output.logits, output.loss\n\n if not return_dict:\n output = (logits,) + outputs[2:]\n return ((loss,) + output) if loss is not None else output\n\n return MaskedLMOutput(\n loss=loss,\n logits=logits,\n hidden_states=outputs.hidden_states,\n attentions=outputs.attentions,\n )\n
"},{"location":"models/utrbert/#multimolecule.models.utrbert.UtrBertForSequencePrediction","title":"UtrBertForSequencePrediction","text":" Bases: UtrBertPreTrainedModel
Examples:
Python Console Session>>> from multimolecule import UtrBertConfig, UtrBertForSequencePrediction, RnaTokenizer\n>>> tokenizer = RnaTokenizer(nmers=4)\n>>> config = UtrBertConfig(vocab_size=tokenizer.vocab_size)\n>>> model = UtrBertForSequencePrediction(config)\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=torch.tensor([[1]]))\n>>> output[\"logits\"].shape\ntorch.Size([1, 1])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n
Source code in multimolecule/models/utrbert/modeling_utrbert.py
Pythonclass UtrBertForSequencePrediction(UtrBertPreTrainedModel):\n \"\"\"\n Examples:\n >>> from multimolecule import UtrBertConfig, UtrBertForSequencePrediction, RnaTokenizer\n >>> tokenizer = RnaTokenizer(nmers=4)\n >>> config = UtrBertConfig(vocab_size=tokenizer.vocab_size)\n >>> model = UtrBertForSequencePrediction(config)\n >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n >>> output = model(**input, labels=torch.tensor([[1]]))\n >>> output[\"logits\"].shape\n torch.Size([1, 1])\n >>> output[\"loss\"] # doctest:+ELLIPSIS\n tensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n \"\"\"\n\n def __init__(self, config: UtrBertConfig):\n super().__init__(config)\n self.utrbert = UtrBertModel(config)\n self.sequence_head = SequencePredictionHead(config)\n self.head_config = self.sequence_head.config\n\n # Initialize weights and apply final processing\n self.post_init()\n\n def forward(\n self,\n input_ids: Tensor | NestedTensor,\n attention_mask: Tensor | None = None,\n position_ids: Tensor | None = None,\n head_mask: Tensor | None = None,\n inputs_embeds: Tensor | NestedTensor | None = None,\n labels: Tensor | None = None,\n output_attentions: bool | None = None,\n output_hidden_states: bool | None = None,\n return_dict: bool | None = None,\n **kwargs,\n ) -> Tuple[Tensor, ...] | SequencePredictorOutput:\n return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n outputs = self.utrbert(\n input_ids,\n attention_mask=attention_mask,\n position_ids=position_ids,\n head_mask=head_mask,\n inputs_embeds=inputs_embeds,\n output_attentions=output_attentions,\n output_hidden_states=output_hidden_states,\n return_dict=return_dict,\n **kwargs,\n )\n output = self.sequence_head(outputs, labels)\n logits, loss = output.logits, output.loss\n\n if not return_dict:\n output = (logits,) + outputs[2:]\n return ((loss,) + output) if loss is not None else output\n\n return SequencePredictorOutput(\n loss=loss,\n logits=logits,\n hidden_states=outputs.hidden_states,\n attentions=outputs.attentions,\n )\n
"},{"location":"models/utrbert/#multimolecule.models.utrbert.UtrBertForTokenPrediction","title":"UtrBertForTokenPrediction","text":" Bases: UtrBertPreTrainedModel
Examples:
Python Console Session>>> from multimolecule import UtrBertConfig, UtrBertForTokenPrediction, RnaTokenizer\n>>> tokenizer = RnaTokenizer(nmers=2)\n>>> config = UtrBertConfig(vocab_size=tokenizer.vocab_size, nmers=2)\n>>> model = UtrBertForTokenPrediction(config)\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=torch.randint(2, (1, 5)))\n>>> output[\"logits\"].shape\ntorch.Size([1, 5, 1])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n
Source code in multimolecule/models/utrbert/modeling_utrbert.py
Pythonclass UtrBertForTokenPrediction(UtrBertPreTrainedModel):\n \"\"\"\n Examples:\n >>> from multimolecule import UtrBertConfig, UtrBertForTokenPrediction, RnaTokenizer\n >>> tokenizer = RnaTokenizer(nmers=2)\n >>> config = UtrBertConfig(vocab_size=tokenizer.vocab_size, nmers=2)\n >>> model = UtrBertForTokenPrediction(config)\n >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n >>> output = model(**input, labels=torch.randint(2, (1, 5)))\n >>> output[\"logits\"].shape\n torch.Size([1, 5, 1])\n >>> output[\"loss\"] # doctest:+ELLIPSIS\n tensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n \"\"\"\n\n def __init__(self, config: UtrBertConfig):\n super().__init__(config)\n self.num_labels = config.num_labels\n self.utrbert = UtrBertModel(config, add_pooling_layer=False)\n self.token_head = TokenKMerHead(config)\n self.head_config = self.token_head.config\n\n # Initialize weights and apply final processing\n self.post_init()\n\n def forward(\n self,\n input_ids: Tensor | NestedTensor,\n attention_mask: Tensor | None = None,\n position_ids: Tensor | None = None,\n head_mask: Tensor | None = None,\n inputs_embeds: Tensor | NestedTensor | None = None,\n labels: Tensor | None = None,\n output_attentions: bool | None = None,\n output_hidden_states: bool | None = None,\n return_dict: bool | None = None,\n **kwargs,\n ) -> Tuple[Tensor, ...] | TokenPredictorOutput:\n return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n outputs = self.utrbert(\n input_ids,\n attention_mask=attention_mask,\n position_ids=position_ids,\n head_mask=head_mask,\n inputs_embeds=inputs_embeds,\n output_attentions=output_attentions,\n output_hidden_states=output_hidden_states,\n return_dict=return_dict,\n **kwargs,\n )\n output = self.token_head(outputs, attention_mask, input_ids, labels)\n logits, loss = output.logits, output.loss\n\n if not return_dict:\n output = (logits,) + outputs[2:]\n return ((loss,) + output) if loss is not None else output\n\n return TokenPredictorOutput(\n loss=loss,\n logits=logits,\n hidden_states=outputs.hidden_states,\n attentions=outputs.attentions,\n )\n
"},{"location":"models/utrbert/#multimolecule.models.utrbert.UtrBertModel","title":"UtrBertModel","text":" Bases: UtrBertPreTrainedModel
Examples:
Python Console Session>>> from multimolecule import UtrBertConfig, UtrBertModel, RnaTokenizer\n>>> tokenizer = RnaTokenizer(nmers=1)\n>>> config = UtrBertConfig(vocab_size=tokenizer.vocab_size)\n>>> model = UtrBertModel(config)\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input)\n>>> output[\"last_hidden_state\"].shape\ntorch.Size([1, 7, 768])\n>>> output[\"pooler_output\"].shape\ntorch.Size([1, 768])\n
Source code in multimolecule/models/utrbert/modeling_utrbert.py
Pythonclass UtrBertModel(UtrBertPreTrainedModel):\n \"\"\"\n Examples:\n >>> from multimolecule import UtrBertConfig, UtrBertModel, RnaTokenizer\n >>> tokenizer = RnaTokenizer(nmers=1)\n >>> config = UtrBertConfig(vocab_size=tokenizer.vocab_size)\n >>> model = UtrBertModel(config)\n >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n >>> output = model(**input)\n >>> output[\"last_hidden_state\"].shape\n torch.Size([1, 7, 768])\n >>> output[\"pooler_output\"].shape\n torch.Size([1, 768])\n \"\"\"\n\n def __init__(self, config: UtrBertConfig, add_pooling_layer: bool = True):\n super().__init__(config)\n self.pad_token_id = config.pad_token_id\n self.embeddings = UtrBertEmbeddings(config)\n self.encoder = UtrBertEncoder(config)\n self.pooler = UtrBertPooler(config) if add_pooling_layer else None\n\n # Initialize weights and apply final processing\n self.post_init()\n\n def get_input_embeddings(self):\n return self.embeddings.word_embeddings\n\n def set_input_embeddings(self, value):\n self.embeddings.word_embeddings = value\n\n def _prune_heads(self, heads_to_prune):\n \"\"\"\n Prunes heads of the model. heads_to_prune: dict of {layer_num: list of heads to prune in this layer} See base\n class PreTrainedModel\n \"\"\"\n for layer, heads in heads_to_prune.items():\n self.encoder.layer[layer].attention.prune_heads(heads)\n\n def forward(\n self,\n input_ids: Tensor | NestedTensor,\n attention_mask: Tensor | None = None,\n position_ids: Tensor | None = None,\n head_mask: Tensor | None = None,\n inputs_embeds: Tensor | NestedTensor | None = None,\n encoder_hidden_states: Tensor | None = None,\n encoder_attention_mask: Tensor | None = None,\n past_key_values: Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None = None,\n use_cache: bool | None = None,\n output_attentions: bool | None = None,\n output_hidden_states: bool | None = None,\n return_dict: bool | None = None,\n **kwargs,\n ) -> Tuple[Tensor, ...] | BaseModelOutputWithPoolingAndCrossAttentions:\n r\"\"\"\n Args:\n encoder_hidden_states:\n Shape: `(batch_size, sequence_length, hidden_size)`\n\n Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if\n the model is configured as a decoder.\n encoder_attention_mask:\n Shape: `(batch_size, sequence_length)`\n\n Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used\n in the cross-attention if the model is configured as a decoder. Mask values selected in `[0, 1]`:\n\n - 1 for tokens that are **not masked**,\n - 0 for tokens that are **masked**.\n past_key_values:\n Tuple of length `config.n_layers` with each tuple having 4 tensors of shape\n `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)\n\n Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up\n decoding.\n\n If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those\n that don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of\n all `decoder_input_ids` of shape `(batch_size, sequence_length)`.\n use_cache:\n If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding\n (see `past_key_values`).\n \"\"\"\n if kwargs:\n warn(\n f\"Additional keyword arguments `{', '.join(kwargs)}` are detected in \"\n f\"`{self.__class__.__name__}.forward`, they will be ignored.\\n\"\n \"This is provided for backward compatibility and may lead to unexpected behavior.\"\n )\n output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions\n output_hidden_states = (\n output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states\n )\n return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n\n if self.config.is_decoder:\n use_cache = use_cache if use_cache is not None else self.config.use_cache\n else:\n use_cache = False\n\n if isinstance(input_ids, NestedTensor):\n input_ids, attention_mask = input_ids.tensor, input_ids.mask\n if input_ids is not None and inputs_embeds is not None:\n raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n if input_ids is not None:\n self.warn_if_padding_and_no_attention_mask(input_ids, attention_mask)\n input_shape = input_ids.size()\n elif inputs_embeds is not None:\n input_shape = inputs_embeds.size()[:-1]\n else:\n raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n batch_size, seq_length = input_shape\n device = input_ids.device if input_ids is not None else inputs_embeds.device # type: ignore[union-attr]\n\n # past_key_values_length\n past_key_values_length = past_key_values[0][0].shape[2] if past_key_values is not None else 0\n\n if attention_mask is None:\n attention_mask = (\n input_ids.ne(self.pad_token_id)\n if self.pad_token_id is not None\n else torch.ones(((batch_size, seq_length + past_key_values_length)), device=device)\n )\n\n # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]\n # ourselves in which case we just need to make it broadcastable to all heads.\n extended_attention_mask: Tensor = self.get_extended_attention_mask(attention_mask, input_shape)\n\n # If a 2D or 3D attention mask is provided for the cross-attention\n # we need to make broadcastable to [batch_size, num_heads, seq_length, seq_length]\n if self.config.is_decoder and encoder_hidden_states is not None:\n encoder_batch_size, encoder_sequence_length, _ = encoder_hidden_states.size()\n encoder_hidden_shape = (encoder_batch_size, encoder_sequence_length)\n if encoder_attention_mask is None:\n encoder_attention_mask = torch.ones(encoder_hidden_shape, device=device)\n encoder_extended_attention_mask = self.invert_attention_mask(encoder_attention_mask)\n else:\n encoder_extended_attention_mask = None\n\n # Prepare head mask if needed\n # 1.0 in head_mask indicate we keep the head\n # attention_probs has shape bsz x n_heads x N x N\n # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]\n # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]\n head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)\n\n embedding_output = self.embeddings(\n input_ids=input_ids,\n position_ids=position_ids,\n inputs_embeds=inputs_embeds,\n past_key_values_length=past_key_values_length,\n )\n encoder_outputs = self.encoder(\n embedding_output,\n attention_mask=extended_attention_mask,\n head_mask=head_mask,\n encoder_hidden_states=encoder_hidden_states,\n encoder_attention_mask=encoder_extended_attention_mask,\n past_key_values=past_key_values,\n use_cache=use_cache,\n output_attentions=output_attentions,\n output_hidden_states=output_hidden_states,\n return_dict=return_dict,\n )\n sequence_output = encoder_outputs[0]\n pooled_output = self.pooler(sequence_output) if self.pooler is not None else None\n\n if not return_dict:\n return (sequence_output, pooled_output) + encoder_outputs[1:]\n\n return BaseModelOutputWithPoolingAndCrossAttentions(\n last_hidden_state=sequence_output,\n pooler_output=pooled_output,\n past_key_values=encoder_outputs.past_key_values,\n hidden_states=encoder_outputs.hidden_states,\n attentions=encoder_outputs.attentions,\n cross_attentions=encoder_outputs.cross_attentions,\n )\n
"},{"location":"models/utrbert/#multimolecule.models.utrbert.UtrBertModel.forward","title":"forward","text":"Pythonforward(input_ids: Tensor | NestedTensor, attention_mask: Tensor | None = None, position_ids: Tensor | None = None, head_mask: Tensor | None = None, inputs_embeds: Tensor | NestedTensor | None = None, encoder_hidden_states: Tensor | None = None, encoder_attention_mask: Tensor | None = None, past_key_values: Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None = None, use_cache: bool | None = None, output_attentions: bool | None = None, output_hidden_states: bool | None = None, return_dict: bool | None = None, **kwargs) -> Tuple[Tensor, ...] | BaseModelOutputWithPoolingAndCrossAttentions\n
Parameters:
Name Type Description DefaultTensor | None
Shape: (batch_size, sequence_length, hidden_size)
Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if the model is configured as a decoder.
None
Tensor | None
Shape: (batch_size, sequence_length)
Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used in the cross-attention if the model is configured as a decoder. Mask values selected in [0, 1]
:
None
Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None
Tuple of length config.n_layers
with each tuple having 4 tensors of shape `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)
Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up decoding.
If past_key_values
are used, the user can optionally input only the last decoder_input_ids
(those that don\u2019t have their past key value states given to this model) of shape (batch_size, 1)
instead of all decoder_input_ids
of shape (batch_size, sequence_length)
.
None
bool | None
If set to True
, past_key_values
key value states are returned and can be used to speed up decoding (see past_key_values
).
None
Source code in multimolecule/models/utrbert/modeling_utrbert.py
Pythondef forward(\n self,\n input_ids: Tensor | NestedTensor,\n attention_mask: Tensor | None = None,\n position_ids: Tensor | None = None,\n head_mask: Tensor | None = None,\n inputs_embeds: Tensor | NestedTensor | None = None,\n encoder_hidden_states: Tensor | None = None,\n encoder_attention_mask: Tensor | None = None,\n past_key_values: Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None = None,\n use_cache: bool | None = None,\n output_attentions: bool | None = None,\n output_hidden_states: bool | None = None,\n return_dict: bool | None = None,\n **kwargs,\n) -> Tuple[Tensor, ...] | BaseModelOutputWithPoolingAndCrossAttentions:\n r\"\"\"\n Args:\n encoder_hidden_states:\n Shape: `(batch_size, sequence_length, hidden_size)`\n\n Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if\n the model is configured as a decoder.\n encoder_attention_mask:\n Shape: `(batch_size, sequence_length)`\n\n Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used\n in the cross-attention if the model is configured as a decoder. Mask values selected in `[0, 1]`:\n\n - 1 for tokens that are **not masked**,\n - 0 for tokens that are **masked**.\n past_key_values:\n Tuple of length `config.n_layers` with each tuple having 4 tensors of shape\n `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)\n\n Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up\n decoding.\n\n If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those\n that don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of\n all `decoder_input_ids` of shape `(batch_size, sequence_length)`.\n use_cache:\n If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding\n (see `past_key_values`).\n \"\"\"\n if kwargs:\n warn(\n f\"Additional keyword arguments `{', '.join(kwargs)}` are detected in \"\n f\"`{self.__class__.__name__}.forward`, they will be ignored.\\n\"\n \"This is provided for backward compatibility and may lead to unexpected behavior.\"\n )\n output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions\n output_hidden_states = (\n output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states\n )\n return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n\n if self.config.is_decoder:\n use_cache = use_cache if use_cache is not None else self.config.use_cache\n else:\n use_cache = False\n\n if isinstance(input_ids, NestedTensor):\n input_ids, attention_mask = input_ids.tensor, input_ids.mask\n if input_ids is not None and inputs_embeds is not None:\n raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n if input_ids is not None:\n self.warn_if_padding_and_no_attention_mask(input_ids, attention_mask)\n input_shape = input_ids.size()\n elif inputs_embeds is not None:\n input_shape = inputs_embeds.size()[:-1]\n else:\n raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n batch_size, seq_length = input_shape\n device = input_ids.device if input_ids is not None else inputs_embeds.device # type: ignore[union-attr]\n\n # past_key_values_length\n past_key_values_length = past_key_values[0][0].shape[2] if past_key_values is not None else 0\n\n if attention_mask is None:\n attention_mask = (\n input_ids.ne(self.pad_token_id)\n if self.pad_token_id is not None\n else torch.ones(((batch_size, seq_length + past_key_values_length)), device=device)\n )\n\n # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]\n # ourselves in which case we just need to make it broadcastable to all heads.\n extended_attention_mask: Tensor = self.get_extended_attention_mask(attention_mask, input_shape)\n\n # If a 2D or 3D attention mask is provided for the cross-attention\n # we need to make broadcastable to [batch_size, num_heads, seq_length, seq_length]\n if self.config.is_decoder and encoder_hidden_states is not None:\n encoder_batch_size, encoder_sequence_length, _ = encoder_hidden_states.size()\n encoder_hidden_shape = (encoder_batch_size, encoder_sequence_length)\n if encoder_attention_mask is None:\n encoder_attention_mask = torch.ones(encoder_hidden_shape, device=device)\n encoder_extended_attention_mask = self.invert_attention_mask(encoder_attention_mask)\n else:\n encoder_extended_attention_mask = None\n\n # Prepare head mask if needed\n # 1.0 in head_mask indicate we keep the head\n # attention_probs has shape bsz x n_heads x N x N\n # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]\n # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]\n head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)\n\n embedding_output = self.embeddings(\n input_ids=input_ids,\n position_ids=position_ids,\n inputs_embeds=inputs_embeds,\n past_key_values_length=past_key_values_length,\n )\n encoder_outputs = self.encoder(\n embedding_output,\n attention_mask=extended_attention_mask,\n head_mask=head_mask,\n encoder_hidden_states=encoder_hidden_states,\n encoder_attention_mask=encoder_extended_attention_mask,\n past_key_values=past_key_values,\n use_cache=use_cache,\n output_attentions=output_attentions,\n output_hidden_states=output_hidden_states,\n return_dict=return_dict,\n )\n sequence_output = encoder_outputs[0]\n pooled_output = self.pooler(sequence_output) if self.pooler is not None else None\n\n if not return_dict:\n return (sequence_output, pooled_output) + encoder_outputs[1:]\n\n return BaseModelOutputWithPoolingAndCrossAttentions(\n last_hidden_state=sequence_output,\n pooler_output=pooled_output,\n past_key_values=encoder_outputs.past_key_values,\n hidden_states=encoder_outputs.hidden_states,\n attentions=encoder_outputs.attentions,\n cross_attentions=encoder_outputs.cross_attentions,\n )\n
"},{"location":"models/utrbert/#multimolecule.models.utrbert.UtrBertModel.forward(encoder_hidden_states)","title":"encoder_hidden_states
","text":""},{"location":"models/utrbert/#multimolecule.models.utrbert.UtrBertModel.forward(encoder_attention_mask)","title":"encoder_attention_mask
","text":""},{"location":"models/utrbert/#multimolecule.models.utrbert.UtrBertModel.forward(past_key_values)","title":"past_key_values
","text":""},{"location":"models/utrbert/#multimolecule.models.utrbert.UtrBertModel.forward(use_cache)","title":"use_cache
","text":""},{"location":"models/utrbert/#multimolecule.models.utrbert.UtrBertPreTrainedModel","title":"UtrBertPreTrainedModel","text":" Bases: PreTrainedModel
An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained models.
Source code inmultimolecule/models/utrbert/modeling_utrbert.py
Pythonclass UtrBertPreTrainedModel(PreTrainedModel):\n \"\"\"\n An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained\n models.\n \"\"\"\n\n config_class = UtrBertConfig\n base_model_prefix = \"utrbert\"\n supports_gradient_checkpointing = True\n _no_split_modules = [\"UtrBertLayer\", \"UtrBertEmbeddings\"]\n\n # Copied from transformers.models.bert.modeling_bert.BertPreTrainedModel._init_weights\n def _init_weights(self, module: nn.Module):\n \"\"\"Initialize the weights\"\"\"\n if isinstance(module, nn.Linear):\n # Slightly different from the TF version which uses truncated_normal for initialization\n # cf https://github.com/pytorch/pytorch/pull/5617\n module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)\n if module.bias is not None:\n module.bias.data.zero_()\n elif isinstance(module, nn.Embedding):\n module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)\n if module.padding_idx is not None:\n module.weight.data[module.padding_idx].zero_()\n elif isinstance(module, nn.LayerNorm):\n module.bias.data.zero_()\n module.weight.data.fill_(1.0)\n
"},{"location":"models/utrlm/","title":"UTR-LM","text":"Pre-trained model on 5\u2019 untranslated region (5\u2019UTR) using masked language modeling (MLM), Secondary Structure (SS), and Minimum Free Energy (MFE) objectives.
"},{"location":"models/utrlm/#statement","title":"Statement","text":"A 5\u2019 UTR Language Model for Decoding Untranslated Regions of mRNA and Function Predictions is published in Nature Machine Intelligence, which is a Closed Access / Author-Fee journal.
Machine learning has been at the forefront of the movement for free and open access to research.
We see no role for closed access or author-fee publication in the future of machine learning research and believe the adoption of these journals as an outlet of record for the machine learning community would be a retrograde step.
The MultiMolecule team is committed to the principles of open access and open science.
We do NOT endorse the publication of manuscripts in Closed Access / Author-Fee journals and encourage the community to support Open Access journals and conferences.
Please consider signing the Statement on Nature Machine Intelligence.
"},{"location":"models/utrlm/#disclaimer","title":"Disclaimer","text":"This is an UNOFFICIAL implementation of the A 5\u2019 UTR Language Model for Decoding Untranslated Regions of mRNA and Function Predictions by Yanyi Chu, Dan Yu, et al.
The OFFICIAL repository of UTR-LM is at a96123155/UTR-LM.
Warning
The MultiMolecule team is unable to confirm that the provided model and checkpoints are producing the same intermediate representations as the original implementation. This is because
The proposed method is published in a Closed Access / Author-Fee journal.
The team releasing UTR-LM did not write this model card for this model so this model card has been written by the MultiMolecule team.
"},{"location":"models/utrlm/#model-details","title":"Model Details","text":"UTR-LM is a bert-style model pre-trained on a large corpus of 5\u2019 untranslated regions (5\u2019UTRs) in a self-supervised fashion. This means that the model was trained on the raw nucleotides of RNA sequences only, with an automatic process to generate inputs and labels from those texts. Please refer to the Training Details section for more information on the training process.
"},{"location":"models/utrlm/#variations","title":"Variations","text":"multimolecule/utrlm-te_el
: The UTR-LM model for Translation Efficiency of transcripts and mRNA Expression Level.multimolecule/utrlm-mrl
: The UTR-LM model for Mean Ribosome Loading.The model file depends on the multimolecule
library. You can install it using pip:
pip install multimolecule\n
"},{"location":"models/utrlm/#direct-use","title":"Direct Use","text":"You can use this model directly with a pipeline for masked language modeling:
Python>>> import multimolecule # you must import multimolecule to register models\n>>> from transformers import pipeline\n>>> unmasker = pipeline(\"fill-mask\", model=\"multimolecule/utrlm-te_el\")\n>>> unmasker(\"gguc<mask>cucugguuagaccagaucugagccu\")\n\n[{'score': 0.07707168161869049,\n 'token': 23,\n 'token_str': '*',\n 'sequence': 'G G U C * C U C U G G U U A G A C C A G A U C U G A G C C U'},\n {'score': 0.07588472962379456,\n 'token': 5,\n 'token_str': '<null>',\n 'sequence': 'G G U C C U C U G G U U A G A C C A G A U C U G A G C C U'},\n {'score': 0.07178673148155212,\n 'token': 9,\n 'token_str': 'U',\n 'sequence': 'G G U C U C U C U G G U U A G A C C A G A U C U G A G C C U'},\n {'score': 0.06414645165205002,\n 'token': 10,\n 'token_str': 'N',\n 'sequence': 'G G U C N C U C U G G U U A G A C C A G A U C U G A G C C U'},\n {'score': 0.06385370343923569,\n 'token': 12,\n 'token_str': 'Y',\n 'sequence': 'G G U C Y C U C U G G U U A G A C C A G A U C U G A G C C U'}]\n
"},{"location":"models/utrlm/#downstream-use","title":"Downstream Use","text":""},{"location":"models/utrlm/#extract-features","title":"Extract Features","text":"Here is how to use this model to get the features of a given sequence in PyTorch:
Pythonfrom multimolecule import RnaTokenizer, UtrLmModel\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/utrlm-te_el\")\nmodel = UtrLmModel.from_pretrained(\"multimolecule/utrlm-te_el\")\n\ntext = \"UAGCUUAUCAGACUGAUGUUGA\"\ninput = tokenizer(text, return_tensors=\"pt\")\n\noutput = model(**input)\n
"},{"location":"models/utrlm/#sequence-classification-regression","title":"Sequence Classification / Regression","text":"Note: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for sequence classification or regression.
Here is how to use this model as backbone to fine-tune for a sequence-level task in PyTorch:
Pythonimport torch\nfrom multimolecule import RnaTokenizer, UtrLmForSequencePrediction\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/utrlm-te_el\")\nmodel = UtrLmForSequencePrediction.from_pretrained(\"multimolecule/utrlm-te_el\")\n\ntext = \"UAGCUUAUCAGACUGAUGUUGA\"\ninput = tokenizer(text, return_tensors=\"pt\")\nlabel = torch.tensor([1])\n\noutput = model(**input, labels=label)\n
"},{"location":"models/utrlm/#token-classification-regression","title":"Token Classification / Regression","text":"Note: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for nucleotide classification or regression.
Here is how to use this model as backbone to fine-tune for a nucleotide-level task in PyTorch:
Pythonimport torch\nfrom multimolecule import RnaTokenizer, UtrLmForTokenPrediction\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/utrlm-te_el\")\nmodel = UtrLmForTokenPrediction.from_pretrained(\"multimolecule/utrlm-te_el\")\n\ntext = \"UAGCUUAUCAGACUGAUGUUGA\"\ninput = tokenizer(text, return_tensors=\"pt\")\nlabel = torch.randint(2, (len(text), ))\n\noutput = model(**input, labels=label)\n
"},{"location":"models/utrlm/#contact-classification-regression","title":"Contact Classification / Regression","text":"Note: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for contact classification or regression.
Here is how to use this model as backbone to fine-tune for a contact-level task in PyTorch:
Pythonimport torch\nfrom multimolecule import RnaTokenizer, UtrLmForContactPrediction\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/utrlm-te_el\")\nmodel = UtrLmForContactPrediction.from_pretrained(\"multimolecule/utrlm-te_el\")\n\ntext = \"UAGCUUAUCAGACUGAUGUUGA\"\ninput = tokenizer(text, return_tensors=\"pt\")\nlabel = torch.randint(2, (len(text), len(text)))\n\noutput = model(**input, labels=label)\n
"},{"location":"models/utrlm/#training-details","title":"Training Details","text":"UTR-LM used a mixed training strategy with one self-supervised task and two supervised tasks, where the labels of both supervised tasks are calculated using ViennaRNA.
<mask>
token in the MLM task.The UTR-LM model was pre-trained on 5\u2019 UTR sequences from three sources:
UTR-LM preprocessed the 5\u2019 UTR sequences in a 4-step pipeline:
Note RnaTokenizer
will convert \u201cT\u201ds to \u201cU\u201ds for you, you may disable this behaviour by passing replace_T_with_U=False
.
UTR-LM used masked language modeling (MLM) as one of the pre-training objectives. The masking procedure is similar to the one used in BERT:
<mask>
.The model was trained on two clusters:
BibTeX:
BibTeX@article {chu2023a,\n author = {Chu, Yanyi and Yu, Dan and Li, Yupeng and Huang, Kaixuan and Shen, Yue and Cong, Le and Zhang, Jason and Wang, Mengdi},\n title = {A 5{\\textquoteright} UTR Language Model for Decoding Untranslated Regions of mRNA and Function Predictions},\n elocation-id = {2023.10.11.561938},\n year = {2023},\n doi = {10.1101/2023.10.11.561938},\n publisher = {Cold Spring Harbor Laboratory},\n abstract = {The 5{\\textquoteright} UTR, a regulatory region at the beginning of an mRNA molecule, plays a crucial role in regulating the translation process and impacts the protein expression level. Language models have showcased their effectiveness in decoding the functions of protein and genome sequences. Here, we introduced a language model for 5{\\textquoteright} UTR, which we refer to as the UTR-LM. The UTR-LM is pre-trained on endogenous 5{\\textquoteright} UTRs from multiple species and is further augmented with supervised information including secondary structure and minimum free energy. We fine-tuned the UTR-LM in a variety of downstream tasks. The model outperformed the best-known benchmark by up to 42\\% for predicting the Mean Ribosome Loading, and by up to 60\\% for predicting the Translation Efficiency and the mRNA Expression Level. The model also applies to identifying unannotated Internal Ribosome Entry Sites within the untranslated region and improves the AUPR from 0.37 to 0.52 compared to the best baseline. Further, we designed a library of 211 novel 5{\\textquoteright} UTRs with high predicted values of translation efficiency and evaluated them via a wet-lab assay. Experiment results confirmed that our top designs achieved a 32.5\\% increase in protein production level relative to well-established 5{\\textquoteright} UTR optimized for therapeutics.Competing Interest StatementThe authors have declared no competing interest.},\n URL = {https://www.biorxiv.org/content/early/2023/10/14/2023.10.11.561938},\n eprint = {https://www.biorxiv.org/content/early/2023/10/14/2023.10.11.561938.full.pdf},\n journal = {bioRxiv}\n}\n
"},{"location":"models/utrlm/#contact","title":"Contact","text":"Please use GitHub issues of MultiMolecule for any questions or comments on the model card.
Please contact the authors of the UTR-LM paper for questions or comments on the paper/model.
"},{"location":"models/utrlm/#license","title":"License","text":"This model is licensed under the AGPL-3.0 License.
Text OnlySPDX-License-Identifier: AGPL-3.0-or-later\n
"},{"location":"models/utrlm/#multimolecule.models.utrlm","title":"multimolecule.models.utrlm","text":""},{"location":"models/utrlm/#multimolecule.models.utrlm.RnaTokenizer","title":"RnaTokenizer","text":" Bases: Tokenizer
Tokenizer for RNA sequences.
Parameters:
Name Type Description DefaultAlphabet | str | List[str] | None
alphabet to use for tokenization.
None
, the standard RNA alphabet will be used.string
, it should correspond to the name of a predefined alphabet. The options includestandard
extended
streamline
nucleobase
None
int
Size of kmer to tokenize.
1
bool
Whether to tokenize into codons.
False
bool
Whether to replace T with U.
True
bool
Whether to convert input to uppercase.
True
Examples:
Python Console Session>>> from multimolecule import RnaTokenizer\n>>> tokenizer = RnaTokenizer()\n>>> tokenizer('<pad><cls><eos><unk><mask><null>ACGUNRYSWKMBDHV.X*-I')[\"input_ids\"]\n[1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 2]\n>>> tokenizer('acgu')[\"input_ids\"]\n[1, 6, 7, 8, 9, 2]\n>>> tokenizer('acgt')[\"input_ids\"]\n[1, 6, 7, 8, 9, 2]\n>>> tokenizer = RnaTokenizer(replace_T_with_U=False)\n>>> tokenizer('acgt')[\"input_ids\"]\n[1, 6, 7, 8, 3, 2]\n>>> tokenizer = RnaTokenizer(nmers=3)\n>>> tokenizer('uagcuuauc')[\"input_ids\"]\n[1, 83, 17, 64, 49, 96, 84, 22, 2]\n>>> tokenizer = RnaTokenizer(codon=True)\n>>> tokenizer('uagcuuauc')[\"input_ids\"]\n[1, 83, 49, 22, 2]\n>>> tokenizer('uagcuuauca')[\"input_ids\"]\nTraceback (most recent call last):\nValueError: length of input sequence must be a multiple of 3 for codon tokenization, but got 10\n
Source code in multimolecule/tokenisers/rna/tokenization_rna.py
Pythonclass RnaTokenizer(Tokenizer):\n \"\"\"\n Tokenizer for RNA sequences.\n\n Args:\n alphabet: alphabet to use for tokenization.\n\n - If is `None`, the standard RNA alphabet will be used.\n - If is a `string`, it should correspond to the name of a predefined alphabet. The options include\n + `standard`\n + `extended`\n + `streamline`\n + `nucleobase`\n - If is an alphabet or a list of characters, that specific alphabet will be used.\n nmers: Size of kmer to tokenize.\n codon: Whether to tokenize into codons.\n replace_T_with_U: Whether to replace T with U.\n do_upper_case: Whether to convert input to uppercase.\n\n Examples:\n >>> from multimolecule import RnaTokenizer\n >>> tokenizer = RnaTokenizer()\n >>> tokenizer('<pad><cls><eos><unk><mask><null>ACGUNRYSWKMBDHV.X*-I')[\"input_ids\"]\n [1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 2]\n >>> tokenizer('acgu')[\"input_ids\"]\n [1, 6, 7, 8, 9, 2]\n >>> tokenizer('acgt')[\"input_ids\"]\n [1, 6, 7, 8, 9, 2]\n >>> tokenizer = RnaTokenizer(replace_T_with_U=False)\n >>> tokenizer('acgt')[\"input_ids\"]\n [1, 6, 7, 8, 3, 2]\n >>> tokenizer = RnaTokenizer(nmers=3)\n >>> tokenizer('uagcuuauc')[\"input_ids\"]\n [1, 83, 17, 64, 49, 96, 84, 22, 2]\n >>> tokenizer = RnaTokenizer(codon=True)\n >>> tokenizer('uagcuuauc')[\"input_ids\"]\n [1, 83, 49, 22, 2]\n >>> tokenizer('uagcuuauca')[\"input_ids\"]\n Traceback (most recent call last):\n ValueError: length of input sequence must be a multiple of 3 for codon tokenization, but got 10\n \"\"\"\n\n model_input_names = [\"input_ids\", \"attention_mask\"]\n\n def __init__(\n self,\n alphabet: Alphabet | str | List[str] | None = None,\n nmers: int = 1,\n codon: bool = False,\n replace_T_with_U: bool = True,\n do_upper_case: bool = True,\n additional_special_tokens: List | Tuple | None = None,\n **kwargs,\n ):\n if codon and (nmers > 1 and nmers != 3):\n raise ValueError(\"Codon and nmers cannot be used together.\")\n if codon:\n nmers = 3 # set to 3 to get correct vocab\n if not isinstance(alphabet, Alphabet):\n alphabet = get_alphabet(alphabet, nmers=nmers)\n super().__init__(\n alphabet=alphabet,\n nmers=nmers,\n codon=codon,\n replace_T_with_U=replace_T_with_U,\n do_upper_case=do_upper_case,\n additional_special_tokens=additional_special_tokens,\n **kwargs,\n )\n self.replace_T_with_U = replace_T_with_U\n self.nmers = nmers\n self.codon = codon\n\n def _tokenize(self, text: str, **kwargs):\n if self.do_upper_case:\n text = text.upper()\n if self.replace_T_with_U:\n text = text.replace(\"T\", \"U\")\n if self.codon:\n if len(text) % 3 != 0:\n raise ValueError(\n f\"length of input sequence must be a multiple of 3 for codon tokenization, but got {len(text)}\"\n )\n return [text[i : i + 3] for i in range(0, len(text), 3)]\n if self.nmers > 1:\n return [text[i : i + self.nmers] for i in range(len(text) - self.nmers + 1)] # noqa: E203\n return list(text)\n
"},{"location":"models/utrlm/#multimolecule.models.utrlm.RnaTokenizer(alphabet)","title":"alphabet
","text":""},{"location":"models/utrlm/#multimolecule.models.utrlm.RnaTokenizer(nmers)","title":"nmers
","text":""},{"location":"models/utrlm/#multimolecule.models.utrlm.RnaTokenizer(codon)","title":"codon
","text":""},{"location":"models/utrlm/#multimolecule.models.utrlm.RnaTokenizer(replace_T_with_U)","title":"replace_T_with_U
","text":""},{"location":"models/utrlm/#multimolecule.models.utrlm.RnaTokenizer(do_upper_case)","title":"do_upper_case
","text":""},{"location":"models/utrlm/#multimolecule.models.utrlm.UtrLmConfig","title":"UtrLmConfig","text":" Bases: PreTrainedConfig
This is the configuration class to store the configuration of a UtrLmModel
. It is used to instantiate a UTR-LM model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the UTR-LM a96123155/UTR-LM architecture.
Configuration objects inherit from PreTrainedConfig
and can be used to control the model outputs. Read the documentation from PreTrainedConfig
for more information.
Parameters:
Name Type Description Defaultint
Vocabulary size of the UTR-LM model. Defines the number of different tokens that can be represented by the inputs_ids
passed when calling [UtrLmModel
].
26
int
Dimensionality of the encoder layers and the pooler layer.
128
int
Number of hidden layers in the Transformer encoder.
6
int
Number of attention heads for each attention layer in the Transformer encoder.
16
int
Dimensionality of the \u201cintermediate\u201d (often named feed-forward) layer in the Transformer encoder.
512
float
The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
0.1
float
The dropout ratio for the attention probabilities.
0.1
int
The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
1026
float
The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
0.02
float
The epsilon used by the layer normalization layers.
1e-12
str
Type of position embedding. Choose one of \"absolute\"
, \"relative_key\"
, \"relative_key_query\", \"rotary\"
. For positional embeddings use \"absolute\"
. For more information on \"relative_key\"
, please refer to Self-Attention with Relative Position Representations (Shaw et al.). For more information on \"relative_key_query\"
, please refer to Method 4 in Improve Transformer Models with Better Relative Position Embeddings (Huang et al.).
'rotary'
bool
Whether the model is used as a decoder or not. If False
, the model is used as an encoder.
False
bool
Whether or not the model should return the last key/values attentions (not used by all models). Only relevant if config.is_decoder=True
.
True
bool
Whether to apply layer normalization after embeddings but before the main stem of the network.
False
bool
When this is enabled, masked tokens are treated as if they had been dropped out by input dropout.
False
Examples:
Python Console Session>>> from multimolecule import UtrLmModel, UtrLmConfig\n>>> # Initializing a UTR-LM multimolecule/utrlm style configuration\n>>> configuration = UtrLmConfig()\n>>> # Initializing a model (with random weights) from the multimolecule/utrlm style configuration\n>>> model = UtrLmModel(configuration)\n>>> # Accessing the model configuration\n>>> configuration = model.config\n
Source code in multimolecule/models/utrlm/configuration_utrlm.py
Pythonclass UtrLmConfig(PreTrainedConfig):\n r\"\"\"\n This is the configuration class to store the configuration of a [`UtrLmModel`][multimolecule.models.UtrLmModel].\n It is used to instantiate a UTR-LM model according to the specified arguments, defining the model architecture.\n Instantiating a configuration with the defaults will yield a similar configuration to that of the UTR-LM\n [a96123155/UTR-LM](https://github.com/a96123155/UTR-LM) architecture.\n\n Configuration objects inherit from [`PreTrainedConfig`][multimolecule.models.PreTrainedConfig] and can be used to\n control the model outputs. Read the documentation from [`PreTrainedConfig`][multimolecule.models.PreTrainedConfig]\n for more information.\n\n Args:\n vocab_size:\n Vocabulary size of the UTR-LM model. Defines the number of different tokens that can be represented by the\n `inputs_ids` passed when calling [`UtrLmModel`].\n hidden_size:\n Dimensionality of the encoder layers and the pooler layer.\n num_hidden_layers:\n Number of hidden layers in the Transformer encoder.\n num_attention_heads:\n Number of attention heads for each attention layer in the Transformer encoder.\n intermediate_size:\n Dimensionality of the \"intermediate\" (often named feed-forward) layer in the Transformer encoder.\n hidden_dropout:\n The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.\n attention_dropout:\n The dropout ratio for the attention probabilities.\n max_position_embeddings:\n The maximum sequence length that this model might ever be used with. Typically set this to something large\n just in case (e.g., 512 or 1024 or 2048).\n initializer_range:\n The standard deviation of the truncated_normal_initializer for initializing all weight matrices.\n layer_norm_eps:\n The epsilon used by the layer normalization layers.\n position_embedding_type:\n Type of position embedding. Choose one of `\"absolute\"`, `\"relative_key\"`, `\"relative_key_query\", \"rotary\"`.\n For positional embeddings use `\"absolute\"`. For more information on `\"relative_key\"`, please refer to\n [Self-Attention with Relative Position Representations (Shaw et al.)](https://arxiv.org/abs/1803.02155).\n For more information on `\"relative_key_query\"`, please refer to *Method 4* in [Improve Transformer Models\n with Better Relative Position Embeddings (Huang et al.)](https://arxiv.org/abs/2009.13658).\n is_decoder:\n Whether the model is used as a decoder or not. If `False`, the model is used as an encoder.\n use_cache:\n Whether or not the model should return the last key/values attentions (not used by all models). Only\n relevant if `config.is_decoder=True`.\n emb_layer_norm_before:\n Whether to apply layer normalization after embeddings but before the main stem of the network.\n token_dropout:\n When this is enabled, masked tokens are treated as if they had been dropped out by input dropout.\n\n Examples:\n >>> from multimolecule import UtrLmModel, UtrLmConfig\n >>> # Initializing a UTR-LM multimolecule/utrlm style configuration\n >>> configuration = UtrLmConfig()\n >>> # Initializing a model (with random weights) from the multimolecule/utrlm style configuration\n >>> model = UtrLmModel(configuration)\n >>> # Accessing the model configuration\n >>> configuration = model.config\n \"\"\"\n\n model_type = \"utrlm\"\n\n def __init__(\n self,\n vocab_size: int = 26,\n hidden_size: int = 128,\n num_hidden_layers: int = 6,\n num_attention_heads: int = 16,\n intermediate_size: int = 512,\n hidden_act: str = \"gelu\",\n hidden_dropout: float = 0.1,\n attention_dropout: float = 0.1,\n max_position_embeddings: int = 1026,\n initializer_range: float = 0.02,\n layer_norm_eps: float = 1e-12,\n position_embedding_type: str = \"rotary\",\n is_decoder: bool = False,\n use_cache: bool = True,\n emb_layer_norm_before: bool = False,\n token_dropout: bool = False,\n head: HeadConfig | None = None,\n lm_head: MaskedLMHeadConfig | None = None,\n ss_head: HeadConfig | None = None,\n mfe_head: HeadConfig | None = None,\n **kwargs,\n ):\n super().__init__(**kwargs)\n\n self.vocab_size = vocab_size\n self.hidden_size = hidden_size\n self.num_hidden_layers = num_hidden_layers\n self.num_attention_heads = num_attention_heads\n self.intermediate_size = intermediate_size\n self.hidden_act = hidden_act\n self.hidden_dropout = hidden_dropout\n self.attention_dropout = attention_dropout\n self.max_position_embeddings = max_position_embeddings\n self.initializer_range = initializer_range\n self.layer_norm_eps = layer_norm_eps\n self.position_embedding_type = position_embedding_type\n self.is_decoder = is_decoder\n self.use_cache = use_cache\n self.emb_layer_norm_before = emb_layer_norm_before\n self.token_dropout = token_dropout\n self.head = HeadConfig(**head) if head is not None else None\n self.lm_head = MaskedLMHeadConfig(**lm_head) if lm_head is not None else None\n self.ss_head = HeadConfig(**ss_head) if ss_head is not None else None\n self.mfe_head = HeadConfig(**mfe_head) if mfe_head is not None else None\n
"},{"location":"models/utrlm/#multimolecule.models.utrlm.UtrLmConfig(vocab_size)","title":"vocab_size
","text":""},{"location":"models/utrlm/#multimolecule.models.utrlm.UtrLmConfig(hidden_size)","title":"hidden_size
","text":""},{"location":"models/utrlm/#multimolecule.models.utrlm.UtrLmConfig(num_hidden_layers)","title":"num_hidden_layers
","text":""},{"location":"models/utrlm/#multimolecule.models.utrlm.UtrLmConfig(num_attention_heads)","title":"num_attention_heads
","text":""},{"location":"models/utrlm/#multimolecule.models.utrlm.UtrLmConfig(intermediate_size)","title":"intermediate_size
","text":""},{"location":"models/utrlm/#multimolecule.models.utrlm.UtrLmConfig(hidden_dropout)","title":"hidden_dropout
","text":""},{"location":"models/utrlm/#multimolecule.models.utrlm.UtrLmConfig(attention_dropout)","title":"attention_dropout
","text":""},{"location":"models/utrlm/#multimolecule.models.utrlm.UtrLmConfig(max_position_embeddings)","title":"max_position_embeddings
","text":""},{"location":"models/utrlm/#multimolecule.models.utrlm.UtrLmConfig(initializer_range)","title":"initializer_range
","text":""},{"location":"models/utrlm/#multimolecule.models.utrlm.UtrLmConfig(layer_norm_eps)","title":"layer_norm_eps
","text":""},{"location":"models/utrlm/#multimolecule.models.utrlm.UtrLmConfig(position_embedding_type)","title":"position_embedding_type
","text":""},{"location":"models/utrlm/#multimolecule.models.utrlm.UtrLmConfig(is_decoder)","title":"is_decoder
","text":""},{"location":"models/utrlm/#multimolecule.models.utrlm.UtrLmConfig(use_cache)","title":"use_cache
","text":""},{"location":"models/utrlm/#multimolecule.models.utrlm.UtrLmConfig(emb_layer_norm_before)","title":"emb_layer_norm_before
","text":""},{"location":"models/utrlm/#multimolecule.models.utrlm.UtrLmConfig(token_dropout)","title":"token_dropout
","text":""},{"location":"models/utrlm/#multimolecule.models.utrlm.UtrLmForContactPrediction","title":"UtrLmForContactPrediction","text":" Bases: UtrLmPreTrainedModel
Examples:
Python Console Session>>> from multimolecule import UtrLmConfig, UtrLmForContactPrediction, RnaTokenizer\n>>> config = UtrLmConfig()\n>>> model = UtrLmForContactPrediction(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=torch.randint(2, (1, 5, 5)))\n>>> output[\"logits\"].shape\ntorch.Size([1, 5, 5, 1])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n
Source code in multimolecule/models/utrlm/modeling_utrlm.py
Pythonclass UtrLmForContactPrediction(UtrLmPreTrainedModel):\n \"\"\"\n Examples:\n >>> from multimolecule import UtrLmConfig, UtrLmForContactPrediction, RnaTokenizer\n >>> config = UtrLmConfig()\n >>> model = UtrLmForContactPrediction(config)\n >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n >>> output = model(**input, labels=torch.randint(2, (1, 5, 5)))\n >>> output[\"logits\"].shape\n torch.Size([1, 5, 5, 1])\n >>> output[\"loss\"] # doctest:+ELLIPSIS\n tensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n \"\"\"\n\n def __init__(self, config: UtrLmConfig):\n super().__init__(config)\n self.utrlm = UtrLmModel(config, add_pooling_layer=True)\n self.contact_head = ContactPredictionHead(config)\n self.head_config = self.contact_head.config\n\n # Initialize weights and apply final processing\n self.post_init()\n\n def forward(\n self,\n input_ids: Tensor | NestedTensor,\n attention_mask: Tensor | None = None,\n position_ids: Tensor | None = None,\n head_mask: Tensor | None = None,\n inputs_embeds: Tensor | NestedTensor | None = None,\n labels: Tensor | None = None,\n output_attentions: bool | None = None,\n output_hidden_states: bool | None = None,\n return_dict: bool | None = None,\n **kwargs,\n ) -> Tuple[Tensor, ...] | ContactPredictorOutput:\n if output_attentions is False:\n warn(\"output_attentions must be True for contact classification and will be ignored.\")\n return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n outputs = self.utrlm(\n input_ids,\n attention_mask=attention_mask,\n position_ids=position_ids,\n head_mask=head_mask,\n inputs_embeds=inputs_embeds,\n output_attentions=True,\n output_hidden_states=output_hidden_states,\n return_dict=return_dict,\n **kwargs,\n )\n output = self.contact_head(outputs, attention_mask, input_ids, labels)\n logits, loss = output.logits, output.loss\n\n if not return_dict:\n output = (logits,) + outputs[2:]\n return ((loss,) + output) if loss is not None else output\n\n return ContactPredictorOutput(\n loss=loss,\n logits=logits,\n hidden_states=outputs.hidden_states,\n attentions=outputs.attentions,\n )\n
"},{"location":"models/utrlm/#multimolecule.models.utrlm.UtrLmForMaskedLM","title":"UtrLmForMaskedLM","text":" Bases: UtrLmPreTrainedModel
Examples:
Python Console Session>>> from multimolecule import UtrLmConfig, UtrLmModel, RnaTokenizer\n>>> config = UtrLmConfig()\n>>> model = UtrLmForMaskedLM(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=input[\"input_ids\"])\n>>> output[\"logits\"].shape\ntorch.Size([1, 7, 26])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<NllLossBackward0>)\n
Source code in multimolecule/models/utrlm/modeling_utrlm.py
Pythonclass UtrLmForMaskedLM(UtrLmPreTrainedModel):\n \"\"\"\n Examples:\n >>> from multimolecule import UtrLmConfig, UtrLmModel, RnaTokenizer\n >>> config = UtrLmConfig()\n >>> model = UtrLmForMaskedLM(config)\n >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n >>> output = model(**input, labels=input[\"input_ids\"])\n >>> output[\"logits\"].shape\n torch.Size([1, 7, 26])\n >>> output[\"loss\"] # doctest:+ELLIPSIS\n tensor(..., grad_fn=<NllLossBackward0>)\n \"\"\"\n\n _tied_weights_keys = [\"lm_head.decoder.weight\", \"lm_head.decoder.bias\"]\n\n def __init__(self, config: UtrLmConfig):\n super().__init__(config)\n if config.is_decoder:\n logger.warning(\n \"If you want to use `UtrLmForMaskedLM` make sure `config.is_decoder=False` for \"\n \"bi-directional self-attention.\"\n )\n self.utrlm = UtrLmModel(config, add_pooling_layer=False)\n self.lm_head = MaskedLMHead(config)\n\n # Initialize weights and apply final processing\n self.post_init()\n\n def forward(\n self,\n input_ids: Tensor | NestedTensor,\n attention_mask: Tensor | None = None,\n position_ids: Tensor | None = None,\n head_mask: Tensor | None = None,\n inputs_embeds: Tensor | NestedTensor | None = None,\n encoder_hidden_states: Tensor | None = None,\n encoder_attention_mask: Tensor | None = None,\n labels: Tensor | None = None,\n output_attentions: bool | None = None,\n output_hidden_states: bool | None = None,\n return_dict: bool | None = None,\n **kwargs,\n ) -> Tuple[Tensor, ...] | MaskedLMOutput:\n return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n outputs = self.utrlm(\n input_ids,\n attention_mask=attention_mask,\n position_ids=position_ids,\n head_mask=head_mask,\n inputs_embeds=inputs_embeds,\n encoder_hidden_states=encoder_hidden_states,\n encoder_attention_mask=encoder_attention_mask,\n output_attentions=output_attentions,\n output_hidden_states=output_hidden_states,\n return_dict=return_dict,\n **kwargs,\n )\n output = self.lm_head(outputs, labels)\n logits, loss = output.logits, output.loss\n\n if not return_dict:\n output = (logits,) + outputs[2:]\n return ((loss,) + output) if loss is not None else output\n\n return MaskedLMOutput(\n loss=loss,\n logits=logits,\n hidden_states=outputs.hidden_states,\n attentions=outputs.attentions,\n )\n
"},{"location":"models/utrlm/#multimolecule.models.utrlm.UtrLmForPreTraining","title":"UtrLmForPreTraining","text":" Bases: UtrLmPreTrainedModel
Examples:
Python Console Session>>> from multimolecule import UtrLmConfig, UtrLmModel, RnaTokenizer\n>>> config = UtrLmConfig()\n>>> model = UtrLmForPreTraining(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels_mlm=input[\"input_ids\"])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<AddBackward0>)\n>>> output[\"logits\"].shape\ntorch.Size([1, 7, 26])\n>>> output[\"contact_map\"].shape\ntorch.Size([1, 5, 5, 1])\n
Source code in multimolecule/models/utrlm/modeling_utrlm.py
Pythonclass UtrLmForPreTraining(UtrLmPreTrainedModel):\n \"\"\"\n Examples:\n >>> from multimolecule import UtrLmConfig, UtrLmModel, RnaTokenizer\n >>> config = UtrLmConfig()\n >>> model = UtrLmForPreTraining(config)\n >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n >>> output = model(**input, labels_mlm=input[\"input_ids\"])\n >>> output[\"loss\"] # doctest:+ELLIPSIS\n tensor(..., grad_fn=<AddBackward0>)\n >>> output[\"logits\"].shape\n torch.Size([1, 7, 26])\n >>> output[\"contact_map\"].shape\n torch.Size([1, 5, 5, 1])\n \"\"\"\n\n _tied_weights_keys = [\n \"lm_head.decoder.weight\",\n \"lm_head.decoder.bias\",\n \"pretrain.predictions.decoder.weight\",\n \"pretrain.predictions.decoder.bias\",\n \"pretrain.predictions_ss.decoder.weight\",\n \"pretrain.predictions_ss.decoder.bias\",\n ]\n\n def __init__(self, config: UtrLmConfig):\n super().__init__(config)\n if config.is_decoder:\n logger.warning(\n \"If you want to use `UtrLmForPreTraining` make sure `config.is_decoder=False` for \"\n \"bi-directional self-attention.\"\n )\n self.utrlm = UtrLmModel(config, add_pooling_layer=False)\n self.pretrain = UtrLmPreTrainingHeads(config)\n\n # Initialize weights and apply final processing\n self.post_init()\n\n def get_output_embeddings(self):\n return self.pretrain.predictions.decoder\n\n def set_output_embeddings(self, embeddings):\n self.pretrain.predictions.decoder = embeddings\n\n def forward(\n self,\n input_ids: Tensor | NestedTensor,\n attention_mask: Tensor | None = None,\n position_ids: Tensor | None = None,\n head_mask: Tensor | None = None,\n inputs_embeds: Tensor | NestedTensor | None = None,\n encoder_hidden_states: Tensor | None = None,\n encoder_attention_mask: Tensor | None = None,\n labels_mlm: Tensor | None = None,\n labels_contact: Tensor | None = None,\n labels_ss: Tensor | None = None,\n labels_mfe: Tensor | None = None,\n output_attentions: bool | None = None,\n output_hidden_states: bool | None = None,\n return_dict: bool | None = None,\n **kwargs,\n ) -> Tuple[Tensor, ...] | UtrLmForPreTrainingOutput:\n if output_attentions is False:\n warn(\"output_attentions must be True for contact classification and will be ignored.\")\n return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n outputs = self.utrlm(\n input_ids,\n attention_mask=attention_mask,\n position_ids=position_ids,\n head_mask=head_mask,\n inputs_embeds=inputs_embeds,\n encoder_hidden_states=encoder_hidden_states,\n encoder_attention_mask=encoder_attention_mask,\n output_attentions=True,\n output_hidden_states=output_hidden_states,\n return_dict=return_dict,\n **kwargs,\n )\n total_loss, logits, contact_map, secondary_structure, minimum_free_energy = self.pretrain(\n outputs,\n attention_mask,\n input_ids,\n labels_mlm=labels_mlm,\n labels_contact=labels_contact,\n labels_ss=labels_ss,\n labels_mfe=labels_mfe,\n )\n\n if not return_dict:\n output = (logits,) + outputs[2:]\n return ((total_loss,) + output) if total_loss is not None else output\n\n return UtrLmForPreTrainingOutput(\n loss=total_loss,\n logits=logits,\n contact_map=contact_map,\n secondary_structure=secondary_structure,\n minimum_free_energy=minimum_free_energy,\n hidden_states=outputs.hidden_states,\n attentions=outputs.attentions,\n )\n
"},{"location":"models/utrlm/#multimolecule.models.utrlm.UtrLmForSequencePrediction","title":"UtrLmForSequencePrediction","text":" Bases: UtrLmPreTrainedModel
Examples:
Python Console Session>>> from multimolecule import UtrLmConfig, UtrLmModel, RnaTokenizer\n>>> config = UtrLmConfig()\n>>> model = UtrLmForSequencePrediction(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=torch.tensor([[1]]))\n>>> output[\"logits\"].shape\ntorch.Size([1, 1])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n
Source code in multimolecule/models/utrlm/modeling_utrlm.py
Pythonclass UtrLmForSequencePrediction(UtrLmPreTrainedModel):\n \"\"\"\n Examples:\n >>> from multimolecule import UtrLmConfig, UtrLmModel, RnaTokenizer\n >>> config = UtrLmConfig()\n >>> model = UtrLmForSequencePrediction(config)\n >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n >>> output = model(**input, labels=torch.tensor([[1]]))\n >>> output[\"logits\"].shape\n torch.Size([1, 1])\n >>> output[\"loss\"] # doctest:+ELLIPSIS\n tensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n \"\"\"\n\n def __init__(self, config: UtrLmConfig):\n super().__init__(config)\n self.utrlm = UtrLmModel(config, add_pooling_layer=True)\n self.sequence_head = SequencePredictionHead(config)\n self.head_config = self.sequence_head.config\n\n # Initialize weights and apply final processing\n self.post_init()\n\n def forward(\n self,\n input_ids: Tensor | NestedTensor,\n attention_mask: Tensor | None = None,\n position_ids: Tensor | None = None,\n head_mask: Tensor | None = None,\n inputs_embeds: Tensor | NestedTensor | None = None,\n labels: Tensor | None = None,\n output_attentions: bool | None = None,\n output_hidden_states: bool | None = None,\n return_dict: bool | None = None,\n **kwargs,\n ) -> Tuple[Tensor, ...] | SequencePredictorOutput:\n return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n outputs = self.utrlm(\n input_ids,\n attention_mask=attention_mask,\n position_ids=position_ids,\n head_mask=head_mask,\n inputs_embeds=inputs_embeds,\n output_attentions=output_attentions,\n output_hidden_states=output_hidden_states,\n return_dict=return_dict,\n **kwargs,\n )\n output = self.sequence_head(outputs, labels)\n logits, loss = output.logits, output.loss\n\n if not return_dict:\n output = (logits,) + outputs[2:]\n return ((loss,) + output) if loss is not None else output\n\n return SequencePredictorOutput(\n loss=loss,\n logits=logits,\n hidden_states=outputs.hidden_states,\n attentions=outputs.attentions,\n )\n
"},{"location":"models/utrlm/#multimolecule.models.utrlm.UtrLmForTokenPrediction","title":"UtrLmForTokenPrediction","text":" Bases: UtrLmPreTrainedModel
Examples:
Python Console Session>>> from multimolecule import UtrLmConfig, UtrLmModel, RnaTokenizer\n>>> config = UtrLmConfig()\n>>> model = UtrLmForTokenPrediction(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=torch.randint(2, (1, 5)))\n>>> output[\"logits\"].shape\ntorch.Size([1, 5, 1])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n
Source code in multimolecule/models/utrlm/modeling_utrlm.py
Pythonclass UtrLmForTokenPrediction(UtrLmPreTrainedModel):\n \"\"\"\n Examples:\n >>> from multimolecule import UtrLmConfig, UtrLmModel, RnaTokenizer\n >>> config = UtrLmConfig()\n >>> model = UtrLmForTokenPrediction(config)\n >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n >>> output = model(**input, labels=torch.randint(2, (1, 5)))\n >>> output[\"logits\"].shape\n torch.Size([1, 5, 1])\n >>> output[\"loss\"] # doctest:+ELLIPSIS\n tensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n \"\"\"\n\n def __init__(self, config: UtrLmConfig):\n super().__init__(config)\n self.utrlm = UtrLmModel(config, add_pooling_layer=True)\n self.token_head = TokenPredictionHead(config)\n self.head_config = self.token_head.config\n\n # Initialize weights and apply final processing\n self.post_init()\n\n def forward(\n self,\n input_ids: Tensor | NestedTensor,\n attention_mask: Tensor | None = None,\n position_ids: Tensor | None = None,\n head_mask: Tensor | None = None,\n inputs_embeds: Tensor | NestedTensor | None = None,\n labels: Tensor | None = None,\n output_attentions: bool | None = None,\n output_hidden_states: bool | None = None,\n return_dict: bool | None = None,\n **kwargs,\n ) -> Tuple[Tensor, ...] | TokenPredictorOutput:\n return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n outputs = self.utrlm(\n input_ids,\n attention_mask=attention_mask,\n position_ids=position_ids,\n head_mask=head_mask,\n inputs_embeds=inputs_embeds,\n output_attentions=output_attentions,\n output_hidden_states=output_hidden_states,\n return_dict=return_dict,\n **kwargs,\n )\n output = self.token_head(outputs, attention_mask, input_ids, labels)\n logits, loss = output.logits, output.loss\n\n if not return_dict:\n output = (logits,) + outputs[2:]\n return ((loss,) + output) if loss is not None else output\n\n return TokenPredictorOutput(\n loss=loss,\n logits=logits,\n hidden_states=outputs.hidden_states,\n attentions=outputs.attentions,\n )\n
"},{"location":"models/utrlm/#multimolecule.models.utrlm.UtrLmModel","title":"UtrLmModel","text":" Bases: UtrLmPreTrainedModel
Examples:
Python Console Session>>> from multimolecule import UtrLmConfig, UtrLmModel, RnaTokenizer\n>>> config = UtrLmConfig()\n>>> model = UtrLmModel(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input)\n>>> output[\"last_hidden_state\"].shape\ntorch.Size([1, 7, 128])\n>>> output[\"pooler_output\"].shape\ntorch.Size([1, 128])\n
Source code in multimolecule/models/utrlm/modeling_utrlm.py
Pythonclass UtrLmModel(UtrLmPreTrainedModel):\n \"\"\"\n Examples:\n >>> from multimolecule import UtrLmConfig, UtrLmModel, RnaTokenizer\n >>> config = UtrLmConfig()\n >>> model = UtrLmModel(config)\n >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n >>> output = model(**input)\n >>> output[\"last_hidden_state\"].shape\n torch.Size([1, 7, 128])\n >>> output[\"pooler_output\"].shape\n torch.Size([1, 128])\n \"\"\"\n\n def __init__(self, config: UtrLmConfig, add_pooling_layer: bool = True):\n super().__init__(config)\n self.pad_token_id = config.pad_token_id\n self.embeddings = UtrLmEmbeddings(config)\n self.encoder = UtrLmEncoder(config)\n self.pooler = UtrLmPooler(config) if add_pooling_layer else None\n\n # Initialize weights and apply final processing\n self.post_init()\n\n def get_input_embeddings(self):\n return self.embeddings.word_embeddings\n\n def set_input_embeddings(self, value):\n self.embeddings.word_embeddings = value\n\n def _prune_heads(self, heads_to_prune):\n \"\"\"\n Prunes heads of the model. heads_to_prune: dict of {layer_num: list of heads to prune in this layer} See base\n class PreTrainedModel\n \"\"\"\n for layer, heads in heads_to_prune.items():\n self.encoder.layer[layer].attention.prune_heads(heads)\n\n def forward(\n self,\n input_ids: Tensor | NestedTensor,\n attention_mask: Tensor | None = None,\n position_ids: Tensor | None = None,\n head_mask: Tensor | None = None,\n inputs_embeds: Tensor | NestedTensor | None = None,\n encoder_hidden_states: Tensor | None = None,\n encoder_attention_mask: Tensor | None = None,\n past_key_values: Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None = None,\n use_cache: bool | None = None,\n output_attentions: bool | None = None,\n output_hidden_states: bool | None = None,\n return_dict: bool | None = None,\n **kwargs,\n ) -> Tuple[Tensor, ...] | BaseModelOutputWithPoolingAndCrossAttentions:\n r\"\"\"\n Args:\n encoder_hidden_states:\n Shape: `(batch_size, sequence_length, hidden_size)`\n\n Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if\n the model is configured as a decoder.\n encoder_attention_mask:\n Shape: `(batch_size, sequence_length)`\n\n Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used\n in the cross-attention if the model is configured as a decoder. Mask values selected in `[0, 1]`:\n\n - 1 for tokens that are **not masked**,\n - 0 for tokens that are **masked**.\n past_key_values:\n Tuple of length `config.n_layers` with each tuple having 4 tensors of shape\n `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)\n\n Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up\n decoding.\n\n If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those\n that don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of\n all `decoder_input_ids` of shape `(batch_size, sequence_length)`.\n use_cache:\n If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding\n (see `past_key_values`).\n \"\"\"\n if kwargs:\n warn(\n f\"Additional keyword arguments `{', '.join(kwargs)}` are detected in \"\n f\"`{self.__class__.__name__}.forward`, they will be ignored.\\n\"\n \"This is provided for backward compatibility and may lead to unexpected behavior.\"\n )\n output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions\n output_hidden_states = (\n output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states\n )\n return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n\n if self.config.is_decoder:\n use_cache = use_cache if use_cache is not None else self.config.use_cache\n else:\n use_cache = False\n\n if isinstance(input_ids, NestedTensor):\n input_ids, attention_mask = input_ids.tensor, input_ids.mask\n if input_ids is not None and inputs_embeds is not None:\n raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n if input_ids is not None:\n self.warn_if_padding_and_no_attention_mask(input_ids, attention_mask)\n input_shape = input_ids.size()\n elif inputs_embeds is not None:\n input_shape = inputs_embeds.size()[:-1]\n else:\n raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n batch_size, seq_length = input_shape\n device = input_ids.device if input_ids is not None else inputs_embeds.device # type: ignore[union-attr]\n\n # past_key_values_length\n past_key_values_length = past_key_values[0][0].shape[2] if past_key_values is not None else 0\n\n if attention_mask is None:\n attention_mask = (\n input_ids.ne(self.pad_token_id)\n if self.pad_token_id is not None\n else torch.ones(((batch_size, seq_length + past_key_values_length)), device=device)\n )\n\n # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]\n # ourselves in which case we just need to make it broadcastable to all heads.\n extended_attention_mask: Tensor = self.get_extended_attention_mask(attention_mask, input_shape)\n\n # If a 2D or 3D attention mask is provided for the cross-attention\n # we need to make broadcastable to [batch_size, num_heads, seq_length, seq_length]\n if self.config.is_decoder and encoder_hidden_states is not None:\n encoder_batch_size, encoder_sequence_length, _ = encoder_hidden_states.size()\n encoder_hidden_shape = (encoder_batch_size, encoder_sequence_length)\n if encoder_attention_mask is None:\n encoder_attention_mask = torch.ones(encoder_hidden_shape, device=device)\n encoder_extended_attention_mask = self.invert_attention_mask(encoder_attention_mask)\n else:\n encoder_extended_attention_mask = None\n\n # Prepare head mask if needed\n # 1.0 in head_mask indicate we keep the head\n # attention_probs has shape bsz x n_heads x N x N\n # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]\n # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]\n head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)\n\n embedding_output = self.embeddings(\n input_ids=input_ids,\n position_ids=position_ids,\n attention_mask=attention_mask,\n inputs_embeds=inputs_embeds,\n past_key_values_length=past_key_values_length,\n )\n encoder_outputs = self.encoder(\n embedding_output,\n attention_mask=extended_attention_mask,\n head_mask=head_mask,\n encoder_hidden_states=encoder_hidden_states,\n encoder_attention_mask=encoder_extended_attention_mask,\n past_key_values=past_key_values,\n use_cache=use_cache,\n output_attentions=output_attentions,\n output_hidden_states=output_hidden_states,\n return_dict=return_dict,\n )\n sequence_output = encoder_outputs[0]\n pooled_output = self.pooler(sequence_output) if self.pooler is not None else None\n\n if not return_dict:\n return (sequence_output, pooled_output) + encoder_outputs[1:]\n\n return BaseModelOutputWithPoolingAndCrossAttentions(\n last_hidden_state=sequence_output,\n pooler_output=pooled_output,\n past_key_values=encoder_outputs.past_key_values,\n hidden_states=encoder_outputs.hidden_states,\n attentions=encoder_outputs.attentions,\n cross_attentions=encoder_outputs.cross_attentions,\n )\n
"},{"location":"models/utrlm/#multimolecule.models.utrlm.UtrLmModel.forward","title":"forward","text":"Pythonforward(input_ids: Tensor | NestedTensor, attention_mask: Tensor | None = None, position_ids: Tensor | None = None, head_mask: Tensor | None = None, inputs_embeds: Tensor | NestedTensor | None = None, encoder_hidden_states: Tensor | None = None, encoder_attention_mask: Tensor | None = None, past_key_values: Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None = None, use_cache: bool | None = None, output_attentions: bool | None = None, output_hidden_states: bool | None = None, return_dict: bool | None = None, **kwargs) -> Tuple[Tensor, ...] | BaseModelOutputWithPoolingAndCrossAttentions\n
Parameters:
Name Type Description DefaultTensor | None
Shape: (batch_size, sequence_length, hidden_size)
Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if the model is configured as a decoder.
None
Tensor | None
Shape: (batch_size, sequence_length)
Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used in the cross-attention if the model is configured as a decoder. Mask values selected in [0, 1]
:
None
Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None
Tuple of length config.n_layers
with each tuple having 4 tensors of shape `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)
Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up decoding.
If past_key_values
are used, the user can optionally input only the last decoder_input_ids
(those that don\u2019t have their past key value states given to this model) of shape (batch_size, 1)
instead of all decoder_input_ids
of shape (batch_size, sequence_length)
.
None
bool | None
If set to True
, past_key_values
key value states are returned and can be used to speed up decoding (see past_key_values
).
None
Source code in multimolecule/models/utrlm/modeling_utrlm.py
Pythondef forward(\n self,\n input_ids: Tensor | NestedTensor,\n attention_mask: Tensor | None = None,\n position_ids: Tensor | None = None,\n head_mask: Tensor | None = None,\n inputs_embeds: Tensor | NestedTensor | None = None,\n encoder_hidden_states: Tensor | None = None,\n encoder_attention_mask: Tensor | None = None,\n past_key_values: Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None = None,\n use_cache: bool | None = None,\n output_attentions: bool | None = None,\n output_hidden_states: bool | None = None,\n return_dict: bool | None = None,\n **kwargs,\n) -> Tuple[Tensor, ...] | BaseModelOutputWithPoolingAndCrossAttentions:\n r\"\"\"\n Args:\n encoder_hidden_states:\n Shape: `(batch_size, sequence_length, hidden_size)`\n\n Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if\n the model is configured as a decoder.\n encoder_attention_mask:\n Shape: `(batch_size, sequence_length)`\n\n Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used\n in the cross-attention if the model is configured as a decoder. Mask values selected in `[0, 1]`:\n\n - 1 for tokens that are **not masked**,\n - 0 for tokens that are **masked**.\n past_key_values:\n Tuple of length `config.n_layers` with each tuple having 4 tensors of shape\n `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)\n\n Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up\n decoding.\n\n If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those\n that don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of\n all `decoder_input_ids` of shape `(batch_size, sequence_length)`.\n use_cache:\n If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding\n (see `past_key_values`).\n \"\"\"\n if kwargs:\n warn(\n f\"Additional keyword arguments `{', '.join(kwargs)}` are detected in \"\n f\"`{self.__class__.__name__}.forward`, they will be ignored.\\n\"\n \"This is provided for backward compatibility and may lead to unexpected behavior.\"\n )\n output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions\n output_hidden_states = (\n output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states\n )\n return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n\n if self.config.is_decoder:\n use_cache = use_cache if use_cache is not None else self.config.use_cache\n else:\n use_cache = False\n\n if isinstance(input_ids, NestedTensor):\n input_ids, attention_mask = input_ids.tensor, input_ids.mask\n if input_ids is not None and inputs_embeds is not None:\n raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n if input_ids is not None:\n self.warn_if_padding_and_no_attention_mask(input_ids, attention_mask)\n input_shape = input_ids.size()\n elif inputs_embeds is not None:\n input_shape = inputs_embeds.size()[:-1]\n else:\n raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n batch_size, seq_length = input_shape\n device = input_ids.device if input_ids is not None else inputs_embeds.device # type: ignore[union-attr]\n\n # past_key_values_length\n past_key_values_length = past_key_values[0][0].shape[2] if past_key_values is not None else 0\n\n if attention_mask is None:\n attention_mask = (\n input_ids.ne(self.pad_token_id)\n if self.pad_token_id is not None\n else torch.ones(((batch_size, seq_length + past_key_values_length)), device=device)\n )\n\n # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]\n # ourselves in which case we just need to make it broadcastable to all heads.\n extended_attention_mask: Tensor = self.get_extended_attention_mask(attention_mask, input_shape)\n\n # If a 2D or 3D attention mask is provided for the cross-attention\n # we need to make broadcastable to [batch_size, num_heads, seq_length, seq_length]\n if self.config.is_decoder and encoder_hidden_states is not None:\n encoder_batch_size, encoder_sequence_length, _ = encoder_hidden_states.size()\n encoder_hidden_shape = (encoder_batch_size, encoder_sequence_length)\n if encoder_attention_mask is None:\n encoder_attention_mask = torch.ones(encoder_hidden_shape, device=device)\n encoder_extended_attention_mask = self.invert_attention_mask(encoder_attention_mask)\n else:\n encoder_extended_attention_mask = None\n\n # Prepare head mask if needed\n # 1.0 in head_mask indicate we keep the head\n # attention_probs has shape bsz x n_heads x N x N\n # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]\n # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]\n head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)\n\n embedding_output = self.embeddings(\n input_ids=input_ids,\n position_ids=position_ids,\n attention_mask=attention_mask,\n inputs_embeds=inputs_embeds,\n past_key_values_length=past_key_values_length,\n )\n encoder_outputs = self.encoder(\n embedding_output,\n attention_mask=extended_attention_mask,\n head_mask=head_mask,\n encoder_hidden_states=encoder_hidden_states,\n encoder_attention_mask=encoder_extended_attention_mask,\n past_key_values=past_key_values,\n use_cache=use_cache,\n output_attentions=output_attentions,\n output_hidden_states=output_hidden_states,\n return_dict=return_dict,\n )\n sequence_output = encoder_outputs[0]\n pooled_output = self.pooler(sequence_output) if self.pooler is not None else None\n\n if not return_dict:\n return (sequence_output, pooled_output) + encoder_outputs[1:]\n\n return BaseModelOutputWithPoolingAndCrossAttentions(\n last_hidden_state=sequence_output,\n pooler_output=pooled_output,\n past_key_values=encoder_outputs.past_key_values,\n hidden_states=encoder_outputs.hidden_states,\n attentions=encoder_outputs.attentions,\n cross_attentions=encoder_outputs.cross_attentions,\n )\n
"},{"location":"models/utrlm/#multimolecule.models.utrlm.UtrLmModel.forward(encoder_hidden_states)","title":"encoder_hidden_states
","text":""},{"location":"models/utrlm/#multimolecule.models.utrlm.UtrLmModel.forward(encoder_attention_mask)","title":"encoder_attention_mask
","text":""},{"location":"models/utrlm/#multimolecule.models.utrlm.UtrLmModel.forward(past_key_values)","title":"past_key_values
","text":""},{"location":"models/utrlm/#multimolecule.models.utrlm.UtrLmModel.forward(use_cache)","title":"use_cache
","text":""},{"location":"models/utrlm/#multimolecule.models.utrlm.UtrLmPreTrainedModel","title":"UtrLmPreTrainedModel","text":" Bases: PreTrainedModel
An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained models.
Source code inmultimolecule/models/utrlm/modeling_utrlm.py
Pythonclass UtrLmPreTrainedModel(PreTrainedModel):\n \"\"\"\n An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained\n models.\n \"\"\"\n\n config_class = UtrLmConfig\n base_model_prefix = \"utrlm\"\n supports_gradient_checkpointing = True\n _no_split_modules = [\"UtrLmLayer\", \"UtrLmEmbeddings\"]\n\n # Copied from transformers.models.bert.modeling_bert.BertPreTrainedModel._init_weights\n def _init_weights(self, module: nn.Module):\n \"\"\"Initialize the weights\"\"\"\n if isinstance(module, nn.Linear):\n # Slightly different from the TF version which uses truncated_normal for initialization\n # cf https://github.com/pytorch/pytorch/pull/5617\n module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)\n if module.bias is not None:\n module.bias.data.zero_()\n elif isinstance(module, nn.Embedding):\n module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)\n if module.padding_idx is not None:\n module.weight.data[module.padding_idx].zero_()\n elif isinstance(module, nn.LayerNorm):\n module.bias.data.zero_()\n module.weight.data.fill_(1.0)\n
"},{"location":"module/","title":"module","text":"module
provides a collection of pre-defined modules for users to implement their own architectures.
MultiMolecule is built upon the ecosystem, embracing a similar design philosophy: Don\u2019t Repeat Yourself. We follow the single model file policy
where each model under the models
package contains one and only one modeling.py
file that describes the network design.
The module
package is intended for simple, reusable modules that are consistent across multiple models. This approach minimizes code duplication and promotes clean, maintainable code.
module
package includes components that are commonly used across different models, such as the SequencePredictionHead
. This reduces redundancy and simplifies the development process.modeling.py
.SequencePredictionHead
, TokenPredictionHead
, and ContactPredictionHead
.SinusoidalEmbedding
and RotaryEmbedding
.embeddings
provide a collection of pre-defined positional embeddings.
Bases: Module
Rotary position embeddings based on those in RoFormer.
Query and keys are transformed by rotation matrices which depend on their relative positions.
CacheThe inverse frequency buffer is cached and updated only when the sequence length changes or the device changes.
Sequence LengthRotary Embedding is irrespective of the sequence length and can be used for any sequence length.
Source code inmultimolecule/module/embeddings/rotary.py
Python@PositionEmbeddingRegistry.register(\"rotary\")\n@PositionEmbeddingRegistryHF.register(\"rotary\")\nclass RotaryEmbedding(nn.Module):\n \"\"\"\n Rotary position embeddings based on those in\n [RoFormer](https://huggingface.co/docs/transformers/model_doc/roformer).\n\n Query and keys are transformed by rotation\n matrices which depend on their relative positions.\n\n Tip: **Cache**\n The inverse frequency buffer is cached and updated only when the sequence length changes or the device changes.\n\n Success: **Sequence Length**\n Rotary Embedding is irrespective of the sequence length and can be used for any sequence length.\n \"\"\"\n\n def __init__(self, embedding_dim: int):\n super().__init__()\n # Generate and save the inverse frequency buffer (non trainable)\n inv_freq = 1.0 / (10000 ** (torch.arange(0, embedding_dim, 2, dtype=torch.int64).float() / embedding_dim))\n self.register_buffer(\"inv_freq\", inv_freq)\n\n self._seq_len_cached = None\n self._cos_cached = None\n self._sin_cached = None\n\n def forward(self, q: Tensor, k: Tensor) -> Tuple[Tensor, Tensor]:\n self._update_cos_sin_tables(k, seq_dimension=-2)\n\n return (self.apply_rotary_pos_emb(q), self.apply_rotary_pos_emb(k))\n\n def _update_cos_sin_tables(self, x, seq_dimension=2):\n seq_len = x.shape[seq_dimension]\n\n # Reset the tables if the sequence length has changed,\n # or if we're on a new device (possibly due to tracing for instance)\n if seq_len != self._seq_len_cached or self._cos_cached.device != x.device:\n self._seq_len_cached = seq_len\n t = torch.arange(x.shape[seq_dimension], device=x.device).type_as(self.inv_freq)\n freqs = torch.outer(t, self.inv_freq)\n emb = torch.cat((freqs, freqs), dim=-1).to(x.device)\n\n self._cos_cached = emb.cos()[None, None, :, :]\n self._sin_cached = emb.sin()[None, None, :, :]\n\n return self._cos_cached, self._sin_cached\n\n def apply_rotary_pos_emb(self, x):\n cos = self._cos_cached[:, :, : x.shape[-2], :]\n sin = self._sin_cached[:, :, : x.shape[-2], :]\n\n return (x * cos) + (self.rotate_half(x) * sin)\n\n @staticmethod\n def rotate_half(x):\n x1, x2 = x.chunk(2, dim=-1)\n return torch.cat((-x2, x1), dim=-1)\n
"},{"location":"module/embeddings/#multimolecule.module.embeddings.SinusoidalEmbedding","title":"SinusoidalEmbedding","text":" Bases: Embedding
Sinusoidal positional embeddings for inputs with any length.
FreezingThe embeddings are frozen and cannot be trained. They will not be saved in the model\u2019s state_dict.
Padding IdxPadding symbols are ignored if the padding_idx is specified.
Sequence LengthThese embeddings get automatically extended in forward if more positions is needed.
Source code inmultimolecule/module/embeddings/sinusoidal.py
Python@PositionEmbeddingRegistry.register(\"sinusoidal\")\n@PositionEmbeddingRegistryHF.register(\"sinusoidal\")\nclass SinusoidalEmbedding(nn.Embedding):\n r\"\"\"\n Sinusoidal positional embeddings for inputs with any length.\n\n Note: **Freezing**\n The embeddings are frozen and cannot be trained.\n They will not be saved in the model's state_dict.\n\n Tip: **Padding Idx**\n Padding symbols are ignored if the padding_idx is specified.\n\n Success: **Sequence Length**\n These embeddings get automatically extended in forward if more positions is needed.\n \"\"\"\n\n _is_hf_initialized = True\n\n def __init__(self, num_embeddings: int, embedding_dim: int, padding_idx: int | None = None, bias: int = 0):\n weight = self.get_embedding(num_embeddings, embedding_dim, padding_idx)\n super().__init__(num_embeddings, embedding_dim, padding_idx, _weight=weight.detach(), _freeze=True)\n self.bias = bias\n\n def update_weight(self, num_embeddings: int, embedding_dim: int, padding_idx: int | None = None):\n weight = self.get_embedding(num_embeddings, embedding_dim, padding_idx).to(\n dtype=self.weight.dtype, device=self.weight.device # type: ignore[has-type]\n )\n self.weight = nn.Parameter(weight.detach(), requires_grad=False)\n\n @staticmethod\n def get_embedding(num_embeddings: int, embedding_dim: int, padding_idx: int | None = None) -> Tensor:\n \"\"\"\n Build sinusoidal embeddings.\n\n This matches the implementation in tensor2tensor, but differs slightly from the description in Section 3.5 of\n \"Attention Is All You Need\".\n \"\"\"\n half_dim = embedding_dim // 2\n emb = torch.exp(torch.arange(half_dim, dtype=torch.float) * -(math.log(10000) / (half_dim - 1)))\n emb = torch.arange(num_embeddings, dtype=torch.float).unsqueeze(1) * emb.unsqueeze(0)\n emb = torch.cat([torch.sin(emb), torch.cos(emb)], dim=1).view(num_embeddings, -1)\n if embedding_dim % 2 == 1:\n # zero pad\n emb = torch.cat([emb, torch.zeros(num_embeddings, 1)], dim=1)\n if padding_idx is not None:\n emb[padding_idx, :] = 0\n return emb\n\n @staticmethod\n def get_position_ids(tensor, padding_idx: int | None = None):\n \"\"\"\n Replace non-padding symbols with their position numbers.\n\n Position numbers begin at padding_idx+1. Padding symbols are ignored.\n \"\"\"\n # The series of casts and type-conversions here are carefully\n # balanced to both work with ONNX export and XLA. In particular XLA\n # prefers ints, cumsum defaults to output longs, and ONNX doesn't know\n # how to handle the dtype kwarg in cumsum.\n if padding_idx is None:\n return torch.cumsum(tensor.new_ones(tensor.size(1)).long(), dim=0) - 1\n mask = tensor.ne(padding_idx).int()\n return (torch.cumsum(mask, dim=1).type_as(mask) * mask).long() + padding_idx\n\n def forward(self, input_ids: Tensor) -> Tensor:\n _, seq_len = input_ids.shape[:2]\n # expand embeddings if needed\n max_pos = seq_len + self.bias + 1\n if self.padding_idx is not None:\n max_pos += self.padding_idx\n if max_pos > self.weight.size(0):\n self.update_weight(max_pos, self.embedding_dim, self.padding_idx)\n # Need to shift the position ids by the padding index\n position_ids = self.get_position_ids(input_ids, self.padding_idx) + self.bias\n return super().forward(position_ids)\n\n def state_dict(self, destination=None, prefix=\"\", keep_vars=False):\n return {}\n\n def load_state_dict(self, *args, state_dict, strict=True):\n return\n\n def _load_from_state_dict(\n self, state_dict, prefix, local_metadata, strict, missing_keys, unexpected_keys, error_msgs\n ):\n return\n
"},{"location":"module/embeddings/#multimolecule.module.embeddings.SinusoidalEmbedding.get_embedding","title":"get_embedding staticmethod
","text":"Pythonget_embedding(num_embeddings: int, embedding_dim: int, padding_idx: int | None = None) -> Tensor\n
Build sinusoidal embeddings.
This matches the implementation in tensor2tensor, but differs slightly from the description in Section 3.5 of \u201cAttention Is All You Need\u201d.
Source code inmultimolecule/module/embeddings/sinusoidal.py
Python@staticmethod\ndef get_embedding(num_embeddings: int, embedding_dim: int, padding_idx: int | None = None) -> Tensor:\n \"\"\"\n Build sinusoidal embeddings.\n\n This matches the implementation in tensor2tensor, but differs slightly from the description in Section 3.5 of\n \"Attention Is All You Need\".\n \"\"\"\n half_dim = embedding_dim // 2\n emb = torch.exp(torch.arange(half_dim, dtype=torch.float) * -(math.log(10000) / (half_dim - 1)))\n emb = torch.arange(num_embeddings, dtype=torch.float).unsqueeze(1) * emb.unsqueeze(0)\n emb = torch.cat([torch.sin(emb), torch.cos(emb)], dim=1).view(num_embeddings, -1)\n if embedding_dim % 2 == 1:\n # zero pad\n emb = torch.cat([emb, torch.zeros(num_embeddings, 1)], dim=1)\n if padding_idx is not None:\n emb[padding_idx, :] = 0\n return emb\n
"},{"location":"module/embeddings/#multimolecule.module.embeddings.SinusoidalEmbedding.get_position_ids","title":"get_position_ids staticmethod
","text":"Pythonget_position_ids(tensor, padding_idx: int | None = None)\n
Replace non-padding symbols with their position numbers.
Position numbers begin at padding_idx+1. Padding symbols are ignored.
Source code inmultimolecule/module/embeddings/sinusoidal.py
Python@staticmethod\ndef get_position_ids(tensor, padding_idx: int | None = None):\n \"\"\"\n Replace non-padding symbols with their position numbers.\n\n Position numbers begin at padding_idx+1. Padding symbols are ignored.\n \"\"\"\n # The series of casts and type-conversions here are carefully\n # balanced to both work with ONNX export and XLA. In particular XLA\n # prefers ints, cumsum defaults to output longs, and ONNX doesn't know\n # how to handle the dtype kwarg in cumsum.\n if padding_idx is None:\n return torch.cumsum(tensor.new_ones(tensor.size(1)).long(), dim=0) - 1\n mask = tensor.ne(padding_idx).int()\n return (torch.cumsum(mask, dim=1).type_as(mask) * mask).long() + padding_idx\n
"},{"location":"module/heads/","title":"heads","text":"heads
provide a collection of pre-defined prediction heads.
heads
take in either a ModelOutupt
, a dict
, or a tuple
as input. It automatically looks for the model output required for prediction and processes it accordingly.
Some prediction heads may require additional information, such as the attention_mask
or the input_ids
, like ContactPredictionHead
. These additional arguments can be passed in as arguments/keyword arguments.
Note that heads
use the same ModelOutupt
conventions as the Transformers. If the model output is a tuple
, we consider the first element as the pooler_output
, the second element as the last_hidden_state
, and the last element as the attention_map
. It is the user\u2019s responsibility to ensure that the model output is correctly formatted.
If the model output is a ModelOutupt
or a dict
, the heads
will look for the HeadConfig.output_name
from the model output. You can specify the output_name
in the HeadConfig
to ensure that the heads
can correctly locate the required tensor.
Bases: BaseHeadConfig
Configuration class for a prediction head.
Parameters:
Name Type Description DefaultNumber of labels to use in the last layer added to the model, typically for a classification task.
Head should look for Config.num_labels
if is None
.
Problem type for XxxForYyyPrediction
models. Can be one of \"binary\"
, \"regression\"
, \"multiclass\"
or \"multilabel\"
.
Head should look for Config.problem_type
if is None
.
Dimensionality of the encoder layers and the pooler layer.
Head should look for Config.hidden_size
if is None
.
The dropout ratio for the hidden states.
requiredThe transform operation applied to hidden states.
requiredThe activation function of transform applied to hidden states.
requiredWhether to apply bias to the final prediction layer.
requiredThe activation function of the final prediction output.
requiredThe epsilon used by the layer normalization layers.
requiredThe name of the tensor required in model outputs.
If is None
, will use the default output name of the corresponding head.
The type of the head in the model.
This is used by MultiMoleculeModel
to construct heads.
multimolecule/module/heads/config.py
Pythonclass HeadConfig(BaseHeadConfig):\n r\"\"\"\n Configuration class for a prediction head.\n\n Args:\n num_labels:\n Number of labels to use in the last layer added to the model, typically for a classification task.\n\n Head should look for [`Config.num_labels`][multimolecule.PreTrainedConfig] if is `None`.\n problem_type:\n Problem type for `XxxForYyyPrediction` models. Can be one of `\"binary\"`, `\"regression\"`,\n `\"multiclass\"` or `\"multilabel\"`.\n\n Head should look for [`Config.problem_type`][multimolecule.PreTrainedConfig] if is `None`.\n hidden_size:\n Dimensionality of the encoder layers and the pooler layer.\n\n Head should look for [`Config.hidden_size`][multimolecule.PreTrainedConfig] if is `None`.\n dropout:\n The dropout ratio for the hidden states.\n transform:\n The transform operation applied to hidden states.\n transform_act:\n The activation function of transform applied to hidden states.\n bias:\n Whether to apply bias to the final prediction layer.\n act:\n The activation function of the final prediction output.\n layer_norm_eps:\n The epsilon used by the layer normalization layers.\n output_name:\n The name of the tensor required in model outputs.\n\n If is `None`, will use the default output name of the corresponding head.\n type:\n The type of the head in the model.\n\n This is used by [`MultiMoleculeModel`][multimolecule.MultiMoleculeModel] to construct heads.\n \"\"\"\n\n num_labels: Optional[int] = None\n problem_type: Optional[str] = None\n hidden_size: Optional[int] = None\n dropout: float = 0.0\n transform: Optional[str] = None\n transform_act: Optional[str] = \"gelu\"\n bias: bool = True\n act: Optional[str] = None\n layer_norm_eps: float = 1e-12\n output_name: Optional[str] = None\n type: Optional[str] = None\n
"},{"location":"module/heads/#multimolecule.module.heads.config.HeadConfig(num_labels)","title":"num_labels
","text":""},{"location":"module/heads/#multimolecule.module.heads.config.HeadConfig(problem_type)","title":"problem_type
","text":""},{"location":"module/heads/#multimolecule.module.heads.config.HeadConfig(hidden_size)","title":"hidden_size
","text":""},{"location":"module/heads/#multimolecule.module.heads.config.HeadConfig(dropout)","title":"dropout
","text":""},{"location":"module/heads/#multimolecule.module.heads.config.HeadConfig(transform)","title":"transform
","text":""},{"location":"module/heads/#multimolecule.module.heads.config.HeadConfig(transform_act)","title":"transform_act
","text":""},{"location":"module/heads/#multimolecule.module.heads.config.HeadConfig(bias)","title":"bias
","text":""},{"location":"module/heads/#multimolecule.module.heads.config.HeadConfig(act)","title":"act
","text":""},{"location":"module/heads/#multimolecule.module.heads.config.HeadConfig(layer_norm_eps)","title":"layer_norm_eps
","text":""},{"location":"module/heads/#multimolecule.module.heads.config.HeadConfig(output_name)","title":"output_name
","text":""},{"location":"module/heads/#multimolecule.module.heads.config.HeadConfig(type)","title":"type
","text":""},{"location":"module/heads/#multimolecule.module.heads.config.MaskedLMHeadConfig","title":"MaskedLMHeadConfig","text":" Bases: BaseHeadConfig
Configuration class for a Masked Language Modeling head.
Parameters:
Name Type Description DefaultDimensionality of the encoder layers and the pooler layer.
Head should look for Config.hidden_size
if is None
.
The dropout ratio for the hidden states.
requiredThe transform operation applied to hidden states.
requiredThe activation function of transform applied to hidden states.
requiredWhether to apply bias to the final prediction layer.
requiredThe activation function of the final prediction output.
requiredThe epsilon used by the layer normalization layers.
requiredThe name of the tensor required in model outputs.
If is None
, will use the default output name of the corresponding head.
multimolecule/module/heads/config.py
Pythonclass MaskedLMHeadConfig(BaseHeadConfig):\n r\"\"\"\n Configuration class for a Masked Language Modeling head.\n\n Args:\n hidden_size:\n Dimensionality of the encoder layers and the pooler layer.\n\n Head should look for [`Config.hidden_size`][multimolecule.PreTrainedConfig] if is `None`.\n dropout:\n The dropout ratio for the hidden states.\n transform:\n The transform operation applied to hidden states.\n transform_act:\n The activation function of transform applied to hidden states.\n bias:\n Whether to apply bias to the final prediction layer.\n act:\n The activation function of the final prediction output.\n layer_norm_eps:\n The epsilon used by the layer normalization layers.\n output_name:\n The name of the tensor required in model outputs.\n\n If is `None`, will use the default output name of the corresponding head.\n \"\"\"\n\n hidden_size: Optional[int] = None\n dropout: float = 0.0\n transform: Optional[str] = \"nonlinear\"\n transform_act: Optional[str] = \"gelu\"\n bias: bool = True\n act: Optional[str] = None\n layer_norm_eps: float = 1e-12\n output_name: Optional[str] = None\n
"},{"location":"module/heads/#multimolecule.module.heads.config.MaskedLMHeadConfig(hidden_size)","title":"hidden_size
","text":""},{"location":"module/heads/#multimolecule.module.heads.config.MaskedLMHeadConfig(dropout)","title":"dropout
","text":""},{"location":"module/heads/#multimolecule.module.heads.config.MaskedLMHeadConfig(transform)","title":"transform
","text":""},{"location":"module/heads/#multimolecule.module.heads.config.MaskedLMHeadConfig(transform_act)","title":"transform_act
","text":""},{"location":"module/heads/#multimolecule.module.heads.config.MaskedLMHeadConfig(bias)","title":"bias
","text":""},{"location":"module/heads/#multimolecule.module.heads.config.MaskedLMHeadConfig(act)","title":"act
","text":""},{"location":"module/heads/#multimolecule.module.heads.config.MaskedLMHeadConfig(layer_norm_eps)","title":"layer_norm_eps
","text":""},{"location":"module/heads/#multimolecule.module.heads.config.MaskedLMHeadConfig(output_name)","title":"output_name
","text":""},{"location":"module/heads/#multimolecule.module.heads.sequence","title":"multimolecule.module.heads.sequence","text":""},{"location":"module/heads/#multimolecule.module.heads.sequence.SequencePredictionHead","title":"SequencePredictionHead","text":" Bases: PredictionHead
Head for tasks in sequence-level.
Parameters:
Name Type Description DefaultPreTrainedConfig
The configuration object for the model.
requiredHeadConfig | None
The configuration object for the head. If None, will use configuration from the config
.
None
Source code in multimolecule/module/heads/sequence.py
Python@HeadRegistry.register(\"sequence\")\nclass SequencePredictionHead(PredictionHead):\n r\"\"\"\n Head for tasks in sequence-level.\n\n Args:\n config: The configuration object for the model.\n head_config: The configuration object for the head.\n If None, will use configuration from the `config`.\n \"\"\"\n\n output_name: str = \"pooler_output\"\n r\"\"\"The default output to use for the head.\"\"\"\n\n def __init__(self, config: PreTrainedConfig, head_config: HeadConfig | None = None):\n super().__init__(config, head_config)\n if head_config is not None and head_config.output_name is not None:\n self.output_name = head_config.output_name\n\n def forward( # type: ignore[override] # pylint: disable=arguments-renamed\n self,\n outputs: ModelOutput | Tuple[Tensor, ...],\n labels: Tensor | None = None,\n output_name: str | None = None,\n **kwargs,\n ) -> HeadOutput:\n r\"\"\"\n Forward pass of the SequencePredictionHead.\n\n Args:\n outputs: The outputs of the model.\n labels: The labels for the head.\n output_name: The name of the output to use.\n Defaults to `self.output_name`.\n \"\"\"\n if isinstance(outputs, (Mapping, ModelOutput)):\n output = outputs[output_name or self.output_name]\n elif isinstance(outputs, tuple):\n output = outputs[1]\n return super().forward(output, labels, **kwargs)\n
"},{"location":"module/heads/#multimolecule.module.heads.sequence.SequencePredictionHead(config)","title":"config
","text":""},{"location":"module/heads/#multimolecule.module.heads.sequence.SequencePredictionHead(head_config)","title":"head_config
","text":""},{"location":"module/heads/#multimolecule.module.heads.sequence.SequencePredictionHead.output_name","title":"output_name class-attribute
instance-attribute
","text":"Pythonoutput_name: str = 'pooler_output'\n
The default output to use for the head.
"},{"location":"module/heads/#multimolecule.module.heads.sequence.SequencePredictionHead.forward","title":"forward","text":"Pythonforward(outputs: ModelOutput | Tuple[Tensor, ...], labels: Tensor | None = None, output_name: str | None = None, **kwargs) -> HeadOutput\n
Forward pass of the SequencePredictionHead.
Parameters:
Name Type Description DefaultModelOutput | Tuple[Tensor, ...]
The outputs of the model.
requiredTensor | None
The labels for the head.
None
str | None
The name of the output to use. Defaults to self.output_name
.
None
Source code in multimolecule/module/heads/sequence.py
Pythondef forward( # type: ignore[override] # pylint: disable=arguments-renamed\n self,\n outputs: ModelOutput | Tuple[Tensor, ...],\n labels: Tensor | None = None,\n output_name: str | None = None,\n **kwargs,\n) -> HeadOutput:\n r\"\"\"\n Forward pass of the SequencePredictionHead.\n\n Args:\n outputs: The outputs of the model.\n labels: The labels for the head.\n output_name: The name of the output to use.\n Defaults to `self.output_name`.\n \"\"\"\n if isinstance(outputs, (Mapping, ModelOutput)):\n output = outputs[output_name or self.output_name]\n elif isinstance(outputs, tuple):\n output = outputs[1]\n return super().forward(output, labels, **kwargs)\n
"},{"location":"module/heads/#multimolecule.module.heads.sequence.SequencePredictionHead.forward(outputs)","title":"outputs
","text":""},{"location":"module/heads/#multimolecule.module.heads.sequence.SequencePredictionHead.forward(labels)","title":"labels
","text":""},{"location":"module/heads/#multimolecule.module.heads.sequence.SequencePredictionHead.forward(output_name)","title":"output_name
","text":""},{"location":"module/heads/#multimolecule.module.heads.token","title":"multimolecule.module.heads.token","text":""},{"location":"module/heads/#multimolecule.module.heads.token.TokenPredictionHead","title":"TokenPredictionHead","text":" Bases: PredictionHead
Head for tasks in token-level.
Parameters:
Name Type Description DefaultPreTrainedConfig
The configuration object for the model.
requiredHeadConfig | None
The configuration object for the head. If None, will use configuration from the config
.
None
Source code in multimolecule/module/heads/token.py
Python@HeadRegistry.token.register(\"single\", default=True)\n@TokenHeadRegistryHF.register(\"single\", default=True)\nclass TokenPredictionHead(PredictionHead):\n r\"\"\"\n Head for tasks in token-level.\n\n Args:\n config: The configuration object for the model.\n head_config: The configuration object for the head.\n If None, will use configuration from the `config`.\n \"\"\"\n\n output_name: str = \"last_hidden_state\"\n r\"\"\"The default output to use for the head.\"\"\"\n\n def __init__(self, config: PreTrainedConfig, head_config: HeadConfig | None = None):\n super().__init__(config, head_config)\n if head_config is not None and head_config.output_name is not None:\n self.output_name = head_config.output_name\n\n def forward( # type: ignore[override] # pylint: disable=arguments-renamed\n self,\n outputs: ModelOutput | Tuple[Tensor, ...],\n attention_mask: Tensor | None = None,\n input_ids: NestedTensor | Tensor | None = None,\n labels: Tensor | None = None,\n output_name: str | None = None,\n **kwargs,\n ) -> HeadOutput:\n r\"\"\"\n Forward pass of the TokenPredictionHead.\n\n Args:\n outputs: The outputs of the model.\n attention_mask: The attention mask for the inputs.\n input_ids: The input ids for the inputs.\n labels: The labels for the head.\n output_name: The name of the output to use.\n Defaults to `self.output_name`.\n \"\"\"\n if isinstance(outputs, (Mapping, ModelOutput)):\n output = outputs[output_name or self.output_name]\n elif isinstance(outputs, tuple):\n output = outputs[0]\n else:\n raise ValueError(f\"Unsupported type for outputs: {type(outputs)}\")\n\n if attention_mask is None:\n attention_mask = self._get_attention_mask(input_ids)\n output = output * attention_mask.unsqueeze(-1)\n output, _, _ = self._remove_special_tokens(output, attention_mask, input_ids)\n\n return super().forward(output, labels, **kwargs)\n
"},{"location":"module/heads/#multimolecule.module.heads.token.TokenPredictionHead(config)","title":"config
","text":""},{"location":"module/heads/#multimolecule.module.heads.token.TokenPredictionHead(head_config)","title":"head_config
","text":""},{"location":"module/heads/#multimolecule.module.heads.token.TokenPredictionHead.output_name","title":"output_name class-attribute
instance-attribute
","text":"Pythonoutput_name: str = 'last_hidden_state'\n
The default output to use for the head.
"},{"location":"module/heads/#multimolecule.module.heads.token.TokenPredictionHead.forward","title":"forward","text":"Pythonforward(outputs: ModelOutput | Tuple[Tensor, ...], attention_mask: Tensor | None = None, input_ids: NestedTensor | Tensor | None = None, labels: Tensor | None = None, output_name: str | None = None, **kwargs) -> HeadOutput\n
Forward pass of the TokenPredictionHead.
Parameters:
Name Type Description DefaultModelOutput | Tuple[Tensor, ...]
The outputs of the model.
requiredTensor | None
The attention mask for the inputs.
None
NestedTensor | Tensor | None
The input ids for the inputs.
None
Tensor | None
The labels for the head.
None
str | None
The name of the output to use. Defaults to self.output_name
.
None
Source code in multimolecule/module/heads/token.py
Pythondef forward( # type: ignore[override] # pylint: disable=arguments-renamed\n self,\n outputs: ModelOutput | Tuple[Tensor, ...],\n attention_mask: Tensor | None = None,\n input_ids: NestedTensor | Tensor | None = None,\n labels: Tensor | None = None,\n output_name: str | None = None,\n **kwargs,\n) -> HeadOutput:\n r\"\"\"\n Forward pass of the TokenPredictionHead.\n\n Args:\n outputs: The outputs of the model.\n attention_mask: The attention mask for the inputs.\n input_ids: The input ids for the inputs.\n labels: The labels for the head.\n output_name: The name of the output to use.\n Defaults to `self.output_name`.\n \"\"\"\n if isinstance(outputs, (Mapping, ModelOutput)):\n output = outputs[output_name or self.output_name]\n elif isinstance(outputs, tuple):\n output = outputs[0]\n else:\n raise ValueError(f\"Unsupported type for outputs: {type(outputs)}\")\n\n if attention_mask is None:\n attention_mask = self._get_attention_mask(input_ids)\n output = output * attention_mask.unsqueeze(-1)\n output, _, _ = self._remove_special_tokens(output, attention_mask, input_ids)\n\n return super().forward(output, labels, **kwargs)\n
"},{"location":"module/heads/#multimolecule.module.heads.token.TokenPredictionHead.forward(outputs)","title":"outputs
","text":""},{"location":"module/heads/#multimolecule.module.heads.token.TokenPredictionHead.forward(attention_mask)","title":"attention_mask
","text":""},{"location":"module/heads/#multimolecule.module.heads.token.TokenPredictionHead.forward(input_ids)","title":"input_ids
","text":""},{"location":"module/heads/#multimolecule.module.heads.token.TokenPredictionHead.forward(labels)","title":"labels
","text":""},{"location":"module/heads/#multimolecule.module.heads.token.TokenPredictionHead.forward(output_name)","title":"output_name
","text":""},{"location":"module/heads/#multimolecule.module.heads.token.TokenKMerHead","title":"TokenKMerHead","text":" Bases: PredictionHead
Head for tasks in token-level with kmer inputs.
Parameters:
Name Type Description DefaultPreTrainedConfig
The configuration object for the model.
requiredHeadConfig | None
The configuration object for the head. If None, will use configuration from the config
.
None
Source code in multimolecule/module/heads/token.py
Python@HeadRegistry.register(\"token.kmer\")\n@TokenHeadRegistryHF.register(\"kmer\")\nclass TokenKMerHead(PredictionHead):\n r\"\"\"\n Head for tasks in token-level with kmer inputs.\n\n Args:\n config: The configuration object for the model.\n head_config: The configuration object for the head.\n If None, will use configuration from the `config`.\n \"\"\"\n\n output_name: str = \"last_hidden_state\"\n r\"\"\"The default output to use for the head.\"\"\"\n\n def __init__(self, config: PreTrainedConfig, head_config: HeadConfig | None = None):\n super().__init__(config, head_config)\n self.nmers = config.nmers\n if head_config is not None and head_config.output_name is not None:\n self.output_name = head_config.output_name\n # Do not pass bos_token_id and eos_token_id to unfold_kmer_embeddings\n # As they will be removed in preprocess\n self.unfold_kmer_embeddings = partial(unfold_kmer_embeddings, nmers=self.nmers)\n\n def forward( # type: ignore[override] # pylint: disable=arguments-renamed\n self,\n outputs: ModelOutput | Tuple[Tensor, ...],\n attention_mask: Tensor | None = None,\n input_ids: NestedTensor | Tensor | None = None,\n labels: Tensor | None = None,\n output_name: str | None = None,\n **kwargs,\n ) -> HeadOutput:\n r\"\"\"\n Forward pass of the TokenKMerHead.\n\n Args:\n outputs: The outputs of the model.\n attention_mask: The attention mask for the inputs.\n input_ids: The input ids for the inputs.\n labels: The labels for the head.\n output_name: The name of the output to use.\n Defaults to `self.output_name`.\n \"\"\"\n if isinstance(outputs, (Mapping, ModelOutput)):\n output = outputs[output_name or self.output_name]\n elif isinstance(outputs, tuple):\n output = outputs[0]\n else:\n raise ValueError(f\"Unsupported type for outputs: {type(outputs)}\")\n\n if attention_mask is None:\n attention_mask = self._get_attention_mask(input_ids)\n output = output * attention_mask.unsqueeze(-1)\n output, attention_mask, _ = self._remove_special_tokens(output, attention_mask, input_ids)\n\n output = self.unfold_kmer_embeddings(output, attention_mask)\n return super().forward(output, labels, **kwargs)\n
"},{"location":"module/heads/#multimolecule.module.heads.token.TokenKMerHead(config)","title":"config
","text":""},{"location":"module/heads/#multimolecule.module.heads.token.TokenKMerHead(head_config)","title":"head_config
","text":""},{"location":"module/heads/#multimolecule.module.heads.token.TokenKMerHead.output_name","title":"output_name class-attribute
instance-attribute
","text":"Pythonoutput_name: str = 'last_hidden_state'\n
The default output to use for the head.
"},{"location":"module/heads/#multimolecule.module.heads.token.TokenKMerHead.forward","title":"forward","text":"Pythonforward(outputs: ModelOutput | Tuple[Tensor, ...], attention_mask: Tensor | None = None, input_ids: NestedTensor | Tensor | None = None, labels: Tensor | None = None, output_name: str | None = None, **kwargs) -> HeadOutput\n
Forward pass of the TokenKMerHead.
Parameters:
Name Type Description DefaultModelOutput | Tuple[Tensor, ...]
The outputs of the model.
requiredTensor | None
The attention mask for the inputs.
None
NestedTensor | Tensor | None
The input ids for the inputs.
None
Tensor | None
The labels for the head.
None
str | None
The name of the output to use. Defaults to self.output_name
.
None
Source code in multimolecule/module/heads/token.py
Pythondef forward( # type: ignore[override] # pylint: disable=arguments-renamed\n self,\n outputs: ModelOutput | Tuple[Tensor, ...],\n attention_mask: Tensor | None = None,\n input_ids: NestedTensor | Tensor | None = None,\n labels: Tensor | None = None,\n output_name: str | None = None,\n **kwargs,\n) -> HeadOutput:\n r\"\"\"\n Forward pass of the TokenKMerHead.\n\n Args:\n outputs: The outputs of the model.\n attention_mask: The attention mask for the inputs.\n input_ids: The input ids for the inputs.\n labels: The labels for the head.\n output_name: The name of the output to use.\n Defaults to `self.output_name`.\n \"\"\"\n if isinstance(outputs, (Mapping, ModelOutput)):\n output = outputs[output_name or self.output_name]\n elif isinstance(outputs, tuple):\n output = outputs[0]\n else:\n raise ValueError(f\"Unsupported type for outputs: {type(outputs)}\")\n\n if attention_mask is None:\n attention_mask = self._get_attention_mask(input_ids)\n output = output * attention_mask.unsqueeze(-1)\n output, attention_mask, _ = self._remove_special_tokens(output, attention_mask, input_ids)\n\n output = self.unfold_kmer_embeddings(output, attention_mask)\n return super().forward(output, labels, **kwargs)\n
"},{"location":"module/heads/#multimolecule.module.heads.token.TokenKMerHead.forward(outputs)","title":"outputs
","text":""},{"location":"module/heads/#multimolecule.module.heads.token.TokenKMerHead.forward(attention_mask)","title":"attention_mask
","text":""},{"location":"module/heads/#multimolecule.module.heads.token.TokenKMerHead.forward(input_ids)","title":"input_ids
","text":""},{"location":"module/heads/#multimolecule.module.heads.token.TokenKMerHead.forward(labels)","title":"labels
","text":""},{"location":"module/heads/#multimolecule.module.heads.token.TokenKMerHead.forward(output_name)","title":"output_name
","text":""},{"location":"module/heads/#multimolecule.module.heads.contact","title":"multimolecule.module.heads.contact","text":""},{"location":"module/heads/#multimolecule.module.heads.contact.ContactPredictionHead","title":"ContactPredictionHead","text":" Bases: PredictionHead
Head for tasks in contact-level.
Performs symmetrization, and average product correct.
Parameters:
Name Type Description DefaultPreTrainedConfig
The configuration object for the model.
requiredHeadConfig | None
The configuration object for the head. If None, will use configuration from the config
.
None
Source code in multimolecule/module/heads/contact.py
Python@HeadRegistry.contact.register(\"attention\")\nclass ContactPredictionHead(PredictionHead):\n r\"\"\"\n Head for tasks in contact-level.\n\n Performs symmetrization, and average product correct.\n\n Args:\n config: The configuration object for the model.\n head_config: The configuration object for the head.\n If None, will use configuration from the `config`.\n \"\"\"\n\n output_name: str = \"attentions\"\n r\"\"\"The default output to use for the head.\"\"\"\n\n requires_attention: bool = True\n\n def __init__(self, config: PreTrainedConfig, head_config: HeadConfig | None = None):\n super().__init__(config, head_config)\n self.config.hidden_size = config.num_hidden_layers * config.num_attention_heads\n num_layers = self.config.get(\"num_layers\", 16)\n num_channels = self.config.get(\"num_channels\", self.config.hidden_size // 10) # type: ignore[operator]\n block = self.config.get(\"block\", \"auto\")\n self.decoder = ResNet(\n num_layers=num_layers,\n hidden_size=self.config.hidden_size, # type: ignore[arg-type]\n block=block,\n num_channels=num_channels,\n num_labels=self.num_labels,\n )\n if head_config is not None and head_config.output_name is not None:\n self.output_name = head_config.output_name\n\n def forward( # type: ignore[override] # pylint: disable=arguments-renamed\n self,\n outputs: ModelOutput | Mapping | Tuple[Tensor, ...],\n attention_mask: Tensor | None = None,\n input_ids: NestedTensor | Tensor | None = None,\n labels: Tensor | None = None,\n output_name: str | None = None,\n **kwargs,\n ) -> HeadOutput:\n r\"\"\"\n Forward pass of the ContactPredictionHead.\n\n Args:\n outputs: The outputs of the model.\n attention_mask: The attention mask for the inputs.\n input_ids: The input ids for the inputs.\n labels: The labels for the head.\n output_name: The name of the output to use.\n Defaults to `self.output_name`.\n \"\"\"\n\n if isinstance(outputs, (Mapping, ModelOutput)):\n output = outputs[output_name or self.output_name]\n elif isinstance(outputs, tuple):\n output = outputs[-1]\n attentions = torch.stack(output, 1)\n\n # In the original model, attentions for padding tokens are completely zeroed out.\n # This makes no difference most of the time because the other tokens won't attend to them,\n # but it does for the contact prediction task, which takes attentions as input,\n # so we have to mimic that here.\n if attention_mask is None:\n attention_mask = self._get_attention_mask(input_ids)\n attention_mask = attention_mask.unsqueeze(1) * attention_mask.unsqueeze(2)\n attentions = attentions * attention_mask[:, None, None, :, :]\n\n # remove cls token attentions\n if self.bos_token_id is not None:\n attentions = attentions[..., 1:, 1:]\n attention_mask = attention_mask[..., 1:]\n if input_ids is not None:\n input_ids = input_ids[..., 1:]\n # remove eos token attentions\n if self.eos_token_id is not None:\n if input_ids is not None:\n eos_mask = input_ids.ne(self.eos_token_id).to(attentions)\n else:\n last_valid_indices = attention_mask.sum(dim=-1)\n seq_length = attention_mask.size(-1)\n eos_mask = torch.arange(seq_length, device=attentions.device).unsqueeze(0) == last_valid_indices\n eos_mask = eos_mask.unsqueeze(1) * eos_mask.unsqueeze(2)\n attentions = attentions * eos_mask[:, None, None, :, :]\n attentions = attentions[..., :-1, :-1]\n\n # features: batch x channels x input_ids x input_ids (symmetric)\n batch_size, layers, heads, seqlen, _ = attentions.size()\n attentions = attentions.view(batch_size, layers * heads, seqlen, seqlen)\n attentions = attentions.to(self.decoder.proj.weight.device)\n attentions = average_product_correct(symmetrize(attentions))\n attentions = attentions.permute(0, 2, 3, 1).squeeze(3)\n\n return super().forward(attentions, labels, **kwargs)\n
"},{"location":"module/heads/#multimolecule.module.heads.contact.ContactPredictionHead(config)","title":"config
","text":""},{"location":"module/heads/#multimolecule.module.heads.contact.ContactPredictionHead(head_config)","title":"head_config
","text":""},{"location":"module/heads/#multimolecule.module.heads.contact.ContactPredictionHead.output_name","title":"output_name class-attribute
instance-attribute
","text":"Pythonoutput_name: str = 'attentions'\n
The default output to use for the head.
"},{"location":"module/heads/#multimolecule.module.heads.contact.ContactPredictionHead.forward","title":"forward","text":"Pythonforward(outputs: ModelOutput | Mapping | Tuple[Tensor, ...], attention_mask: Tensor | None = None, input_ids: NestedTensor | Tensor | None = None, labels: Tensor | None = None, output_name: str | None = None, **kwargs) -> HeadOutput\n
Forward pass of the ContactPredictionHead.
Parameters:
Name Type Description DefaultModelOutput | Mapping | Tuple[Tensor, ...]
The outputs of the model.
requiredTensor | None
The attention mask for the inputs.
None
NestedTensor | Tensor | None
The input ids for the inputs.
None
Tensor | None
The labels for the head.
None
str | None
The name of the output to use. Defaults to self.output_name
.
None
Source code in multimolecule/module/heads/contact.py
Pythondef forward( # type: ignore[override] # pylint: disable=arguments-renamed\n self,\n outputs: ModelOutput | Mapping | Tuple[Tensor, ...],\n attention_mask: Tensor | None = None,\n input_ids: NestedTensor | Tensor | None = None,\n labels: Tensor | None = None,\n output_name: str | None = None,\n **kwargs,\n) -> HeadOutput:\n r\"\"\"\n Forward pass of the ContactPredictionHead.\n\n Args:\n outputs: The outputs of the model.\n attention_mask: The attention mask for the inputs.\n input_ids: The input ids for the inputs.\n labels: The labels for the head.\n output_name: The name of the output to use.\n Defaults to `self.output_name`.\n \"\"\"\n\n if isinstance(outputs, (Mapping, ModelOutput)):\n output = outputs[output_name or self.output_name]\n elif isinstance(outputs, tuple):\n output = outputs[-1]\n attentions = torch.stack(output, 1)\n\n # In the original model, attentions for padding tokens are completely zeroed out.\n # This makes no difference most of the time because the other tokens won't attend to them,\n # but it does for the contact prediction task, which takes attentions as input,\n # so we have to mimic that here.\n if attention_mask is None:\n attention_mask = self._get_attention_mask(input_ids)\n attention_mask = attention_mask.unsqueeze(1) * attention_mask.unsqueeze(2)\n attentions = attentions * attention_mask[:, None, None, :, :]\n\n # remove cls token attentions\n if self.bos_token_id is not None:\n attentions = attentions[..., 1:, 1:]\n attention_mask = attention_mask[..., 1:]\n if input_ids is not None:\n input_ids = input_ids[..., 1:]\n # remove eos token attentions\n if self.eos_token_id is not None:\n if input_ids is not None:\n eos_mask = input_ids.ne(self.eos_token_id).to(attentions)\n else:\n last_valid_indices = attention_mask.sum(dim=-1)\n seq_length = attention_mask.size(-1)\n eos_mask = torch.arange(seq_length, device=attentions.device).unsqueeze(0) == last_valid_indices\n eos_mask = eos_mask.unsqueeze(1) * eos_mask.unsqueeze(2)\n attentions = attentions * eos_mask[:, None, None, :, :]\n attentions = attentions[..., :-1, :-1]\n\n # features: batch x channels x input_ids x input_ids (symmetric)\n batch_size, layers, heads, seqlen, _ = attentions.size()\n attentions = attentions.view(batch_size, layers * heads, seqlen, seqlen)\n attentions = attentions.to(self.decoder.proj.weight.device)\n attentions = average_product_correct(symmetrize(attentions))\n attentions = attentions.permute(0, 2, 3, 1).squeeze(3)\n\n return super().forward(attentions, labels, **kwargs)\n
"},{"location":"module/heads/#multimolecule.module.heads.contact.ContactPredictionHead.forward(outputs)","title":"outputs
","text":""},{"location":"module/heads/#multimolecule.module.heads.contact.ContactPredictionHead.forward(attention_mask)","title":"attention_mask
","text":""},{"location":"module/heads/#multimolecule.module.heads.contact.ContactPredictionHead.forward(input_ids)","title":"input_ids
","text":""},{"location":"module/heads/#multimolecule.module.heads.contact.ContactPredictionHead.forward(labels)","title":"labels
","text":""},{"location":"module/heads/#multimolecule.module.heads.contact.ContactPredictionHead.forward(output_name)","title":"output_name
","text":""},{"location":"module/heads/#multimolecule.module.heads.contact.ContactLogitsHead","title":"ContactLogitsHead","text":" Bases: PredictionHead
Head for tasks in contact-level.
Performs symmetrization, and average product correct.
Parameters:
Name Type Description DefaultPreTrainedConfig
The configuration object for the model.
requiredHeadConfig | None
The configuration object for the head. If None, will use configuration from the config
.
None
Source code in multimolecule/module/heads/contact.py
Python@HeadRegistry.contact.register(\"logits\")\nclass ContactLogitsHead(PredictionHead):\n r\"\"\"\n Head for tasks in contact-level.\n\n Performs symmetrization, and average product correct.\n\n Args:\n config: The configuration object for the model.\n head_config: The configuration object for the head.\n If None, will use configuration from the `config`.\n \"\"\"\n\n output_name: str = \"last_hidden_state\"\n r\"\"\"The default output to use for the head.\"\"\"\n\n requires_attention: bool = False\n\n def __init__(self, config: PreTrainedConfig, head_config: HeadConfig | None = None):\n super().__init__(config, head_config)\n num_layers = self.config.get(\"num_layers\", 16)\n num_channels = self.config.get(\"num_channels\", self.config.hidden_size // 10) # type: ignore[operator]\n block = self.config.get(\"block\", \"auto\")\n self.decoder = ResNet(\n num_layers=num_layers,\n hidden_size=self.config.hidden_size, # type: ignore[arg-type]\n block=block,\n num_channels=num_channels,\n num_labels=self.num_labels,\n )\n if head_config is not None and head_config.output_name is not None:\n self.output_name = head_config.output_name\n\n def forward( # type: ignore[override] # pylint: disable=arguments-renamed\n self,\n outputs: ModelOutput | Mapping | Tuple[Tensor, ...],\n attention_mask: Tensor | None = None,\n input_ids: NestedTensor | Tensor | None = None,\n labels: Tensor | None = None,\n output_name: str | None = None,\n **kwargs,\n ) -> HeadOutput:\n r\"\"\"\n Forward pass of the ContactPredictionHead.\n\n Args:\n outputs: The outputs of the model.\n attention_mask: The attention mask for the inputs.\n input_ids: The input ids for the inputs.\n labels: The labels for the head.\n output_name: The name of the output to use.\n Defaults to `self.output_name`.\n \"\"\"\n if isinstance(outputs, (Mapping, ModelOutput)):\n output = outputs[output_name or self.output_name]\n elif isinstance(outputs, tuple):\n output = outputs[0]\n else:\n raise ValueError(f\"Unsupported type for outputs: {type(outputs)}\")\n\n if attention_mask is None:\n attention_mask = self._get_attention_mask(input_ids)\n output = output * attention_mask.unsqueeze(-1)\n output, _, _ = self._remove_special_tokens(output, attention_mask, input_ids)\n\n # make symmetric contact map\n contact_map = output.unsqueeze(1) * output.unsqueeze(2)\n\n return super().forward(contact_map, labels, **kwargs)\n
"},{"location":"module/heads/#multimolecule.module.heads.contact.ContactLogitsHead(config)","title":"config
","text":""},{"location":"module/heads/#multimolecule.module.heads.contact.ContactLogitsHead(head_config)","title":"head_config
","text":""},{"location":"module/heads/#multimolecule.module.heads.contact.ContactLogitsHead.output_name","title":"output_name class-attribute
instance-attribute
","text":"Pythonoutput_name: str = 'last_hidden_state'\n
The default output to use for the head.
"},{"location":"module/heads/#multimolecule.module.heads.contact.ContactLogitsHead.forward","title":"forward","text":"Pythonforward(outputs: ModelOutput | Mapping | Tuple[Tensor, ...], attention_mask: Tensor | None = None, input_ids: NestedTensor | Tensor | None = None, labels: Tensor | None = None, output_name: str | None = None, **kwargs) -> HeadOutput\n
Forward pass of the ContactPredictionHead.
Parameters:
Name Type Description DefaultModelOutput | Mapping | Tuple[Tensor, ...]
The outputs of the model.
requiredTensor | None
The attention mask for the inputs.
None
NestedTensor | Tensor | None
The input ids for the inputs.
None
Tensor | None
The labels for the head.
None
str | None
The name of the output to use. Defaults to self.output_name
.
None
Source code in multimolecule/module/heads/contact.py
Pythondef forward( # type: ignore[override] # pylint: disable=arguments-renamed\n self,\n outputs: ModelOutput | Mapping | Tuple[Tensor, ...],\n attention_mask: Tensor | None = None,\n input_ids: NestedTensor | Tensor | None = None,\n labels: Tensor | None = None,\n output_name: str | None = None,\n **kwargs,\n) -> HeadOutput:\n r\"\"\"\n Forward pass of the ContactPredictionHead.\n\n Args:\n outputs: The outputs of the model.\n attention_mask: The attention mask for the inputs.\n input_ids: The input ids for the inputs.\n labels: The labels for the head.\n output_name: The name of the output to use.\n Defaults to `self.output_name`.\n \"\"\"\n if isinstance(outputs, (Mapping, ModelOutput)):\n output = outputs[output_name or self.output_name]\n elif isinstance(outputs, tuple):\n output = outputs[0]\n else:\n raise ValueError(f\"Unsupported type for outputs: {type(outputs)}\")\n\n if attention_mask is None:\n attention_mask = self._get_attention_mask(input_ids)\n output = output * attention_mask.unsqueeze(-1)\n output, _, _ = self._remove_special_tokens(output, attention_mask, input_ids)\n\n # make symmetric contact map\n contact_map = output.unsqueeze(1) * output.unsqueeze(2)\n\n return super().forward(contact_map, labels, **kwargs)\n
"},{"location":"module/heads/#multimolecule.module.heads.contact.ContactLogitsHead.forward(outputs)","title":"outputs
","text":""},{"location":"module/heads/#multimolecule.module.heads.contact.ContactLogitsHead.forward(attention_mask)","title":"attention_mask
","text":""},{"location":"module/heads/#multimolecule.module.heads.contact.ContactLogitsHead.forward(input_ids)","title":"input_ids
","text":""},{"location":"module/heads/#multimolecule.module.heads.contact.ContactLogitsHead.forward(labels)","title":"labels
","text":""},{"location":"module/heads/#multimolecule.module.heads.contact.ContactLogitsHead.forward(output_name)","title":"output_name
","text":""},{"location":"module/heads/#multimolecule.module.heads.contact.symmetrize","title":"symmetrize","text":"Pythonsymmetrize(x)\n
Make layer symmetric in final two dimensions, used for contact prediction.
Source code inmultimolecule/module/heads/contact.py
Pythondef symmetrize(x):\n \"Make layer symmetric in final two dimensions, used for contact prediction.\"\n return x + x.transpose(-1, -2)\n
"},{"location":"module/heads/#multimolecule.module.heads.contact.average_product_correct","title":"average_product_correct","text":"Pythonaverage_product_correct(x)\n
Perform average product correct, used for contact prediction.
Source code inmultimolecule/module/heads/contact.py
Pythondef average_product_correct(x):\n \"Perform average product correct, used for contact prediction.\"\n a1 = x.sum(-1, keepdims=True)\n a2 = x.sum(-2, keepdims=True)\n a12 = x.sum((-1, -2), keepdims=True)\n\n avg = a1 * a2\n avg.div_(a12) # in-place to reduce memory\n normalized = x - avg\n return normalized\n
"},{"location":"module/heads/#multimolecule.module.heads.pretrain","title":"multimolecule.module.heads.pretrain","text":""},{"location":"module/heads/#multimolecule.module.heads.pretrain.MaskedLMHead","title":"MaskedLMHead","text":" Bases: Module
Head for masked language modeling.
Parameters:
Name Type Description DefaultPreTrainedConfig
The configuration object for the model.
requiredMaskedLMHeadConfig | None
The configuration object for the head. If None, will use configuration from the config
.
None
Source code in multimolecule/module/heads/pretrain.py
Python@HeadRegistry.register(\"masked_lm\")\nclass MaskedLMHead(nn.Module):\n r\"\"\"\n Head for masked language modeling.\n\n Args:\n config: The configuration object for the model.\n head_config: The configuration object for the head.\n If None, will use configuration from the `config`.\n \"\"\"\n\n output_name: str = \"last_hidden_state\"\n r\"\"\"The default output to use for the head.\"\"\"\n\n def __init__(\n self, config: PreTrainedConfig, weight: Tensor | None = None, head_config: MaskedLMHeadConfig | None = None\n ):\n super().__init__()\n if head_config is None:\n head_config = (config.lm_head if hasattr(config, \"lm_head\") else config.head) or MaskedLMHeadConfig()\n self.config: MaskedLMHeadConfig = head_config\n if self.config.hidden_size is None:\n self.config.hidden_size = config.hidden_size\n self.num_labels = config.vocab_size\n self.dropout = nn.Dropout(self.config.dropout)\n self.transform = HeadTransformRegistryHF.build(self.config)\n self.decoder = nn.Linear(self.config.hidden_size, self.num_labels, bias=False)\n if weight is not None:\n self.decoder.weight = weight\n if self.config.bias:\n self.bias = nn.Parameter(torch.zeros(self.num_labels))\n self.decoder.bias = self.bias\n self.activation = ACT2FN[self.config.act] if self.config.act is not None else None\n if head_config is not None and head_config.output_name is not None:\n self.output_name = head_config.output_name\n\n def forward(\n self, outputs: ModelOutput | Tuple[Tensor, ...], labels: Tensor | None = None, output_name: str | None = None\n ) -> HeadOutput:\n r\"\"\"\n Forward pass of the MaskedLMHead.\n\n Args:\n outputs: The outputs of the model.\n labels: The labels for the head.\n output_name: The name of the output to use.\n Defaults to `self.output_name`.\n \"\"\"\n if isinstance(outputs, (Mapping, ModelOutput)):\n output = outputs[output_name or self.output_name]\n elif isinstance(outputs, tuple):\n output = outputs[0]\n else:\n raise ValueError(f\"Unsupported type for outputs: {type(outputs)}\")\n output = self.dropout(output)\n output = self.transform(output)\n output = self.decoder(output)\n if self.activation is not None:\n output = self.activation(output)\n if labels is not None:\n if isinstance(labels, NestedTensor):\n if isinstance(output, Tensor):\n output = labels.nested_like(output, strict=False)\n return HeadOutput(output, F.cross_entropy(output.concat, labels.concat))\n return HeadOutput(output, F.cross_entropy(output.view(-1, self.num_labels), labels.view(-1)))\n return HeadOutput(output)\n
"},{"location":"module/heads/#multimolecule.module.heads.pretrain.MaskedLMHead(config)","title":"config
","text":""},{"location":"module/heads/#multimolecule.module.heads.pretrain.MaskedLMHead(head_config)","title":"head_config
","text":""},{"location":"module/heads/#multimolecule.module.heads.pretrain.MaskedLMHead.output_name","title":"output_name class-attribute
instance-attribute
","text":"Pythonoutput_name: str = 'last_hidden_state'\n
The default output to use for the head.
"},{"location":"module/heads/#multimolecule.module.heads.pretrain.MaskedLMHead.forward","title":"forward","text":"Pythonforward(outputs: ModelOutput | Tuple[Tensor, ...], labels: Tensor | None = None, output_name: str | None = None) -> HeadOutput\n
Forward pass of the MaskedLMHead.
Parameters:
Name Type Description DefaultModelOutput | Tuple[Tensor, ...]
The outputs of the model.
requiredTensor | None
The labels for the head.
None
str | None
The name of the output to use. Defaults to self.output_name
.
None
Source code in multimolecule/module/heads/pretrain.py
Pythondef forward(\n self, outputs: ModelOutput | Tuple[Tensor, ...], labels: Tensor | None = None, output_name: str | None = None\n) -> HeadOutput:\n r\"\"\"\n Forward pass of the MaskedLMHead.\n\n Args:\n outputs: The outputs of the model.\n labels: The labels for the head.\n output_name: The name of the output to use.\n Defaults to `self.output_name`.\n \"\"\"\n if isinstance(outputs, (Mapping, ModelOutput)):\n output = outputs[output_name or self.output_name]\n elif isinstance(outputs, tuple):\n output = outputs[0]\n else:\n raise ValueError(f\"Unsupported type for outputs: {type(outputs)}\")\n output = self.dropout(output)\n output = self.transform(output)\n output = self.decoder(output)\n if self.activation is not None:\n output = self.activation(output)\n if labels is not None:\n if isinstance(labels, NestedTensor):\n if isinstance(output, Tensor):\n output = labels.nested_like(output, strict=False)\n return HeadOutput(output, F.cross_entropy(output.concat, labels.concat))\n return HeadOutput(output, F.cross_entropy(output.view(-1, self.num_labels), labels.view(-1)))\n return HeadOutput(output)\n
"},{"location":"module/heads/#multimolecule.module.heads.pretrain.MaskedLMHead.forward(outputs)","title":"outputs
","text":""},{"location":"module/heads/#multimolecule.module.heads.pretrain.MaskedLMHead.forward(labels)","title":"labels
","text":""},{"location":"module/heads/#multimolecule.module.heads.pretrain.MaskedLMHead.forward(output_name)","title":"output_name
","text":""},{"location":"module/heads/#multimolecule.module.heads.generic","title":"multimolecule.module.heads.generic","text":""},{"location":"module/heads/#multimolecule.module.heads.generic.PredictionHead","title":"PredictionHead","text":" Bases: Module
Head for all-level of tasks.
Parameters:
Name Type Description DefaultPreTrainedConfig
The configuration object for the model.
requiredHeadConfig | None
The configuration object for the head. If None, will use configuration from the config
.
None
Source code in multimolecule/module/heads/generic.py
Pythonclass PredictionHead(nn.Module):\n r\"\"\"\n Head for all-level of tasks.\n\n Args:\n config: The configuration object for the model.\n head_config: The configuration object for the head.\n If None, will use configuration from the `config`.\n \"\"\"\n\n num_labels: int\n requires_attention: bool = False\n\n def __init__(self, config: PreTrainedConfig, head_config: HeadConfig | None = None):\n super().__init__()\n if head_config is None:\n head_config = config.head or HeadConfig(num_labels=config.num_labels)\n elif head_config.num_labels is None:\n head_config.num_labels = config.num_labels\n self.config = head_config\n if self.config.hidden_size is None:\n self.config.hidden_size = config.hidden_size\n if self.config.problem_type is None:\n self.config.problem_type = config.problem_type\n self.bos_token_id = config.bos_token_id\n self.eos_token_id = config.eos_token_id\n self.pad_token_id = config.pad_token_id\n self.num_labels = self.config.num_labels # type: ignore[assignment]\n self.dropout = nn.Dropout(self.config.dropout)\n self.transform = HeadTransformRegistryHF.build(self.config)\n self.decoder = nn.Linear(self.config.hidden_size, self.num_labels, bias=self.config.bias)\n self.activation = ACT2FN[self.config.act] if self.config.act is not None else None\n self.criterion = CriterionRegistry.build(self.config)\n\n def forward(self, embeddings: Tensor, labels: Tensor | None, **kwargs) -> HeadOutput:\n r\"\"\"\n Forward pass of the PredictionHead.\n\n Args:\n embeddings: The embeddings to be passed through the head.\n labels: The labels for the head.\n \"\"\"\n if kwargs:\n warn(\n f\"The following arguments are not applicable to {self.__class__.__name__}\"\n f\"and will be ignored: {kwargs.keys()}\"\n )\n output = self.dropout(embeddings)\n output = self.transform(output)\n output = self.decoder(output)\n if self.activation is not None:\n output = self.activation(output)\n if labels is not None:\n if isinstance(labels, NestedTensor):\n if isinstance(output, Tensor):\n output = labels.nested_like(output, strict=False)\n return HeadOutput(output, self.criterion(output.concat, labels.concat))\n return HeadOutput(output, self.criterion(output, labels))\n return HeadOutput(output)\n\n def _get_attention_mask(self, input_ids: NestedTensor | Tensor) -> Tensor:\n if isinstance(input_ids, NestedTensor):\n return input_ids.mask\n if input_ids is None:\n raise ValueError(\n f\"Either attention_mask or input_ids must be provided for {self.__class__.__name__} to work.\"\n )\n if self.pad_token_id is None:\n raise ValueError(\n f\"pad_token_id must be provided when attention_mask is not passed to {self.__class__.__name__}.\"\n )\n return input_ids.ne(self.pad_token_id)\n\n def _remove_special_tokens(\n self, output: Tensor, attention_mask: Tensor, input_ids: Tensor | None\n ) -> Tuple[Tensor, Tensor, Tensor]:\n # remove cls token embeddings\n if self.bos_token_id is not None:\n output = output[..., 1:, :]\n attention_mask = attention_mask[..., 1:]\n if input_ids is not None:\n input_ids = input_ids[..., 1:]\n # remove eos token embeddings\n if self.eos_token_id is not None:\n if input_ids is not None:\n eos_mask = input_ids.ne(self.eos_token_id).to(output)\n input_ids = input_ids[..., :-1]\n else:\n last_valid_indices = attention_mask.sum(dim=-1)\n seq_length = attention_mask.size(-1)\n eos_mask = torch.arange(seq_length, device=output.device) == last_valid_indices.unsqueeze(1)\n output = output * eos_mask[:, :, None]\n output = output[..., :-1, :]\n attention_mask = attention_mask[..., 1:]\n return output, attention_mask, input_ids\n
"},{"location":"module/heads/#multimolecule.module.heads.generic.PredictionHead(config)","title":"config
","text":""},{"location":"module/heads/#multimolecule.module.heads.generic.PredictionHead(head_config)","title":"head_config
","text":""},{"location":"module/heads/#multimolecule.module.heads.generic.PredictionHead.forward","title":"forward","text":"Pythonforward(embeddings: Tensor, labels: Tensor | None, **kwargs) -> HeadOutput\n
Forward pass of the PredictionHead.
Parameters:
Name Type Description DefaultTensor
The embeddings to be passed through the head.
requiredTensor | None
The labels for the head.
required Source code inmultimolecule/module/heads/generic.py
Pythondef forward(self, embeddings: Tensor, labels: Tensor | None, **kwargs) -> HeadOutput:\n r\"\"\"\n Forward pass of the PredictionHead.\n\n Args:\n embeddings: The embeddings to be passed through the head.\n labels: The labels for the head.\n \"\"\"\n if kwargs:\n warn(\n f\"The following arguments are not applicable to {self.__class__.__name__}\"\n f\"and will be ignored: {kwargs.keys()}\"\n )\n output = self.dropout(embeddings)\n output = self.transform(output)\n output = self.decoder(output)\n if self.activation is not None:\n output = self.activation(output)\n if labels is not None:\n if isinstance(labels, NestedTensor):\n if isinstance(output, Tensor):\n output = labels.nested_like(output, strict=False)\n return HeadOutput(output, self.criterion(output.concat, labels.concat))\n return HeadOutput(output, self.criterion(output, labels))\n return HeadOutput(output)\n
"},{"location":"module/heads/#multimolecule.module.heads.generic.PredictionHead.forward(embeddings)","title":"embeddings
","text":""},{"location":"module/heads/#multimolecule.module.heads.generic.PredictionHead.forward(labels)","title":"labels
","text":""},{"location":"module/heads/#multimolecule.module.heads.output","title":"multimolecule.module.heads.output","text":""},{"location":"module/heads/#multimolecule.module.heads.output.HeadOutput","title":"HeadOutput dataclass
","text":" Bases: ModelOutput
Output of a prediction head.
Parameters:
Name Type Description DefaultFloatTensor
The prediction logits from the head.
requiredFloatTensor | None
The loss from the head. Defaults to None.
None
Source code in multimolecule/module/heads/output.py
Python@dataclass\nclass HeadOutput(ModelOutput):\n r\"\"\"\n Output of a prediction head.\n\n Args:\n logits: The prediction logits from the head.\n loss: The loss from the head.\n Defaults to None.\n \"\"\"\n\n logits: FloatTensor\n loss: FloatTensor | None = None\n
"},{"location":"module/heads/#multimolecule.module.heads.output.HeadOutput(logits)","title":"logits
","text":""},{"location":"module/heads/#multimolecule.module.heads.output.HeadOutput(loss)","title":"loss
","text":""},{"location":"tokenisers/","title":"tokenisers","text":"tokenisers
provide a collection of pre-defined tokenizers.
A tokenizer is a class that converts a sequence of nucleotides or amino acids into a sequence of indices. It is used to pre-process the input sequence before feeding it into a model.
Please refer to Tokenizer for more details.
"},{"location":"tokenisers/#available-tokenizers","title":"Available Tokenizers","text":"DnaTokenizer is smart, it tokenizes raw DNA nucleotides into tokens, no matter if the input is in uppercase or lowercase, uses T (Thymine) or U (Uracil), and with or without special tokens. It also supports tokenization into nmers and codons, so you don\u2019t have to write complex code to preprocess your data.
By default, DnaTokenizer
uses the standard alphabet. If nmers
is greater than 1
, or codon
is set to True
, it will instead use the streamline alphabet.
MultiMolecule provides a set of predefined alphabets for tokenization.
"},{"location":"tokenisers/dna/#multimolecule.tokenisers.DnaTokenizer","title":"multimolecule.tokenisers.DnaTokenizer","text":" Bases: Tokenizer
Tokenizer for DNA sequences.
Parameters:
Name Type Description DefaultAlphabet | str | List[str] | None
alphabet to use for tokenization.
None
, the standard RNA alphabet will be used.string
, it should correspond to the name of a predefined alphabet. The options includestandard
iupac
streamline
nucleobase
None
int
Size of kmer to tokenize.
1
bool
Whether to tokenize into codons.
False
bool
Whether to replace U with T.
True
bool
Whether to convert input to uppercase.
True
Examples:
Python Console Session>>> from multimolecule import DnaTokenizer\n>>> tokenizer = DnaTokenizer()\n>>> tokenizer('<pad><cls><eos><unk><mask><null>ACGTNRYSWKMBDHV.X*-')[\"input_ids\"]\n[1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 2]\n>>> tokenizer('acgt')[\"input_ids\"]\n[1, 6, 7, 8, 9, 2]\n>>> tokenizer('acgu')[\"input_ids\"]\n[1, 6, 7, 8, 9, 2]\n>>> tokenizer = DnaTokenizer(replace_U_with_T=False)\n>>> tokenizer('acgu')[\"input_ids\"]\n[1, 6, 7, 8, 3, 2]\n>>> tokenizer = DnaTokenizer(nmers=3)\n>>> tokenizer('tataaagta')[\"input_ids\"]\n[1, 84, 21, 81, 6, 8, 19, 71, 2]\n>>> tokenizer = DnaTokenizer(codon=True)\n>>> tokenizer('tataaagta')[\"input_ids\"]\n[1, 84, 6, 71, 2]\n>>> tokenizer('tataaagtaa')[\"input_ids\"]\nTraceback (most recent call last):\nValueError: length of input sequence must be a multiple of 3 for codon tokenization, but got 10\n
Source code in multimolecule/tokenisers/dna/tokenization_dna.py
Pythonclass DnaTokenizer(Tokenizer):\n \"\"\"\n Tokenizer for DNA sequences.\n\n Args:\n alphabet: alphabet to use for tokenization.\n\n - If is `None`, the standard RNA alphabet will be used.\n - If is a `string`, it should correspond to the name of a predefined alphabet. The options include\n + `standard`\n + `iupac`\n + `streamline`\n + `nucleobase`\n - If is an alphabet or a list of characters, that specific alphabet will be used.\n nmers: Size of kmer to tokenize.\n codon: Whether to tokenize into codons.\n replace_U_with_T: Whether to replace U with T.\n do_upper_case: Whether to convert input to uppercase.\n\n Examples:\n >>> from multimolecule import DnaTokenizer\n >>> tokenizer = DnaTokenizer()\n >>> tokenizer('<pad><cls><eos><unk><mask><null>ACGTNRYSWKMBDHV.X*-')[\"input_ids\"]\n [1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 2]\n >>> tokenizer('acgt')[\"input_ids\"]\n [1, 6, 7, 8, 9, 2]\n >>> tokenizer('acgu')[\"input_ids\"]\n [1, 6, 7, 8, 9, 2]\n >>> tokenizer = DnaTokenizer(replace_U_with_T=False)\n >>> tokenizer('acgu')[\"input_ids\"]\n [1, 6, 7, 8, 3, 2]\n >>> tokenizer = DnaTokenizer(nmers=3)\n >>> tokenizer('tataaagta')[\"input_ids\"]\n [1, 84, 21, 81, 6, 8, 19, 71, 2]\n >>> tokenizer = DnaTokenizer(codon=True)\n >>> tokenizer('tataaagta')[\"input_ids\"]\n [1, 84, 6, 71, 2]\n >>> tokenizer('tataaagtaa')[\"input_ids\"]\n Traceback (most recent call last):\n ValueError: length of input sequence must be a multiple of 3 for codon tokenization, but got 10\n \"\"\"\n\n model_input_names = [\"input_ids\", \"attention_mask\"]\n\n def __init__(\n self,\n alphabet: Alphabet | str | List[str] | None = None,\n nmers: int = 1,\n codon: bool = False,\n replace_U_with_T: bool = True,\n do_upper_case: bool = True,\n additional_special_tokens: List | Tuple | None = None,\n **kwargs,\n ):\n if codon and (nmers > 1 and nmers != 3):\n raise ValueError(\"Codon and nmers cannot be used together.\")\n if codon:\n nmers = 3 # set to 3 to get correct vocab\n if not isinstance(alphabet, Alphabet):\n alphabet = get_alphabet(alphabet, nmers=nmers)\n super().__init__(\n alphabet=alphabet,\n nmers=nmers,\n codon=codon,\n replace_U_with_T=replace_U_with_T,\n do_upper_case=do_upper_case,\n additional_special_tokens=additional_special_tokens,\n **kwargs,\n )\n self.replace_U_with_T = replace_U_with_T\n self.nmers = nmers\n self.codon = codon\n\n def _tokenize(self, text: str, **kwargs):\n if self.do_upper_case:\n text = text.upper()\n if self.replace_U_with_T:\n text = text.replace(\"U\", \"T\")\n if self.codon:\n if len(text) % 3 != 0:\n raise ValueError(\n f\"length of input sequence must be a multiple of 3 for codon tokenization, but got {len(text)}\"\n )\n return [text[i : i + 3] for i in range(0, len(text), 3)]\n if self.nmers > 1:\n return [text[i : i + self.nmers] for i in range(len(text) - self.nmers + 1)] # noqa: E203\n return list(text)\n
"},{"location":"tokenisers/dna/#multimolecule.tokenisers.DnaTokenizer(alphabet)","title":"alphabet
","text":""},{"location":"tokenisers/dna/#multimolecule.tokenisers.DnaTokenizer(nmers)","title":"nmers
","text":""},{"location":"tokenisers/dna/#multimolecule.tokenisers.DnaTokenizer(codon)","title":"codon
","text":""},{"location":"tokenisers/dna/#multimolecule.tokenisers.DnaTokenizer(replace_U_with_T)","title":"replace_U_with_T
","text":""},{"location":"tokenisers/dna/#multimolecule.tokenisers.DnaTokenizer(do_upper_case)","title":"do_upper_case
","text":""},{"location":"tokenisers/dna/#standard-alphabet","title":"Standard Alphabet","text":"The standard alphabet is an extended version of the IUPAC alphabet. This extension includes two additional symbols to the IUPAC alphabet, X
and *
.
X
: Any base; is slightly different from N
which represents Unknown base. In automatic word embedding conversion, the X
will be initialized as the mean of A
, C
, G
, and T
, while N
will not be further processed.*
: is not used in MultiMolecule and is reserved for future use.gap
Note that we use .
to represent a gap in the sequence.
While -
exists in the standard alphabet, it is not used in MultiMolecule and is reserved for future use.
IUPAC nucleotide code is a standard nucleotide code proposed by the International Union of Pure and Applied Chemistry (IUPAC) to represent DNA sequences.
It consists of 10 symbols that represent ambiguity in the nucleotide sequence and 1 symbol that represents a gap in addition to the streamline alphabet.
Code Represents A Adenine C Cytosine G Guanine T Thymine R A or G Y C or T S C or G W A or T K G or T M A or C B C, G, or T D A, G, or T H A, C, or T V A, C, or G N A, C, G, or T . GapNote that we use .
to represent a gap in the sequence.
The streamline alphabet includes one additional symbol to the nucleobase alphabet, N
to represent unknown nucleobase.
The nucleobase alphabet is a minimal version of the DNA alphabet that includes only the four canonical nucleotides A
, C
, G
, and T
.
DotBracketTokenizer provides a simple way to tokenize secondary structure in dot-bracket notation. It also supports tokenization into nmers and codons, so you don\u2019t have to write complex code to preprocess your data.
By default, DotBracketTokenizer
uses the standard alphabet. If nmers
is greater than 1
, or codon
is set to True
, it will instead use the streamline alphabet.
MultiMolecule provides a set of predefined alphabets for tokenization.
"},{"location":"tokenisers/dot_bracket/#multimolecule.tokenisers.DotBracketTokenizer","title":"multimolecule.tokenisers.DotBracketTokenizer","text":" Bases: Tokenizer
Tokenizer for Secondary Structure sequences.
Parameters:
Name Type Description DefaultAlphabet | str | List[str] | None
alphabet to use for tokenization.
None
, the standard Secondary Structure alphabet will be used.string
, it should correspond to the name of a predefined alphabet. The options includestandard
iupac
streamline
nucleobase
None
int
Size of kmer to tokenize.
1
bool
Whether to tokenize into codons.
False
Examples:
Python Console Session>>> from multimolecule import DotBracketTokenizer\n>>> tokenizer = DotBracketTokenizer()\n>>> tokenizer('<pad><cls><eos><unk><mask><null>.()+,[]{}|<>-_:~$@^%*')[\"input_ids\"]\n[1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 2]\n>>> tokenizer('(.)')[\"input_ids\"]\n[1, 7, 6, 8, 2]\n>>> tokenizer('+(.)')[\"input_ids\"]\n[1, 9, 7, 6, 8, 2]\n>>> tokenizer = DotBracketTokenizer(nmers=3)\n>>> tokenizer('(((((+..........)))))')[\"input_ids\"]\n[1, 27, 27, 27, 29, 34, 54, 6, 6, 6, 6, 6, 6, 6, 6, 8, 16, 48, 48, 48, 2]\n>>> tokenizer = DotBracketTokenizer(codon=True)\n>>> tokenizer('(((((+..........)))))')[\"input_ids\"]\n[1, 27, 29, 6, 6, 6, 16, 48, 2]\n>>> tokenizer('(((((+...........)))))')[\"input_ids\"]\nTraceback (most recent call last):\nValueError: length of input sequence must be a multiple of 3 for codon tokenization, but got 22\n
Source code in multimolecule/tokenisers/dot_bracket/tokenization_db.py
Pythonclass DotBracketTokenizer(Tokenizer):\n \"\"\"\n Tokenizer for Secondary Structure sequences.\n\n Args:\n alphabet: alphabet to use for tokenization.\n\n - If is `None`, the standard Secondary Structure alphabet will be used.\n - If is a `string`, it should correspond to the name of a predefined alphabet. The options include\n + `standard`\n + `iupac`\n + `streamline`\n + `nucleobase`\n - If is an alphabet or a list of characters, that specific alphabet will be used.\n nmers: Size of kmer to tokenize.\n codon: Whether to tokenize into codons.\n\n Examples:\n >>> from multimolecule import DotBracketTokenizer\n >>> tokenizer = DotBracketTokenizer()\n >>> tokenizer('<pad><cls><eos><unk><mask><null>.()+,[]{}|<>-_:~$@^%*')[\"input_ids\"]\n [1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 2]\n >>> tokenizer('(.)')[\"input_ids\"]\n [1, 7, 6, 8, 2]\n >>> tokenizer('+(.)')[\"input_ids\"]\n [1, 9, 7, 6, 8, 2]\n >>> tokenizer = DotBracketTokenizer(nmers=3)\n >>> tokenizer('(((((+..........)))))')[\"input_ids\"]\n [1, 27, 27, 27, 29, 34, 54, 6, 6, 6, 6, 6, 6, 6, 6, 8, 16, 48, 48, 48, 2]\n >>> tokenizer = DotBracketTokenizer(codon=True)\n >>> tokenizer('(((((+..........)))))')[\"input_ids\"]\n [1, 27, 29, 6, 6, 6, 16, 48, 2]\n >>> tokenizer('(((((+...........)))))')[\"input_ids\"]\n Traceback (most recent call last):\n ValueError: length of input sequence must be a multiple of 3 for codon tokenization, but got 22\n \"\"\"\n\n model_input_names = [\"input_ids\", \"attention_mask\"]\n\n def __init__(\n self,\n alphabet: Alphabet | str | List[str] | None = None,\n nmers: int = 1,\n codon: bool = False,\n additional_special_tokens: List | Tuple | None = None,\n **kwargs,\n ):\n if codon and (nmers > 1 and nmers != 3):\n raise ValueError(\"Codon and nmers cannot be used together.\")\n if codon:\n nmers = 3 # set to 3 to get correct vocab\n if not isinstance(alphabet, Alphabet):\n alphabet = get_alphabet(alphabet, nmers=nmers)\n super().__init__(\n alphabet=alphabet,\n nmers=nmers,\n codon=codon,\n additional_special_tokens=additional_special_tokens,\n **kwargs,\n )\n self.nmers = nmers\n self.codon = codon\n\n def _tokenize(self, text: str, **kwargs):\n if self.codon:\n if len(text) % 3 != 0:\n raise ValueError(\n f\"length of input sequence must be a multiple of 3 for codon tokenization, but got {len(text)}\"\n )\n return [text[i : i + 3] for i in range(0, len(text), 3)]\n if self.nmers > 1:\n return [text[i : i + self.nmers] for i in range(len(text) - self.nmers + 1)] # noqa: E203\n return list(text)\n
"},{"location":"tokenisers/dot_bracket/#multimolecule.tokenisers.DotBracketTokenizer(alphabet)","title":"alphabet
","text":""},{"location":"tokenisers/dot_bracket/#multimolecule.tokenisers.DotBracketTokenizer(nmers)","title":"nmers
","text":""},{"location":"tokenisers/dot_bracket/#multimolecule.tokenisers.DotBracketTokenizer(codon)","title":"codon
","text":""},{"location":"tokenisers/dot_bracket/#standard-alphabet","title":"Standard Alphabet","text":"The standard alphabet is an extended version of the Extended Dot-Bracket Notation. This extension includes most symbols from the WUSS notation for better compatibility with existing tools.
Code Represents . unpaired ( internal helices of all terminal stems ) internal helices of all terminal stems + nick between strand , unpaired in multibranch loops [ internal helices that includes at least one annotated () stem ] internal helices that includes at least one annotated () stem { all internal helices of deeper multifurcations } all internal helices of deeper multifurcations | mostly paired < simple terminal stems > simple terminal stems - bulges and interior loops _ unpaired : single stranded in the exterior loop ~ local structural alignment left regions of target and query unaligned $ Not Used @ Not Used ^ Not Used % Not Used * Not Used"},{"location":"tokenisers/dot_bracket/#extended-alphabet","title":"Extended Alphabet","text":"Extended Dot-Bracket Notation is a more generalized version of the original Dot-Bracket notation may use additional pairs of brackets for annotating pseudo-knots, since different pairs of brackets are not required to be nested.
Code Represents . unpaired ( internal helices of all terminal stems ) internal helices of all terminal stems + nick between strand , unpaired in multibranch loops [ internal helices that includes at least one annotated () stem ] internal helices that includes at least one annotated () stem { all internal helices of deeper multifurcations } all internal helices of deeper multifurcations | mostly paired < simple terminal stems > simple terminal stemsNote that we use .
to represent a gap in the sequence.
The streamline alphabet includes one additional symbol to the nucleobase alphabet, N
to represent unknown nucleobase.
ProteinTokenizer is smart, it tokenizes raw amino acids into tokens, no matter if the input is in uppercase or lowercase, and with or without special tokens.
By default, ProteinTokenizer
uses the standard alphabet.
MultiMolecule provides a set of predefined alphabets for tokenization.
"},{"location":"tokenisers/protein/#multimolecule.tokenisers.ProteinTokenizer","title":"multimolecule.tokenisers.ProteinTokenizer","text":" Bases: Tokenizer
Tokenizer for Protein sequences.
Parameters:
Name Type Description DefaultAlphabet | str | List[str] | None
alphabet to use for tokenization.
None
, the standard RNA alphabet will be used.string
, it should correspond to the name of a predefined alphabet. The options includestandard
iupac
streamline
None
bool
Whether to convert input to uppercase.
True
Examples:
Python Console Session>>> from multimolecule import ProteinTokenizer\n>>> tokenizer = ProteinTokenizer()\n>>> tokenizer('ACDEFGHIKLMNPQRSTVWYXZBJUO')[\"input_ids\"]\n[1, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 2]\n>>> tokenizer('<pad><cls><eos><unk><mask><null>.*-')[\"input_ids\"]\n[1, 0, 1, 2, 3, 4, 5, 32, 33, 34, 2]\n>>> tokenizer('manlgcwmlv')[\"input_ids\"]\n[1, 16, 6, 17, 15, 11, 7, 24, 16, 15, 23, 2]\n
Source code in multimolecule/tokenisers/protein/tokenization_protein.py
Pythonclass ProteinTokenizer(Tokenizer):\n \"\"\"\n Tokenizer for Protein sequences.\n\n Args:\n alphabet: alphabet to use for tokenization.\n\n - If is `None`, the standard RNA alphabet will be used.\n - If is a `string`, it should correspond to the name of a predefined alphabet. The options include\n + `standard`\n + `iupac`\n + `streamline`\n - If is an alphabet or a list of characters, that specific alphabet will be used.\n do_upper_case: Whether to convert input to uppercase.\n\n Examples:\n >>> from multimolecule import ProteinTokenizer\n >>> tokenizer = ProteinTokenizer()\n >>> tokenizer('ACDEFGHIKLMNPQRSTVWYXZBJUO')[\"input_ids\"]\n [1, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 2]\n >>> tokenizer('<pad><cls><eos><unk><mask><null>.*-')[\"input_ids\"]\n [1, 0, 1, 2, 3, 4, 5, 32, 33, 34, 2]\n >>> tokenizer('manlgcwmlv')[\"input_ids\"]\n [1, 16, 6, 17, 15, 11, 7, 24, 16, 15, 23, 2]\n \"\"\"\n\n model_input_names = [\"input_ids\", \"attention_mask\"]\n\n def __init__(\n self,\n alphabet: Alphabet | str | List[str] | None = None,\n do_upper_case: bool = True,\n additional_special_tokens: List | Tuple | None = None,\n **kwargs,\n ):\n if not isinstance(alphabet, Alphabet):\n alphabet = get_alphabet(alphabet)\n super().__init__(\n alphabet=alphabet,\n additional_special_tokens=additional_special_tokens,\n do_upper_case=do_upper_case,\n **kwargs,\n )\n\n def _tokenize(self, text: str, **kwargs):\n if self.do_upper_case:\n text = text.upper()\n return list(text)\n
"},{"location":"tokenisers/protein/#multimolecule.tokenisers.ProteinTokenizer(alphabet)","title":"alphabet
","text":""},{"location":"tokenisers/protein/#multimolecule.tokenisers.ProteinTokenizer(do_upper_case)","title":"do_upper_case
","text":""},{"location":"tokenisers/protein/#standard-alphabet","title":"Standard Alphabet","text":"The standard alphabet is an extended version of the IUPAC alphabet. This extension includes six additional symbols to the IUPAC alphabet, J
, U
, O
, .
, -
, and *
.
J
: Xle; Leucine (L) or Isoleucine (I)U
: Sec; SelenocysteineO
: Pyl; Pyrrolysine.
: is not used in MultiMolecule and is reserved for future use.-
: is not used in MultiMolecule and is reserved for future use.*
: is not used in MultiMolecule and is reserved for future use.IUPAC amino acid code is a standard amino acid code proposed by the International Union of Pure and Applied Chemistry (IUPAC) to represent Protein sequences.
The IUPAC amino acid code consists of three additional symbols to Streamline Alphabet, B
, Z
, and X
.
The streamline alphabet is a simplified version of the standard alphabet.
Amino Acid Code Three letter Code Amino Acid A Ala Alanine C Cys Cysteine D Asp Aspartic Acid E Glu Glutamic Acid F Phe Phenylalanine G Gly Glycine H His Histidine I Ile Isoleucine K Lys Lysine L Leu Leucine M Met Methionine N Asn Asparagine P Pro Proline Q Gln Glutamine R Arg Arginine S Ser Serine T Thr Threonine V Val Valine W Trp Tryptophan Y Tyr Tyrosine X Xaa Any amino acid"},{"location":"tokenisers/rna/","title":"RnaTokenizer","text":"RnaTokenizer is smart, it tokenizes raw RNA nucleotides into tokens, no matter if the input is in uppercase or lowercase, uses U (Uracil) or U (Thymine), and with or without special tokens. It also supports tokenization into nmers and codons, so you don\u2019t have to write complex code to preprocess your data.
By default, RnaTokenizer
uses the standard alphabet. If nmers
is greater than 1
, or codon
is set to True
, it will instead use the streamline alphabet.
MultiMolecule provides a set of predefined alphabets for tokenization.
"},{"location":"tokenisers/rna/#multimolecule.tokenisers.RnaTokenizer","title":"multimolecule.tokenisers.RnaTokenizer","text":" Bases: Tokenizer
Tokenizer for RNA sequences.
Parameters:
Name Type Description DefaultAlphabet | str | List[str] | None
alphabet to use for tokenization.
None
, the standard RNA alphabet will be used.string
, it should correspond to the name of a predefined alphabet. The options includestandard
extended
streamline
nucleobase
None
int
Size of kmer to tokenize.
1
bool
Whether to tokenize into codons.
False
bool
Whether to replace T with U.
True
bool
Whether to convert input to uppercase.
True
Examples:
Python Console Session>>> from multimolecule import RnaTokenizer\n>>> tokenizer = RnaTokenizer()\n>>> tokenizer('<pad><cls><eos><unk><mask><null>ACGUNRYSWKMBDHV.X*-I')[\"input_ids\"]\n[1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 2]\n>>> tokenizer('acgu')[\"input_ids\"]\n[1, 6, 7, 8, 9, 2]\n>>> tokenizer('acgt')[\"input_ids\"]\n[1, 6, 7, 8, 9, 2]\n>>> tokenizer = RnaTokenizer(replace_T_with_U=False)\n>>> tokenizer('acgt')[\"input_ids\"]\n[1, 6, 7, 8, 3, 2]\n>>> tokenizer = RnaTokenizer(nmers=3)\n>>> tokenizer('uagcuuauc')[\"input_ids\"]\n[1, 83, 17, 64, 49, 96, 84, 22, 2]\n>>> tokenizer = RnaTokenizer(codon=True)\n>>> tokenizer('uagcuuauc')[\"input_ids\"]\n[1, 83, 49, 22, 2]\n>>> tokenizer('uagcuuauca')[\"input_ids\"]\nTraceback (most recent call last):\nValueError: length of input sequence must be a multiple of 3 for codon tokenization, but got 10\n
Source code in multimolecule/tokenisers/rna/tokenization_rna.py
Pythonclass RnaTokenizer(Tokenizer):\n \"\"\"\n Tokenizer for RNA sequences.\n\n Args:\n alphabet: alphabet to use for tokenization.\n\n - If is `None`, the standard RNA alphabet will be used.\n - If is a `string`, it should correspond to the name of a predefined alphabet. The options include\n + `standard`\n + `extended`\n + `streamline`\n + `nucleobase`\n - If is an alphabet or a list of characters, that specific alphabet will be used.\n nmers: Size of kmer to tokenize.\n codon: Whether to tokenize into codons.\n replace_T_with_U: Whether to replace T with U.\n do_upper_case: Whether to convert input to uppercase.\n\n Examples:\n >>> from multimolecule import RnaTokenizer\n >>> tokenizer = RnaTokenizer()\n >>> tokenizer('<pad><cls><eos><unk><mask><null>ACGUNRYSWKMBDHV.X*-I')[\"input_ids\"]\n [1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 2]\n >>> tokenizer('acgu')[\"input_ids\"]\n [1, 6, 7, 8, 9, 2]\n >>> tokenizer('acgt')[\"input_ids\"]\n [1, 6, 7, 8, 9, 2]\n >>> tokenizer = RnaTokenizer(replace_T_with_U=False)\n >>> tokenizer('acgt')[\"input_ids\"]\n [1, 6, 7, 8, 3, 2]\n >>> tokenizer = RnaTokenizer(nmers=3)\n >>> tokenizer('uagcuuauc')[\"input_ids\"]\n [1, 83, 17, 64, 49, 96, 84, 22, 2]\n >>> tokenizer = RnaTokenizer(codon=True)\n >>> tokenizer('uagcuuauc')[\"input_ids\"]\n [1, 83, 49, 22, 2]\n >>> tokenizer('uagcuuauca')[\"input_ids\"]\n Traceback (most recent call last):\n ValueError: length of input sequence must be a multiple of 3 for codon tokenization, but got 10\n \"\"\"\n\n model_input_names = [\"input_ids\", \"attention_mask\"]\n\n def __init__(\n self,\n alphabet: Alphabet | str | List[str] | None = None,\n nmers: int = 1,\n codon: bool = False,\n replace_T_with_U: bool = True,\n do_upper_case: bool = True,\n additional_special_tokens: List | Tuple | None = None,\n **kwargs,\n ):\n if codon and (nmers > 1 and nmers != 3):\n raise ValueError(\"Codon and nmers cannot be used together.\")\n if codon:\n nmers = 3 # set to 3 to get correct vocab\n if not isinstance(alphabet, Alphabet):\n alphabet = get_alphabet(alphabet, nmers=nmers)\n super().__init__(\n alphabet=alphabet,\n nmers=nmers,\n codon=codon,\n replace_T_with_U=replace_T_with_U,\n do_upper_case=do_upper_case,\n additional_special_tokens=additional_special_tokens,\n **kwargs,\n )\n self.replace_T_with_U = replace_T_with_U\n self.nmers = nmers\n self.codon = codon\n\n def _tokenize(self, text: str, **kwargs):\n if self.do_upper_case:\n text = text.upper()\n if self.replace_T_with_U:\n text = text.replace(\"T\", \"U\")\n if self.codon:\n if len(text) % 3 != 0:\n raise ValueError(\n f\"length of input sequence must be a multiple of 3 for codon tokenization, but got {len(text)}\"\n )\n return [text[i : i + 3] for i in range(0, len(text), 3)]\n if self.nmers > 1:\n return [text[i : i + self.nmers] for i in range(len(text) - self.nmers + 1)] # noqa: E203\n return list(text)\n
"},{"location":"tokenisers/rna/#multimolecule.tokenisers.RnaTokenizer(alphabet)","title":"alphabet
","text":""},{"location":"tokenisers/rna/#multimolecule.tokenisers.RnaTokenizer(nmers)","title":"nmers
","text":""},{"location":"tokenisers/rna/#multimolecule.tokenisers.RnaTokenizer(codon)","title":"codon
","text":""},{"location":"tokenisers/rna/#multimolecule.tokenisers.RnaTokenizer(replace_T_with_U)","title":"replace_T_with_U
","text":""},{"location":"tokenisers/rna/#multimolecule.tokenisers.RnaTokenizer(do_upper_case)","title":"do_upper_case
","text":""},{"location":"tokenisers/rna/#standard-alphabet","title":"Standard Alphabet","text":"The standard alphabet is an extended version of the IUPAC alphabet. This extension includes three additional symbols to the IUPAC alphabet, I
, X
and *
.
I
: Inosine; is a post-transcriptional modification that is not a standard RNA base. Inosine is the result of a deamination reaction of adenines that is catalyzed by adenosine deaminases acting on tRNAs (ADATs)X
: Any base; is slightly different from N
which represents Unknown base. In automatic word embedding conversion, the X
will be initialized as the mean of A
, C
, G
, and U
, while N
will not be further processed.*
: is not used in MultiMolecule and is reserved for future use.gap
Note that we use .
to represent a gap in the sequence.
While -
exists in the standard alphabet, it is not used in MultiMolecule and is reserved for future use.
IUPAC nucleotide code is a standard nucleotide code proposed by the International Union of Pure and Applied Chemistry (IUPAC) to represent RNA sequences.
It consists of 10 symbols that represent ambiguity in the nucleotide sequence and 1 symbol that represents a gap in addition to the streamline alphabet.
Code Represents A Adenine C Cytosine G Guanine U Uracil R A or G Y C or U S G or C W A or U K G or U M A or C B C, G, or U D A, G, or U H A, C, or U V A, C, or G N A, C, G, or U . GapNote that we use .
to represent a gap in the sequence.
The streamline alphabet includes one additional symbol to the nucleobase alphabet, N
to represent unknown nucleobase.
The nucleobase alphabet is a minimal version of the RNA alphabet that includes only the four canonical nucleotides A
, C
, G
, and U
.
\u200b\u4f7f\u7528\u200b\u673a\u5668\u200b\u5b66\u4e60\u200b\u52a0\u901f\u200b\u5206\u5b50\u751f\u7269\u5b66\u200b\u7814\u7a76\u200b
"},{"location":"zh/#_1","title":"\u4ecb\u7ecd","text":"
\u200b\u6b22\u8fce\u200b\u6765\u5230\u200b MultiMolecule (\u200b\u6d66\u539f\u200b)\uff0c\u200b\u8fd9\u662f\u200b\u4e00\u6b3e\u200b\u57fa\u7840\u200b\u5e93\u200b\uff0c\u200b\u65e8\u5728\u200b\u901a\u8fc7\u200b\u673a\u5668\u200b\u5b66\u4e60\u200b\u52a0\u901f\u200b\u5206\u5b50\u751f\u7269\u5b66\u200b\u7684\u200b\u79d1\u7814\u200b\u8fdb\u5c55\u200b\u3002 MultiMolecule \u200b\u63d0\u4f9b\u200b\u4e86\u200b\u4e00\u5957\u200b\u5168\u9762\u200b\u4e14\u200b\u7075\u6d3b\u200b\u7684\u200b\u5de5\u5177\u200b\uff0c\u200b\u5e2e\u52a9\u200b\u7814\u7a76\u200b\u4eba\u5458\u200b\u8f7b\u677e\u200b\u5229\u7528\u200b AI\uff0c\u200b\u4e3b\u8981\u200b\u805a\u7126\u200b\u4e8e\u200b\u751f\u7269\u200b\u5206\u5b50\u200b\u6570\u636e\u200b\uff08RNA\u3001DNA \u200b\u548c\u200b\u86cb\u767d\u8d28\u200b\uff09\u3002
"},{"location":"zh/#_2","title":"\u6982\u89c8","text":"MultiMolecule \u200b\u4ee5\u200b\u7075\u6d3b\u6027\u200b\u548c\u200b\u6613\u7528\u6027\u200b\u4e3a\u200b\u8bbe\u8ba1\u200b\u6838\u5fc3\u200b\u3002 \u200b\u5176\u200b\u6a21\u5757\u5316\u200b\u8bbe\u8ba1\u200b\u5141\u8bb8\u200b\u60a8\u200b\u6839\u636e\u200b\u9700\u8981\u200b\u4ec5\u200b\u4f7f\u7528\u200b\u6240\u200b\u9700\u200b\u7684\u200b\u7ec4\u4ef6\u200b\uff0c\u200b\u5e76\u200b\u80fd\u200b\u65e0\u7f1d\u200b\u96c6\u6210\u200b\u5230\u200b\u73b0\u6709\u200b\u7684\u200b\u5de5\u4f5c\u200b\u6d41\u7a0b\u200b\u4e2d\u200b\uff0c\u200b\u800c\u200b\u4e0d\u4f1a\u200b\u589e\u52a0\u200b\u4e0d\u5fc5\u8981\u200b\u7684\u200b\u590d\u6742\u6027\u200b\u3002
data
\uff1a\u200b\u667a\u80fd\u200b\u7684\u200b Dataset
\uff0c\u200b\u80fd\u591f\u200b\u81ea\u52a8\u200b\u63a8\u65ad\u200b\u4efb\u52a1\u200b\uff0c\u200b\u5305\u62ec\u200b\u4efb\u52a1\u200b\u7684\u200b\u5c42\u7ea7\u200b\uff08\u200b\u5e8f\u5217\u200b\u3001\u200b\u4ee4\u724c\u200b\u3001\u200b\u63a5\u89e6\u200b\uff09\u200b\u548c\u200b\u7c7b\u578b\u200b\uff08\u200b\u5206\u7c7b\u200b\u3001\u200b\u56de\u5f52\u200b\uff09\u3002\u200b\u8fd8\u200b\u63d0\u4f9b\u200b\u591a\u4efb\u52a1\u200b\u6570\u636e\u200b\u96c6\u200b\u548c\u200b\u91c7\u6837\u5668\u200b\uff0c\u200b\u7b80\u5316\u200b\u591a\u4efb\u52a1\u200b\u5b66\u4e60\u200b\uff0c\u200b\u65e0\u9700\u200b\u989d\u5916\u200b\u914d\u7f6e\u200b\u3002datasets
\uff1a\u200b\u5e7f\u6cdb\u200b\u4f7f\u7528\u200b\u7684\u200b\u751f\u7269\u200b\u5206\u5b50\u200b\u6570\u636e\u200b\u96c6\u200b\u96c6\u5408\u200b\u3002module
\uff1a\u200b\u6a21\u5757\u5316\u200b\u795e\u7ecf\u7f51\u7edc\u200b\u6784\u5efa\u200b\u5757\u200b\uff0c\u200b\u5305\u62ec\u200b\u5d4c\u5165\u200b\u5c42\u200b\u3001\u200b\u9884\u6d4b\u200b\u5934\u200b\u548c\u200b\u635f\u5931\u200b\u51fd\u6570\u200b\uff0c\u200b\u7528\u4e8e\u200b\u6784\u5efa\u200b\u81ea\u5b9a\u4e49\u200b\u6a21\u578b\u200b\u3002models
\uff1a\u200b\u5206\u5b50\u751f\u7269\u5b66\u200b\u9886\u57df\u200b\u7684\u200b\u6700\u200b\u5148\u8fdb\u200b\u9884\u200b\u8bad\u7ec3\u200b\u6a21\u578b\u200b\u5b9e\u73b0\u200b\u3002tokenisers
\uff1a\u200b\u7528\u4e8e\u200b\u5c06\u200b DNA\u3001RNA\u3001\u200b\u86cb\u767d\u8d28\u200b\u53ca\u5176\u200b\u4ed6\u200b\u5e8f\u5217\u200b\u8f6c\u6362\u200b\u4e3a\u200b\u72ec\u70ed\u200b\u7f16\u7801\u200b\u7684\u200b\u5206\u8bcd\u5668\u200b\u3002\u200b\u4ece\u200b PyPI \u200b\u5b89\u88c5\u200b\u6700\u65b0\u200b\u7684\u200b\u7a33\u5b9a\u200b\u7248\u672c\u200b\uff1a
Bashpip install multimolecule\n
\u200b\u4ece\u200b\u6e90\u4ee3\u7801\u200b\u5b89\u88c5\u200b\u6700\u65b0\u200b\u7248\u672c\u200b\uff1a
Bashpip install git+https://github.com/DLS5-Omics/MultiMolecule\n
"},{"location":"zh/#_4","title":"\u5f15\u7528","text":"\u200b\u5982\u679c\u200b\u60a8\u200b\u5728\u200b\u7814\u7a76\u200b\u4e2d\u200b\u4f7f\u7528\u200b MultiMolecule\uff0c\u200b\u8bf7\u200b\u6309\u7167\u200b\u4ee5\u4e0b\u200b\u65b9\u5f0f\u200b\u5f15\u7528\u200b\u6211\u4eec\u200b\uff1a
BibTeX@software{chen_2024_12638419,\n author = {Chen, Zhiyuan and Zhu, Sophia Y.},\n title = {MultiMolecule},\n doi = {10.5281/zenodo.12638419},\n publisher = {Zenodo},\n url = {https://doi.org/10.5281/zenodo.12638419},\n year = 2024,\n month = may,\n day = 4\n}\n
"},{"location":"zh/#_5","title":"\u8bb8\u53ef\u8bc1","text":"\u200b\u6211\u4eec\u200b\u76f8\u4fe1\u200b\u5f00\u653e\u200b\u662f\u200b\u7814\u7a76\u200b\u7684\u200b\u57fa\u7840\u200b\u3002
MultiMolecule \u200b\u5728\u200bGNU Affero \u200b\u901a\u7528\u200b\u516c\u5171\u200b\u8bb8\u53ef\u8bc1\u200b\u4e0b\u200b\u6388\u6743\u200b\u3002
\u200b\u8bf7\u200b\u52a0\u5165\u200b\u6211\u4eec\u200b\uff0c\u200b\u5171\u540c\u200b\u5efa\u7acb\u200b\u4e00\u4e2a\u200b\u5f00\u653e\u200b\u7684\u200b\u7814\u7a76\u200b\u793e\u533a\u200b\u3002
SPDX-License-Identifier: AGPL-3.0-or-later
\u200b\u7531\u4e39\u7075\u200b\u5728\u200b\u5730\u7403\u200b\u5f00\u53d1\u200b
\u200b\u6211\u4eec\u200b\u662f\u200b\u4e00\u4e2a\u200b\u7531\u200b\u5f00\u53d1\u8005\u200b\u3001\u200b\u8bbe\u8ba1\u200b\u4eba\u5458\u200b\u548c\u200b\u5176\u4ed6\u200b\u4eba\u5458\u200b\u7ec4\u6210\u200b\u7684\u200b\u793e\u533a\u200b\uff0c\u200b\u81f4\u529b\u4e8e\u200b\u8ba9\u200b\u6df1\u5ea6\u200b\u5b66\u4e60\u200b\u6280\u672f\u200b\u66f4\u52a0\u200b\u5f00\u653e\u200b\u3002
\u200b\u6211\u4eec\u200b\u662f\u200b\u4e00\u4e2a\u200b\u7531\u200b\u4e2a\u4f53\u200b\u7ec4\u6210\u200b\u7684\u200b\u793e\u533a\u200b\uff0c\u200b\u81f4\u529b\u4e8e\u200b\u63a8\u52a8\u200b\u6df1\u5ea6\u200b\u5b66\u4e60\u200b\u7684\u200b\u53ef\u80fd\u6027\u200b\u8fb9\u754c\u200b\u3002
\u200b\u6211\u4eec\u200b\u5bf9\u200b\u6df1\u5ea6\u200b\u5b66\u4e60\u200b\u53ca\u5176\u200b\u7528\u6237\u200b\u5145\u6ee1\u200b\u6fc0\u60c5\u200b\u3002
\u200b\u6211\u4eec\u200b\u662f\u200b\u4e39\u7075\u200b\u3002
"},{"location":"zh/about/license-faq/","title":"License FAQ","text":"\u200b\u7ffb\u8bd1\u200b
\u200b\u672c\u6587\u200b\u5185\u5bb9\u200b\u4e3a\u200b\u7ffb\u8bd1\u200b\u7248\u672c\u200b\uff0c\u200b\u65e8\u5728\u200b\u4e3a\u200b\u7528\u6237\u200b\u63d0\u4f9b\u65b9\u4fbf\u200b\u3002 \u200b\u6211\u4eec\u200b\u5df2\u7ecf\u200b\u5c3d\u529b\u200b\u786e\u4fdd\u200b\u7ffb\u8bd1\u200b\u7684\u200b\u51c6\u786e\u6027\u200b\u3002 \u200b\u4f46\u200b\u8bf7\u200b\u6ce8\u610f\u200b\uff0c\u200b\u7ffb\u8bd1\u200b\u5185\u5bb9\u200b\u53ef\u80fd\u200b\u5305\u542b\u200b\u9519\u8bef\u200b\uff0c\u200b\u4ec5\u4f9b\u53c2\u8003\u200b\u3002 \u200b\u8bf7\u4ee5\u200b\u82f1\u6587\u200b\u539f\u6587\u200b\u4e3a\u51c6\u200b\u3002
\u200b\u4e3a\u200b\u6ee1\u8db3\u200b\u5408\u89c4\u6027\u200b\u4e0e\u200b\u6267\u6cd5\u200b\u8981\u6c42\u200b\uff0c\u200b\u7ffb\u8bd1\u200b\u6587\u6863\u200b\u4e2d\u200b\u7684\u200b\u4efb\u4f55\u200b\u4e0d\u200b\u51c6\u786e\u200b\u6216\u200b\u6b67\u4e49\u200b\u4e4b\u5904\u200b\u5747\u200b\u4e0d\u200b\u5177\u6709\u200b\u7ea6\u675f\u529b\u200b\uff0c\u200b\u4e5f\u200b\u4e0d\u200b\u5177\u5907\u200b\u6cd5\u5f8b\u6548\u529b\u200b\u3002
"},{"location":"zh/about/license-faq/#_1","title":"\u8bb8\u53ef\u200b\u534f\u8bae\u200b\u5e38\u89c1\u200b\u95ee\u9898\u89e3\u7b54","text":"\u200b\u672c\u200b\u8bb8\u53ef\u200b\u534f\u8bae\u200b\u5e38\u89c1\u200b\u95ee\u9898\u89e3\u7b54\u200b\u89e3\u91ca\u200b\u4e86\u200b\u60a8\u200b\u53ef\u4ee5\u200b\u5728\u200b\u4f55\u79cd\u200b\u6761\u4ef6\u200b\u4e0b\u200b\u4f7f\u7528\u200b\u7531\u4e39\u7075\u200b\u56e2\u961f\u200b\uff08\u200b\u4e5f\u200b\u79f0\u4e3a\u200b\u4e39\u7075\u200b\uff09\uff08\u201c\u200b\u6211\u4eec\u200b\u201d\u200b\u6216\u200b\u201c\u200b\u6211\u4eec\u200b\u7684\u200b\u201d\uff09\u200b\u63d0\u4f9b\u200b\u7684\u200b\u6570\u636e\u200b\u3001\u200b\u6a21\u578b\u200b\u3001\u200b\u4ee3\u7801\u200b\u3001\u200b\u914d\u7f6e\u200b\u3001\u200b\u6587\u6863\u200b\u548c\u200b\u6743\u91cd\u200b\u3002 \u200b\u5b83\u200b\u4f5c\u4e3a\u200b\u6211\u4eec\u200b\u7684\u200b \u200b\u8bb8\u53ef\u200b\u534f\u8bae\u200b \u200b\u7684\u200b\u9644\u52a0\u6587\u4ef6\u200b\u3002
"},{"location":"zh/about/license-faq/#0","title":"0. \u200b\u5173\u952e\u70b9\u200b\u603b\u7ed3","text":"\u200b\u672c\u200b\u603b\u7ed3\u200b\u63d0\u4f9b\u200b\u4e86\u200b\u5e38\u89c1\u200b\u95ee\u9898\u89e3\u7b54\u200b\u7684\u200b\u5173\u952e\u70b9\u200b\uff0c\u200b\u4f46\u200b\u60a8\u200b\u53ef\u4ee5\u200b\u901a\u8fc7\u200b\u70b9\u51fb\u200b\u6bcf\u4e2a\u200b\u5173\u952e\u70b9\u200b\u540e\u200b\u7684\u200b\u94fe\u63a5\u200b\u6216\u200b\u4f7f\u7528\u200b\u76ee\u5f55\u200b\u6765\u200b\u627e\u5230\u200b\u60a8\u200b\u6240\u200b\u67e5\u627e\u200b\u7684\u200b\u90e8\u5206\u200b\u4ee5\u200b\u4e86\u89e3\u200b\u66f4\u200b\u591a\u200b\u8be6\u60c5\u200b\u3002
\u200b\u5728\u200b MultiMolecule \u200b\u4e2d\u200b\uff0c\u200b\u4ec0\u4e48\u200b\u6784\u6210\u200b\u4e86\u200b\u201c\u200b\u6e90\u4ee3\u7801\u200b\u201d\uff1f
\u200b\u6211\u4eec\u200b\u8ba4\u4e3a\u200b\u6211\u4eec\u200b\u5b58\u50a8\u200b\u5e93\u4e2d\u200b\u7684\u200b\u6240\u6709\u200b\u5185\u5bb9\u200b\u90fd\u200b\u662f\u200b\u6e90\u4ee3\u7801\u200b\uff0c\u200b\u5305\u62ec\u200b\u6570\u636e\u200b\u3001\u200b\u6a21\u578b\u200b\u3001\u200b\u4ee3\u7801\u200b\u3001\u200b\u914d\u7f6e\u200b\u548c\u200b\u6587\u6863\u200b\u3002
\u200b\u5728\u200bMultiMolecule\u200b\u4e2d\u200b\uff0c\u200b\u4ec0\u4e48\u200b\u6784\u6210\u200b\u4e86\u200b\u201c\u200b\u6e90\u4ee3\u7801\u200b\u201d\uff1f
\u200b\u6211\u200b\u53ef\u4ee5\u200b\u4f7f\u7528\u200b MultiMolecule \u200b\u53d1\u8868\u200b\u7814\u7a76\u200b\u8bba\u6587\u200b\u5417\u200b\uff1f
\u200b\u89c6\u200b\u60c5\u51b5\u200b\u800c\u5b9a\u200b\u3002
\u200b\u60a8\u200b\u53ef\u4ee5\u200b\u6309\u7167\u200b \u200b\u8bb8\u53ef\u200b\u534f\u8bae\u200b \u200b\u7684\u200b\u6761\u6b3e\u200b\u5728\u200b\u5b8c\u5168\u200b\u5f00\u653e\u200b\u83b7\u53d6\u200b\u7684\u200b\u671f\u520a\u200b\u548c\u200b\u4f1a\u8bae\u200b\u6216\u9884\u200b\u5370\u672c\u200b\u670d\u52a1\u5668\u200b\u4e0a\u200b\u53d1\u8868\u200b\u7814\u7a76\u200b\u8bba\u6587\u200b\u3002
\u200b\u8981\u200b\u5728\u200b\u5c01\u95ed\u200b\u83b7\u53d6\u200b\u7684\u200b\u671f\u520a\u200b\u548c\u200b\u4f1a\u8bae\u200b\u4e0a\u200b\u53d1\u8868\u200b\u7814\u7a76\u200b\u8bba\u6587\u200b\uff0c\u200b\u60a8\u200b\u5fc5\u987b\u200b\u4ece\u200b\u6211\u4eec\u200b\u8fd9\u91cc\u200b\u83b7\u5f97\u200b\u5355\u72ec\u200b\u7684\u200b\u8bb8\u53ef\u200b\u3002
\u200b\u6211\u200b\u53ef\u4ee5\u200b\u4f7f\u7528\u200bMultiMolecule\u200b\u53d1\u8868\u200b\u7814\u7a76\u200b\u8bba\u6587\u200b\u5417\u200b\uff1f
\u200b\u6211\u200b\u53ef\u4ee5\u200b\u5c06\u200b MultiMolecule \u200b\u7528\u4e8e\u200b\u5546\u4e1a\u7528\u9014\u200b\u5417\u200b\uff1f
\u200b\u662f\u200b\u7684\u200b\uff0c\u200b\u60a8\u200b\u53ef\u4ee5\u200b\u6839\u636e\u200b \u200b\u8bb8\u53ef\u200b\u534f\u8bae\u200b \u200b\u7684\u200b\u6761\u6b3e\u200b\u5c06\u200bMultiMolecule\u200b\u7528\u4e8e\u200b\u5546\u4e1a\u7528\u9014\u200b\u3002
\u200b\u6211\u200b\u53ef\u4ee5\u200b\u5c06\u200bMultiMolecule\u200b\u7528\u4e8e\u200b\u5546\u4e1a\u7528\u9014\u200b\u5417\u200b\uff1f
\u200b\u4e0e\u200b\u67d0\u4e9b\u200b\u7ec4\u7ec7\u200b\u6709\u200b\u5173\u7cfb\u200b\u7684\u200b\u4eba\u200b\u662f\u5426\u200b\u6709\u200b\u7279\u5b9a\u200b\u7684\u200b\u8bb8\u53ef\u200b\u6761\u6b3e\u200b\uff1f
\u200b\u662f\u200b\u7684\u200b\uff0c\u200b\u4e0e\u200b\u67d0\u4e9b\u200b\u7ec4\u7ec7\u200b\u6709\u200b\u5173\u7cfb\u200b\u7684\u200b\u4eba\u200b\u6709\u200b\u7279\u5b9a\u200b\u7684\u200b\u8bb8\u53ef\u200b\u6761\u6b3e\u200b\u3002
\u200b\u4e0e\u200b\u67d0\u4e9b\u200b\u7ec4\u7ec7\u200b\u6709\u200b\u5173\u7cfb\u200b\u7684\u200b\u4eba\u200b\u662f\u5426\u200b\u6709\u200b\u7279\u5b9a\u200b\u7684\u200b\u8bb8\u53ef\u200b\u6761\u6b3e\u200b\uff1f
"},{"location":"zh/about/license-faq/#1-multimolecule","title":"1. \u200b\u5728\u200b MultiMolecule \u200b\u4e2d\u200b\uff0c\u200b\u4ec0\u4e48\u200b\u6784\u6210\u200b\u4e86\u200b\u201c\u200b\u6e90\u4ee3\u7801\u200b\u201d\uff1f","text":"\u200b\u6211\u4eec\u200b\u8ba4\u4e3a\u200b\u6211\u4eec\u200b\u5b58\u50a8\u200b\u5e93\u4e2d\u200b\u7684\u200b\u6240\u6709\u200b\u5185\u5bb9\u200b\u90fd\u200b\u662f\u200b\u6e90\u4ee3\u7801\u200b\u3002
\u200b\u673a\u5668\u200b\u5b66\u4e60\u200b\u6a21\u578b\u200b\u7684\u200b\u8bad\u7ec3\u200b\u8fc7\u7a0b\u200b\u88ab\u200b\u89c6\u4f5c\u200b\u7c7b\u4f3c\u200b\u4e8e\u200b\u4f20\u7edf\u200b\u8f6f\u4ef6\u200b\u7684\u200b\u7f16\u8bd1\u200b\u8fc7\u7a0b\u200b\u3002\u200b\u56e0\u6b64\u200b\uff0c\u200b\u6a21\u578b\u200b\u3001\u200b\u4ee3\u7801\u200b\u3001\u200b\u914d\u7f6e\u200b\u3001\u200b\u6587\u6863\u200b\u548c\u200b\u7528\u4e8e\u200b\u8bad\u7ec3\u200b\u7684\u200b\u6570\u636e\u200b\u90fd\u200b\u88ab\u200b\u89c6\u4e3a\u200b\u6e90\u4ee3\u7801\u200b\u7684\u200b\u4e00\u90e8\u5206\u200b\uff0c\u200b\u800c\u200b\u8bad\u7ec3\u200b\u51fa\u200b\u7684\u200b\u6a21\u578b\u200b\u6743\u91cd\u200b\u5219\u200b\u88ab\u200b\u89c6\u4e3a\u200b\u76ee\u6807\u200b\u4ee3\u7801\u200b\u7684\u200b\u4e00\u90e8\u5206\u200b\u3002
\u200b\u6211\u4eec\u200b\u8fd8\u200b\u5c06\u200b\u7814\u7a76\u200b\u8bba\u6587\u200b\u548c\u200b\u624b\u7a3f\u200b\u89c6\u4e3a\u200b\u4e00\u79cd\u200b\u7279\u6b8a\u200b\u7684\u200b\u6587\u6863\u200b\u5f62\u5f0f\u200b\uff0c\u200b\u5b83\u4eec\u200b\u4e5f\u200b\u662f\u200b\u6e90\u4ee3\u7801\u200b\u7684\u200b\u4e00\u90e8\u5206\u200b\u3002
"},{"location":"zh/about/license-faq/#2-multimolecule","title":"2 \u200b\u6211\u200b\u53ef\u4ee5\u200b\u4f7f\u7528\u200b MultiMolecule \u200b\u53d1\u8868\u200b\u7814\u7a76\u200b\u8bba\u6587\u200b\u5417\u200b\uff1f","text":"\u200b\u7531\u4e8e\u200b\u7814\u7a76\u200b\u8bba\u6587\u200b\u88ab\u200b\u89c6\u4e3a\u200b\u6e90\u4ee3\u7801\u200b\u7684\u200b\u4e00\u79cd\u200b\u5f62\u5f0f\u200b\uff0c\u200b\u5982\u679c\u200b\u53d1\u8868\u200b\u4f7f\u7528\u200b MultiMolecule \u200b\u7684\u200b\u8bba\u6587\u200b\uff0c\u200b\u51fa\u7248\u5546\u200b\u5fc5\u987b\u200b\u5f00\u6e90\u200b\u5176\u200b\u670d\u52a1\u5668\u200b\u4e0a\u200b\u7684\u200b\u6240\u6709\u200b\u6750\u6599\u200b\uff0c\u200b\u4ee5\u200b\u7b26\u5408\u200b \u200b\u8bb8\u53ef\u200b\u534f\u8bae\u200b \u200b\u7684\u200b\u8981\u6c42\u200b\u3002\u200b\u5bf9\u4e8e\u200b\u5927\u591a\u6570\u200b\u51fa\u7248\u5546\u200b\u6765\u8bf4\u200b\uff0c\u200b\u8fd9\u662f\u200b\u4e0d\u5207\u5b9e\u9645\u200b\u7684\u200b\u3002
\u200b\u4f5c\u4e3a\u200b \u200b\u8bb8\u53ef\u200b\u534f\u8bae\u200b \u200b\u7b2c\u200b 7 \u200b\u6761\u200b\u7684\u200b\u7279\u522b\u200b\u8c41\u514d\u200b\uff0c\u200b\u6211\u4eec\u200b\u5141\u8bb8\u200b\u5728\u200b\u4e0d\u200b\u5411\u200b\u4f5c\u8005\u200b\u6536\u53d6\u200b\u4efb\u4f55\u200b\u8d39\u7528\u200b\u7684\u200b\u5b8c\u5168\u200b\u5f00\u653e\u200b\u83b7\u53d6\u200b\u7684\u200b\u671f\u520a\u200b\u3001\u200b\u4f1a\u8bae\u200b\u6216\u9884\u200b\u5370\u672c\u200b\u670d\u52a1\u5668\u200b\u4e0a\u200b\u53d1\u8868\u200b\u4f7f\u7528\u200b MultiMolecule \u200b\u7684\u200b\u7814\u7a76\u200b\u8bba\u6587\u200b\uff0c\u200b\u524d\u63d0\u200b\u662f\u200b\u6240\u6709\u200b\u53d1\u8868\u200b\u7684\u200b\u624b\u7a3f\u200b\u90fd\u200b\u5e94\u200b\u6309\u7167\u200b\u5141\u8bb8\u200b\u5171\u4eab\u200b\u624b\u7a3f\u200b\u7684\u200bGNU \u200b\u81ea\u7531\u200b\u6587\u6863\u200b\u8bb8\u53ef\u200b\u534f\u8bae\u200b\uff08GFDL\uff09\u200b\u6216\u200b\u77e5\u8bc6\u200b\u5171\u4eab\u200b\u8bb8\u53ef\u200b\u534f\u8bae\u200b\u6216\u200bOSI \u200b\u6279\u51c6\u200b\u8bb8\u53ef\u200b\u534f\u8bae\u200b\u63d0\u4f9b\u200b\u3002
\u200b\u4f5c\u4e3a\u200b \u200b\u8bb8\u53ef\u200b\u534f\u8bae\u200b \u200b\u7b2c\u200b 7 \u200b\u6761\u200b\u7684\u200b\u7279\u522b\u200b\u8c41\u514d\u200b\uff0c\u200b\u6211\u4eec\u200b\u5141\u8bb8\u200b\u5728\u200b\u90e8\u5206\u200b\u975e\u76c8\u5229\u6027\u200b\u7684\u200b\u6742\u5fd7\u200b\u3001\u200b\u4f1a\u8bae\u200b\u6216\u9884\u200b\u5370\u672c\u200b\u670d\u52a1\u5668\u200b\u4e0a\u200b\u53d1\u8868\u200b\u4f7f\u7528\u200b MultiMolecule \u200b\u7684\u200b\u7814\u7a76\u200b\u8bba\u6587\u200b\u3002\u200b\u76ee\u524d\u200b\uff0c\u200b\u6211\u4eec\u200b\u5141\u8bb8\u200b\u7684\u200b\u975e\u76c8\u5229\u6027\u200b\u6742\u5fd7\u200b\u3001\u200b\u4f1a\u8bae\u200b\u6216\u9884\u200b\u5370\u672c\u200b\u670d\u52a1\u5668\u200b\u5305\u62ec\u200b\uff1a
\u200b\u8981\u200b\u5728\u200b\u5c01\u95ed\u200b\u83b7\u53d6\u200b\u7684\u200b\u671f\u520a\u200b\u6216\u200b\u4f1a\u8bae\u200b\u4e0a\u200b\u53d1\u8868\u200b\u8bba\u6587\u200b\uff0c\u200b\u60a8\u200b\u5fc5\u987b\u200b\u4ece\u200b\u6211\u4eec\u200b\u8fd9\u91cc\u200b\u83b7\u5f97\u200b\u5355\u72ec\u200b\u7684\u200b\u8bb8\u53ef\u200b\u3002\u200b\u8fd9\u200b\u901a\u5e38\u200b\u5305\u62ec\u200b\u5171\u540c\u200b\u7f72\u540d\u200b\u3001\u200b\u652f\u6301\u200b\u9879\u76ee\u200b\u7684\u200b\u8d39\u7528\u200b\u6216\u200b\u4e24\u8005\u200b\u517c\u800c\u6709\u4e4b\u200b\u3002\u200b\u8bf7\u200b\u901a\u8fc7\u200b multimolecule@zyc.ai \u200b\u4e0e\u200b\u6211\u4eec\u200b\u8054\u7cfb\u200b\u4ee5\u200b\u83b7\u53d6\u200b\u66f4\u200b\u591a\u200b\u4fe1\u606f\u200b\u3002
\u200b\u867d\u7136\u200b\u4e0d\u662f\u200b\u5f3a\u5236\u6027\u200b\u7684\u200b\uff0c\u200b\u4f46\u200b\u6211\u4eec\u200b\u5efa\u8bae\u200b\u5728\u200b\u7814\u7a76\u200b\u8bba\u6587\u200b\u4e2d\u200b\u5f15\u7528\u200b MultiMolecule \u200b\u9879\u76ee\u200b\u3002
"},{"location":"zh/about/license-faq/#3-multimolecule","title":"3. \u200b\u6211\u200b\u53ef\u4ee5\u200b\u5c06\u200b MultiMolecule \u200b\u7528\u4e8e\u200b\u5546\u4e1a\u7528\u9014\u200b\u5417\u200b\uff1f","text":"\u200b\u662f\u200b\u7684\u200b\uff0c\u200b\u60a8\u200b\u53ef\u4ee5\u200b\u6839\u636e\u200b \u200b\u8bb8\u53ef\u200b\u534f\u8bae\u200b \u200b\u5c06\u200b MultiMolecule \u200b\u7528\u4e8e\u200b\u5546\u4e1a\u7528\u9014\u200b\u3002\u200b\u4f46\u662f\u200b\uff0c\u200b\u60a8\u200b\u5fc5\u987b\u200b\u5f00\u6e90\u200b\u5bf9\u200b\u6e90\u4ee3\u7801\u200b\u7684\u200b\u4efb\u4f55\u200b\u4fee\u6539\u200b\uff0c\u200b\u5e76\u200b\u4f7f\u200b\u5176\u200b\u5728\u200b \u200b\u8bb8\u53ef\u200b\u534f\u8bae\u200b \u200b\u4e0b\u200b\u53ef\u7528\u200b\u3002
\u200b\u5982\u679c\u200b\u60a8\u200b\u5e0c\u671b\u200b\u5728\u200b\u4e0d\u200b\u5f00\u6e90\u200b\u4fee\u6539\u200b\u5185\u5bb9\u200b\u7684\u200b\u60c5\u51b5\u200b\u4e0b\u200b\u5c06\u200b MultiMolecule \u200b\u7528\u4e8e\u200b\u5546\u4e1a\u7528\u9014\u200b\uff0c\u200b\u5219\u200b\u5fc5\u987b\u200b\u4ece\u200b\u6211\u4eec\u200b\u8fd9\u91cc\u200b\u83b7\u5f97\u200b\u5355\u72ec\u200b\u7684\u200b\u8bb8\u53ef\u200b\u3002\u200b\u8fd9\u200b\u901a\u5e38\u200b\u6d89\u53ca\u200b\u652f\u6301\u200b\u9879\u76ee\u200b\u7684\u200b\u8d39\u7528\u200b\u3002\u200b\u8bf7\u200b\u901a\u8fc7\u200b multimolecule@zyc.ai \u200b\u4e0e\u200b\u6211\u4eec\u200b\u8054\u7cfb\u200b\u4ee5\u200b\u83b7\u53d6\u200b\u66f4\u200b\u591a\u200b\u8be6\u7ec6\u4fe1\u606f\u200b\u3002
"},{"location":"zh/about/license-faq/#4","title":"4. \u200b\u4e0e\u200b\u67d0\u4e9b\u200b\u7ec4\u7ec7\u200b\u6709\u200b\u5173\u7cfb\u200b\u7684\u200b\u4eba\u200b\u662f\u5426\u200b\u6709\u200b\u7279\u5b9a\u200b\u7684\u200b\u8bb8\u53ef\u200b\u6761\u6b3e\u200b\uff1f","text":"\u200b\u662f\u200b\u7684\u200b\uff01
\u200b\u5982\u679c\u200b\u60a8\u200b\u4e0e\u200b\u4e00\u4e2a\u200b\u4e0e\u200b\u6211\u4eec\u200b\u6709\u200b\u5355\u72ec\u200b\u8bb8\u53ef\u200b\u534f\u8bae\u200b\u7684\u200b\u7ec4\u7ec7\u200b\u6709\u200b\u5173\u7cfb\u200b\uff0c\u200b\u60a8\u200b\u53ef\u80fd\u200b\u4f1a\u200b\u53d7\u5230\u200b\u4e0d\u540c\u200b\u7684\u200b\u8bb8\u53ef\u200b\u6761\u6b3e\u200b\u7684\u200b\u7ea6\u675f\u200b\u3002\u200b\u8bf7\u200b\u54a8\u8be2\u200b\u60a8\u200b\u7ec4\u7ec7\u200b\u7684\u200b\u6cd5\u5f8b\u200b\u90e8\u95e8\u200b\uff0c\u200b\u4ee5\u200b\u786e\u5b9a\u200b\u60a8\u200b\u662f\u5426\u200b\u53d7\u5236\u4e8e\u200b\u5355\u72ec\u200b\u7684\u200b\u8bb8\u53ef\u200b\u534f\u8bae\u200b\u3002
\u200b\u4ee5\u4e0b\u200b\u7ec4\u7ec7\u200b\u7684\u200b\u6210\u5458\u200b\u81ea\u52a8\u200b\u83b7\u5f97\u200b\u4e00\u4e2a\u200b\u4e0d\u53ef\u200b\u8f6c\u8ba9\u200b\u3001\u200b\u4e0d\u53ef\u200b\u518d\u200b\u8bb8\u53ef\u200b\u3001\u200b\u4e0d\u53ef\u200b\u5206\u53d1\u200b\u7684\u200b MIT \u200b\u8bb8\u53ef\u200b\u534f\u8bae\u200b \u200b\u6765\u200b\u4f7f\u7528\u200b MultiMolecule\uff1a
\u200b\u6b64\u200b\u7279\u522b\u200b\u8bb8\u53ef\u200b\u88ab\u200b\u89c6\u4e3a\u200b \u200b\u8bb8\u53ef\u200b\u534f\u8bae\u200b \u200b\u7b2c\u200b 7 \u200b\u6761\u4e2d\u200b\u7684\u200b\u9644\u52a0\u200b\u6761\u6b3e\u200b\u3002 \u200b\u5b83\u200b\u4e0d\u53ef\u200b\u518d\u200b\u5206\u53d1\u200b\uff0c\u200b\u5e76\u4e14\u200b\u60a8\u200b\u88ab\u200b\u7981\u6b62\u200b\u521b\u5efa\u200b\u4efb\u4f55\u200b\u72ec\u7acb\u200b\u7684\u200b\u884d\u751f\u200b\u4f5c\u54c1\u200b\u3002 \u200b\u57fa\u4e8e\u200b\u6b64\u200b\u8bb8\u53ef\u200b\u7684\u200b\u4efb\u4f55\u200b\u4fee\u6539\u200b\u6216\u200b\u884d\u751f\u200b\u4f5c\u54c1\u200b\u5c06\u200b\u81ea\u52a8\u200b\u88ab\u200b\u89c6\u4e3a\u200b MultiMolecule \u200b\u7684\u200b\u884d\u751f\u200b\u4f5c\u54c1\u200b\uff0c\u200b\u5fc5\u987b\u200b\u9075\u5b88\u200b \u200b\u8bb8\u53ef\u200b\u534f\u8bae\u200b \u200b\u7684\u200b\u6240\u6709\u200b\u6761\u6b3e\u200b\u3002 \u200b\u8fd9\u200b\u786e\u4fdd\u200b\u4e86\u200b\u7b2c\u4e09\u65b9\u200b\u65e0\u6cd5\u200b\u7ed5\u8fc7\u200b\u8bb8\u53ef\u200b\u6761\u6b3e\u200b\u6216\u200b\u4ece\u200b\u884d\u751f\u200b\u4f5c\u54c1\u200b\u4e2d\u200b\u521b\u5efa\u200b\u5355\u72ec\u200b\u7684\u200b\u8bb8\u53ef\u200b\u534f\u8bae\u200b\u3002
"},{"location":"zh/about/license-faq/#5-agpl-multimolecule","title":"5. \u200b\u5982\u679c\u200b\u6211\u200b\u7684\u200b\u7ec4\u7ec7\u200b\u7981\u6b62\u200b\u4f7f\u7528\u200b AGPL \u200b\u8bb8\u53ef\u200b\u534f\u8bae\u200b\u4e0b\u200b\u7684\u200b\u4ee3\u7801\u200b\uff0c\u200b\u6211\u8be5\u200b\u5982\u4f55\u200b\u4f7f\u7528\u200b MultiMolecule\uff1f","text":"\u200b\u4e00\u4e9b\u200b\u7ec4\u7ec7\u200b\uff08\u200b\u5982\u200bGoogle\uff09\u200b\u6709\u200b\u7981\u6b62\u200b\u4f7f\u7528\u200b AGPL \u200b\u8bb8\u53ef\u200b\u534f\u8bae\u200b\u4e0b\u200b\u4ee3\u7801\u200b\u7684\u200b\u653f\u7b56\u200b\u3002
\u200b\u5982\u679c\u200b\u60a8\u200b\u4e0e\u200b\u7981\u6b62\u200b\u4f7f\u7528\u200b AGPL \u200b\u8bb8\u53ef\u200b\u534f\u8bae\u200b\u4ee3\u7801\u200b\u7684\u200b\u7ec4\u7ec7\u200b\u6709\u200b\u5173\u7cfb\u200b\uff0c\u200b\u60a8\u200b\u5fc5\u987b\u200b\u4ece\u200b\u6211\u4eec\u200b\u8fd9\u91cc\u200b\u83b7\u5f97\u200b\u5355\u72ec\u200b\u7684\u200b\u8bb8\u53ef\u200b\u3002\u200b\u8bf7\u200b\u901a\u8fc7\u200b multimolecule@zyc.ai \u200b\u4e0e\u200b\u6211\u4eec\u200b\u8054\u7cfb\u200b\u4ee5\u200b\u83b7\u53d6\u200b\u66f4\u200b\u591a\u200b\u8be6\u7ec6\u4fe1\u606f\u200b\u3002
"},{"location":"zh/about/license-faq/#6-multimolecule","title":"6. \u200b\u5982\u679c\u200b\u6211\u200b\u662f\u200b\u7f8e\u56fd\u8054\u90a6\u653f\u5e9c\u200b\u7684\u200b\u96c7\u5458\u200b\uff0c\u200b\u6211\u200b\u53ef\u4ee5\u200b\u4f7f\u7528\u200b MultiMolecule \u200b\u5417\u200b\uff1f","text":"\u200b\u4e0d\u80fd\u200b\u3002
\u200b\u6839\u636e\u200b17 U.S. Code \u00a7 105\uff0c\u200b\u7f8e\u56fd\u8054\u90a6\u653f\u5e9c\u200b\u96c7\u5458\u200b\u64b0\u5199\u200b\u7684\u200b\u4ee3\u7801\u200b\u4e0d\u200b\u53d7\u200b\u7248\u6743\u4fdd\u62a4\u200b\u3002
\u200b\u56e0\u6b64\u200b\uff0c\u200b\u7f8e\u56fd\u8054\u90a6\u653f\u5e9c\u200b\u96c7\u5458\u200b\u65e0\u6cd5\u200b\u9075\u5b88\u200b \u200b\u8bb8\u53ef\u200b\u534f\u8bae\u200b \u200b\u7684\u200b\u6761\u6b3e\u200b\u3002
"},{"location":"zh/about/license-faq/#7","title":"7. \u200b\u6211\u4eec\u200b\u4f1a\u200b\u66f4\u65b0\u200b\u6b64\u200b\u5e38\u89c1\u200b\u95ee\u9898\u89e3\u7b54\u200b\u5417\u200b\uff1f","text":"\u200b\u7b80\u800c\u8a00\u4e4b\u200b
\u200b\u662f\u200b\u7684\u200b\uff0c\u200b\u6211\u4eec\u200b\u5c06\u200b\u6839\u636e\u200b\u9700\u8981\u200b\u66f4\u65b0\u200b\u6b64\u200b\u5e38\u89c1\u200b\u95ee\u9898\u89e3\u7b54\u200b\u4ee5\u200b\u4fdd\u6301\u200b\u4e0e\u200b\u76f8\u5173\u200b\u6cd5\u5f8b\u200b\u7684\u200b\u4e00\u81f4\u200b\u3002
\u200b\u6211\u4eec\u200b\u53ef\u80fd\u200b\u4f1a\u200b\u4e0d\u65f6\u200b\u66f4\u65b0\u200b\u6b64\u200b\u8bb8\u53ef\u200b\u534f\u8bae\u200b\u5e38\u89c1\u200b\u95ee\u9898\u89e3\u7b54\u200b\u3002 \u200b\u66f4\u65b0\u200b\u540e\u200b\u7684\u200b\u7248\u672c\u200b\u5c06\u200b\u901a\u8fc7\u200b\u66f4\u65b0\u200b\u672c\u200b\u9875\u9762\u200b\u5e95\u90e8\u200b\u7684\u200b\u201c\u200b\u6700\u540e\u200b\u4fee\u8ba2\u200b\u65f6\u95f4\u200b\u201d\u200b\u6765\u200b\u8868\u793a\u200b\u3002 \u200b\u5982\u679c\u200b\u6211\u4eec\u200b\u8fdb\u884c\u200b\u4efb\u4f55\u200b\u91cd\u5927\u200b\u66f4\u6539\u200b\uff0c\u200b\u6211\u4eec\u200b\u5c06\u200b\u901a\u8fc7\u200b\u5728\u200b\u672c\u9875\u200b\u53d1\u5e03\u200b\u65b0\u200b\u7684\u200b\u8bb8\u53ef\u200b\u534f\u8bae\u200b\u5e38\u89c1\u200b\u95ee\u9898\u89e3\u7b54\u200b\u6765\u200b\u901a\u77e5\u200b\u60a8\u200b\u3002 \u200b\u7531\u4e8e\u200b\u6211\u4eec\u200b\u4e0d\u200b\u6536\u96c6\u200b\u60a8\u200b\u7684\u200b\u4efb\u4f55\u200b\u8054\u7cfb\u200b\u4fe1\u606f\u200b\uff0c\u200b\u6211\u4eec\u200b\u65e0\u6cd5\u200b\u76f4\u63a5\u200b\u901a\u77e5\u200b\u60a8\u200b\u3002 \u200b\u6211\u4eec\u200b\u9f13\u52b1\u200b\u60a8\u200b\u7ecf\u5e38\u200b\u67e5\u770b\u200b\u672c\u200b\u8bb8\u53ef\u200b\u534f\u8bae\u200b\u5e38\u89c1\u200b\u95ee\u9898\u89e3\u7b54\u200b\uff0c\u200b\u4ee5\u200b\u4e86\u89e3\u200b\u60a8\u200b\u53ef\u4ee5\u200b\u5982\u4f55\u200b\u4f7f\u7528\u200b\u6211\u4eec\u200b\u7684\u200b\u6570\u636e\u200b\u3001\u200b\u6a21\u578b\u200b\u3001\u200b\u4ee3\u7801\u200b\u3001\u200b\u914d\u7f6e\u200b\u3001\u200b\u6587\u6863\u200b\u548c\u200b\u6743\u91cd\u200b\u3002
"},{"location":"zh/data/","title":"data","text":"data
\u200b\u63d0\u4f9b\u200b\u4e86\u200b\u4e00\u7cfb\u5217\u200b\u7528\u4e8e\u200b\u5904\u7406\u200b\u6570\u636e\u200b\u7684\u200b\u5b9e\u7528\u5de5\u5177\u200b\u3002
\u200b\u5c3d\u7ba1\u200b datasets
\u200b\u662f\u200b\u4e00\u4e2a\u200b\u5f3a\u5927\u200b\u7684\u200b\u7ba1\u7406\u200b\u6570\u636e\u200b\u96c6\u200b\u7684\u200b\u5e93\u200b\uff0c\u200b\u4f46\u200b\u5b83\u200b\u662f\u200b\u4e00\u4e2a\u200b\u901a\u7528\u200b\u5de5\u5177\u200b\uff0c\u200b\u53ef\u80fd\u200b\u65e0\u6cd5\u200b\u6db5\u76d6\u200b\u79d1\u5b66\u200b\u5e94\u7528\u7a0b\u5e8f\u200b\u7684\u200b\u6240\u6709\u200b\u7279\u5b9a\u200b\u529f\u80fd\u200b\u3002
data
\u200b\u5305\u200b\u65e8\u5728\u200b\u901a\u8fc7\u200b\u63d0\u4f9b\u200b\u5728\u200b\u79d1\u5b66\u200b\u4efb\u52a1\u200b\u4e2d\u200b\u5e38\u7528\u200b\u7684\u200b\u6570\u636e\u5904\u7406\u200b\u5b9e\u7528\u7a0b\u5e8f\u200b\u6765\u200b\u8865\u5145\u200b datasets
\u3002
from multimolecule.data import Dataset\n\ndata = Dataset(\"data/rna/5utr.csv\", split=\"train\", pretrained=\"multimolecule/rna\")\n
"},{"location":"zh/data/#datasets","title":"\u4ece\u200b datasets
\u200b\u52a0\u8f7d","text":"Pythonfrom multimolecule.data import Dataset\n\ndata = Dataset(\"multimolecule/bprna-spot\", split=\"train\", pretrained=\"multimolecule/rna\")\n
"},{"location":"zh/datasets/","title":"datasets","text":"datasets
\u200b\u63d0\u4f9b\u200b\u4e86\u200b\u4e00\u7cfb\u5217\u200b\u5e7f\u6cdb\u200b\u4f7f\u7528\u200b\u7684\u200b\u6570\u636e\u200b\u96c6\u200b\u3002
from multimolecule.data import Dataset\n\ndata = Dataset(\"multimolecule/bprna-spot\", split=\"train\", pretrained=\"multimolecule/rna\")\n
"},{"location":"zh/models/","title":"models","text":"models
\u200b\u63d0\u4f9b\u200b\u4e86\u200b\u4e00\u7cfb\u5217\u200b\u9884\u200b\u8bad\u7ec3\u200b\u6a21\u578b\u200b\u3002
\u200b\u5728\u200b transformers
\u200b\u5e93\u200b\u5f53\u4e2d\u200b\uff0c\u200b\u6a21\u578b\u200b\u7c7b\u200b\u7684\u200b\u540d\u5b57\u200b\u6709\u65f6\u200b\u53ef\u4ee5\u200b\u5f15\u8d77\u200b\u8bef\u89e3\u200b\u3002 \u200b\u5c3d\u7ba1\u200b\u8fd9\u4e9b\u200b\u7c7b\u200b\u652f\u6301\u200b\u56de\u5f52\u200b\u548c\u200b\u5206\u7c7b\u200b\u4efb\u52a1\u200b\uff0c\u200b\u4f46\u200b\u5b83\u4eec\u200b\u7684\u200b\u540d\u5b57\u200b\u901a\u5e38\u200b\u5305\u542b\u200b xxxForSequenceClassification
\uff0c\u200b\u8fd9\u200b\u53ef\u80fd\u200b\u6697\u793a\u200b\u5b83\u4eec\u200b\u53ea\u80fd\u200b\u7528\u4e8e\u200b\u5206\u7c7b\u200b\u3002
\u200b\u4e3a\u4e86\u200b\u907f\u514d\u200b\u8fd9\u79cd\u200b\u6b67\u4e49\u200b\uff0cMultiMolecule \u200b\u63d0\u4f9b\u200b\u4e86\u200b\u4e00\u7cfb\u5217\u200b\u6a21\u578b\u200b\u7c7b\u200b\uff0c\u200b\u8fd9\u4e9b\u200b\u7c7b\u200b\u7684\u200b\u540d\u79f0\u200b\u6e05\u6670\u200b\u3001\u200b\u76f4\u89c2\u200b\uff0c\u200b\u53cd\u6620\u200b\u4e86\u200b\u5b83\u4eec\u200b\u7684\u200b\u9884\u671f\u200b\u7528\u9014\u200b\uff1a
multimolecule.AutoModelForSequencePrediction
: \u200b\u5e8f\u5217\u200b\u9884\u6d4b\u200bmultimolecule.AutoModelForTokenPrediction
: \u200b\u4ee4\u724c\u200b\u9884\u6d4b\u200bmultimolecule.AutoModelForContactPrediction
: \u200b\u63a5\u89e6\u200b\u9884\u6d4b\u200b\u200b\u6bcf\u4e2a\u200b\u6a21\u578b\u200b\u90fd\u200b\u652f\u6301\u200b\u56de\u5f52\u200b\u548c\u200b\u5206\u7c7b\u200b\u4efb\u52a1\u200b\uff0c\u200b\u4e3a\u200b\u5e7f\u6cdb\u200b\u7684\u200b\u5e94\u7528\u200b\u63d0\u4f9b\u200b\u4e86\u200b\u7075\u6d3b\u6027\u200b\u548c\u200b\u7cbe\u5ea6\u200b\u3002
"},{"location":"zh/models/#_2","title":"\u63a5\u89e6\u200b\u9884\u6d4b","text":"\u200b\u63a5\u89e6\u200b\u9884\u6d4b\u200b\u4e3a\u200b\u5e8f\u5217\u200b\u4e2d\u200b\u7684\u200b\u6bcf\u200b\u4e00\u5bf9\u200b\u4ee4\u724c\u200b\u5206\u914d\u200b\u4e00\u4e2a\u200b\u6807\u7b7e\u200b\u3002 \u200b\u6700\u200b\u5e38\u89c1\u200b\u7684\u200b\u63a5\u89e6\u200b\u9884\u6d4b\u200b\u4efb\u52a1\u200b\u4e4b\u4e00\u200b\u662f\u200b\u86cb\u767d\u8d28\u200b\u8ddd\u79bb\u200b\u56fe\u200b\u9884\u6d4b\u200b\u3002 \u200b\u86cb\u767d\u8d28\u200b\u8ddd\u79bb\u200b\u56fe\u200b\u9884\u6d4b\u200b\u8bd5\u56fe\u200b\u627e\u5230\u200b\u4e09\u7ef4\u200b\u86cb\u767d\u8d28\u200b\u7ed3\u6784\u200b\u4e2d\u200b\u6240\u6709\u200b\u53ef\u80fd\u200b\u7684\u200b\u6c28\u57fa\u9178\u200b\u6b8b\u57fa\u200b\u5bf9\u200b\u4e4b\u95f4\u200b\u7684\u200b\u8ddd\u79bb\u200b
"},{"location":"zh/models/#_3","title":"\u6838\u82f7\u9178\u200b\u9884\u6d4b","text":"\u200b\u4e0e\u200b Token Classification \u200b\u7c7b\u4f3c\u200b\uff0c\u200b\u4f46\u200b\u5982\u679c\u200b\u6a21\u578b\u200b\u914d\u7f6e\u200b\u4e2d\u200b\u5b9a\u4e49\u200b\u4e86\u200b <bos>
\u200b\u6216\u200b <eos>
\u200b\u4ee4\u724c\u200b\uff0c\u200b\u5219\u200b\u5c06\u200b\u5176\u200b\u79fb\u9664\u200b\u3002
<bos>
\u200b\u548c\u200b <eos>
\u200b\u4ee4\u724c\u200b
\u200b\u5728\u200b MultiMolecule \u200b\u63d0\u4f9b\u200b\u7684\u200b\u5206\u8bcd\u5668\u200b\u4e2d\u200b\uff0c<bos>
\u200b\u4ee4\u724c\u200b\u6307\u5411\u200b <cls>
\u200b\u4ee4\u724c\u200b\uff0c<sep>
\u200b\u4ee4\u724c\u200b\u6307\u5411\u200b <eos>
\u200b\u4ee4\u724c\u200b\u3002
multimolecule.AutoModel
\u200b\u6784\u5efa","text":"Pythonfrom transformers import AutoTokenizer\n\nfrom multimolecule import AutoModelForSequencePrediction\n\nmodel = AutoModelForSequencePrediction.from_pretrained(\"multimolecule/rnafm\")\ntokenizer = AutoTokenizer.from_pretrained(\"multimolecule/rnafm\")\n\nsequence = \"UAGCGUAUCAGACUGAUGUUG\"\noutput = model(**tokenizer(sequence, return_tensors=\"pt\"))\n
"},{"location":"zh/models/#_5","title":"\u76f4\u63a5\u200b\u8bbf\u95ee","text":"\u200b\u6240\u6709\u200b\u6a21\u578b\u200b\u53ef\u4ee5\u200b\u901a\u8fc7\u200b from_pretrained
\u200b\u65b9\u6cd5\u200b\u76f4\u63a5\u200b\u52a0\u8f7d\u200b\u3002
from multimolecule.models import RnaFmForTokenPrediction, RnaTokenizer\n\nmodel = RnaFmForTokenPrediction.from_pretrained(\"multimolecule/rnafm\")\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rnafm\")\n\nsequence = \"UAGCGUAUCAGACUGAUGUUG\"\noutput = model(**tokenizer(sequence, return_tensors=\"pt\"))\n
"},{"location":"zh/models/#transformersautomodel","title":"\u4f7f\u7528\u200b transformers.AutoModel
\u200b\u6784\u5efa","text":"\u200b\u867d\u7136\u200b\u6211\u4eec\u200b\u4e3a\u200b\u6a21\u578b\u200b\u7c7b\u200b\u4f7f\u7528\u200b\u4e86\u200b\u4e0d\u540c\u200b\u7684\u200b\u547d\u540d\u200b\u7ea6\u5b9a\u200b\uff0c\u200b\u4f46\u200b\u6a21\u578b\u200b\u4ecd\u7136\u200b\u6ce8\u518c\u200b\u5230\u200b\u76f8\u5e94\u200b\u7684\u200b transformers.AutoModel
\u200b\u4e2d\u200b\u3002
from transformers import AutoModelForSequenceClassification, AutoTokenizer\n\nimport multimolecule # noqa: F401\n\nmodel = AutoModelForSequenceClassification.from_pretrained(\"multimolecule/mrnafm\")\ntokenizer = AutoTokenizer.from_pretrained(\"multimolecule/mrnafm\")\n\nsequence = \"UAGCGUAUCAGACUGAUGUUG\"\noutput = model(**tokenizer(sequence, return_tensors=\"pt\"))\n
\u200b\u4f7f\u7528\u200b\u524d\u5148\u200b import multimolecule
\u200b\u8bf7\u200b\u6ce8\u610f\u200b\uff0c\u200b\u5728\u200b\u4f7f\u7528\u200b transformers.AutoModel
\u200b\u6784\u5efa\u200b\u6a21\u578b\u200b\u4e4b\u524d\u200b\uff0c\u200b\u5fc5\u987b\u200b\u5148\u200b import multimolecule
\u3002 \u200b\u6a21\u578b\u200b\u7684\u200b\u6ce8\u518c\u200b\u5728\u200b multimolecule
\u200b\u5305\u4e2d\u200b\u5b8c\u6210\u200b\uff0c\u200b\u6a21\u578b\u200b\u5728\u200b transformers
\u200b\u5305\u4e2d\u200b\u4e0d\u53ef\u200b\u7528\u200b\u3002
\u200b\u5982\u679c\u200b\u5728\u200b\u4f7f\u7528\u200b transformers.AutoModel
\u200b\u4e4b\u524d\u200b\u672a\u200b import multimolecule
\uff0c\u200b\u5c06\u4f1a\u200b\u5f15\u53d1\u200b\u4ee5\u4e0b\u200b\u9519\u8bef\u200b\uff1a
ValueError: The checkpoint you are trying to load has model type `rnafm` but Transformers does not recognize this architecture. This could be because of an issue with the checkpoint, or because your version of Transformers is out of date.\n
"},{"location":"zh/models/#_6","title":"\u521d\u59cb\u5316\u200b\u4e00\u4e2a\u200b\u9999\u8349\u200b\u6a21\u578b","text":"\u200b\u4f60\u200b\u4e5f\u200b\u53ef\u4ee5\u200b\u4f7f\u7528\u200b\u6a21\u578b\u200b\u7c7b\u200b\u521d\u59cb\u5316\u200b\u4e00\u4e2a\u200b\u57fa\u7840\u200b\u6a21\u578b\u200b\u3002
Pythonfrom multimolecule.models import RnaFmConfig, RnaFmForTokenPrediction, RnaTokenizer\n\nconfig = RnaFmConfig()\nmodel = RnaFmForTokenPrediction(config)\ntokenizer = RnaTokenizer()\n\nsequence = \"UAGCGUAUCAGACUGAUGUUG\"\noutput = model(**tokenizer(sequence, return_tensors=\"pt\"))\n
"},{"location":"zh/models/#_7","title":"\u53ef\u7528\u200b\u6a21\u578b","text":""},{"location":"zh/models/#dna","title":"\u8131\u6c27\u6838\u7cd6\u6838\u9178\u200b\uff08DNA\uff09","text":"module
\u200b\u63d0\u4f9b\u200b\u4e86\u200b\u4e00\u7cfb\u5217\u200b\u9884\u5b9a\u200b\u4e49\u200b\u6a21\u5757\u200b\uff0c\u200b\u4f9b\u200b\u7528\u6237\u200b\u5b9e\u73b0\u200b\u81ea\u5df1\u200b\u7684\u200b\u67b6\u6784\u200b\u3002
MultiMolecule \u200b\u5efa\u7acb\u200b\u5728\u200b \u200b\u751f\u6001\u7cfb\u7edf\u200b\u4e4b\u4e0a\u200b\uff0c\u200b\u62e5\u62b1\u200b\u7c7b\u4f3c\u200b\u7684\u200b\u8bbe\u8ba1\u200b\u7406\u5ff5\u200b\uff1a\u200b\u4e0d\u8981\u200b \u200b\u91cd\u590d\u200b\u81ea\u5df1\u200b\u3002 \u200b\u6211\u4eec\u200b\u9075\u5faa\u200b \u200b\u5355\u4e00\u200b\u6a21\u578b\u200b\u6587\u4ef6\u200b\u7b56\u7565\u200b
\uff0c\u200b\u5176\u4e2d\u200b models
\u200b\u5305\u4e2d\u200b\u7684\u200b\u6bcf\u4e2a\u200b\u6a21\u578b\u200b\u90fd\u200b\u5305\u542b\u200b\u4e00\u4e2a\u200b\u4e14\u200b\u4ec5\u200b\u6709\u200b\u4e00\u4e2a\u200b\u63cf\u8ff0\u200b\u7f51\u7edc\u200b\u8bbe\u8ba1\u200b\u7684\u200b modeling.py
\u200b\u6587\u4ef6\u200b\u3002
module
\u200b\u5305\u200b\u65e8\u5728\u200b\u63d0\u4f9b\u200b\u7b80\u5355\u200b\u3001\u200b\u53ef\u200b\u91cd\u7528\u200b\u7684\u200b\u6a21\u5757\u200b\uff0c\u200b\u8fd9\u4e9b\u200b\u6a21\u5757\u200b\u5728\u200b\u591a\u4e2a\u200b\u6a21\u578b\u200b\u4e2d\u200b\u4fdd\u6301\u4e00\u81f4\u200b\u3002\u200b\u8fd9\u79cd\u200b\u65b9\u6cd5\u200b\u6700\u5927\u200b\u7a0b\u5ea6\u200b\u5730\u200b\u51cf\u5c11\u200b\u4e86\u200b\u4ee3\u7801\u200b\u91cd\u590d\u200b\uff0c\u200b\u5e76\u200b\u4fc3\u8fdb\u200b\u4e86\u200b\u5e72\u51c0\u200b\u3001\u200b\u6613\u4e8e\u200b\u7ef4\u62a4\u200b\u7684\u200b\u4ee3\u7801\u200b\u3002
module
\u200b\u5305\u62ec\u200b\u4e00\u4e9b\u200b\u5728\u200b\u4e0d\u540c\u200b\u6a21\u578b\u200b\u4e2d\u200b\u5e38\u7528\u200b\u7684\u200b\u7ec4\u4ef6\u200b\uff0c\u200b\u4f8b\u5982\u200b SequencePredictionHead
\u3002\u200b\u8fd9\u200b\u51cf\u5c11\u200b\u4e86\u200b\u5197\u4f59\u200b\uff0c\u200b\u5e76\u200b\u7b80\u5316\u200b\u4e86\u200b\u5f00\u53d1\u200b\u8fc7\u7a0b\u200b\u3002module
\u200b\u5305\u200b\u4e13\u6ce8\u200b\u4e8e\u200b\u66f4\u200b\u7b80\u5355\u200b\u7684\u200b\u7ec4\u4ef6\u200b\uff0c\u200b\u5c06\u200b\u590d\u6742\u200b\u7684\u200b\u3001\u200b\u7279\u5b9a\u200b\u4e8e\u200b\u6a21\u578b\u200b\u7684\u200b\u53d8\u5316\u200b\u7559\u7ed9\u200b\u6bcf\u4e2a\u200b\u6a21\u578b\u200b\u7684\u200b modeling.py
\u200b\u4e2d\u200b\u5b9a\u4e49\u200b\u3002SequencePredictionHead
\u3001TokenPredictionHead
\u200b\u548c\u200bContactPredictionHead
\u3002SinusoidalEmbedding
\u200b\u548c\u200b RotaryEmbedding
\u3002embeddings
\u200b\u63d0\u4f9b\u200b\u4e86\u200b\u4e00\u7cfb\u5217\u200b\u9884\u5b9a\u200b\u4e49\u200b\u7684\u200b\u4f4d\u7f6e\u200b\u7f16\u7801\u200b\u3002
heads
\u200b\u63d0\u4f9b\u200b\u4e86\u200b\u4e00\u7cfb\u5217\u200b\u7684\u200b\u6a21\u578b\u200b\u9884\u6d4b\u200b\u5934\u200b\uff0c\u200b\u7528\u4e8e\u200b\u5904\u7406\u200b\u4e0d\u540c\u200b\u7684\u200b\u4efb\u52a1\u200b\u3002
heads
\u200b\u63a5\u53d7\u200b ModelOutupt
\u3001dict
\u200b\u6216\u200b tuple
\u200b\u4f5c\u4e3a\u200b\u8f93\u5165\u200b\u3002 \u200b\u5b83\u4f1a\u200b\u81ea\u52a8\u200b\u67e5\u627e\u200b\u9884\u6d4b\u200b\u6240\u200b\u9700\u200b\u7684\u200b\u6a21\u578b\u200b\u8f93\u51fa\u200b\u5e76\u200b\u76f8\u5e94\u200b\u5730\u200b\u5904\u7406\u200b\u3002
\u200b\u4e00\u4e9b\u200b\u9884\u6d4b\u200b\u5934\u200b\u53ef\u80fd\u200b\u9700\u8981\u200b\u989d\u5916\u200b\u7684\u200b\u4fe1\u606f\u200b\uff0c\u200b\u4f8b\u5982\u200b attention_mask
\u200b\u6216\u200b input_ids
\uff0c\u200b\u4f8b\u5982\u200b ContactPredictionHead
\u3002 \u200b\u8fd9\u4e9b\u200b\u989d\u5916\u200b\u7684\u200b\u53c2\u6570\u200b\u53ef\u4ee5\u200b\u4f5c\u4e3a\u200b\u53c2\u6570\u200b/\u200b\u5173\u952e\u5b57\u200b\u53c2\u6570\u200b\u4f20\u5165\u200b\u3002
\u200b\u8bf7\u200b\u6ce8\u610f\u200b\uff0cheads
\u200b\u4f7f\u7528\u200b\u4e0e\u200b Transformers \u200b\u76f8\u540c\u200b\u7684\u200b ModelOutupt
\u200b\u7ea6\u5b9a\u200b\u3002 \u200b\u5982\u679c\u200b\u6a21\u578b\u200b\u8f93\u51fa\u200b\u662f\u200b\u4e00\u4e2a\u200b tuple
\uff0c\u200b\u6211\u4eec\u200b\u5c06\u200b\u7b2c\u4e00\u4e2a\u200b\u5143\u7d20\u200b\u89c6\u4e3a\u200b pooler_output
\uff0c\u200b\u7b2c\u4e8c\u4e2a\u200b\u5143\u7d20\u200b\u89c6\u4e3a\u200b last_hidden_state
\uff0c\u200b\u6700\u540e\u200b\u4e00\u4e2a\u200b\u5143\u7d20\u200b\u89c6\u4e3a\u200b attention_map
\u3002 \u200b\u7528\u6237\u200b\u6709\u200b\u8d23\u4efb\u200b\u786e\u4fdd\u200b\u6a21\u578b\u200b\u8f93\u51fa\u200b\u683c\u5f0f\u200b\u6b63\u786e\u200b\u3002
\u200b\u5982\u679c\u200b\u6a21\u578b\u200b\u8f93\u51fa\u200b\u662f\u200b\u4e00\u4e2a\u200b ModelOutupt
\u200b\u6216\u200b\u4e00\u4e2a\u200b dict
\uff0cheads
\u200b\u5c06\u200b\u4ece\u200b\u6a21\u578b\u200b\u8f93\u51fa\u200b\u4e2d\u200b\u67e5\u627e\u200b HeadConfig.output_name
\u3002 \u200b\u4f60\u200b\u53ef\u4ee5\u200b\u5728\u200b HeadConfig
\u200b\u4e2d\u200b\u6307\u5b9a\u200b output_name
\uff0c\u200b\u4ee5\u200b\u786e\u4fdd\u200b heads
\u200b\u53ef\u4ee5\u200b\u6b63\u786e\u200b\u5b9a\u4f4d\u200b\u6240\u200b\u9700\u200b\u7684\u200b\u5f20\u91cf\u200b\u3002
tokenisers
\u200b\u63d0\u4f9b\u200b\u4e86\u200b\u4e00\u7cfb\u5217\u200b\u9884\u5b9a\u200b\u4e49\u200b\u4ee4\u724c\u200b\u5668\u200b\u3002
\u200b\u4ee4\u724c\u200b\u5668\u662f\u200b\u4e00\u4e2a\u200b\u5c06\u200b\u6838\u82f7\u9178\u200b\u6216\u200b\u6c28\u57fa\u9178\u200b\u5e8f\u5217\u200b\u8f6c\u6362\u200b\u4e3a\u200b\u7d22\u5f15\u200b\u5e8f\u5217\u200b\u7684\u200b\u7c7b\u200b\u3002\u200b\u5b83\u200b\u7528\u4e8e\u200b\u5728\u200b\u5c06\u200b\u8f93\u5165\u200b\u5e8f\u5217\u200b\u9988\u9001\u200b\u5230\u200b\u6a21\u578b\u200b\u4e4b\u524d\u200b\u5bf9\u200b\u5176\u200b\u8fdb\u884c\u200b\u9884\u5904\u7406\u200b\u3002
\u200b\u8bf7\u53c2\u9605\u200b Tokenizer \u200b\u4e86\u89e3\u200b\u66f4\u200b\u591a\u200b\u7ec6\u8282\u200b\u3002
"},{"location":"zh/tokenisers/#_1","title":"\u53ef\u7528\u200b\u4ee4\u724c\u200b\u5668","text":"Accelerate Molecular Biology Research with Machine Learning
"},{"location":"#introduction","title":"Introduction","text":"
Welcome to MultiMolecule (\u200b\u6d66\u539f\u200b), a foundational library designed to accelerate scientific research in molecular biology through machine learning. MultiMolecule provides a comprehensive yet flexible set of tools for researchers aiming to leverage AI with ease, focusing on biomolecular data (RNA, DNA, and protein).
"},{"location":"#overview","title":"Overview","text":"MultiMolecule is built with flexibility and ease of use in mind. Its modular design allows you to utilize only the components you need, integrating seamlessly into your existing workflows without adding unnecessary complexity.
data
: Smart Dataset
that automatically infer tasks\u2014including their level (sequence, token, contact) and type (classification, regression). Provides multi-task datasets and samplers to facilitate multitask learning without additional configuration.datasets
: A collection of widely-used biomolecular datasets.module
: Modular neural network building blocks, including embeddings, heads, and criterions for constructing custom models.models
: Implementation of state-of-the-art pre-trained models in molecular biology.tokenisers
: Tokenizers to convert DNA, RNA, protein and other sequences to one-hot encodings.Install the most recent stable version on PyPI:
Bashpip install multimolecule\n
Install the latest version from the source:
Bashpip install git+https://github.com/DLS5-Omics/MultiMolecule\n
"},{"location":"#citation","title":"Citation","text":"If you use MultiMolecule in your research, please cite us as follows:
BibTeX@software{chen_2024_12638419,\n author = {Chen, Zhiyuan and Zhu, Sophia Y.},\n title = {MultiMolecule},\n doi = {10.5281/zenodo.12638419},\n publisher = {Zenodo},\n url = {https://doi.org/10.5281/zenodo.12638419},\n year = 2024,\n month = may,\n day = 4\n}\n
"},{"location":"#license","title":"License","text":"We believe openness is the Foundation of Research.
MultiMolecule is licensed under the GNU Affero General Public License.
Please join us in building an open research community.
SPDX-License-Identifier: AGPL-3.0-or-later
Developed by DanLing on Earth
We are a community of developers, designers, and others from around the world who are working together to make deep learning more accessible.
We are a community of individuals who seek to push the boundaries of what is possible with deep learning.
We are passionate about Deep Learning and the people who use it.
We are DanLing.
"},{"location":"about/license-faq/","title":"License FAQ","text":"This License FAQ explains the terms and conditions under which you may use the data, models, code, configuration, documentation, and weights provided by the DanLing Team (also known as DanLing) (\u2018we\u2019, \u2018us\u2019, or \u2018our\u2019). It serves as an addendum to our License.
"},{"location":"about/license-faq/#0-summary-of-key-points","title":"0. Summary of Key Points","text":"This summary provides key points from our license, but you can find out more details about any of these topics by clicking the link following each key point and by reading the full license.
What constitutes the \u2018source code\u2019 in MultiMolecule?
We consider everything in our repositories to be source code, including data, models, code, configuration, and documentation.
What constitutes the \u2018source code\u2019 in MultiMolecule?
Can I publish research papers using MultiMolecule?
It depends.
You can publish research papers on fully open access journals and conferences or preprint servers following the terms of the License.
You must obtain a separate license from us to publish research papers on closed access journals and conferences.
Can I publish research papers using MultiMolecule?
Can I use MultiMolecule for commercial purposes?
Yes, you can use MultiMolecule for commercial purposes under the terms of the License.
Can I use MultiMolecule for commercial purposes?
Do people affiliated with certain organizations have specific license terms?
Yes, people affiliated with certain organizations have specific license terms.
Do people affiliated with certain organizations have specific license terms?
"},{"location":"about/license-faq/#1-what-constitutes-the-source-code-in-multimolecule","title":"1. What constitutes the \u201csource code\u201d in MultiMolecule?","text":"We consider everything in our repositories to be source code.
The training process of machine learning models is viewed similarly to the compilation process of traditional software. As such, the model, code, configuration, documentation, and data used for training are all part of the source code, while the trained model weights are part of the object code.
We also consider research papers and manuscripts a special form of documentation, which are also part of the source code.
"},{"location":"about/license-faq/#2-can-i-publish-research-papers-using-multimolecule","title":"2. Can I publish research papers using MultiMolecule?","text":"Since research papers are considered a form of source code, publishers are legally required to open-source all materials on their server to comply with the License if they publish papers using MultiMolecule. This is generally impractical for most publishers.
As a special exemption under section 7 of the License, we grant permission to publish research papers using MultiMolecule in fully open access journals, conferences, or preprint servers that do not charge any fee from authors, provided all published manuscripts are made available under the GNU Free Documentation License (GFDL), or a Creative Commons license, or an OSI-approved license that permits the sharing of manuscripts.
As a special exemption under section 7 of the License, we grant permission to publish research papers using MultiMolecule in certain non-profit journals, conferences, or preprint servers. Currently, the non-profit journals, conferences, or preprint servers we allow include:
For publishing in closed access journals or conferences, you must obtain a separate license from us. This typically involves co-authorship, a fee to support the project, or both. Contact us at multimolecule@zyc.ai for more information.
While not mandatory, we recommend citing the MultiMolecule project in your research papers.
"},{"location":"about/license-faq/#3-can-i-use-multimolecule-for-commercial-purposes","title":"3. Can I use MultiMolecule for commercial purposes?","text":"Yes, MultiMolecule can be used for commercial purposes under the License. However, you must open-source any modifications to the source code and make them available under the License.
If you prefer to use MultiMolecule for commercial purposes without open-sourcing your modifications, you must obtain a separate license from us. This typically involves a fee to support the project. Contact us at multimolecule@zyc.ai for further details.
"},{"location":"about/license-faq/#4-do-people-affiliated-with-certain-organizations-have-specific-license-terms","title":"4. Do people affiliated with certain organizations have specific license terms?","text":"YES!
If you are affiliated with an organization that has a separate license agreement with us, you may be subject to different license terms. Please consult your organization\u2019s legal department to determine if you are subject to a separate license agreement.
Members of the following organizations automatically receive a non-transferable, non-sublicensable, and non-distributable MIT License to use MultiMolecule:
This special license is considered an additional term under section 7 of the License. It is not redistributable, and you are prohibited from creating any independent derivative works. Any modifications or derivative works based on this license are automatically considered derivative works of MultiMolecule and must comply with all the terms of the License. This ensures that third parties cannot bypass the license terms or create separate licenses from derivative works.
"},{"location":"about/license-faq/#5-how-can-i-use-multimolecule-if-my-organization-forbids-the-use-of-code-under-the-agpl-license","title":"5. How can I use MultiMolecule if my organization forbids the use of code under the AGPL License?","text":"Some organizations, such as Google, have policies that prohibit the use of code under the AGPL License.
If you are affiliated with an organization that forbids the use of AGPL-licensed code, you must obtain a separate license from us. Contact us at multimolecule@zyc.ai for more information.
"},{"location":"about/license-faq/#6-can-i-use-multimolecule-if-i-am-a-federal-employee-of-the-united-states-government","title":"6. Can I use MultiMolecule if I am a federal employee of the United States Government?","text":"No.
Code written by federal employees of the United States Government is not protected by copyright under 17 U.S. Code \u00a7 105.
As a result, federal employees of the United States Government cannot comply with the terms of the License.
"},{"location":"about/license-faq/#7-do-we-make-updates-to-this-faq","title":"7. Do we make updates to this FAQ?","text":"In Short
Yes, we will update this FAQ as necessary to stay compliant with relevant laws.
We may update this license FAQ from time to time. The updated version will be indicated by an updated \u2018Last Revised Time\u2019 at the bottom of this license FAQ. If we make any material changes, we will notify you by posting the new license FAQ on this page. We are unable to notify you directly as we do not collect any contact information from you. We encourage you to review this license FAQ frequently to stay informed of how you can use our data, models, code, configuration, documentation, and weights.
"},{"location":"about/license/","title":"GNU AFFERO GENERAL PUBLIC LICENSE","text":"Version 3, 19 November 2007
Copyright (C) 2007 Free Software Foundation, Inc. https://fsf.org/
Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed.
"},{"location":"about/license/#preamble","title":"Preamble","text":"The GNU Affero General Public License is a free, copyleft license for software and other kinds of works, specifically designed to ensure cooperation with the community in the case of network server software.
The licenses for most software and other practical works are designed to take away your freedom to share and change the works. By contrast, our General Public Licenses are intended to guarantee your freedom to share and change all versions of a program\u2013to make sure it remains free software for all its users.
When we speak of free software, we are referring to freedom, not price. Our General Public Licenses are designed to make sure that you have the freedom to distribute copies of free software (and charge for them if you wish), that you receive source code or can get it if you want it, that you can change the software or use pieces of it in new free programs, and that you know you can do these things.
Developers that use our General Public Licenses protect your rights with two steps: (1) assert copyright on the software, and (2) offer you this License which gives you legal permission to copy, distribute and/or modify the software.
A secondary benefit of defending all users\u2019 freedom is that improvements made in alternate versions of the program, if they receive widespread use, become available for other developers to incorporate. Many developers of free software are heartened and encouraged by the resulting cooperation. However, in the case of software used on network servers, this result may fail to come about. The GNU General Public License permits making a modified version and letting the public access it on a server without ever releasing its source code to the public.
The GNU Affero General Public License is designed specifically to ensure that, in such cases, the modified source code becomes available to the community. It requires the operator of a network server to provide the source code of the modified version running there to the users of that server. Therefore, public use of a modified version, on a publicly accessible server, gives the public access to the source code of the modified version.
An older license, called the Affero General Public License and published by Affero, was designed to accomplish similar goals. This is a different license, not a version of the Affero GPL, but Affero has released a new version of the Affero GPL which permits relicensing under this license.
The precise terms and conditions for copying, distribution and modification follow.
"},{"location":"about/license/#terms-and-conditions","title":"TERMS AND CONDITIONS","text":""},{"location":"about/license/#0-definitions","title":"0. Definitions.","text":"\u201cThis License\u201d refers to version 3 of the GNU Affero General Public License.
\u201cCopyright\u201d also means copyright-like laws that apply to other kinds of works, such as semiconductor masks.
\u201cThe Program\u201d refers to any copyrightable work licensed under this License. Each licensee is addressed as \u201cyou\u201d. \u201cLicensees\u201d and \u201crecipients\u201d may be individuals or organizations.
To \u201cmodify\u201d a work means to copy from or adapt all or part of the work in a fashion requiring copyright permission, other than the making of an exact copy. The resulting work is called a \u201cmodified version\u201d of the earlier work or a work \u201cbased on\u201d the earlier work.
A \u201ccovered work\u201d means either the unmodified Program or a work based on the Program.
To \u201cpropagate\u201d a work means to do anything with it that, without permission, would make you directly or secondarily liable for infringement under applicable copyright law, except executing it on a computer or modifying a private copy. Propagation includes copying, distribution (with or without modification), making available to the public, and in some countries other activities as well.
To \u201cconvey\u201d a work means any kind of propagation that enables other parties to make or receive copies. Mere interaction with a user through a computer network, with no transfer of a copy, is not conveying.
An interactive user interface displays \u201cAppropriate Legal Notices\u201d to the extent that it includes a convenient and prominently visible feature that (1) displays an appropriate copyright notice, and (2) tells the user that there is no warranty for the work (except to the extent that warranties are provided), that licensees may convey the work under this License, and how to view a copy of this License. If the interface presents a list of user commands or options, such as a menu, a prominent item in the list meets this criterion.
"},{"location":"about/license/#1-source-code","title":"1. Source Code.","text":"The \u201csource code\u201d for a work means the preferred form of the work for making modifications to it. \u201cObject code\u201d means any non-source form of a work.
A \u201cStandard Interface\u201d means an interface that either is an official standard defined by a recognized standards body, or, in the case of interfaces specified for a particular programming language, one that is widely used among developers working in that language.
The \u201cSystem Libraries\u201d of an executable work include anything, other than the work as a whole, that (a) is included in the normal form of packaging a Major Component, but which is not part of that Major Component, and (b) serves only to enable use of the work with that Major Component, or to implement a Standard Interface for which an implementation is available to the public in source code form. A \u201cMajor Component\u201d, in this context, means a major essential component (kernel, window system, and so on) of the specific operating system (if any) on which the executable work runs, or a compiler used to produce the work, or an object code interpreter used to run it.
The \u201cCorresponding Source\u201d for a work in object code form means all the source code needed to generate, install, and (for an executable work) run the object code and to modify the work, including scripts to control those activities. However, it does not include the work\u2019s System Libraries, or general-purpose tools or generally available free programs which are used unmodified in performing those activities but which are not part of the work. For example, Corresponding Source includes interface definition files associated with source files for the work, and the source code for shared libraries and dynamically linked subprograms that the work is specifically designed to require, such as by intimate data communication or control flow between those subprograms and other parts of the work.
The Corresponding Source need not include anything that users can regenerate automatically from other parts of the Corresponding Source.
The Corresponding Source for a work in source code form is that same work.
"},{"location":"about/license/#2-basic-permissions","title":"2. Basic Permissions.","text":"All rights granted under this License are granted for the term of copyright on the Program, and are irrevocable provided the stated conditions are met. This License explicitly affirms your unlimited permission to run the unmodified Program. The output from running a covered work is covered by this License only if the output, given its content, constitutes a covered work. This License acknowledges your rights of fair use or other equivalent, as provided by copyright law.
You may make, run and propagate covered works that you do not convey, without conditions so long as your license otherwise remains in force. You may convey covered works to others for the sole purpose of having them make modifications exclusively for you, or provide you with facilities for running those works, provided that you comply with the terms of this License in conveying all material for which you do not control copyright. Those thus making or running the covered works for you must do so exclusively on your behalf, under your direction and control, on terms that prohibit them from making any copies of your copyrighted material outside their relationship with you.
Conveying under any other circumstances is permitted solely under the conditions stated below. Sublicensing is not allowed; section 10 makes it unnecessary.
"},{"location":"about/license/#3-protecting-users-legal-rights-from-anti-circumvention-law","title":"3. Protecting Users\u2019 Legal Rights From Anti-Circumvention Law.","text":"No covered work shall be deemed part of an effective technological measure under any applicable law fulfilling obligations under article 11 of the WIPO copyright treaty adopted on 20 December 1996, or similar laws prohibiting or restricting circumvention of such measures.
When you convey a covered work, you waive any legal power to forbid circumvention of technological measures to the extent such circumvention is effected by exercising rights under this License with respect to the covered work, and you disclaim any intention to limit operation or modification of the work as a means of enforcing, against the work\u2019s users, your or third parties\u2019 legal rights to forbid circumvention of technological measures.
"},{"location":"about/license/#4-conveying-verbatim-copies","title":"4. Conveying Verbatim Copies.","text":"You may convey verbatim copies of the Program\u2019s source code as you receive it, in any medium, provided that you conspicuously and appropriately publish on each copy an appropriate copyright notice; keep intact all notices stating that this License and any non-permissive terms added in accord with section 7 apply to the code; keep intact all notices of the absence of any warranty; and give all recipients a copy of this License along with the Program.
You may charge any price or no price for each copy that you convey, and you may offer support or warranty protection for a fee.
"},{"location":"about/license/#5-conveying-modified-source-versions","title":"5. Conveying Modified Source Versions.","text":"You may convey a work based on the Program, or the modifications to produce it from the Program, in the form of source code under the terms of section 4, provided that you also meet all of these conditions:
A compilation of a covered work with other separate and independent works, which are not by their nature extensions of the covered work, and which are not combined with it such as to form a larger program, in or on a volume of a storage or distribution medium, is called an \u201caggregate\u201d if the compilation and its resulting copyright are not used to limit the access or legal rights of the compilation\u2019s users beyond what the individual works permit. Inclusion of a covered work in an aggregate does not cause this License to apply to the other parts of the aggregate.
"},{"location":"about/license/#6-conveying-non-source-forms","title":"6. Conveying Non-Source Forms.","text":"You may convey a covered work in object code form under the terms of sections 4 and 5, provided that you also convey the machine-readable Corresponding Source under the terms of this License, in one of these ways:
A separable portion of the object code, whose source code is excluded from the Corresponding Source as a System Library, need not be included in conveying the object code work.
A \u201cUser Product\u201d is either (1) a \u201cconsumer product\u201d, which means any tangible personal property which is normally used for personal, family, or household purposes, or (2) anything designed or sold for incorporation into a dwelling. In determining whether a product is a consumer product, doubtful cases shall be resolved in favor of coverage. For a particular product received by a particular user, \u201cnormally used\u201d refers to a typical or common use of that class of product, regardless of the status of the particular user or of the way in which the particular user actually uses, or expects or is expected to use, the product. A product is a consumer product regardless of whether the product has substantial commercial, industrial or non-consumer uses, unless such uses represent the only significant mode of use of the product.
\u201cInstallation Information\u201d for a User Product means any methods, procedures, authorization keys, or other information required to install and execute modified versions of a covered work in that User Product from a modified version of its Corresponding Source. The information must suffice to ensure that the continued functioning of the modified object code is in no case prevented or interfered with solely because modification has been made.
If you convey an object code work under this section in, or with, or specifically for use in, a User Product, and the conveying occurs as part of a transaction in which the right of possession and use of the User Product is transferred to the recipient in perpetuity or for a fixed term (regardless of how the transaction is characterized), the Corresponding Source conveyed under this section must be accompanied by the Installation Information. But this requirement does not apply if neither you nor any third party retains the ability to install modified object code on the User Product (for example, the work has been installed in ROM).
The requirement to provide Installation Information does not include a requirement to continue to provide support service, warranty, or updates for a work that has been modified or installed by the recipient, or for the User Product in which it has been modified or installed. Access to a network may be denied when the modification itself materially and adversely affects the operation of the network or violates the rules and protocols for communication across the network.
Corresponding Source conveyed, and Installation Information provided, in accord with this section must be in a format that is publicly documented (and with an implementation available to the public in source code form), and must require no special password or key for unpacking, reading or copying.
"},{"location":"about/license/#7-additional-terms","title":"7. Additional Terms.","text":"\u201cAdditional permissions\u201d are terms that supplement the terms of this License by making exceptions from one or more of its conditions. Additional permissions that are applicable to the entire Program shall be treated as though they were included in this License, to the extent that they are valid under applicable law. If additional permissions apply only to part of the Program, that part may be used separately under those permissions, but the entire Program remains governed by this License without regard to the additional permissions.
When you convey a copy of a covered work, you may at your option remove any additional permissions from that copy, or from any part of it. (Additional permissions may be written to require their own removal in certain cases when you modify the work.) You may place additional permissions on material, added by you to a covered work, for which you have or can give appropriate copyright permission.
Notwithstanding any other provision of this License, for material you add to a covered work, you may (if authorized by the copyright holders of that material) supplement the terms of this License with terms:
All other non-permissive additional terms are considered \u201cfurther restrictions\u201d within the meaning of section 10. If the Program as you received it, or any part of it, contains a notice stating that it is governed by this License along with a term that is a further restriction, you may remove that term. If a license document contains a further restriction but permits relicensing or conveying under this License, you may add to a covered work material governed by the terms of that license document, provided that the further restriction does not survive such relicensing or conveying.
If you add terms to a covered work in accord with this section, you must place, in the relevant source files, a statement of the additional terms that apply to those files, or a notice indicating where to find the applicable terms.
Additional terms, permissive or non-permissive, may be stated in the form of a separately written license, or stated as exceptions; the above requirements apply either way.
"},{"location":"about/license/#8-termination","title":"8. Termination.","text":"You may not propagate or modify a covered work except as expressly provided under this License. Any attempt otherwise to propagate or modify it is void, and will automatically terminate your rights under this License (including any patent licenses granted under the third paragraph of section 11).
However, if you cease all violation of this License, then your license from a particular copyright holder is reinstated (a) provisionally, unless and until the copyright holder explicitly and finally terminates your license, and (b) permanently, if the copyright holder fails to notify you of the violation by some reasonable means prior to 60 days after the cessation.
Moreover, your license from a particular copyright holder is reinstated permanently if the copyright holder notifies you of the violation by some reasonable means, this is the first time you have received notice of violation of this License (for any work) from that copyright holder, and you cure the violation prior to 30 days after your receipt of the notice.
Termination of your rights under this section does not terminate the licenses of parties who have received copies or rights from you under this License. If your rights have been terminated and not permanently reinstated, you do not qualify to receive new licenses for the same material under section 10.
"},{"location":"about/license/#9-acceptance-not-required-for-having-copies","title":"9. Acceptance Not Required for Having Copies.","text":"You are not required to accept this License in order to receive or run a copy of the Program. Ancillary propagation of a covered work occurring solely as a consequence of using peer-to-peer transmission to receive a copy likewise does not require acceptance. However, nothing other than this License grants you permission to propagate or modify any covered work. These actions infringe copyright if you do not accept this License. Therefore, by modifying or propagating a covered work, you indicate your acceptance of this License to do so.
"},{"location":"about/license/#10-automatic-licensing-of-downstream-recipients","title":"10. Automatic Licensing of Downstream Recipients.","text":"Each time you convey a covered work, the recipient automatically receives a license from the original licensors, to run, modify and propagate that work, subject to this License. You are not responsible for enforcing compliance by third parties with this License.
An \u201centity transaction\u201d is a transaction transferring control of an organization, or substantially all assets of one, or subdividing an organization, or merging organizations. If propagation of a covered work results from an entity transaction, each party to that transaction who receives a copy of the work also receives whatever licenses to the work the party\u2019s predecessor in interest had or could give under the previous paragraph, plus a right to possession of the Corresponding Source of the work from the predecessor in interest, if the predecessor has it or can get it with reasonable efforts.
You may not impose any further restrictions on the exercise of the rights granted or affirmed under this License. For example, you may not impose a license fee, royalty, or other charge for exercise of rights granted under this License, and you may not initiate litigation (including a cross-claim or counterclaim in a lawsuit) alleging that any patent claim is infringed by making, using, selling, offering for sale, or importing the Program or any portion of it.
"},{"location":"about/license/#11-patents","title":"11. Patents.","text":"A \u201ccontributor\u201d is a copyright holder who authorizes use under this License of the Program or a work on which the Program is based. The work thus licensed is called the contributor\u2019s \u201ccontributor version\u201d.
A contributor\u2019s \u201cessential patent claims\u201d are all patent claims owned or controlled by the contributor, whether already acquired or hereafter acquired, that would be infringed by some manner, permitted by this License, of making, using, or selling its contributor version, but do not include claims that would be infringed only as a consequence of further modification of the contributor version. For purposes of this definition, \u201ccontrol\u201d includes the right to grant patent sublicenses in a manner consistent with the requirements of this License.
Each contributor grants you a non-exclusive, worldwide, royalty-free patent license under the contributor\u2019s essential patent claims, to make, use, sell, offer for sale, import and otherwise run, modify and propagate the contents of its contributor version.
In the following three paragraphs, a \u201cpatent license\u201d is any express agreement or commitment, however denominated, not to enforce a patent (such as an express permission to practice a patent or covenant not to sue for patent infringement). To \u201cgrant\u201d such a patent license to a party means to make such an agreement or commitment not to enforce a patent against the party.
If you convey a covered work, knowingly relying on a patent license, and the Corresponding Source of the work is not available for anyone to copy, free of charge and under the terms of this License, through a publicly available network server or other readily accessible means, then you must either (1) cause the Corresponding Source to be so available, or (2) arrange to deprive yourself of the benefit of the patent license for this particular work, or (3) arrange, in a manner consistent with the requirements of this License, to extend the patent license to downstream recipients. \u201cKnowingly relying\u201d means you have actual knowledge that, but for the patent license, your conveying the covered work in a country, or your recipient\u2019s use of the covered work in a country, would infringe one or more identifiable patents in that country that you have reason to believe are valid.
If, pursuant to or in connection with a single transaction or arrangement, you convey, or propagate by procuring conveyance of, a covered work, and grant a patent license to some of the parties receiving the covered work authorizing them to use, propagate, modify or convey a specific copy of the covered work, then the patent license you grant is automatically extended to all recipients of the covered work and works based on it.
A patent license is \u201cdiscriminatory\u201d if it does not include within the scope of its coverage, prohibits the exercise of, or is conditioned on the non-exercise of one or more of the rights that are specifically granted under this License. You may not convey a covered work if you are a party to an arrangement with a third party that is in the business of distributing software, under which you make payment to the third party based on the extent of your activity of conveying the work, and under which the third party grants, to any of the parties who would receive the covered work from you, a discriminatory patent license (a) in connection with copies of the covered work conveyed by you (or copies made from those copies), or (b) primarily for and in connection with specific products or compilations that contain the covered work, unless you entered into that arrangement, or that patent license was granted, prior to 28 March 2007.
Nothing in this License shall be construed as excluding or limiting any implied license or other defenses to infringement that may otherwise be available to you under applicable patent law.
"},{"location":"about/license/#12-no-surrender-of-others-freedom","title":"12. No Surrender of Others\u2019 Freedom.","text":"If conditions are imposed on you (whether by court order, agreement or otherwise) that contradict the conditions of this License, they do not excuse you from the conditions of this License. If you cannot convey a covered work so as to satisfy simultaneously your obligations under this License and any other pertinent obligations, then as a consequence you may not convey it at all. For example, if you agree to terms that obligate you to collect a royalty for further conveying from those to whom you convey the Program, the only way you could satisfy both those terms and this License would be to refrain entirely from conveying the Program.
"},{"location":"about/license/#13-remote-network-interaction-use-with-the-gnu-general-public-license","title":"13. Remote Network Interaction; Use with the GNU General Public License.","text":"Notwithstanding any other provision of this License, if you modify the Program, your modified version must prominently offer all users interacting with it remotely through a computer network (if your version supports such interaction) an opportunity to receive the Corresponding Source of your version by providing access to the Corresponding Source from a network server at no charge, through some standard or customary means of facilitating copying of software. This Corresponding Source shall include the Corresponding Source for any work covered by version 3 of the GNU General Public License that is incorporated pursuant to the following paragraph.
Notwithstanding any other provision of this License, you have permission to link or combine any covered work with a work licensed under version 3 of the GNU General Public License into a single combined work, and to convey the resulting work. The terms of this License will continue to apply to the part which is the covered work, but the work with which it is combined will remain governed by version 3 of the GNU General Public License.
"},{"location":"about/license/#14-revised-versions-of-this-license","title":"14. Revised Versions of this License.","text":"The Free Software Foundation may publish revised and/or new versions of the GNU Affero General Public License from time to time. Such new versions will be similar in spirit to the present version, but may differ in detail to address new problems or concerns.
Each version is given a distinguishing version number. If the Program specifies that a certain numbered version of the GNU Affero General Public License \u201cor any later version\u201d applies to it, you have the option of following the terms and conditions either of that numbered version or of any later version published by the Free Software Foundation. If the Program does not specify a version number of the GNU Affero General Public License, you may choose any version ever published by the Free Software Foundation.
If the Program specifies that a proxy can decide which future versions of the GNU Affero General Public License can be used, that proxy\u2019s public statement of acceptance of a version permanently authorizes you to choose that version for the Program.
Later license versions may give you additional or different permissions. However, no additional obligations are imposed on any author or copyright holder as a result of your choosing to follow a later version.
"},{"location":"about/license/#15-disclaimer-of-warranty","title":"15. Disclaimer of Warranty.","text":"THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM \u201cAS IS\u201d WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU. SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING, REPAIR OR CORRECTION.
"},{"location":"about/license/#16-limitation-of-liability","title":"16. Limitation of Liability.","text":"IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MODIFIES AND/OR CONVEYS THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
"},{"location":"about/license/#17-interpretation-of-sections-15-and-16","title":"17. Interpretation of Sections 15 and 16.","text":"If the disclaimer of warranty and limitation of liability provided above cannot be given local legal effect according to their terms, reviewing courts shall apply local law that most closely approximates an absolute waiver of all civil liability in connection with the Program, unless a warranty or assumption of liability accompanies a copy of the Program in return for a fee.
END OF TERMS AND CONDITIONS
"},{"location":"about/license/#how-to-apply-these-terms-to-your-new-programs","title":"How to Apply These Terms to Your New Programs","text":"If you develop a new program, and you want it to be of the greatest possible use to the public, the best way to achieve this is to make it free software which everyone can redistribute and change under these terms.
To do so, attach the following notices to the program. It is safest to attach them to the start of each source file to most effectively state the exclusion of warranty; and each file should have at least the \u201ccopyright\u201d line and a pointer to where the full notice is found.
Text Only <one line to give the program's name and a brief idea of what it does.>\n Copyright (C) <year> <name of author>\n\n This program is free software: you can redistribute it and/or modify\n it under the terms of the GNU Affero General Public License as\n published by the Free Software Foundation, either version 3 of the\n License, or (at your option) any later version.\n\n This program is distributed in the hope that it will be useful,\n but WITHOUT ANY WARRANTY; without even the implied warranty of\n MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the\n GNU Affero General Public License for more details.\n\n You should have received a copy of the GNU Affero General Public License\n along with this program. If not, see <https://www.gnu.org/licenses/>.\n
Also add information on how to contact you by electronic and paper mail.
If your software can interact with users remotely through a computer network, you should also make sure that it provides a way for users to get its source. For example, if your program is a web application, its interface could display a \u201cSource\u201d link that leads users to an archive of the code. There are many ways you could offer source, and different solutions will be better for different programs; see section 13 for the specific requirements.
You should also get your employer (if you work as a programmer) or school, if any, to sign a \u201ccopyright disclaimer\u201d for the program, if necessary. For more information on this, and how to apply and follow the GNU AGPL, see https://www.gnu.org/licenses/.
"},{"location":"data/","title":"data","text":"data
provides a collection of data processing utilities for handling data.
While datasets
is a powerful library for managing datasets, it is a general-purpose tool that may not cover all the specific functionalities of scientific applications.
The data
package is designed to complement datasets
by offering additional data processing utilities that are commonly used in scientific tasks.
from multimolecule.data import Dataset\n\ndata = Dataset(\"data/rna/5utr.csv\", split=\"train\", pretrained=\"multimolecule/rna\")\n
"},{"location":"data/#load-from-datasets","title":"Load from datasets
","text":"Pythonfrom multimolecule.data import Dataset\n\ndata = Dataset(\"multimolecule/bprna-spot\", split=\"train\", pretrained=\"multimolecule/rna\")\n
"},{"location":"data/dataset/","title":"Dataset","text":""},{"location":"data/dataset/#multimolecule.data.Dataset","title":"multimolecule.data.Dataset","text":" Bases: Dataset
The base class for all datasets.
Dataset is a subclass of datasets.Dataset
that provides additional functionality for handling structured data. It has three main features:
Attributes:
Name Type Descriptiontasks
NestedDict
A nested dictionary of the inferred tasks for each label column in the dataset.
tokenizer
PreTrainedTokenizerBase
The pretrained tokenizer to use for tokenization.
truncation
bool
Whether to truncate sequences that exceed the maximum length of the tokenizer.
max_seq_length
int
The maximum length of the input sequences.
data_cols
List
The names of all columns in the dataset.
feature_cols
List
The names of the feature columns in the dataset.
label_cols
List
The names of the label columns in the dataset.
sequence_cols
List
The names of the sequence columns in the dataset.
column_names_map
Mapping[str, str] | None
A mapping of column names to new column names.
preprocess
bool
Whether to preprocess the dataset.
Parameters:
Name Type Description DefaultTable | DataFrame | dict | list | str
The dataset. This can be a path to a file, a tag on the Hugging Face Hub, a pyarrow.Table, a dict, a list, or a pandas.DataFrame.
requiredNamedSplit
The split of the dataset.
requiredPreTrainedTokenizerBase | None
A pretrained tokenizer to use for tokenization. Either tokenizer
or pretrained
must be specified.
None
str | None
The name of a pretrained tokenizer to use for tokenization. Either tokenizer
or pretrained
must be specified.
None
List | None
The names of the feature columns in the dataset. Will be inferred automatically if not specified.
None
List | None
The names of the label columns in the dataset. Will be inferred automatically if not specified.
None
List | None
The names of the ID columns in the dataset. Will be inferred automatically if not specified.
None
bool | None
Whether to preprocess the dataset. Preprocessing involves pre-tokenizing the sequences using the tokenizer. Defaults to True
.
None
bool | None
Whether to automatically rename sequence columns to standard name. Only works when there is exactly one sequence column You can control the naming through multimolecule.defaults.SEQUENCE_COL_NAME
. For more refined control, use column_names_map
.
None
Whether to automatically rename label column to standard name. Only works when there is exactly one label column. You can control the naming through multimolecule.defaults.LABEL_COL_NAME
. For more refined control, use column_names_map
.
Mapping[str, str] | None
A mapping of column names to new column names. This is useful for renaming columns to inputs that are expected by a model. Defaults to None
.
None
bool | None
Whether to truncate sequences that exceed the maximum length of the tokenizer. Defaults to False
.
None
int | None
The maximum length of the input sequences. Defaults to the model_max_length
of the tokenizer.
None
Mapping[str, Task] | None
A mapping of column names to tasks. Will be inferred automatically if not specified.
None
Mapping[str, int] | None
A mapping of column names to discrete mappings. This is useful for mapping the raw value to nominal value in classification tasks. Will be inferred automatically if not specified.
None
str
How to handle NaN and inf values in the dataset. Can be \u201cignore\u201d, \u201cerror\u201d, \u201cdrop\u201d, or \u201cfill\u201d. Defaults to \u201cignore\u201d.
'ignore'
str | int | float
The value to fill NaN and inf values with. Defaults to 0.
0
DatasetInfo | None
The dataset info.
None
Table | None
The indices table.
None
str | None
The fingerprint of the dataset.
None
Source code in multimolecule/data/dataset.py
Pythonclass Dataset(datasets.Dataset):\n r\"\"\"\n The base class for all datasets.\n\n Dataset is a subclass of [`datasets.Dataset`][] that provides additional functionality for handling structured data.\n It has three main features:\n\n - column identification: identify the special columns (sequence and structure columns) in the dataset.\n - tokenization: tokenize the sequence columns in the dataset using a pretrained tokenizer.\n - task inference: infer the task type and level of each label column in the dataset.\n\n Attributes:\n tasks: A nested dictionary of the inferred tasks for each label column in the dataset.\n tokenizer: The pretrained tokenizer to use for tokenization.\n truncation: Whether to truncate sequences that exceed the maximum length of the tokenizer.\n max_seq_length: The maximum length of the input sequences.\n data_cols: The names of all columns in the dataset.\n feature_cols: The names of the feature columns in the dataset.\n label_cols: The names of the label columns in the dataset.\n sequence_cols: The names of the sequence columns in the dataset.\n column_names_map: A mapping of column names to new column names.\n preprocess: Whether to preprocess the dataset.\n\n Args:\n data: The dataset. This can be a path to a file, a tag on the Hugging Face Hub, a pyarrow.Table,\n a [dict][], a [list][], or a [pandas.DataFrame][].\n split: The split of the dataset.\n tokenizer: A pretrained tokenizer to use for tokenization.\n Either `tokenizer` or `pretrained` must be specified.\n pretrained: The name of a pretrained tokenizer to use for tokenization.\n Either `tokenizer` or `pretrained` must be specified.\n feature_cols: The names of the feature columns in the dataset.\n Will be inferred automatically if not specified.\n label_cols: The names of the label columns in the dataset.\n Will be inferred automatically if not specified.\n id_cols: The names of the ID columns in the dataset.\n Will be inferred automatically if not specified.\n preprocess: Whether to preprocess the dataset.\n Preprocessing involves pre-tokenizing the sequences using the tokenizer.\n Defaults to `True`.\n auto_rename_sequence_col: Whether to automatically rename sequence columns to standard name.\n Only works when there is exactly one sequence column\n You can control the naming through `multimolecule.defaults.SEQUENCE_COL_NAME`.\n For more refined control, use `column_names_map`.\n auto_rename_label_cols: Whether to automatically rename label column to standard name.\n Only works when there is exactly one label column.\n You can control the naming through `multimolecule.defaults.LABEL_COL_NAME`.\n For more refined control, use `column_names_map`.\n column_names_map: A mapping of column names to new column names.\n This is useful for renaming columns to inputs that are expected by a model.\n Defaults to `None`.\n truncation: Whether to truncate sequences that exceed the maximum length of the tokenizer.\n Defaults to `False`.\n max_seq_length: The maximum length of the input sequences.\n Defaults to the `model_max_length` of the tokenizer.\n tasks: A mapping of column names to tasks.\n Will be inferred automatically if not specified.\n discrete_map: A mapping of column names to discrete mappings.\n This is useful for mapping the raw value to nominal value in classification tasks.\n Will be inferred automatically if not specified.\n nan_process: How to handle NaN and inf values in the dataset.\n Can be \"ignore\", \"error\", \"drop\", or \"fill\". Defaults to \"ignore\".\n fill_value: The value to fill NaN and inf values with.\n Defaults to 0.\n info: The dataset info.\n indices_table: The indices table.\n fingerprint: The fingerprint of the dataset.\n \"\"\"\n\n tokenizer: PreTrainedTokenizerBase\n truncation: bool = False\n max_seq_length: int\n seq_length_offset: int = 0\n\n _id_cols: List\n _feature_cols: List\n _label_cols: List\n\n _sequence_cols: List\n _secondary_structure_cols: List\n\n _tasks: NestedDict[str, Task]\n _discrete_map: Mapping\n\n preprocess: bool = True\n auto_rename_sequence_col: bool = True\n auto_rename_label_col: bool = False\n column_names_map: Mapping[str, str] | None = None\n ignored_cols: List[str] = []\n\n def __init__(\n self,\n data: Table | DataFrame | dict | list | str,\n split: datasets.NamedSplit,\n tokenizer: PreTrainedTokenizerBase | None = None,\n pretrained: str | None = None,\n feature_cols: List | None = None,\n label_cols: List | None = None,\n id_cols: List | None = None,\n preprocess: bool | None = None,\n auto_rename_sequence_col: bool | None = None,\n auto_rename_label_col: bool | None = None,\n column_names_map: Mapping[str, str] | None = None,\n truncation: bool | None = None,\n max_seq_length: int | None = None,\n tasks: Mapping[str, Task] | None = None,\n discrete_map: Mapping[str, int] | None = None,\n nan_process: str = \"ignore\",\n fill_value: str | int | float = 0,\n info: datasets.DatasetInfo | None = None,\n indices_table: Table | None = None,\n fingerprint: str | None = None,\n ignored_cols: List[str] | None = None,\n ):\n self._tasks = NestedDict()\n if tasks is not None:\n self.tasks = tasks\n if discrete_map is not None:\n self._discrete_map = discrete_map\n arrow_table = self.build_table(\n data, split, feature_cols, label_cols, nan_process=nan_process, fill_value=fill_value\n )\n super().__init__(\n arrow_table=arrow_table, split=split, info=info, indices_table=indices_table, fingerprint=fingerprint\n )\n self.identify_special_cols(feature_cols=feature_cols, label_cols=label_cols, id_cols=id_cols)\n self.post(\n tokenizer=tokenizer,\n pretrained=pretrained,\n preprocess=preprocess,\n truncation=truncation,\n max_seq_length=max_seq_length,\n auto_rename_sequence_col=auto_rename_sequence_col,\n auto_rename_label_col=auto_rename_label_col,\n column_names_map=column_names_map,\n )\n self.ignored_cols = ignored_cols or self.id_cols\n self.train = split == datasets.Split.TRAIN\n\n def build_table(\n self,\n data: Table | DataFrame | dict | str,\n split: datasets.NamedSplit,\n feature_cols: List | None = None,\n label_cols: List | None = None,\n nan_process: str | None = \"ignore\",\n fill_value: str | int | float = 0,\n ) -> datasets.table.Table:\n if isinstance(data, str):\n try:\n data = datasets.load_dataset(data, split=split).data\n except FileNotFoundError:\n data = dl.load_pandas(data)\n if isinstance(data, DataFrame):\n data = data.loc[:, ~data.columns.str.contains(\"^Unnamed\")]\n data = pa.Table.from_pandas(data, preserve_index=False)\n elif isinstance(data, dict):\n data = pa.Table.from_pydict(data)\n elif isinstance(data, list):\n data = pa.Table.from_pylist(data)\n elif isinstance(data, DataFrame):\n data = pa.Table.from_pandas(data, preserve_index=False)\n if feature_cols is not None and label_cols is not None:\n data = data.select(feature_cols + label_cols)\n data = self.process_nan(data, nan_process=nan_process, fill_value=fill_value)\n return data\n\n def post(\n self,\n tokenizer: PreTrainedTokenizerBase | None = None,\n pretrained: str | None = None,\n max_seq_length: int | None = None,\n truncation: bool | None = None,\n preprocess: bool | None = None,\n auto_rename_sequence_col: bool | None = None,\n auto_rename_label_col: bool | None = None,\n column_names_map: Mapping[str, str] | None = None,\n ) -> None:\n r\"\"\"\n Perform pre-processing steps after initialization.\n\n It first identifies the special columns (sequence and structure columns) in the dataset.\n Then it sets the feature and label columns based on the input arguments.\n If `auto_rename_sequence_col` is `True`, it will automatically rename the sequence column.\n If `auto_rename_label_col` is `True`, it will automatically rename the label column.\n Finally, it sets the [`transform`][datasets.Dataset.set_transform] function based on the `preprocess` flag.\n \"\"\"\n if tokenizer is None:\n if pretrained is None:\n raise ValueError(\"tokenizer and pretrained can not be both None.\")\n tokenizer = AutoTokenizer.from_pretrained(pretrained)\n if max_seq_length is None:\n max_seq_length = tokenizer.model_max_length\n else:\n tokenizer.model_max_length = max_seq_length\n self.tokenizer = tokenizer\n self.max_seq_length = max_seq_length\n if truncation is not None:\n self.truncation = truncation\n if self.tokenizer.cls_token is not None:\n self.seq_length_offset += 1\n if self.tokenizer.sep_token is not None and self.tokenizer.sep_token != self.tokenizer.eos_token:\n self.seq_length_offset += 1\n if self.tokenizer.eos_token is not None:\n self.seq_length_offset += 1\n if preprocess is not None:\n self.preprocess = preprocess\n if auto_rename_sequence_col is not None:\n self.auto_rename_sequence_col = auto_rename_sequence_col\n if auto_rename_label_col is not None:\n self.auto_rename_label_col = auto_rename_label_col\n if column_names_map is None:\n column_names_map = {}\n if self.auto_rename_sequence_col:\n if len(self.sequence_cols) != 1:\n raise ValueError(\"auto_rename_sequence_col can only be used when there is exactly one sequence column.\")\n column_names_map[self.sequence_cols[0]] = defaults.SEQUENCE_COL_NAME # type: ignore[index]\n if self.auto_rename_label_col:\n if len(self.label_cols) != 1:\n raise ValueError(\"auto_rename_label_col can only be used when there is exactly one label column.\")\n column_names_map[self.label_cols[0]] = defaults.LABEL_COL_NAME # type: ignore[index]\n self.column_names_map = column_names_map\n if self.column_names_map:\n self.rename_columns(self.column_names_map)\n self.infer_tasks()\n\n if self.preprocess:\n self.update(self.map(self.tokenization))\n if self.secondary_structure_cols:\n self.update(self.map(self.convert_secondary_structure))\n if self.discrete_map:\n self.update(self.map(self.map_discrete))\n fn_kwargs = {\n \"columns\": [name for name, task in self.tasks.items() if task.level in [\"token\", \"contact\"]],\n \"max_seq_length\": self.max_seq_length - self.seq_length_offset,\n }\n if self.truncation and 0 < self.max_seq_length < 2**32:\n self.update(self.map(self.truncate, fn_kwargs=fn_kwargs))\n self.set_transform(self.transform)\n\n def transform(self, batch: Mapping) -> Mapping:\n r\"\"\"\n Default [`transform`][datasets.Dataset.set_transform].\n\n See Also:\n [`collate`][multimolecule.Dataset.collate]\n \"\"\"\n return {k: self.collate(k, v) for k, v in batch.items() if k not in self.ignored_cols}\n\n def collate(self, col: str, data: Any) -> Tensor | NestedTensor | None:\n r\"\"\"\n Collate the data for a column.\n\n If the column is a sequence column, it will tokenize the data if `tokenize` is `True`.\n Otherwise, it will return a tensor or nested tensor.\n \"\"\"\n if col in self.sequence_cols:\n if isinstance(data[0], str):\n data = self.tokenize(data)\n return NestedTensor(data)\n if not self.preprocess:\n if col in self.discrete_map:\n data = map_value(data, self.discrete_map[col])\n if col in self.tasks:\n data = truncate_value(data, self.max_seq_length - self.seq_length_offset, self.tasks[col].level)\n if isinstance(data[0], str):\n return data\n try:\n return torch.tensor(data)\n except ValueError:\n return NestedTensor(data)\n\n def infer_tasks(self, sequence_col: str | None = None) -> NestedDict:\n for col in self.label_cols:\n if col in self.tasks:\n continue\n if col in self.secondary_structure_cols:\n task = Task(TaskType.Binary, level=TaskLevel.Contact, num_labels=1)\n self.tasks[col] = task # type: ignore[index]\n warn(\n f\"Secondary structure columns are assumed to be {task}. \"\n \"Please explicitly specify the task if this is not the case.\"\n )\n else:\n try:\n self.tasks[col] = self.infer_task(col, sequence_col) # type: ignore[index]\n except ValueError:\n raise ValueError(f\"Unable to infer task for column {col}.\")\n return self.tasks\n\n def infer_task(self, label_col: str, sequence_col: str | None = None) -> Task:\n if sequence_col is None:\n if len(self.sequence_cols) != 1:\n raise ValueError(\"sequence_col must be specified if there are multiple sequence columns.\")\n sequence_col = self.sequence_cols[0]\n sequence = self._data.column(sequence_col)\n column = self._data.column(label_col)\n return infer_task(\n sequence,\n column,\n truncation=self.truncation,\n max_seq_length=self.max_seq_length,\n seq_length_offset=self.seq_length_offset,\n )\n\n def infer_discrete_map(self, discrete_map: Mapping | None = None):\n self._discrete_map = discrete_map or NestedDict()\n ignored_cols = set(self.discrete_map.keys()) | set(self.sequence_cols) | set(self.secondary_structure_cols)\n data_cols = [i for i in self.data_cols if i not in ignored_cols]\n for col in data_cols:\n discrete_map = infer_discrete_map(self._data.column(col))\n if discrete_map:\n self._discrete_map[col] = discrete_map # type: ignore[index]\n return self._discrete_map\n\n def __getitems__(self, keys: int | slice | Iterable[int]) -> Any:\n return self.__getitem__(keys)\n\n def identify_special_cols(\n self, feature_cols: List | None = None, label_cols: List | None = None, id_cols: List | None = None\n ) -> Sequence:\n all_cols = self.data.column_names\n self._id_cols = id_cols or [i for i in all_cols if i in defaults.ID_COL_NAMES]\n\n string_cols: list[str] = [k for k, v in self.features.items() if k not in self.id_cols and v.dtype == \"string\"]\n self._sequence_cols = [i for i in string_cols if i.lower() in defaults.SEQUENCE_COL_NAMES]\n self._secondary_structure_cols = [i for i in string_cols if i in defaults.SECONDARY_STRUCTURE_COL_NAMES]\n\n data_cols = [i for i in all_cols if i not in self.id_cols]\n if label_cols is None:\n if feature_cols is None:\n feature_cols = [i for i in data_cols if i in defaults.SEQUENCE_COL_NAMES]\n label_cols = [i for i in data_cols if i not in feature_cols]\n self._label_cols = label_cols\n if feature_cols is None:\n feature_cols = [i for i in data_cols if i not in self.label_cols]\n self._feature_cols = feature_cols\n missing_feature_cols = set(self.feature_cols).difference(data_cols)\n if missing_feature_cols:\n raise ValueError(f\"{missing_feature_cols} are specified in feature_cols, but not found in dataset.\")\n missing_label_cols = set(self.label_cols).difference(data_cols)\n if missing_label_cols:\n raise ValueError(f\"{missing_label_cols} are specified in label_cols, but not found in dataset.\")\n return string_cols\n\n def tokenize(self, string: str) -> Tensor:\n return self.tokenizer(string, return_attention_mask=False, truncation=self.truncation)[\"input_ids\"]\n\n def tokenization(self, data: Mapping[str, str]) -> Mapping[str, Tensor]:\n return {col: self.tokenize(data[col]) for col in self.sequence_cols}\n\n def convert_secondary_structure(self, data: Mapping) -> Mapping:\n return {col: dot_bracket_to_contact_map(data[col]) for col in self.secondary_structure_cols}\n\n def map_discrete(self, data: Mapping) -> Mapping:\n return {name: map_value(data[name], mapping) for name, mapping in self.discrete_map.items()}\n\n def truncate(self, data: Mapping, columns: List[str], max_seq_length: int) -> Mapping:\n return {name: truncate_value(data[name], max_seq_length, self.tasks[name].level) for name in columns}\n\n def update(self, dataset: datasets.Dataset):\n r\"\"\"\n Perform an in-place update of the dataset.\n\n This method is used to update the dataset after changes have been made to the underlying data.\n It updates the format columns, data, info, and fingerprint of the dataset.\n \"\"\"\n # pylint: disable=W0212\n # Why datasets won't support in-place changes?\n # It's just impossible to extend.\n self._format_columns = dataset._format_columns\n self._data = dataset._data\n self._info = dataset._info\n self._fingerprint = dataset._fingerprint\n\n def rename_columns(self, column_mapping: Mapping[str, str], new_fingerprint: str | None = None) -> datasets.Dataset:\n self.update(super().rename_columns(column_mapping, new_fingerprint=new_fingerprint))\n self._id_cols = [column_mapping.get(i, i) for i in self.id_cols]\n self._feature_cols = [column_mapping.get(i, i) for i in self.feature_cols]\n self._label_cols = [column_mapping.get(i, i) for i in self.label_cols]\n self._sequence_cols = [column_mapping.get(i, i) for i in self.sequence_cols]\n self._secondary_structure_cols = [column_mapping.get(i, i) for i in self.secondary_structure_cols]\n self.tasks = {column_mapping.get(k, k): v for k, v in self.tasks.items()}\n return self\n\n def rename_column(\n self, original_column_name: str, new_column_name: str, new_fingerprint: str | None = None\n ) -> datasets.Dataset:\n self.update(super().rename_column(original_column_name, new_column_name, new_fingerprint))\n self._id_cols = [new_column_name if i == original_column_name else i for i in self.id_cols]\n self._feature_cols = [new_column_name if i == original_column_name else i for i in self.feature_cols]\n self._label_cols = [new_column_name if i == original_column_name else i for i in self.label_cols]\n self._sequence_cols = [new_column_name if i == original_column_name else i for i in self.sequence_cols]\n self._secondary_structure_cols = [\n new_column_name if i == original_column_name else i for i in self.secondary_structure_cols\n ]\n self.tasks = {new_column_name if k == original_column_name else k: v for k, v in self.tasks.items()}\n return self\n\n def process_nan(self, data: Table, nan_process: str | None, fill_value: str | int | float = 0) -> Table:\n if nan_process == \"ignore\":\n return data\n data = data.to_pandas()\n data = data.replace([float(\"inf\"), -float(\"inf\")], float(\"nan\"))\n if data.isnull().values.any():\n if nan_process is None or nan_process == \"error\":\n raise ValueError(\"NaN / inf values have been found in the dataset.\")\n warn(\n \"NaN / inf values have been found in the dataset.\\n\"\n \"While we can handle them, the data type of the corresponding column may be set to float, \"\n \"which can and very likely will disrupt the auto task recognition.\\n\"\n \"It is recommended to address these values before loading the dataset.\"\n )\n if nan_process == \"drop\":\n data = data.dropna()\n elif nan_process == \"fill\":\n data = data.fillna(fill_value)\n else:\n raise ValueError(f\"Invalid nan_process: {nan_process}\")\n return pa.Table.from_pandas(data, preserve_index=False)\n\n @property\n def id_cols(self) -> List:\n return self._id_cols\n\n @property\n def data_cols(self) -> List:\n return self.feature_cols + self.label_cols\n\n @property\n def feature_cols(self) -> List:\n return self._feature_cols\n\n @property\n def label_cols(self) -> List:\n return self._label_cols\n\n @property\n def sequence_cols(self) -> List:\n return self._sequence_cols\n\n @property\n def secondary_structure_cols(self) -> List:\n return self._secondary_structure_cols\n\n @property\n def tasks(self) -> NestedDict:\n if not hasattr(self, \"_tasks\"):\n self._tasks = NestedDict()\n return self.infer_tasks()\n return self._tasks\n\n @tasks.setter\n def tasks(self, tasks: Mapping):\n self._tasks = NestedDict()\n for name, task in tasks.items():\n if not isinstance(task, Task):\n task = Task(**task)\n self._tasks[name] = task\n\n @property\n def discrete_map(self) -> Mapping:\n if not hasattr(self, \"_discrete_map\"):\n return self.infer_discrete_map()\n return self._discrete_map\n
"},{"location":"data/dataset/#multimolecule.data.Dataset(data)","title":"data
","text":""},{"location":"data/dataset/#multimolecule.data.Dataset(split)","title":"split
","text":""},{"location":"data/dataset/#multimolecule.data.Dataset(tokenizer)","title":"tokenizer
","text":""},{"location":"data/dataset/#multimolecule.data.Dataset(pretrained)","title":"pretrained
","text":""},{"location":"data/dataset/#multimolecule.data.Dataset(feature_cols)","title":"feature_cols
","text":""},{"location":"data/dataset/#multimolecule.data.Dataset(label_cols)","title":"label_cols
","text":""},{"location":"data/dataset/#multimolecule.data.Dataset(id_cols)","title":"id_cols
","text":""},{"location":"data/dataset/#multimolecule.data.Dataset(preprocess)","title":"preprocess
","text":""},{"location":"data/dataset/#multimolecule.data.Dataset(auto_rename_sequence_col)","title":"auto_rename_sequence_col
","text":""},{"location":"data/dataset/#multimolecule.data.Dataset(auto_rename_label_cols)","title":"auto_rename_label_cols
","text":""},{"location":"data/dataset/#multimolecule.data.Dataset(column_names_map)","title":"column_names_map
","text":""},{"location":"data/dataset/#multimolecule.data.Dataset(truncation)","title":"truncation
","text":""},{"location":"data/dataset/#multimolecule.data.Dataset(max_seq_length)","title":"max_seq_length
","text":""},{"location":"data/dataset/#multimolecule.data.Dataset(tasks)","title":"tasks
","text":""},{"location":"data/dataset/#multimolecule.data.Dataset(discrete_map)","title":"discrete_map
","text":""},{"location":"data/dataset/#multimolecule.data.Dataset(nan_process)","title":"nan_process
","text":""},{"location":"data/dataset/#multimolecule.data.Dataset(fill_value)","title":"fill_value
","text":""},{"location":"data/dataset/#multimolecule.data.Dataset(info)","title":"info
","text":""},{"location":"data/dataset/#multimolecule.data.Dataset(indices_table)","title":"indices_table
","text":""},{"location":"data/dataset/#multimolecule.data.Dataset(fingerprint)","title":"fingerprint
","text":""},{"location":"data/dataset/#multimolecule.data.Dataset.post","title":"post","text":"Pythonpost(tokenizer: PreTrainedTokenizerBase | None = None, pretrained: str | None = None, max_seq_length: int | None = None, truncation: bool | None = None, preprocess: bool | None = None, auto_rename_sequence_col: bool | None = None, auto_rename_label_col: bool | None = None, column_names_map: Mapping[str, str] | None = None) -> None\n
Perform pre-processing steps after initialization.
It first identifies the special columns (sequence and structure columns) in the dataset. Then it sets the feature and label columns based on the input arguments. If auto_rename_sequence_col
is True
, it will automatically rename the sequence column. If auto_rename_label_col
is True
, it will automatically rename the label column. Finally, it sets the transform
function based on the preprocess
flag.
multimolecule/data/dataset.py
Pythondef post(\n self,\n tokenizer: PreTrainedTokenizerBase | None = None,\n pretrained: str | None = None,\n max_seq_length: int | None = None,\n truncation: bool | None = None,\n preprocess: bool | None = None,\n auto_rename_sequence_col: bool | None = None,\n auto_rename_label_col: bool | None = None,\n column_names_map: Mapping[str, str] | None = None,\n) -> None:\n r\"\"\"\n Perform pre-processing steps after initialization.\n\n It first identifies the special columns (sequence and structure columns) in the dataset.\n Then it sets the feature and label columns based on the input arguments.\n If `auto_rename_sequence_col` is `True`, it will automatically rename the sequence column.\n If `auto_rename_label_col` is `True`, it will automatically rename the label column.\n Finally, it sets the [`transform`][datasets.Dataset.set_transform] function based on the `preprocess` flag.\n \"\"\"\n if tokenizer is None:\n if pretrained is None:\n raise ValueError(\"tokenizer and pretrained can not be both None.\")\n tokenizer = AutoTokenizer.from_pretrained(pretrained)\n if max_seq_length is None:\n max_seq_length = tokenizer.model_max_length\n else:\n tokenizer.model_max_length = max_seq_length\n self.tokenizer = tokenizer\n self.max_seq_length = max_seq_length\n if truncation is not None:\n self.truncation = truncation\n if self.tokenizer.cls_token is not None:\n self.seq_length_offset += 1\n if self.tokenizer.sep_token is not None and self.tokenizer.sep_token != self.tokenizer.eos_token:\n self.seq_length_offset += 1\n if self.tokenizer.eos_token is not None:\n self.seq_length_offset += 1\n if preprocess is not None:\n self.preprocess = preprocess\n if auto_rename_sequence_col is not None:\n self.auto_rename_sequence_col = auto_rename_sequence_col\n if auto_rename_label_col is not None:\n self.auto_rename_label_col = auto_rename_label_col\n if column_names_map is None:\n column_names_map = {}\n if self.auto_rename_sequence_col:\n if len(self.sequence_cols) != 1:\n raise ValueError(\"auto_rename_sequence_col can only be used when there is exactly one sequence column.\")\n column_names_map[self.sequence_cols[0]] = defaults.SEQUENCE_COL_NAME # type: ignore[index]\n if self.auto_rename_label_col:\n if len(self.label_cols) != 1:\n raise ValueError(\"auto_rename_label_col can only be used when there is exactly one label column.\")\n column_names_map[self.label_cols[0]] = defaults.LABEL_COL_NAME # type: ignore[index]\n self.column_names_map = column_names_map\n if self.column_names_map:\n self.rename_columns(self.column_names_map)\n self.infer_tasks()\n\n if self.preprocess:\n self.update(self.map(self.tokenization))\n if self.secondary_structure_cols:\n self.update(self.map(self.convert_secondary_structure))\n if self.discrete_map:\n self.update(self.map(self.map_discrete))\n fn_kwargs = {\n \"columns\": [name for name, task in self.tasks.items() if task.level in [\"token\", \"contact\"]],\n \"max_seq_length\": self.max_seq_length - self.seq_length_offset,\n }\n if self.truncation and 0 < self.max_seq_length < 2**32:\n self.update(self.map(self.truncate, fn_kwargs=fn_kwargs))\n self.set_transform(self.transform)\n
"},{"location":"data/dataset/#multimolecule.data.Dataset.transform","title":"transform","text":"Pythontransform(batch: Mapping) -> Mapping\n
Default transform
.
collate
multimolecule/data/dataset.py
Pythondef transform(self, batch: Mapping) -> Mapping:\n r\"\"\"\n Default [`transform`][datasets.Dataset.set_transform].\n\n See Also:\n [`collate`][multimolecule.Dataset.collate]\n \"\"\"\n return {k: self.collate(k, v) for k, v in batch.items() if k not in self.ignored_cols}\n
"},{"location":"data/dataset/#multimolecule.data.Dataset.collate","title":"collate","text":"Pythoncollate(col: str, data: Any) -> Tensor | NestedTensor | None\n
Collate the data for a column.
If the column is a sequence column, it will tokenize the data if tokenize
is True
. Otherwise, it will return a tensor or nested tensor.
multimolecule/data/dataset.py
Pythondef collate(self, col: str, data: Any) -> Tensor | NestedTensor | None:\n r\"\"\"\n Collate the data for a column.\n\n If the column is a sequence column, it will tokenize the data if `tokenize` is `True`.\n Otherwise, it will return a tensor or nested tensor.\n \"\"\"\n if col in self.sequence_cols:\n if isinstance(data[0], str):\n data = self.tokenize(data)\n return NestedTensor(data)\n if not self.preprocess:\n if col in self.discrete_map:\n data = map_value(data, self.discrete_map[col])\n if col in self.tasks:\n data = truncate_value(data, self.max_seq_length - self.seq_length_offset, self.tasks[col].level)\n if isinstance(data[0], str):\n return data\n try:\n return torch.tensor(data)\n except ValueError:\n return NestedTensor(data)\n
"},{"location":"data/dataset/#multimolecule.data.Dataset.update","title":"update","text":"Pythonupdate(dataset: Dataset)\n
Perform an in-place update of the dataset.
This method is used to update the dataset after changes have been made to the underlying data. It updates the format columns, data, info, and fingerprint of the dataset.
Source code inmultimolecule/data/dataset.py
Pythondef update(self, dataset: datasets.Dataset):\n r\"\"\"\n Perform an in-place update of the dataset.\n\n This method is used to update the dataset after changes have been made to the underlying data.\n It updates the format columns, data, info, and fingerprint of the dataset.\n \"\"\"\n # pylint: disable=W0212\n # Why datasets won't support in-place changes?\n # It's just impossible to extend.\n self._format_columns = dataset._format_columns\n self._data = dataset._data\n self._info = dataset._info\n self._fingerprint = dataset._fingerprint\n
"},{"location":"datasets/","title":"datasets","text":"datasets
provide a collection of widely used datasets.
from multimolecule.data import Dataset\n\ndata = Dataset(\"multimolecule/bprna-spot\", split=\"train\", pretrained=\"multimolecule/rna\")\n
"},{"location":"datasets/archiveii/","title":"ArchiveII","text":"ArchiveII is a dataset of RNA sequences and their secondary structures, widely used in RNA secondary structure prediction benchmarks.
ArchiveII contains 2975 RNA samples across 10 RNA families, with sequence lengths ranging from 28 to 2968 nucleotides. This dataset is frequently used to evaluate RNA secondary structure prediction methods, including those that handle both pseudoknotted and non-pseudoknotted structures.
It is considered complementary to the RNAStrAlign dataset.
"},{"location":"datasets/archiveii/#disclaimer","title":"Disclaimer","text":"This is an UNOFFICIAL release of the ArchiveII by Mehdi Saman Booy, et al.
The team releasing ArchiveII did not write this dataset card for this dataset so this dataset card has been written by the MultiMolecule team.
"},{"location":"datasets/archiveii/#dataset-description","title":"Dataset Description","text":"id: A unique identifier for each RNA entry. This ID is derived from the family and the original .sta
file name, and serves as a reference to the specific RNA structure within the dataset.
sequence: The nucleotide sequence of the RNA molecule, represented using the standard RNA bases:
secondary_structure: The secondary structure of the RNA represented in dot-bracket notation, using up to three types of symbols to indicate base pairing and unpaired regions, as per bpRNA\u2019s standard:
.
): Represent unpaired nucleotides.(
and )
): Represent base pairs in standard stems (page 1).family: The RNA family to which the sequence belongs, such as 16S rRNA, 5S rRNA, etc.
This dataset is available in two additional variants:
This dataset is licensed under the AGPL-3.0 License.
Text OnlySPDX-License-Identifier: AGPL-3.0-or-later\n
"},{"location":"datasets/archiveii/#citation","title":"Citation","text":"BibTeX@article{samanbooy2022rna,\n author = {Saman Booy, Mehdi and Ilin, Alexander and Orponen, Pekka},\n journal = {BMC Bioinformatics},\n keywords = {Deep learning; Pseudoknotted structures; RNA structure prediction},\n month = feb,\n number = 1,\n pages = {58},\n publisher = {Springer Science and Business Media LLC},\n title = {{RNA} secondary structure prediction with convolutional neural networks},\n volume = 23,\n year = 2022\n}\n
"},{"location":"datasets/bprna-new/","title":"bpRNA-1m","text":"bpRNA-new is a database of single molecule secondary structures annotated using bpRNA.
bpRNA-new is a dataset of RNA families from Rfam 14.2, designed for cross-family validation to assess generalization capability. It focuses on families distinct from those in bpRNA-1m, providing a robust benchmark for evaluating model performance on unseen RNA families.
"},{"location":"datasets/bprna-new/#disclaimer","title":"Disclaimer","text":"This is an UNOFFICIAL release of the bpRNA-new by Kengo Sato, et al.
The team releasing bpRNA-new did not write this dataset card for this dataset so this dataset card has been written by the MultiMolecule team.
"},{"location":"datasets/bprna-new/#dataset-description","title":"Dataset Description","text":"This dataset is licensed under the AGPL-3.0 License.
Text OnlySPDX-License-Identifier: AGPL-3.0-or-later\n
"},{"location":"datasets/bprna-new/#citation","title":"Citation","text":"BibTeX@article{sato2021rna,\n author = {Sato, Kengo and Akiyama, Manato and Sakakibara, Yasubumi},\n journal = {Nature Communications},\n month = feb,\n number = 1,\n pages = {941},\n publisher = {Springer Science and Business Media LLC},\n title = {{RNA} secondary structure prediction using deep learning with thermodynamic integration},\n volume = 12,\n year = 2021\n}\n
"},{"location":"datasets/bprna-spot/","title":"bpRNA-1m","text":"bpRNA-spot is a database of single molecule secondary structures annotated using bpRNA.
bpRNA-spot is a subset of bpRNA-1m. It applies CD-HIT (CD-HIT-EST) to remove sequences with more than 80% sequence similarity from bpRNA-1m. It further randomly splits the remaining sequences into training, validation, and test sets with a ratio of apprxiately 8:1:1.
"},{"location":"datasets/bprna-spot/#disclaimer","title":"Disclaimer","text":"This is an UNOFFICIAL release of the bpRNA-spot by Jaswinder Singh, et al.
The team releasing bpRNA-spot did not write this dataset card for this dataset so this dataset card has been written by the MultiMolecule team.
"},{"location":"datasets/bprna-spot/#dataset-description","title":"Dataset Description","text":"This dataset is licensed under the AGPL-3.0 License.
Text OnlySPDX-License-Identifier: AGPL-3.0-or-later\n
"},{"location":"datasets/bprna-spot/#citation","title":"Citation","text":"BibTeX@article{singh2019rna,\n author = {Singh, Jaswinder and Hanson, Jack and Paliwal, Kuldip and Zhou, Yaoqi},\n journal = {Nature Communications},\n month = nov,\n number = 1,\n pages = {5407},\n publisher = {Springer Science and Business Media LLC},\n title = {{RNA} secondary structure prediction using an ensemble of two-dimensional deep neural networks and transfer learning},\n volume = 10,\n year = 2019\n}\n\n@article{darty2009varna,\n author = {Darty, K{\\'e}vin and Denise, Alain and Ponty, Yann},\n journal = {Bioinformatics},\n month = aug,\n number = 15,\n pages = {1974--1975},\n publisher = {Oxford University Press (OUP)},\n title = {{VARNA}: Interactive drawing and editing of the {RNA} secondary structure},\n volume = 25,\n year = 2009\n}\n\n@article{berman2000protein,\n author = {Berman, H M and Westbrook, J and Feng, Z and Gilliland, G and Bhat, T N and Weissig, H and Shindyalov, I N and Bourne, P E},\n journal = {Nucleic Acids Research},\n month = jan,\n number = 1,\n pages = {235--242},\n publisher = {Oxford University Press (OUP)},\n title = {The Protein Data Bank},\n volume = 28,\n year = 2000\n}\n
"},{"location":"datasets/bprna/","title":"bpRNA-1m","text":"bpRNA-1m is a database of single molecule secondary structures annotated using bpRNA.
"},{"location":"datasets/bprna/#disclaimer","title":"Disclaimer","text":"This is an UNOFFICIAL release of the bpRNA-1m by Center for Quantitative Life Sciences of the Oregon State University.
The team releasing bpRNA-1m did not write this dataset card for this dataset so this dataset card has been written by the MultiMolecule team.
"},{"location":"datasets/bprna/#dataset-description","title":"Dataset Description","text":"The converted dataset consists of the following columns, each providing specific information about the RNA secondary structures, consistent with the bpRNA standard:
id: A unique identifier for each RNA entry. This ID is derived from the original .sta
file name and serves as a reference to the specific RNA structure within the dataset.
sequence: The nucleotide sequence of the RNA molecule, represented using the standard RNA bases:
secondary_structure: The secondary structure of the RNA represented in dot-bracket notation, using up to three types of symbols to indicate base pairing and unpaired regions, as per bpRNA\u2019s standard:
.
): Represent unpaired nucleotides.(
and )
): Represent base pairs in standard stems (page 1).[
and ]
): Represent base pairs in pseudoknots (page 2).{
and }
): Represent base pairs in additional pseudoknots (page 3).structural_annotation: Structural annotations categorizing different regions of the RNA based on their roles within the secondary structure, consistent with bpRNA standards:
functional_annotation: Functional annotations indicating specific functional elements or regions within the RNA sequence, as defined by bpRNA:
This dataset is available in two variants:
This dataset is licensed under the AGPL-3.0 License.
Text OnlySPDX-License-Identifier: AGPL-3.0-or-later\n
"},{"location":"datasets/bprna/#citation","title":"Citation","text":"BibTeX@article{danaee2018bprna,\n author = {Danaee, Padideh and Rouches, Mason and Wiley, Michelle and Deng, Dezhong and Huang, Liang and Hendrix, David},\n journal = {Nucleic Acids Research},\n month = jun,\n number = 11,\n pages = {5381--5394},\n title = {{bpRNA}: large-scale automated annotation and analysis of {RNA} secondary structure},\n volume = 46,\n year = 2018\n}\n\n@article{cannone2002comparative,\n author = {Cannone, Jamie J and Subramanian, Sankar and Schnare, Murray N and Collett, James R and D'Souza, Lisa M and Du, Yushi and Feng, Brian and Lin, Nan and Madabusi, Lakshmi V and M{\\\"u}ller, Kirsten M and Pande, Nupur and Shang, Zhidi and Yu, Nan and Gutell, Robin R},\n copyright = {https://www.springernature.com/gp/researchers/text-and-data-mining},\n journal = {BMC Bioinformatics},\n month = jan,\n number = 1,\n pages = {2},\n publisher = {Springer Science and Business Media LLC},\n title = {The comparative {RNA} web ({CRW}) site: an online database of comparative sequence and structure information for ribosomal, intron, and other {RNAs}},\n volume = 3,\n year = 2002\n}\n\n@article{zwieb2003tmrdb,\n author = {Zwieb, Christian and Gorodkin, Jan and Knudsen, Bjarne and Burks, Jody and Wower, Jacek},\n journal = {Nucleic Acids Research},\n month = jan,\n number = 1,\n pages = {446--447},\n publisher = {Oxford University Press (OUP)},\n title = {{tmRDB} ({tmRNA} database)},\n volume = 31,\n year = 2003\n}\n\n@article{rosenblad2003srpdb,\n author = {Rosenblad, Magnus Alm and Gorodkin, Jan and Knudsen, Bjarne and Zwieb, Christian and Samuelsson, Tore},\n journal = {Nucleic Acids Research},\n month = jan,\n number = 1,\n pages = {363--364},\n publisher = {Oxford University Press (OUP)},\n title = {{SRPDB}: Signal Recognition Particle Database},\n volume = 31,\n year = 2003\n}\n\n@article{sprinzl2005compilation,\n author = {Sprinzl, Mathias and Vassilenko, Konstantin S},\n journal = {Nucleic Acids Research},\n month = jan,\n number = {Database issue},\n pages = {D139--40},\n publisher = {Oxford University Press (OUP)},\n title = {Compilation of {tRNA} sequences and sequences of {tRNA} genes},\n volume = 33,\n year = 2005\n}\n\n@article{brown1994ribonuclease,\n author = {Brown, J W and Haas, E S and Gilbert, D G and Pace, N R},\n journal = {Nucleic Acids Research},\n month = sep,\n number = 17,\n pages = {3660--3662},\n publisher = {Oxford University Press (OUP)},\n title = {The Ribonuclease {P} database},\n volume = 22,\n year = 1994\n}\n\n@article{griffiths2003rfam,\n author = {Griffiths-Jones, Sam and Bateman, Alex and Marshall, Mhairi and Khanna, Ajay and Eddy, Sean R},\n journal = {Nucleic Acids Research},\n month = jan,\n number = 1,\n pages = {439--441},\n publisher = {Oxford University Press (OUP)},\n title = {Rfam: an {RNA} family database},\n volume = 31,\n year = 2003\n}\n\n@article{berman2000protein,\n author = {Berman, H M and Westbrook, J and Feng, Z and Gilliland, G and Bhat, T N and Weissig, H and Shindyalov, I N and Bourne, P E},\n journal = {Nucleic Acids Research},\n month = jan,\n number = 1,\n pages = {235--242},\n publisher = {Oxford University Press (OUP)},\n title = {The Protein Data Bank},\n volume = 28,\n year = 2000\n}\n
"},{"location":"datasets/eternabench-cm/","title":"EternaBench-CM","text":"EternaBench-CM is a synthetic RNA dataset comprising 12,711 RNA constructs that have been chemically mapped using SHAPE and MAP-seq methods. These RNA sequences are probed to obtain experimental data on their nucleotide reactivity, which indicates whether specific regions of the RNA are flexible or structured. The dataset provides high-resolution, large-scale data that can be used for studying RNA folding and stability.
"},{"location":"datasets/eternabench-cm/#disclaimer","title":"Disclaimer","text":"This is an UNOFFICIAL release of the EternaBench-CM by Hannah K. Wayment-Steele, et al.
The team releasing EternaBench-CM did not write this dataset card for this dataset so this dataset card has been written by the MultiMolecule team.
"},{"location":"datasets/eternabench-cm/#dataset-description","title":"Dataset Description","text":"The dataset includes a large set of synthetic RNA sequences with experimental chemical mapping data, which provides a quantitative readout of RNA nucleotide reactivity. These data are ensemble-averaged and serve as a critical benchmark for evaluating secondary structure prediction algorithms in their ability to model RNA folding dynamics.
"},{"location":"datasets/eternabench-cm/#example-entry","title":"Example Entry","text":"index design sequence secondary_structure reactivity errors signal_to_noise 769337-1 d+m plots weaker again GGAAAAAAAAAAA\u2026 ................ [0.642,1.4853,0.1629, \u2026] [0.3181,0.4221,0.1823, \u2026] 3.227"},{"location":"datasets/eternabench-cm/#column-description","title":"Column Description","text":"id: A unique identifier for each RNA sequence entry.
design: The name given to each RNA design by contributors, used for easy reference.
sequence: The nucleotide sequence of the RNA molecule, represented using the standard RNA bases:
secondary_structure: The secondary structure of the RNA represented in dot-bracket notation, using up to three types of symbols to indicate base pairing and unpaired regions, as per bpRNA\u2019s standard:
.
): Represent unpaired nucleotides.(
and )
): Represent base pairs in standard stems (page 1).[
and ]
): Represent base pairs in pseudoknots (page 2).{
and }
): Represent base pairs in additional pseudoknots (page 3).reactivity: A list of normalized reactivity values for each nucleotide, representing the likelihood that a nucleotide is unpaired. High reactivity indicates high flexibility (unpaired regions), and low reactivity corresponds to paired or structured regions.
errors: Arrays of floating-point numbers indicating the experimental errors corresponding to the measurements in the reactivity. These values help quantify the uncertainty in the degradation rates and reactivity measurements.
signal_to_noise: The signal-to-noise ratio calculated from the reactivity and error values, providing a measure of data quality.
This dataset is licensed under the AGPL-3.0 License.
Text OnlySPDX-License-Identifier: AGPL-3.0-or-later\n
"},{"location":"datasets/eternabench-cm/#citation","title":"Citation","text":"BibTeX@article{waymentsteele2022rna,\n author = {Wayment-Steele, Hannah K and Kladwang, Wipapat and Strom, Alexandra I and Lee, Jeehyung and Treuille, Adrien and Becka, Alex and {Eterna Participants} and Das, Rhiju},\n journal = {Nature Methods},\n month = oct,\n number = 10,\n pages = {1234--1242},\n publisher = {Springer Science and Business Media LLC},\n title = {{RNA} secondary structure packages evaluated and improved by high-throughput experiments},\n volume = 19,\n year = 2022\n}\n
"},{"location":"datasets/eternabench-external/","title":"EternaBench-External","text":"EternaBench-External consists of 31 independent RNA datasets from various biological sources, including viral genomes, mRNAs, and synthetic RNAs. These sequences were probed using techniques such as SHAPE-CE, SHAPE-MaP, and DMS-MaP-seq to understand RNA secondary structures under different experimental and biological conditions. This dataset serves as a benchmark for evaluating RNA structure prediction models, with a particular focus on generalization to natural RNA molecules.
"},{"location":"datasets/eternabench-external/#disclaimer","title":"Disclaimer","text":"This is an UNOFFICIAL release of the EternaBench-External by Hannah K. Wayment-Steele, et al.
The team releasing EternaBench-External did not write this dataset card for this dataset so this dataset card has been written by the MultiMolecule team.
"},{"location":"datasets/eternabench-external/#dataset-description","title":"Dataset Description","text":"This dataset includes RNA sequences from various biological origins, including viral genomes and mRNAs, and covers a wide range of probing methods like SHAPE-CE and icSHAPE. Each dataset entry provides sequence information, reactivity profiles, and RNA secondary structure data. This dataset can be used to examine how RNA structures vary under different conditions and to validate structural predictions for diverse RNA types.
"},{"location":"datasets/eternabench-external/#example-entry","title":"Example Entry","text":"name sequence reactivity seqpos class dataset Dadonaite,2019 Influenza genome SHAPE(1M7) SSII-Mn(2+) Mut. TTTACCCACAGCTGTGAATT\u2026 [0.639309,0.813297,0.622869,\u2026] [7425,7426,7427,\u2026] viral_gRNA Dadonaite,2019"},{"location":"datasets/eternabench-external/#column-description","title":"Column Description","text":"name: The name of the dataset entry, typically including the experimental setup and biological source.
sequence: The nucleotide sequence of the RNA molecule, represented using the standard RNA bases:
reactivity: A list of normalized reactivity values for each nucleotide, representing the likelihood that a nucleotide is unpaired. High reactivity indicates high flexibility (unpaired regions), and low reactivity corresponds to paired or structured regions.
seqpos: A list of sequence positions corresponding to each nucleotide in the sequence.
class: The type of RNA sequence, can be one of the following:
dataset: The source or reference for the dataset entry, indicating its origin.
This dataset is available in four variants:
This dataset is licensed under the AGPL-3.0 License.
Text OnlySPDX-License-Identifier: AGPL-3.0-or-later\n
"},{"location":"datasets/eternabench-external/#citation","title":"Citation","text":"BibTeX@article{waymentsteele2022rna,\n author = {Wayment-Steele, Hannah K and Kladwang, Wipapat and Strom, Alexandra I and Lee, Jeehyung and Treuille, Adrien and Becka, Alex and {Eterna Participants} and Das, Rhiju},\n journal = {Nature Methods},\n month = oct,\n number = 10,\n pages = {1234--1242},\n publisher = {Springer Science and Business Media LLC},\n title = {{RNA} secondary structure packages evaluated and improved by high-throughput experiments},\n volume = 19,\n year = 2022\n}\n
"},{"location":"datasets/eternabench-switch/","title":"EternaBench-Switch","text":"EternaBench-Switch is a synthetic RNA dataset consisting of 7,228 riboswitch constructs, designed to explore the structural behavior of RNA molecules that change conformation upon binding to ligands such as FMN, theophylline, or tryptophan. These riboswitches exhibit different structural states in the presence or absence of their ligands, and the dataset includes detailed measurements of binding affinities (dissociation constants), activation ratios, and RNA folding properties.
"},{"location":"datasets/eternabench-switch/#disclaimer","title":"Disclaimer","text":"This is an UNOFFICIAL release of the EternaBench-Switch by Hannah K. Wayment-Steele, et al.
The team releasing EternaBench-Switch did not write this dataset card for this dataset so this dataset card has been written by the MultiMolecule team.
"},{"location":"datasets/eternabench-switch/#dataset-description","title":"Dataset Description","text":"The dataset includes synthetic RNA sequences designed to act as riboswitches. These molecules can adopt different structural states in response to ligand binding, and the dataset provides detailed information on the binding affinities for various ligands, along with metrics on the RNA\u2019s ability to switch between conformations. With over 7,000 entries, this dataset is highly useful for studying RNA folding, ligand interaction, and RNA structural dynamics.
"},{"location":"datasets/eternabench-switch/#example-entry","title":"Example Entry","text":"id design sequence activation_ratio ligand switch kd_off kd_on kd_fmn kd_no_fmn min_kd_val ms2_aptamer lig_aptamer ms2_lig_aptamer log_kd_nolig log_kd_lig log_kd_nolig_scaled log_kd_lig_scaled log_AR folding_subscore num_clusters 286 null AGGAAACAUGAGGAU\u2026 0.8824621522 FMN OFF 13.3115 15.084 null null 3.0082 .....(((((x((xxxx)))))))..... .................. .....(((((x((xx\u2026 2.7137 2.5886 1.6123 1.4873 -0.125 null null"},{"location":"datasets/eternabench-switch/#column-description","title":"Column Description","text":"id: A unique identifier for each RNA sequence entry.
design: The name given to each RNA design by contributors, used for easy reference.
sequence: The nucleotide sequence of the RNA molecule, represented using the standard RNA bases:
activation_ratio: The ratio reflecting the RNA molecule\u2019s structural change between two states (e.g., ON and OFF) upon ligand binding.
ligand: The small molecule ligand (e.g., FMN, theophylline) that the RNA is designed to bind to, inducing the switch.
switch: A binary or categorical value indicating whether the RNA demonstrates switching behavior.
kd_off: The dissociation constant (KD) when the RNA is in the \u201cOFF\u201d state (without ligand), representing its binding affinity.
kd_on: The dissociation constant (KD) when the RNA is in the \u201cON\u201d state (with ligand), indicating its affinity after activation.
kd_fmn: The dissociation constant for the RNA binding to the FMN ligand.
kd_no_fmn: The dissociation constant when no FMN ligand is present, indicating the RNA\u2019s behavior in a ligand-free state.
min_kd_val: The minimum KD value observed across different ligand-binding conditions.
ms2_aptamer: Indicates whether the RNA contains an MS2 aptamer, a motif that binds the MS2 viral coat protein.
lig_aptamer: A flag showing the presence of an aptamer that binds the ligand (e.g., FMN), demonstrating ligand-specific binding capability.
ms2_lig_aptamer: Indicates if the RNA contains both an MS2 aptamer and a ligand-binding aptamer, potentially allowing for multifaceted binding behavior.
log_kd_nolig: The logarithmic value of the dissociation constant without the ligand.
log_kd_lig: The logarithmic value of the dissociation constant with the ligand present.
log_kd_nolig_scaled: A normalized and scaled version of log_kd_nolig for easier comparison across conditions.
log_kd_lig_scaled: A normalized and scaled version of log_kd_lig for consistency in data comparisons.
log_AR: The logarithmic scale of the activation ratio, offering a standardized measure of activation strength.
folding_subscore: A numerical score indicating how well the RNA molecule folds into the predicted structure.
num_clusters: The number of distinct structural clusters or conformations predicted for the RNA, reflecting the complexity of the folding landscape.
This dataset is licensed under the AGPL-3.0 License.
Text OnlySPDX-License-Identifier: AGPL-3.0-or-later\n
"},{"location":"datasets/eternabench-switch/#citation","title":"Citation","text":"BibTeX@article{waymentsteele2022rna,\n author = {Wayment-Steele, Hannah K and Kladwang, Wipapat and Strom, Alexandra I and Lee, Jeehyung and Treuille, Adrien and Becka, Alex and {Eterna Participants} and Das, Rhiju},\n journal = {Nature Methods},\n month = oct,\n number = 10,\n pages = {1234--1242},\n publisher = {Springer Science and Business Media LLC},\n title = {{RNA} secondary structure packages evaluated and improved by high-throughput experiments},\n volume = 19,\n year = 2022\n}\n
"},{"location":"datasets/gencode/","title":"GENCODE","text":"GENCODE is a comprehensive annotation project that aims to provide high-quality annotations of the human and mouse genomes. The project is part of the ENCODE (ENCyclopedia Of DNA Elements) scale-up project, which seeks to identify all functional elements in the human genome.
"},{"location":"datasets/gencode/#disclaimer","title":"Disclaimer","text":"This is an UNOFFICIAL release of the GENCODE by Paul Flicek, Roderic Guigo, Manolis Kellis, Mark Gerstein, Benedict Paten, Michael Tress, Jyoti Choudhary, et al.
The team releasing GENCODE did not write this dataset card for this dataset so this dataset card has been written by the MultiMolecule team.
"},{"location":"datasets/gencode/#dataset-description","title":"Dataset Description","text":"This dataset is licensed under the AGPL-3.0 License.
Text OnlySPDX-License-Identifier: AGPL-3.0-or-later\n
"},{"location":"datasets/gencode/#datasets","title":"Datasets","text":"The GENCODE dataset is available in Human and Mouse:
@article{frankish2023gencode,\n author = {Frankish, Adam and Carbonell-Sala, S{\\'\\i}lvia and Diekhans, Mark and Jungreis, Irwin and Loveland, Jane E and Mudge, Jonathan M and Sisu, Cristina and Wright, James C and Arnan, Carme and Barnes, If and Banerjee, Abhimanyu and Bennett, Ruth and Berry, Andrew and Bignell, Alexandra and Boix, Carles and Calvet, Ferriol and Cerd{\\'a}n-V{\\'e}lez, Daniel and Cunningham, Fiona and Davidson, Claire and Donaldson, Sarah and Dursun, Cagatay and Fatima, Reham and Giorgetti, Stefano and Giron, Carlos Garc{\\i}a and Gonzalez, Jose Manuel and Hardy, Matthew and Harrison, Peter W and Hourlier, Thibaut and Hollis, Zoe and Hunt, Toby and James, Benjamin and Jiang, Yunzhe and Johnson, Rory and Kay, Mike and Lagarde, Julien and Martin, Fergal J and G{\\'o}mez, Laura Mart{\\'\\i}nez and Nair, Surag and Ni, Pengyu and Pozo, Fernando and Ramalingam, Vivek and Ruffier, Magali and Schmitt, Bianca M and Schreiber, Jacob M and Steed, Emily and Suner, Marie-Marthe and Sumathipala, Dulika and Sycheva, Irina and Uszczynska-Ratajczak, Barbara and Wass, Elizabeth and Yang, Yucheng T and Yates, Andrew and Zafrulla, Zahoor and Choudhary, Jyoti S and Gerstein, Mark and Guigo, Roderic and Hubbard, Tim J P and Kellis, Manolis and Kundaje, Anshul and Paten, Benedict and Tress, Michael L and Flicek, Paul},\n journal = {Nucleic Acids Research},\n month = jan,\n number = {D1},\n pages = {D942--D949},\n publisher = {Oxford University Press (OUP)},\n title = {{GENCODE}: reference annotation for the human and mouse genomes in 2023},\n volume = 51,\n year = 2023\n}\n\n@article{frankish2021gencode,\n author = {Frankish, Adam and Diekhans, Mark and Jungreis, Irwin and Lagarde, Julien and Loveland, Jane E and Mudge, Jonathan M and Sisu, Cristina and Wright, James C and Armstrong, Joel and Barnes, If and Berry, Andrew and Bignell, Alexandra and Boix, Carles and Carbonell Sala, Silvia and Cunningham, Fiona and Di Domenico, Tom{\\'a}s and Donaldson, Sarah and Fiddes, Ian T and Garc{\\'\\i}a Gir{\\'o}n, Carlos and Gonzalez, Jose Manuel and Grego, Tiago and Hardy, Matthew and Hourlier, Thibaut and Howe, Kevin L and Hunt, Toby and Izuogu, Osagie G and Johnson, Rory and Martin, Fergal J and Mart{\\'\\i}nez, Laura and Mohanan, Shamika and Muir, Paul and Navarro, Fabio C P and Parker, Anne and Pei, Baikang and Pozo, Fernando and Riera, Ferriol Calvet and Ruffier, Magali and Schmitt, Bianca M and Stapleton, Eloise and Suner, Marie-Marthe and Sycheva, Irina and Uszczynska-Ratajczak, Barbara and Wolf, Maxim Y and Xu, Jinuri and Yang, Yucheng T and Yates, Andrew and Zerbino, Daniel and Zhang, Yan and Choudhary, Jyoti S and Gerstein, Mark and Guig{\\'o}, Roderic and Hubbard, Tim J P and Kellis, Manolis and Paten, Benedict and Tress, Michael L and Flicek, Paul},\n journal = {Nucleic Acids Research},\n month = jan,\n number = {D1},\n pages = {D916--D923},\n publisher = {Oxford University Press (OUP)},\n title = {{GENCODE} 2021},\n volume = 49,\n year = 2021\n}\n\n@article{frankish2019gencode,\n author = {Frankish, Adam and Diekhans, Mark and Ferreira, Anne-Maud and Johnson, Rory and Jungreis, Irwin and Loveland, Jane and Mudge, Jonathan M and Sisu, Cristina and Wright, James and Armstrong, Joel and Barnes, If and Berry, Andrew and Bignell, Alexandra and Carbonell Sala, Silvia and Chrast, Jacqueline and Cunningham, Fiona and Di Domenico, Tom{\\'a}s and Donaldson, Sarah and Fiddes, Ian T and Garc{\\'\\i}a Gir{\\'o}n, Carlos and Gonzalez, Jose Manuel and Grego, Tiago and Hardy, Matthew and Hourlier, Thibaut and Hunt, Toby and Izuogu, Osagie G and Lagarde, Julien and Martin, Fergal J and Mart{\\'\\i}nez, Laura and Mohanan, Shamika and Muir, Paul and Navarro, Fabio C P and Parker, Anne and Pei, Baikang and Pozo, Fernando and Ruffier, Magali and Schmitt, Bianca M and Stapleton, Eloise and Suner, Marie-Marthe and Sycheva, Irina and Uszczynska-Ratajczak, Barbara and Xu, Jinuri and Yates, Andrew and Zerbino, Daniel and Zhang, Yan and Aken, Bronwen and Choudhary, Jyoti S and Gerstein, Mark and Guig{\\'o}, Roderic and Hubbard, Tim J P and Kellis, Manolis and Paten, Benedict and Reymond, Alexandre and Tress, Michael L and Flicek, Paul},\n journal = {Nucleic Acids Research},\n month = jan,\n number = {D1},\n pages = {D766--D773},\n publisher = {Oxford University Press (OUP)},\n title = {{GENCODE} reference annotation for the human and mouse genomes},\n volume = 47,\n year = 2019\n}\n\n@article{mudge2015creating,\n author = {Mudge, Jonathan M and Harrow, Jennifer},\n copyright = {https://creativecommons.org/licenses/by/4.0},\n journal = {Mamm. Genome},\n language = {en},\n month = oct,\n number = {9-10},\n pages = {366--378},\n publisher = {Springer Science and Business Media LLC},\n title = {Creating reference gene annotation for the mouse {C57BL6/J} genome assembly},\n volume = 26,\n year = 2015\n}\n\n@article{harrow2012gencode,\n author = {Harrow, Jennifer and Frankish, Adam and Gonzalez, Jose M and Tapanari, Electra and Diekhans, Mark and Kokocinski, Felix and Aken, Bronwen L and Barrell, Daniel and Zadissa, Amonida and Searle, Stephen and Barnes, If and Bignell, Alexandra and Boychenko, Veronika and Hunt, Toby and Kay, Mike and Mukherjee, Gaurab and Rajan, Jeena and Despacio-Reyes, Gloria and Saunders, Gary and Steward, Charles and Harte, Rachel and Lin, Michael and Howald, C{\\'e}dric and Tanzer, Andrea and Derrien, Thomas and Chrast, Jacqueline and Walters, Nathalie and Balasubramanian, Suganthi and Pei, Baikang and Tress, Michael and Rodriguez, Jose Manuel and Ezkurdia, Iakes and van Baren, Jeltje and Brent, Michael and Haussler, David and Kellis, Manolis and Valencia, Alfonso and Reymond, Alexandre and Gerstein, Mark and Guig{\\'o}, Roderic and Hubbard, Tim J},\n journal = {Genome Research},\n month = sep,\n number = 9,\n pages = {1760--1774},\n title = {{GENCODE}: the reference human genome annotation for The {ENCODE} Project},\n volume = 22,\n year = 2012\n}\n\n@article{harrow2006gencode,\n author = {Harrow, Jennifer and Denoeud, France and Frankish, Adam and Reymond, Alexandre and Chen, Chao-Kung and Chrast, Jacqueline and Lagarde, Julien and Gilbert, James G R and Storey, Roy and Swarbreck, David and Rossier, Colette and Ucla, Catherine and Hubbard, Tim and Antonarakis, Stylianos E and Guigo, Roderic},\n journal = {Genome Biology},\n month = aug,\n number = {Suppl 1},\n pages = {S4.1--9},\n publisher = {Springer Nature},\n title = {{GENCODE}: producing a reference annotation for {ENCODE}},\n volume = {7 Suppl 1},\n year = 2006\n}\n
"},{"location":"datasets/rfam/","title":"Rfam","text":"Rfam is a database of structure-annotated multiple sequence alignments, covariance models and family annotation for a number of non-coding RNA, cis-regulatory and self-splicing intron families.
The seed alignments are hand curated and aligned using available sequence and structure data, and covariance models are built from these alignments using the INFERNAL v1.1.4 software suite.
The full regions list is created by searching the RFAMSEQ database using the covariance model, and then listing all hits above a family specific threshold to the model.
Rfam is maintained by a consortium of researchers at the European Bioinformatics Institute, Sean Eddy\u2019s laboratory and Eric Nawrocki.
"},{"location":"datasets/rfam/#disclaimer","title":"Disclaimer","text":"This is an UNOFFICIAL release of the Rfam by Ioanna Kalvari, Eric P. Nawrocki, Sarah W. Burge, Paul P Gardner, Sam Griffiths-Jones, et al.
The team releasing Rfam did not write this dataset card for this dataset so this dataset card has been written by the MultiMolecule team.
"},{"location":"datasets/rfam/#dataset-description","title":"Dataset Description","text":"This dataset is licensed under the AGPL-3.0 License.
Text OnlySPDX-License-Identifier: AGPL-3.0-or-later\n
Tip
The original Rfam dataset is licensed under the CC0 1.0 Universal license and is available at Rfam.
"},{"location":"datasets/rfam/#citation","title":"Citation","text":"BibTeX@article{kalvari2021rfam,\n author = {Kalvari, Ioanna and Nawrocki, Eric P and Ontiveros-Palacios, Nancy and Argasinska, Joanna and Lamkiewicz, Kevin and Marz, Manja and Griffiths-Jones, Sam and Toffano-Nioche, Claire and Gautheret, Daniel and Weinberg, Zasha and Rivas, Elena and Eddy, Sean R and Finn, Robert D and Bateman, Alex and Petrov, Anton I},\n copyright = {http://creativecommons.org/licenses/by/4.0/},\n journal = {Nucleic Acids Research},\n language = {en},\n month = jan,\n number = {D1},\n pages = {D192--D200},\n publisher = {Oxford University Press (OUP)},\n title = {Rfam 14: expanded coverage of metagenomic, viral and {microRNA} families},\n volume = 49,\n year = 2021\n}\n\n@article{hufsky2021computational,\n author = {Hufsky, Franziska and Lamkiewicz, Kevin and Almeida, Alexandre and Aouacheria, Abdel and Arighi, Cecilia and Bateman, Alex and Baumbach, Jan and Beerenwinkel, Niko and Brandt, Christian and Cacciabue, Marco and Chuguransky, Sara and Drechsel, Oliver and Finn, Robert D and Fritz, Adrian and Fuchs, Stephan and Hattab, Georges and Hauschild, Anne-Christin and Heider, Dominik and Hoffmann, Marie and H{\\\"o}lzer, Martin and Hoops, Stefan and Kaderali, Lars and Kalvari, Ioanna and von Kleist, Max and Kmiecinski, Ren{\\'o} and K{\\\"u}hnert, Denise and Lasso, Gorka and Libin, Pieter and List, Markus and L{\\\"o}chel, Hannah F and Martin, Maria J and Martin, Roman and Matschinske, Julian and McHardy, Alice C and Mendes, Pedro and Mistry, Jaina and Navratil, Vincent and Nawrocki, Eric P and O'Toole, {\\'A}ine Niamh and Ontiveros-Palacios, Nancy and Petrov, Anton I and Rangel-Pineros, Guillermo and Redaschi, Nicole and Reimering, Susanne and Reinert, Knut and Reyes, Alejandro and Richardson, Lorna and Robertson, David L and Sadegh, Sepideh and Singer, Joshua B and Theys, Kristof and Upton, Chris and Welzel, Marius and Williams, Lowri and Marz, Manja},\n copyright = {http://creativecommons.org/licenses/by/4.0/},\n journal = {Briefings in Bioinformatics},\n month = mar,\n number = 2,\n pages = {642--663},\n publisher = {Oxford University Press (OUP)},\n title = {Computational strategies to combat {COVID-19}: useful tools to accelerate {SARS-CoV-2} and coronavirus research},\n volume = 22,\n year = 2021\n}\n\n@article{kalvari2018noncoding,\n author = {Kalvari, Ioanna and Nawrocki, Eric P and Argasinska, Joanna and Quinones-Olvera, Natalia and Finn, Robert D and Bateman, Alex and Petrov, Anton I},\n journal = {Current Protocols in Bioinformatics},\n month = jun,\n number = 1,\n pages = {e51},\n title = {Non-coding {RNA} analysis using the rfam database},\n volume = 62,\n year = 2018\n}\n\n@article{kalvari2018rfam,\n author = {Kalvari, Ioanna and Argasinska, Joanna and Quinones-Olvera,\n Natalia and Nawrocki, Eric P and Rivas, Elena and Eddy, Sean R\n and Bateman, Alex and Finn, Robert D and Petrov, Anton I},\n journal = {Nucleic Acids Research},\n month = jan,\n number = {D1},\n pages = {D335--D342},\n title = {Rfam 13.0: shifting to a genome-centric resource for non-coding {RNA} families},\n volume = 46,\n year = 2018\n}\n\n@article{nawrocki2015rfam,\n author = {Nawrocki, Eric P and Burge, Sarah W and Bateman, Alex and Daub, Jennifer and Eberhardt, Ruth Y and Eddy, Sean R and Floden, Evan W and Gardner, Paul P and Jones, Thomas A and Tate, John and Finn, Robert D},\n copyright = {http://creativecommons.org/licenses/by/4.0/},\n journal = {Nucleic Acids Research},\n month = jan,\n number = {Database issue},\n pages = {D130--7},\n publisher = {Oxford University Press (OUP)},\n title = {Rfam 12.0: updates to the {RNA} families database},\n volume = 43,\n year = 2015\n}\n\n@article{burge2013rfam,\n author = {Burge, Sarah W and Daub, Jennifer and Eberhardt, Ruth and Tate, John and Barquist, Lars and Nawrocki, Eric P and Eddy, Sean R and Gardner, Paul P and Bateman, Alex},\n copyright = {http://creativecommons.org/licenses/by-nc/3.0/},\n journal = {Nucleic Acids Research},\n month = jan,\n number = {Database issue},\n pages = {D226--32},\n publisher = {Oxford University Press (OUP)},\n title = {Rfam 11.0: 10 years of {RNA} families},\n volume = 41,\n year = 2013\n}\n\n@article{gardner2011rfam,\n author = {Gardner, Paul P and Daub, Jennifer and Tate, John and Moore, Benjamin L and Osuch, Isabelle H and Griffiths-Jones, Sam and Finn, Robert D and Nawrocki, Eric P and Kolbe, Diana L and Eddy, Sean R and Bateman, Alex},\n journal = {Nucleic Acids Research},\n month = jan,\n number = {Database issue},\n pages = {D141--5},\n title = {Rfam: Wikipedia, clans and the ``decimal'' release},\n volume = 39,\n year = 2011\n}\n\n@article{gardner2009rfam,\n author = {Gardner, Paul P and Daub, Jennifer and Tate, John G and Nawrocki, Eric P and Kolbe, Diana L and Lindgreen, Stinus and Wilkinson, Adam C and Finn, Robert D and Griffiths-Jones, Sam and Eddy, Sean R and Bateman, Alex},\n journal = {Nucleic Acids Research},\n month = jan,\n number = {Database issue},\n pages = {D136--40},\n title = {Rfam: updates to the {RNA} families database},\n volume = 37,\n year = 2009\n}\n\n@article{daub2008rna,\n author = {Daub, Jennifer and Gardner, Paul P and Tate, John and Ramsk{\\\"o}ld, Daniel and Manske, Magnus and Scott, William G and Weinberg, Zasha and Griffiths-Jones, Sam and Bateman, Alex},\n journal = {RNA},\n month = dec,\n number = 12,\n pages = {2462--2464},\n title = {The {RNA} {WikiProject}: community annotation of {RNA} families},\n volume = 14,\n year = 2008\n}\n\n@article{griffiths2005rfam,\n author = {Griffiths-Jones, Sam and Moxon, Simon and Marshall, Mhairi and Khanna, Ajay and Eddy, Sean R. and Bateman, Alex},\n doi = {10.1093/nar/gki081},\n eprint = {https://academic.oup.com/nar/article-pdf/33/suppl\\_1/D121/7622063/gki081.pdf},\n issn = {0305-1048},\n journal = {Nucleic Acids Research},\n month = jan,\n number = {suppl_1},\n pages = {D121-D124},\n title = {{Rfam: annotating non-coding RNAs in complete genomes}},\n url = {https://doi.org/10.1093/nar/gki081},\n volume = {33},\n year = {2005}\n}\n\n@article{griffiths2003rfam,\n author = {Griffiths-Jones, Sam and Bateman, Alex and Marshall, Mhairi and Khanna, Ajay and Eddy, Sean R.},\n doi = {10.1093/nar/gkg006},\n eprint = {https://academic.oup.com/nar/article-pdf/31/1/439/7125749/gkg006.pdf},\n issn = {0305-1048},\n journal = {Nucleic Acids Research},\n month = jan,\n number = {1},\n pages = {439-441},\n title = {{Rfam: an RNA family database}},\n url = {https://doi.org/10.1093/nar/gkg006},\n volume = {31},\n year = {2003}\n}\n
"},{"location":"datasets/rivas/","title":"RIVAS","text":"The RIVAS dataset is a curated collection of RNA sequences and their secondary structures, designed for training and evaluating RNA secondary structure prediction methods. The dataset combines sequences from published studies and databases like Rfam, covering diverse RNA families such as tRNA, SRP RNA, and ribozymes. The secondary structure data is obtained from experimentally verified structures and consensus structures from Rfam alignments, ensuring high-quality annotations for model training and evaluation.
"},{"location":"datasets/rivas/#disclaimer","title":"Disclaimer","text":"This is an UNOFFICIAL release of the RIVAS dataset by Elena Rivas, et al.
The team releasing RIVAS did not write this dataset card for this dataset so this dataset card has been written by the MultiMolecule team.
"},{"location":"datasets/rivas/#dataset-description","title":"Dataset Description","text":"The converted dataset consists of the following columns, each providing specific information about the RNA secondary structures, consistent with the bpRNA standard:
id: A unique identifier for each RNA entry. This ID is derived from the original .sta
file name and serves as a reference to the specific RNA structure within the dataset.
sequence: The nucleotide sequence of the RNA molecule, represented using the standard RNA bases:
secondary_structure: The secondary structure of the RNA represented in dot-bracket notation, using up to three types of symbols to indicate base pairing and unpaired regions, as per bpRNA\u2019s standard:
.
): Represent unpaired nucleotides.(
and )
): Represent base pairs in standard stems (page 1).[
and ]
): Represent base pairs in pseudoknots (page 2).{
and }
): Represent base pairs in additional pseudoknots (page 3).This dataset is available in three variants:
This dataset is licensed under the AGPL-3.0 License.
Text OnlySPDX-License-Identifier: AGPL-3.0-or-later\n
"},{"location":"datasets/rivas/#citation","title":"Citation","text":"BibTeX@article{rivas2012a,\n author = {Rivas, Elena and Lang, Raymond and Eddy, Sean R},\n journal = {RNA},\n month = feb,\n number = 2,\n pages = {193--212},\n publisher = {Cold Spring Harbor Laboratory},\n title = {A range of complex probabilistic models for {RNA} secondary structure prediction that includes the nearest-neighbor model and more},\n volume = 18,\n year = 2012\n}\n
"},{"location":"datasets/rnacentral/","title":"RNAcentral","text":"RNAcentral is a free, public resource that offers integrated access to a comprehensive and up-to-date set of non-coding RNA sequences provided by a collaborating group of Expert Databases representing a broad range of organisms and RNA types.
The development of RNAcentral is coordinated by European Bioinformatics Institute and is supported by Wellcome. Initial funding was provided by BBSRC.
"},{"location":"datasets/rnacentral/#disclaimer","title":"Disclaimer","text":"This is an UNOFFICIAL release of the RNAcentral by the RNAcentral Consortium.
The team releasing RNAcentral did not write this dataset card for this dataset so this dataset card has been written by the MultiMolecule team.
"},{"location":"datasets/rnacentral/#dataset-description","title":"Dataset Description","text":"This dataset is available in five additional variants:
In addition to the main RNAcentral dataset, we also provide the following derived datasets:
This dataset is licensed under the AGPL-3.0 License.
Text OnlySPDX-License-Identifier: AGPL-3.0-or-later\n
Tip
The original RNAcentral dataset is licensed under the CC0 1.0 Universal license and is available at RNAcentral.
"},{"location":"datasets/rnacentral/#citation","title":"Citation","text":"BibTeX@article{rnacentral2021,\n author = {{RNAcentral Consortium}},\n doi = {https://doi.org/10.1093/nar/gkaa921},\n journal = {Nucleic Acids Research},\n month = jan,\n number = {D1},\n pages = {D212--D220},\n publisher = {Oxford University Press (OUP)},\n title = {{RNAcentral} 2021: secondary structure integration, improved sequence search and new member databases},\n url = {https://academic.oup.com/nar/article/49/D1/D212/5940500},\n volume = 49,\n year = 2021\n}\n\n@article{sweeney2020exploring,\n author = {Sweeney, Blake A. and Tagmazian, Arina A. and Ribas, Carlos E. and Finn, Robert D. and Bateman, Alex and Petrov, Anton I.},\n doi = {https://doi.org/10.1002/cpbi.104},\n eprint = {https://currentprotocols.onlinelibrary.wiley.com/doi/pdf/10.1002/cpbi.104},\n journal = {Current Protocols in Bioinformatics},\n keywords = {Galaxy, ncRNA, non-coding RNA, RNAcentral, RNA-seq},\n number = {1},\n pages = {e104},\n title = {Exploring Non-Coding RNAs in RNAcentral},\n url = {https://currentprotocols.onlinelibrary.wiley.com/doi/abs/10.1002/cpbi.104},\n volume = 71,\n year = 2020\n}\n\n@article{rnacentral2019,\n author = {{The RNAcentral Consortium}},\n doi = {https://doi.org/10.1093/nar/gky1034},\n journal = {Nucleic Acids Research},\n month = jan,\n number = {D1},\n pages = {D221--D229},\n publisher = {Oxford University Press (OUP)},\n title = {{RNAcentral}: a hub of information for non-coding {RNA} sequences},\n url = {https://academic.oup.com/nar/article/47/D1/D221/5160993},\n volume = 47,\n year = 2019\n}\n\n@article{rnacentral2017,\n author = {{The RNAcentral Consortium} and Petrov, Anton I and Kay, Simon J E and Kalvari, Ioanna and Howe, Kevin L and Gray, Kristian A and Bruford, Elspeth A and Kersey, Paul J and Cochrane, Guy and Finn, Robert D and Bateman, Alex and Kozomara, Ana and Griffiths-Jones, Sam and Frankish, Adam and Zwieb, Christian W and Lau, Britney Y and Williams, Kelly P and Chan, Patricia Pand Lowe, Todd M and Cannone, Jamie J and Gutell, Robin and Machnicka, Magdalena A and Bujnicki, Janusz M and Yoshihama, Maki and Kenmochi, Naoya and Chai, Benli and Cole, James R and Szymanski, Maciej and Karlowski, Wojciech M and Wood, Valerie and Huala, Eva and Berardini, Tanya Z and Zhao, Yi and Chen, Runsheng and Zhu, Weimin and Paraskevopoulou, Maria D and Vlachos, Ioannis S and Hatzigeorgiou, Artemis G and Ma, Lina and Zhang, Zhang and Puetz, Joern and Stadler, Peter F and McDonald, Daniel and Basu, Siddhartha and Fey, Petra and Engel, Stacia R and Cherry, J Michael and Volders, Pieter-Jan and Mestdagh, Pieter and Wower, Jacek and Clark, Michael B and Quek, Xiu Cheng and Dinger, Marcel E},\n doi = {https://doi.org/10.1093/nar/gkw1008},\n journal = {Nucleic Acids Research},\n month = jan,\n number = {D1},\n pages = {D128--D134},\n publisher = {Oxford University Press (OUP)},\n title = {{RNAcentral}: a comprehensive database of non-coding {RNA} sequences},\n url = {https://academic.oup.com/nar/article/45/D1/D128/2333921},\n volume = 45,\n year = 2017\n}\n\n@article{rnacentral2015,\n author = {{RNAcentral Consortium} and Petrov, Anton I and Kay, Simon J E and Gibson, Richard and Kulesha, Eugene and Staines, Dan and Bruford, Elspeth A and Wright, Mathew W and Burge, Sarah and Finn, Robert D and Kersey, Paul J and Cochrane, Guy and Bateman, Alex and Griffiths-Jones, Sam and Harrow, Jennifer and Chan, Patricia P and Lowe, Todd M and Zwieb, Christian W and Wower, Jacek and Williams, Kelly P and Hudson, Corey M and Gutell, Robin and Clark, Michael B and Dinger, Marcel and Quek, Xiu Cheng and Bujnicki, Janusz M and Chua, Nam-Hai and Liu, Jun and Wang, Huan and Skogerb{\\o}, Geir and Zhao, Yi and Chen, Runsheng and Zhu, Weimin and Cole, James R and Chai, Benli and Huang, Hsien-Da and Huang, His-Yuan and Cherry, J Michael and Hatzigeorgiou, Artemis and Pruitt, Kim D},\n doi = {https://doi.org/10.1093/nar/gku991},\n journal = {Nucleic Acids Research},\n month = jan,\n number = {Database issue},\n pages = {D123--D129},\n title = {{RNAcentral}: an international database of {ncRNA} sequences},\n url = {https://academic.oup.com/nar/article/43/D1/D123/2439941},\n volume = 43,\n year = 2015\n}\n\n@article{bateman2011rnacentral,\n author = {Bateman, Alex and Agrawal, Shipra and Birney, Ewan and Bruford, Elspeth A and Bujnicki, Janusz M and Cochrane, Guy and Cole, James R and Dinger, Marcel E and Enright, Anton J and Gardner, Paul P and Gautheret, Daniel and Griffiths-Jones, Sam and Harrow, Jen and Herrero, Javier and Holmes, Ian H and Huang, Hsien-Da and Kelly, Krystyna A and Kersey, Paul and Kozomara, Ana and Lowe, Todd M and Marz, Manja and Moxon, Simon andPruitt, Kim D and Samuelsson, Tore and Stadler, Peter F and Vilella, Albert J and Vogel, Jan-Hinnerk and Williams, Kelly P and Wright, Mathew W and Zwieb, Christian},\n doi = {https://doi.org/10.1261/rna.2750811},\n journal = {RNA},\n month = nov,\n number = 11,\n pages = {1941--1946},\n publisher = {Cold Spring Harbor Laboratory},\n title = {{RNAcentral}: A vision for an international database of {RNA} sequences},\n url = {https://rnajournal.cshlp.org/content/17/11/1941.long},\n volume = 17,\n year = 2011\n}\n
"},{"location":"datasets/rnastralign/","title":"RNAStrAlign","text":"RNAStrAlign is a comprehensive dataset of RNA sequences and their secondary structures.
RNAStrAlign aggregates data from multiple established RNA structure repositories, covering diverse RNA families such as 5S ribosomal RNA, tRNA, and group I introns.
It is considered complementary to the ArchiveII dataset.
"},{"location":"datasets/rnastralign/#disclaimer","title":"Disclaimer","text":"This is an UNOFFICIAL release of the RNAStrAlign by Zhen Tan, et al.
The team releasing RNAStrAlign did not write this dataset card for this dataset so this dataset card has been written by the MultiMolecule team.
"},{"location":"datasets/rnastralign/#dataset-description","title":"Dataset Description","text":"id: A unique identifier for each RNA entry. This ID is derived from the family and the original .sta
file name, and serves as a reference to the specific RNA structure within the dataset.
sequence: The nucleotide sequence of the RNA molecule, represented using the standard RNA bases:
secondary_structure: The secondary structure of the RNA represented in dot-bracket notation, using up to three types of symbols to indicate base pairing and unpaired regions, as per bpRNA\u2019s standard:
.
): Represent unpaired nucleotides.(
and )
): Represent base pairs in standard stems (page 1).family: The RNA family to which the sequence belongs, such as 16S rRNA, 5S rRNA, etc.
subfamily: A more specific subfamily within the family, such as Actinobacteria for 16S rRNA.
Not all families have subfamilies, in which case this field will be None
.
This dataset is available in two additional variants:
This dataset is licensed under the AGPL-3.0 License.
Text OnlySPDX-License-Identifier: AGPL-3.0-or-later\n
"},{"location":"datasets/rnastralign/#citation","title":"Citation","text":"BibTeX@article{ran2017turbofold,\n author = {Tan, Zhen and Fu, Yinghan and Sharma, Gaurav and Mathews, David H},\n journal = {Nucleic Acids Research},\n month = nov,\n number = 20,\n pages = {11570--11581},\n title = {{TurboFold} {II}: {RNA} structural alignment and secondary structure prediction informed by multiple homologs},\n volume = 45,\n year = 2017\n}\n
"},{"location":"datasets/ryos/","title":"RYOS","text":"RYOS is a database of RNA backbone stability in aqueous solution.
RYOS focuses on exploring the stability of mRNA molecules for vaccine applications. This dataset is part of a broader effort to address one of the key challenges of mRNA vaccines: degradation during shipping and storage.
"},{"location":"datasets/ryos/#statement","title":"Statement","text":"Deep learning models for predicting RNA degradation via dual crowdsourcing is published in Nature Machine Intelligence, which is a Closed Access / Author-Fee journal.
Machine learning has been at the forefront of the movement for free and open access to research.
We see no role for closed access or author-fee publication in the future of machine learning research and believe the adoption of these journals as an outlet of record for the machine learning community would be a retrograde step.
The MultiMolecule team is committed to the principles of open access and open science.
We do NOT endorse the publication of manuscripts in Closed Access / Author-Fee journals and encourage the community to support Open Access journals and conferences.
Please consider signing the Statement on Nature Machine Intelligence.
"},{"location":"datasets/ryos/#disclaimer","title":"Disclaimer","text":"This is an UNOFFICIAL release of the RYOS by Hannah K. Wayment-Steele, et al.
The team releasing RYOS did not write this dataset card for this dataset so this dataset card has been written by the MultiMolecule team.
"},{"location":"datasets/ryos/#dataset-description","title":"Dataset Description","text":"id: A unique identifier for each RNA sequence entry.
design: The name given to each RNA design by contributors, used for easy reference.
sequence: The nucleotide sequence of the RNA molecule, represented using the standard RNA bases:
secondary_structure: The secondary structure of the RNA represented in dot-bracket notation, using up to three types of symbols to indicate base pairing and unpaired regions, as per bpRNA\u2019s standard:
.
): Represent unpaired nucleotides.(
and )
): Represent base pairs in standard stems (page 1).[
and ]
): Represent base pairs in pseudoknots (page 2).{
and }
): Represent base pairs in additional pseudoknots (page 3).reactivity: A list of floating-point values that provide an estimate of the likelihood of the RNA backbone being cut at each nucleotide position. These values help determine the stability of the RNA structure under various experimental conditions.
deg_pH10 and deg_Mg_pH10: Arrays of degradation rates observed under two conditions: incubation at pH 10 without and with magnesium, respectively. These values provide insight into how different conditions affect the stability of RNA molecules.
deg_50C and deg_Mg_50C: Arrays of degradation rates after incubation at 50\u00b0C, without and with magnesium. These values capture how RNA sequences respond to elevated temperatures, which is relevant for storage and transportation conditions.
*_error_* Columns: Arrays of floating-point numbers indicating the experimental errors corresponding to the measurements in the reactivity and deg_ columns. These values help quantify the uncertainty in the degradation rates and reactivity measurements.
SN_filter: A filter applied to the dataset based on the signal-to-noise ratio, indicating whether a specific sequence meets the dataset\u2019s quality criteria.
If the SN_filter is True
, the sequence meets the quality criteria; otherwise, it does not.
Note that due to technical limitations, the ground truth measurements are not available for the final bases of each RNA sequence, resulting in a shorter length for the provided labels compared to the full sequence.
"},{"location":"datasets/ryos/#variations","title":"Variations","text":"This dataset is available in two subsets:
This dataset is licensed under the AGPL-3.0 License.
Text OnlySPDX-License-Identifier: AGPL-3.0-or-later\n
"},{"location":"datasets/ryos/#citation","title":"Citation","text":"BibTeX@article{waymentsteele2021deep,\n author = {Wayment-Steele, Hannah K and Kladwang, Wipapat and Watkins, Andrew M and Kim, Do Soon and Tunguz, Bojan and Reade, Walter and Demkin, Maggie and Romano, Jonathan and Wellington-Oguri, Roger and Nicol, John J and Gao, Jiayang and Onodera, Kazuki and Fujikawa, Kazuki and Mao, Hanfei and Vandewiele, Gilles and Tinti, Michele and Steenwinckel, Bram and Ito, Takuya and Noumi, Taiga and He, Shujun and Ishi, Keiichiro and Lee, Youhan and {\\\"O}zt{\\\"u}rk, Fatih and Chiu, Anthony and {\\\"O}zt{\\\"u}rk, Emin and Amer, Karim and Fares, Mohamed and Participants, Eterna and Das, Rhiju},\n journal = {ArXiv},\n month = oct,\n title = {Deep learning models for predicting {RNA} degradation via dual crowdsourcing},\n year = 2021\n}\n
"},{"location":"models/","title":"models","text":"models
provide a collection of pre-trained models.
In the transformers
library, the names of model classes can sometimes be misleading. While these classes support both regression and classification tasks, their names often include xxxForSequenceClassification
, which may imply they are only for classification.
To avoid this ambiguity, MultiMolecule provides a set of model classes with clear, intuitive names that reflect their intended use:
multimolecule.AutoModelForSequencePrediction
: Sequence Predictionmultimolecule.AutoModelForTokenPrediction
: Token Predictionmultimolecule.AutoModelForContactPrediction
: Contact PredictionEach of these models supports both regression and classification tasks, offering flexibility and precision for a wide range of applications.
"},{"location":"models/#contact-prediction","title":"Contact Prediction","text":"Contact prediction assign a label to each pair of token in a sentence. One of the most common contact prediction tasks is protein distance map prediction. Protein distance map prediction attempts to find the distance between all possible amino acid residue pairs of a three-dimensional protein structure
"},{"location":"models/#nucleotide-prediction","title":"Nucleotide Prediction","text":"Similar to Token Classification, but removes the <bos>
token and the <eos>
token if they are defined in the model config.
<bos>
and <eos>
tokens
In tokenizers provided by MultiMolecule, <bos>
token is pointed to <cls>
token, and <sep>
token is pointed to <eos>
token.
multimolecule.AutoModel
s","text":"Pythonfrom transformers import AutoTokenizer\n\nfrom multimolecule import AutoModelForSequencePrediction\n\nmodel = AutoModelForSequencePrediction.from_pretrained(\"multimolecule/rnafm\")\ntokenizer = AutoTokenizer.from_pretrained(\"multimolecule/rnafm\")\n\nsequence = \"UAGCGUAUCAGACUGAUGUUG\"\noutput = model(**tokenizer(sequence, return_tensors=\"pt\"))\n
"},{"location":"models/#direct-access","title":"Direct Access","text":"All models can be directly loaded with the from_pretrained
method.
from multimolecule.models import RnaFmForTokenPrediction, RnaTokenizer\n\nmodel = RnaFmForTokenPrediction.from_pretrained(\"multimolecule/rnafm\")\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rnafm\")\n\nsequence = \"UAGCGUAUCAGACUGAUGUUG\"\noutput = model(**tokenizer(sequence, return_tensors=\"pt\"))\n
"},{"location":"models/#build-with-transformersautomodels","title":"Build with transformers.AutoModel
s","text":"While we use a different naming convention for model classes, the models are still registered to corresponding transformers.AutoModel
s.
from transformers import AutoModelForSequenceClassification, AutoTokenizer\n\nimport multimolecule # noqa: F401\n\nmodel = AutoModelForSequenceClassification.from_pretrained(\"multimolecule/mrnafm\")\ntokenizer = AutoTokenizer.from_pretrained(\"multimolecule/mrnafm\")\n\nsequence = \"UAGCGUAUCAGACUGAUGUUG\"\noutput = model(**tokenizer(sequence, return_tensors=\"pt\"))\n
import multimolecule
before use
Note that you must import multimolecule
before building the model using transformers.AutoModel
. The registration of models is done in the multimolecule
package, and the models are not available in the transformers
package.
The following error will be raised if you do not import multimolecule
before using transformers.AutoModel
:
ValueError: The checkpoint you are trying to load has model type `rnafm` but Transformers does not recognize this architecture. This could be because of an issue with the checkpoint, or because your version of Transformers is out of date.\n
"},{"location":"models/#initialize-a-vanilla-model","title":"Initialize a vanilla model","text":"You can also initialize a vanilla model using the model class.
Pythonfrom multimolecule.models import RnaFmConfig, RnaFmForTokenPrediction, RnaTokenizer\n\nconfig = RnaFmConfig()\nmodel = RnaFmForTokenPrediction(config)\ntokenizer = RnaTokenizer()\n\nsequence = \"UAGCGUAUCAGACUGAUGUUG\"\noutput = model(**tokenizer(sequence, return_tensors=\"pt\"))\n
"},{"location":"models/#available-models","title":"Available Models","text":""},{"location":"models/#deoxyribonucleic-acid-dna","title":"DeoxyriboNucleic Acid (DNA)","text":"Pre-trained model on protein-coding DNA (cDNA) using a masked language modeling (MLM) objective.
"},{"location":"models/calm/#statement","title":"Statement","text":"Codon language embeddings provide strong signals for use in protein engineering is published in Nature Machine Intelligence, which is a Closed Access / Author-Fee journal.
Machine learning has been at the forefront of the movement for free and open access to research.
We see no role for closed access or author-fee publication in the future of machine learning research and believe the adoption of these journals as an outlet of record for the machine learning community would be a retrograde step.
The MultiMolecule team is committed to the principles of open access and open science.
We do NOT endorse the publication of manuscripts in Closed Access / Author-Fee journals and encourage the community to support Open Access journals and conferences.
Please consider signing the Statement on Nature Machine Intelligence.
"},{"location":"models/calm/#disclaimer","title":"Disclaimer","text":"This is an UNOFFICIAL implementation of the Codon language embeddings provide strong signals for use in protein engineering by Carlos Outeiral and Charlotte M. Deane.
The OFFICIAL repository of CaLM is at oxpig/CaLM.
Warning
The MultiMolecule team is unable to confirm that the provided model and checkpoints are producing the same intermediate representations as the original implementation. This is because
The proposed method is published in a Closed Access / Author-Fee journal.
The team releasing CaLM did not write this model card for this model so this model card has been written by the MultiMolecule team.
"},{"location":"models/calm/#model-details","title":"Model Details","text":"CaLM is a bert-style model pre-trained on a large corpus of protein-coding DNA sequences in a self-supervised fashion. This means that the model was trained on the raw nucleotides of DNA sequences only, with an automatic process to generate inputs and labels from those texts. Please refer to the Training Details section for more information on the training process.
"},{"location":"models/calm/#model-specification","title":"Model Specification","text":"Num Layers Hidden Size Num Heads Intermediate Size Num Parameters (M) FLOPs (G) MACs (G) Max Num Tokens 12 768 12 3072 85.75 22.36 11.17 1024"},{"location":"models/calm/#links","title":"Links","text":"The model file depends on the multimolecule
library. You can install it using pip:
pip install multimolecule\n
"},{"location":"models/calm/#direct-use","title":"Direct Use","text":"You can use this model directly with a pipeline for masked language modeling:
Python>>> import multimolecule # you must import multimolecule to register models\n>>> from transformers import pipeline\n>>> unmasker = pipeline(\"fill-mask\", model=\"multimolecule/calm\")\n>>> unmasker(\"agc<mask>cattatggcgaaccttggctgctg\")\n\n[{'score': 0.011160684749484062,\n 'token': 100,\n 'token_str': 'UUN',\n 'sequence': 'AGC UUN CAU UAU GGC GAA CCU UGG CUG CUG'},\n {'score': 0.01067513320595026,\n 'token': 117,\n 'token_str': 'NGC',\n 'sequence': 'AGC NGC CAU UAU GGC GAA CCU UGG CUG CUG'},\n {'score': 0.010549729689955711,\n 'token': 127,\n 'token_str': 'NNC',\n 'sequence': 'AGC NNC CAU UAU GGC GAA CCU UGG CUG CUG'},\n {'score': 0.0103579331189394,\n 'token': 51,\n 'token_str': 'CNA',\n 'sequence': 'AGC CNA CAU UAU GGC GAA CCU UGG CUG CUG'},\n {'score': 0.010322545655071735,\n 'token': 77,\n 'token_str': 'GNC',\n 'sequence': 'AGC GNC CAU UAU GGC GAA CCU UGG CUG CUG'}]\n
"},{"location":"models/calm/#downstream-use","title":"Downstream Use","text":""},{"location":"models/calm/#extract-features","title":"Extract Features","text":"Here is how to use this model to get the features of a given sequence in PyTorch:
Pythonfrom multimolecule import RnaTokenizer, CaLmModel\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/calm\")\nmodel = CaLmModel.from_pretrained(\"multimolecule/calm\")\n\ntext = \"GCCAGTCGCTGACAGCCGCGG\"\ninput = tokenizer(text, return_tensors=\"pt\")\n\noutput = model(**input)\n
"},{"location":"models/calm/#sequence-classification-regression","title":"Sequence Classification / Regression","text":"Note: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for sequence classification or regression.
Here is how to use this model as backbone to fine-tune for a sequence-level task in PyTorch:
Pythonimport torch\nfrom multimolecule import RnaTokenizer, CaLmForSequencePrediction\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/calm\")\nmodel = CaLmForSequencePrediction.from_pretrained(\"multimolecule/calm\")\n\ntext = \"GCCAGTCGCTGACAGCCGCGG\"\ninput = tokenizer(text, return_tensors=\"pt\")\nlabel = torch.tensor([1])\n\noutput = model(**input, labels=label)\n
"},{"location":"models/calm/#token-classification-regression","title":"Token Classification / Regression","text":"Note: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for nucleotide classification or regression.
Here is how to use this model as backbone to fine-tune for a nucleotide-level task in PyTorch:
Pythonimport torch\nfrom multimolecule import RnaTokenizer, CaLmForTokenPrediction\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/calm\")\nmodel = CaLmForTokenPrediction.from_pretrained(\"multimolecule/calm\")\n\ntext = \"GCCAGTCGCTGACAGCCGCGG\"\ninput = tokenizer(text, return_tensors=\"pt\")\nlabel = torch.randint(2, (len(text), ))\n\noutput = model(**input, labels=label)\n
"},{"location":"models/calm/#contact-classification-regression","title":"Contact Classification / Regression","text":"Note: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for contact classification or regression.
Here is how to use this model as backbone to fine-tune for a contact-level task in PyTorch:
Pythonimport torch\nfrom multimolecule import RnaTokenizer, CaLmForContactPrediction\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/calm\")\nmodel = CaLmForContactPrediction.from_pretrained(\"multimolecule/calm\")\n\ntext = \"GCCAGTCGCTGACAGCCGCGG\"\ninput = tokenizer(text, return_tensors=\"pt\")\nlabel = torch.randint(2, (len(text), len(text)))\n\noutput = model(**input, labels=label)\n
"},{"location":"models/calm/#training-details","title":"Training Details","text":"CaLM used Masked Language Modeling (MLM) as the pre-training objective: taking a sequence, the model randomly masks 25% of the tokens in the input then runs the entire masked sentence through the model and has to predict the masked tokens. This is comparable to the Cloze task in language modeling.
"},{"location":"models/calm/#training-data","title":"Training Data","text":"The CaLM model was pre-trained coding sequences of all organisms available on the European Nucleotide Archive (ENA). European Nucleotide Archive provides a comprehensive record of the world\u2019s nucleotide sequencing information, covering raw sequencing data, sequence assembly information and functional annotation.
CaLM collected coding sequences of all organisms from ENA on April 2022, including 114,214,475 sequences. Only high level assembly information (dataclass CON) were used. Sequences matching the following criteria were filtered out:
N
, Y
, R
)ATG
To reduce redundancy, CaLM grouped the entries by organism, and apply CD-HIT (CD-HIT-EST) with a cut-off at 40% sequence identity to the translated protein sequences.
The final dataset contains 9,858,385 cDNA sequences.
Note that the alphabet in the original implementation is RNA instead of DNA, therefore, we use RnaTokenizer
to tokenize the sequences. RnaTokenizer
of multimolecule
will convert \u201cU\u201ds to \u201cT\u201ds for you, you may disable this behaviour by passing replace_T_with_U=False
.
CaLM used masked language modeling (MLM) as the pre-training objective. The masking procedure is similar to the one used in BERT:
<mask>
.The model was trained on 4 NVIDIA Quadro RTX4000 GPUs with 8GiB memories.
BibTeX:
BibTeX@article {outeiral2022coodn,\n author = {Outeiral, Carlos and Deane, Charlotte M.},\n title = {Codon language embeddings provide strong signals for protein engineering},\n elocation-id = {2022.12.15.519894},\n year = {2022},\n doi = {10.1101/2022.12.15.519894},\n publisher = {Cold Spring Harbor Laboratory},\n abstract = {Protein representations from deep language models have yielded state-of-the-art performance across many tasks in computational protein engineering. In recent years, progress has primarily focused on parameter count, with recent models{\\textquoteright} capacities surpassing the size of the very datasets they were trained on. Here, we propose an alternative direction. We show that large language models trained on codons, instead of amino acid sequences, provide high-quality representations that outperform comparable state-of-the-art models across a variety of tasks. In some tasks, like species recognition, prediction of protein and transcript abundance, or melting point estimation, we show that a language model trained on codons outperforms every other published protein language model, including some that contain over 50 times more parameters. These results suggest that, in addition to commonly studied scale and model complexity, the information content of biological data provides an orthogonal direction to improve the power of machine learning in biology.Competing Interest StatementThe authors have declared no competing interest.},\n URL = {https://www.biorxiv.org/content/early/2022/12/19/2022.12.15.519894},\n eprint = {https://www.biorxiv.org/content/early/2022/12/19/2022.12.15.519894.full.pdf},\n journal = {bioRxiv}\n}\n
"},{"location":"models/calm/#contact","title":"Contact","text":"Please use GitHub issues of MultiMolecule for any questions or comments on the model card.
Please contact the authors of the CaLM paper for questions or comments on the paper/model.
"},{"location":"models/calm/#license","title":"License","text":"This model is licensed under the AGPL-3.0 License.
Text OnlySPDX-License-Identifier: AGPL-3.0-or-later\n
"},{"location":"models/calm/#multimolecule.models.calm","title":"multimolecule.models.calm","text":""},{"location":"models/calm/#multimolecule.models.calm.RnaTokenizer","title":"RnaTokenizer","text":" Bases: Tokenizer
Tokenizer for RNA sequences.
Parameters:
Name Type Description DefaultAlphabet | str | List[str] | None
alphabet to use for tokenization.
None
, the standard RNA alphabet will be used.string
, it should correspond to the name of a predefined alphabet. The options includestandard
extended
streamline
nucleobase
None
int
Size of kmer to tokenize.
1
bool
Whether to tokenize into codons.
False
bool
Whether to replace T with U.
True
bool
Whether to convert input to uppercase.
True
Examples:
Python Console Session>>> from multimolecule import RnaTokenizer\n>>> tokenizer = RnaTokenizer()\n>>> tokenizer('<pad><cls><eos><unk><mask><null>ACGUNRYSWKMBDHV.X*-I')[\"input_ids\"]\n[1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 2]\n>>> tokenizer('acgu')[\"input_ids\"]\n[1, 6, 7, 8, 9, 2]\n>>> tokenizer('acgt')[\"input_ids\"]\n[1, 6, 7, 8, 9, 2]\n>>> tokenizer = RnaTokenizer(replace_T_with_U=False)\n>>> tokenizer('acgt')[\"input_ids\"]\n[1, 6, 7, 8, 3, 2]\n>>> tokenizer = RnaTokenizer(nmers=3)\n>>> tokenizer('uagcuuauc')[\"input_ids\"]\n[1, 83, 17, 64, 49, 96, 84, 22, 2]\n>>> tokenizer = RnaTokenizer(codon=True)\n>>> tokenizer('uagcuuauc')[\"input_ids\"]\n[1, 83, 49, 22, 2]\n>>> tokenizer('uagcuuauca')[\"input_ids\"]\nTraceback (most recent call last):\nValueError: length of input sequence must be a multiple of 3 for codon tokenization, but got 10\n
Source code in multimolecule/tokenisers/rna/tokenization_rna.py
Pythonclass RnaTokenizer(Tokenizer):\n \"\"\"\n Tokenizer for RNA sequences.\n\n Args:\n alphabet: alphabet to use for tokenization.\n\n - If is `None`, the standard RNA alphabet will be used.\n - If is a `string`, it should correspond to the name of a predefined alphabet. The options include\n + `standard`\n + `extended`\n + `streamline`\n + `nucleobase`\n - If is an alphabet or a list of characters, that specific alphabet will be used.\n nmers: Size of kmer to tokenize.\n codon: Whether to tokenize into codons.\n replace_T_with_U: Whether to replace T with U.\n do_upper_case: Whether to convert input to uppercase.\n\n Examples:\n >>> from multimolecule import RnaTokenizer\n >>> tokenizer = RnaTokenizer()\n >>> tokenizer('<pad><cls><eos><unk><mask><null>ACGUNRYSWKMBDHV.X*-I')[\"input_ids\"]\n [1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 2]\n >>> tokenizer('acgu')[\"input_ids\"]\n [1, 6, 7, 8, 9, 2]\n >>> tokenizer('acgt')[\"input_ids\"]\n [1, 6, 7, 8, 9, 2]\n >>> tokenizer = RnaTokenizer(replace_T_with_U=False)\n >>> tokenizer('acgt')[\"input_ids\"]\n [1, 6, 7, 8, 3, 2]\n >>> tokenizer = RnaTokenizer(nmers=3)\n >>> tokenizer('uagcuuauc')[\"input_ids\"]\n [1, 83, 17, 64, 49, 96, 84, 22, 2]\n >>> tokenizer = RnaTokenizer(codon=True)\n >>> tokenizer('uagcuuauc')[\"input_ids\"]\n [1, 83, 49, 22, 2]\n >>> tokenizer('uagcuuauca')[\"input_ids\"]\n Traceback (most recent call last):\n ValueError: length of input sequence must be a multiple of 3 for codon tokenization, but got 10\n \"\"\"\n\n model_input_names = [\"input_ids\", \"attention_mask\"]\n\n def __init__(\n self,\n alphabet: Alphabet | str | List[str] | None = None,\n nmers: int = 1,\n codon: bool = False,\n replace_T_with_U: bool = True,\n do_upper_case: bool = True,\n additional_special_tokens: List | Tuple | None = None,\n **kwargs,\n ):\n if codon and (nmers > 1 and nmers != 3):\n raise ValueError(\"Codon and nmers cannot be used together.\")\n if codon:\n nmers = 3 # set to 3 to get correct vocab\n if not isinstance(alphabet, Alphabet):\n alphabet = get_alphabet(alphabet, nmers=nmers)\n super().__init__(\n alphabet=alphabet,\n nmers=nmers,\n codon=codon,\n replace_T_with_U=replace_T_with_U,\n do_upper_case=do_upper_case,\n additional_special_tokens=additional_special_tokens,\n **kwargs,\n )\n self.replace_T_with_U = replace_T_with_U\n self.nmers = nmers\n self.codon = codon\n\n def _tokenize(self, text: str, **kwargs):\n if self.do_upper_case:\n text = text.upper()\n if self.replace_T_with_U:\n text = text.replace(\"T\", \"U\")\n if self.codon:\n if len(text) % 3 != 0:\n raise ValueError(\n f\"length of input sequence must be a multiple of 3 for codon tokenization, but got {len(text)}\"\n )\n return [text[i : i + 3] for i in range(0, len(text), 3)]\n if self.nmers > 1:\n return [text[i : i + self.nmers] for i in range(len(text) - self.nmers + 1)] # noqa: E203\n return list(text)\n
"},{"location":"models/calm/#multimolecule.models.calm.RnaTokenizer(alphabet)","title":"alphabet
","text":""},{"location":"models/calm/#multimolecule.models.calm.RnaTokenizer(nmers)","title":"nmers
","text":""},{"location":"models/calm/#multimolecule.models.calm.RnaTokenizer(codon)","title":"codon
","text":""},{"location":"models/calm/#multimolecule.models.calm.RnaTokenizer(replace_T_with_U)","title":"replace_T_with_U
","text":""},{"location":"models/calm/#multimolecule.models.calm.RnaTokenizer(do_upper_case)","title":"do_upper_case
","text":""},{"location":"models/calm/#multimolecule.models.calm.CaLmConfig","title":"CaLmConfig","text":" Bases: PreTrainedConfig
This is the configuration class to store the configuration of a CaLmModel
. It is used to instantiate a CaLM model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the CaLM oxpig/CaLM architecture.
Configuration objects inherit from PreTrainedConfig
and can be used to control the model outputs. Read the documentation from PreTrainedConfig
for more information.
Parameters:
Name Type Description Defaultint
Vocabulary size of the CaLM model. Defines the number of different tokens that can be represented by the inputs_ids
passed when calling [CaLmModel
].
131
int
Dimensionality of the encoder layers and the pooler layer.
768
int
Number of hidden layers in the Transformer encoder.
12
int
Number of attention heads for each attention layer in the Transformer encoder.
12
int
Dimensionality of the \u201cintermediate\u201d (often named feed-forward) layer in the Transformer encoder.
3072
float
The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
0.1
float
The dropout ratio for the attention probabilities.
0.1
int
The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
1026
float
The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
0.02
float
The epsilon used by the layer normalization layers.
1e-12
str
Type of position embedding. Choose one of \"absolute\"
, \"relative_key\"
, \"relative_key_query\", \"rotary\"
. For positional embeddings use \"absolute\"
. For more information on \"relative_key\"
, please refer to Self-Attention with Relative Position Representations (Shaw et al.). For more information on \"relative_key_query\"
, please refer to Method 4 in Improve Transformer Models with Better Relative Position Embeddings (Huang et al.).
'rotary'
bool
Whether the model is used as a decoder or not. If False
, the model is used as an encoder.
False
bool
Whether or not the model should return the last key/values attentions (not used by all models). Only relevant if config.is_decoder=True
.
True
bool
Whether to apply layer normalization after embeddings but before the main stem of the network.
False
bool
When this is enabled, masked tokens are treated as if they had been dropped out by input dropout.
False
Examples:
Python Console Session>>> from multimolecule import CaLmModel, CaLmConfig\n>>> # Initializing a CaLM multimolecule/calm style configuration\n>>> configuration = CaLmConfig()\n>>> # Initializing a model (with random weights) from the multimolecule/calm style configuration\n>>> model = CaLmModel(configuration)\n>>> # Accessing the model configuration\n>>> configuration = model.config\n
Source code in multimolecule/models/calm/configuration_calm.py
Pythonclass CaLmConfig(PreTrainedConfig):\n r\"\"\"\n This is the configuration class to store the configuration of a [`CaLmModel`][multimolecule.models.CaLmModel]. It\n is used to instantiate a CaLM model according to the specified arguments, defining the model architecture.\n Instantiating a configuration with the defaults will yield a similar configuration to that of the CaLM\n [oxpig/CaLM](https://github.com/oxpig/CaLM) architecture.\n\n Configuration objects inherit from [`PreTrainedConfig`][multimolecule.models.PreTrainedConfig] and can be used to\n control the model outputs. Read the documentation from [`PreTrainedConfig`][multimolecule.models.PreTrainedConfig]\n for more information.\n\n Args:\n vocab_size:\n Vocabulary size of the CaLM model. Defines the number of different tokens that can be represented by the\n `inputs_ids` passed when calling [`CaLmModel`].\n hidden_size:\n Dimensionality of the encoder layers and the pooler layer.\n num_hidden_layers:\n Number of hidden layers in the Transformer encoder.\n num_attention_heads:\n Number of attention heads for each attention layer in the Transformer encoder.\n intermediate_size:\n Dimensionality of the \"intermediate\" (often named feed-forward) layer in the Transformer encoder.\n hidden_dropout:\n The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.\n attention_dropout:\n The dropout ratio for the attention probabilities.\n max_position_embeddings:\n The maximum sequence length that this model might ever be used with. Typically set this to something large\n just in case (e.g., 512 or 1024 or 2048).\n initializer_range:\n The standard deviation of the truncated_normal_initializer for initializing all weight matrices.\n layer_norm_eps:\n The epsilon used by the layer normalization layers.\n position_embedding_type:\n Type of position embedding. Choose one of `\"absolute\"`, `\"relative_key\"`, `\"relative_key_query\", \"rotary\"`.\n For positional embeddings use `\"absolute\"`. For more information on `\"relative_key\"`, please refer to\n [Self-Attention with Relative Position Representations (Shaw et al.)](https://arxiv.org/abs/1803.02155).\n For more information on `\"relative_key_query\"`, please refer to *Method 4* in [Improve Transformer Models\n with Better Relative Position Embeddings (Huang et al.)](https://arxiv.org/abs/2009.13658).\n is_decoder:\n Whether the model is used as a decoder or not. If `False`, the model is used as an encoder.\n use_cache:\n Whether or not the model should return the last key/values attentions (not used by all models). Only\n relevant if `config.is_decoder=True`.\n emb_layer_norm_before:\n Whether to apply layer normalization after embeddings but before the main stem of the network.\n token_dropout:\n When this is enabled, masked tokens are treated as if they had been dropped out by input dropout.\n\n Examples:\n >>> from multimolecule import CaLmModel, CaLmConfig\n >>> # Initializing a CaLM multimolecule/calm style configuration\n >>> configuration = CaLmConfig()\n >>> # Initializing a model (with random weights) from the multimolecule/calm style configuration\n >>> model = CaLmModel(configuration)\n >>> # Accessing the model configuration\n >>> configuration = model.config\n \"\"\"\n\n model_type = \"calm\"\n\n def __init__(\n self,\n vocab_size: int = 131,\n codon: bool = True,\n hidden_size: int = 768,\n num_hidden_layers: int = 12,\n num_attention_heads: int = 12,\n intermediate_size: int = 3072,\n hidden_act: str = \"gelu\",\n hidden_dropout: float = 0.1,\n attention_dropout: float = 0.1,\n max_position_embeddings: int = 1026,\n initializer_range: float = 0.02,\n layer_norm_eps: float = 1e-12,\n position_embedding_type: str = \"rotary\",\n is_decoder: bool = False,\n use_cache: bool = True,\n emb_layer_norm_before: bool = False,\n token_dropout: bool = False,\n head: HeadConfig | None = None,\n lm_head: MaskedLMHeadConfig | None = None,\n **kwargs,\n ):\n super().__init__(**kwargs)\n\n self.vocab_size = vocab_size\n self.codon = codon\n self.hidden_size = hidden_size\n self.num_hidden_layers = num_hidden_layers\n self.num_attention_heads = num_attention_heads\n self.intermediate_size = intermediate_size\n self.hidden_act = hidden_act\n self.hidden_dropout = hidden_dropout\n self.attention_dropout = attention_dropout\n self.max_position_embeddings = max_position_embeddings\n self.initializer_range = initializer_range\n self.layer_norm_eps = layer_norm_eps\n self.position_embedding_type = position_embedding_type\n self.is_decoder = is_decoder\n self.use_cache = use_cache\n self.emb_layer_norm_before = emb_layer_norm_before\n self.token_dropout = token_dropout\n self.head = HeadConfig(**head) if head is not None else None\n self.lm_head = MaskedLMHeadConfig(**lm_head) if lm_head is not None else None\n
"},{"location":"models/calm/#multimolecule.models.calm.CaLmConfig(vocab_size)","title":"vocab_size
","text":""},{"location":"models/calm/#multimolecule.models.calm.CaLmConfig(hidden_size)","title":"hidden_size
","text":""},{"location":"models/calm/#multimolecule.models.calm.CaLmConfig(num_hidden_layers)","title":"num_hidden_layers
","text":""},{"location":"models/calm/#multimolecule.models.calm.CaLmConfig(num_attention_heads)","title":"num_attention_heads
","text":""},{"location":"models/calm/#multimolecule.models.calm.CaLmConfig(intermediate_size)","title":"intermediate_size
","text":""},{"location":"models/calm/#multimolecule.models.calm.CaLmConfig(hidden_dropout)","title":"hidden_dropout
","text":""},{"location":"models/calm/#multimolecule.models.calm.CaLmConfig(attention_dropout)","title":"attention_dropout
","text":""},{"location":"models/calm/#multimolecule.models.calm.CaLmConfig(max_position_embeddings)","title":"max_position_embeddings
","text":""},{"location":"models/calm/#multimolecule.models.calm.CaLmConfig(initializer_range)","title":"initializer_range
","text":""},{"location":"models/calm/#multimolecule.models.calm.CaLmConfig(layer_norm_eps)","title":"layer_norm_eps
","text":""},{"location":"models/calm/#multimolecule.models.calm.CaLmConfig(position_embedding_type)","title":"position_embedding_type
","text":""},{"location":"models/calm/#multimolecule.models.calm.CaLmConfig(is_decoder)","title":"is_decoder
","text":""},{"location":"models/calm/#multimolecule.models.calm.CaLmConfig(use_cache)","title":"use_cache
","text":""},{"location":"models/calm/#multimolecule.models.calm.CaLmConfig(emb_layer_norm_before)","title":"emb_layer_norm_before
","text":""},{"location":"models/calm/#multimolecule.models.calm.CaLmConfig(token_dropout)","title":"token_dropout
","text":""},{"location":"models/calm/#multimolecule.models.calm.CaLmForContactPrediction","title":"CaLmForContactPrediction","text":" Bases: CaLmPreTrainedModel
Examples:
Python Console Session>>> from multimolecule import CaLmConfig, CaLmForContactPrediction, RnaTokenizer\n>>> config = CaLmConfig()\n>>> model = CaLmForContactPrediction(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=torch.randint(2, (1, 5, 5)))\n>>> output[\"logits\"].shape\ntorch.Size([1, 5, 5, 1])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n
Source code in multimolecule/models/calm/modeling_calm.py
Pythonclass CaLmForContactPrediction(CaLmPreTrainedModel):\n \"\"\"\n Examples:\n >>> from multimolecule import CaLmConfig, CaLmForContactPrediction, RnaTokenizer\n >>> config = CaLmConfig()\n >>> model = CaLmForContactPrediction(config)\n >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n >>> output = model(**input, labels=torch.randint(2, (1, 5, 5)))\n >>> output[\"logits\"].shape\n torch.Size([1, 5, 5, 1])\n >>> output[\"loss\"] # doctest:+ELLIPSIS\n tensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n \"\"\"\n\n def __init__(self, config: CaLmConfig):\n super().__init__(config)\n self.calm = CaLmModel(config, add_pooling_layer=True)\n self.contact_head = ContactPredictionHead(config)\n self.head_config = self.contact_head.config\n\n # Initialize weights and apply final processing\n self.post_init()\n\n def forward(\n self,\n input_ids: Tensor | NestedTensor,\n attention_mask: Tensor | None = None,\n position_ids: Tensor | None = None,\n head_mask: Tensor | None = None,\n inputs_embeds: Tensor | NestedTensor | None = None,\n labels: Tensor | None = None,\n output_attentions: bool | None = None,\n output_hidden_states: bool | None = None,\n return_dict: bool | None = None,\n **kwargs,\n ) -> Tuple[Tensor, ...] | ContactPredictorOutput:\n if output_attentions is False:\n warn(\"output_attentions must be True for contact classification and will be ignored.\")\n return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n outputs = self.calm(\n input_ids,\n attention_mask=attention_mask,\n position_ids=position_ids,\n head_mask=head_mask,\n inputs_embeds=inputs_embeds,\n output_attentions=True,\n output_hidden_states=output_hidden_states,\n return_dict=return_dict,\n **kwargs,\n )\n output = self.contact_head(outputs, attention_mask, input_ids, labels)\n logits, loss = output.logits, output.loss\n\n if not return_dict:\n output = (logits,) + outputs[2:]\n return ((loss,) + output) if loss is not None else output\n\n return ContactPredictorOutput(\n loss=loss,\n logits=logits,\n hidden_states=outputs.hidden_states,\n attentions=outputs.attentions,\n )\n
"},{"location":"models/calm/#multimolecule.models.calm.CaLmForMaskedLM","title":"CaLmForMaskedLM","text":" Bases: CaLmPreTrainedModel
Examples:
Python Console Session>>> from multimolecule import CaLmConfig, CaLmForMaskedLM, RnaTokenizer\n>>> config = CaLmConfig()\n>>> model = CaLmForMaskedLM(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=input[\"input_ids\"])\n>>> output[\"logits\"].shape\ntorch.Size([1, 7, 131])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<NllLossBackward0>)\n
Source code in multimolecule/models/calm/modeling_calm.py
Pythonclass CaLmForMaskedLM(CaLmPreTrainedModel):\n \"\"\"\n Examples:\n >>> from multimolecule import CaLmConfig, CaLmForMaskedLM, RnaTokenizer\n >>> config = CaLmConfig()\n >>> model = CaLmForMaskedLM(config)\n >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n >>> output = model(**input, labels=input[\"input_ids\"])\n >>> output[\"logits\"].shape\n torch.Size([1, 7, 131])\n >>> output[\"loss\"] # doctest:+ELLIPSIS\n tensor(..., grad_fn=<NllLossBackward0>)\n \"\"\"\n\n _tied_weights_keys = [\"lm_head.decoder.weight\", \"lm_head.decoder.bias\"]\n\n def __init__(self, config: CaLmConfig):\n super().__init__(config)\n if config.is_decoder:\n logger.warning(\n \"If you want to use `CaLmForMaskedLM` make sure `config.is_decoder=False` for \"\n \"bi-directional self-attention.\"\n )\n self.calm = CaLmModel(config, add_pooling_layer=False)\n self.lm_head = MaskedLMHead(config, self.calm.embeddings.word_embeddings.weight)\n\n # Initialize weights and apply final processing\n self.post_init()\n\n def forward(\n self,\n input_ids: Tensor | NestedTensor,\n attention_mask: Tensor | None = None,\n position_ids: Tensor | None = None,\n head_mask: Tensor | None = None,\n inputs_embeds: Tensor | NestedTensor | None = None,\n encoder_hidden_states: Tensor | None = None,\n encoder_attention_mask: Tensor | None = None,\n labels: Tensor | None = None,\n output_attentions: bool | None = None,\n output_hidden_states: bool | None = None,\n return_dict: bool | None = None,\n **kwargs,\n ) -> Tuple[Tensor, ...] | MaskedLMOutput:\n return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n outputs = self.calm(\n input_ids,\n attention_mask=attention_mask,\n position_ids=position_ids,\n head_mask=head_mask,\n inputs_embeds=inputs_embeds,\n encoder_hidden_states=encoder_hidden_states,\n encoder_attention_mask=encoder_attention_mask,\n output_attentions=output_attentions,\n output_hidden_states=output_hidden_states,\n return_dict=return_dict,\n **kwargs,\n )\n output = self.lm_head(outputs, labels)\n logits, loss = output.logits, output.loss\n\n if not return_dict:\n output = (logits,) + outputs[2:]\n return ((loss,) + output) if loss is not None else output\n\n return MaskedLMOutput(\n loss=loss,\n logits=logits,\n hidden_states=outputs.hidden_states,\n attentions=outputs.attentions,\n )\n
"},{"location":"models/calm/#multimolecule.models.calm.CaLmForSequencePrediction","title":"CaLmForSequencePrediction","text":" Bases: CaLmPreTrainedModel
Examples:
Python Console Session>>> from multimolecule import CaLmConfig, CaLmForSequencePrediction, RnaTokenizer\n>>> config = CaLmConfig()\n>>> model = CaLmForSequencePrediction(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=torch.tensor([[1]]))\n>>> output[\"logits\"].shape\ntorch.Size([1, 1])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n
Source code in multimolecule/models/calm/modeling_calm.py
Pythonclass CaLmForSequencePrediction(CaLmPreTrainedModel):\n \"\"\"\n Examples:\n >>> from multimolecule import CaLmConfig, CaLmForSequencePrediction, RnaTokenizer\n >>> config = CaLmConfig()\n >>> model = CaLmForSequencePrediction(config)\n >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n >>> output = model(**input, labels=torch.tensor([[1]]))\n >>> output[\"logits\"].shape\n torch.Size([1, 1])\n >>> output[\"loss\"] # doctest:+ELLIPSIS\n tensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n \"\"\"\n\n def __init__(self, config: CaLmConfig):\n super().__init__(config)\n self.calm = CaLmModel(config, add_pooling_layer=True)\n self.sequence_head = SequencePredictionHead(config)\n self.head_config = self.sequence_head.config\n\n # Initialize weights and apply final processing\n self.post_init()\n\n def forward(\n self,\n input_ids: Tensor | NestedTensor,\n attention_mask: Tensor | None = None,\n position_ids: Tensor | None = None,\n head_mask: Tensor | None = None,\n inputs_embeds: Tensor | NestedTensor | None = None,\n labels: Tensor | None = None,\n output_attentions: bool | None = None,\n output_hidden_states: bool | None = None,\n return_dict: bool | None = None,\n **kwargs,\n ) -> Tuple[Tensor, ...] | SequencePredictorOutput:\n return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n outputs = self.calm(\n input_ids,\n attention_mask=attention_mask,\n position_ids=position_ids,\n head_mask=head_mask,\n inputs_embeds=inputs_embeds,\n output_attentions=output_attentions,\n output_hidden_states=output_hidden_states,\n return_dict=return_dict,\n **kwargs,\n )\n output = self.sequence_head(outputs, labels)\n logits, loss = output.logits, output.loss\n\n if not return_dict:\n output = (logits,) + outputs[2:]\n return ((loss,) + output) if loss is not None else output\n\n return SequencePredictorOutput(\n loss=loss,\n logits=logits,\n hidden_states=outputs.hidden_states,\n attentions=outputs.attentions,\n )\n
"},{"location":"models/calm/#multimolecule.models.calm.CaLmForTokenPrediction","title":"CaLmForTokenPrediction","text":" Bases: CaLmPreTrainedModel
Examples:
Python Console Session>>> from multimolecule import CaLmConfig, CaLmForTokenPrediction, RnaTokenizer\n>>> config = CaLmConfig()\n>>> model = CaLmForTokenPrediction(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=torch.randint(2, (1, 5)))\n>>> output[\"logits\"].shape\ntorch.Size([1, 5, 1])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n
Source code in multimolecule/models/calm/modeling_calm.py
Pythonclass CaLmForTokenPrediction(CaLmPreTrainedModel):\n \"\"\"\n Examples:\n >>> from multimolecule import CaLmConfig, CaLmForTokenPrediction, RnaTokenizer\n >>> config = CaLmConfig()\n >>> model = CaLmForTokenPrediction(config)\n >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n >>> output = model(**input, labels=torch.randint(2, (1, 5)))\n >>> output[\"logits\"].shape\n torch.Size([1, 5, 1])\n >>> output[\"loss\"] # doctest:+ELLIPSIS\n tensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n \"\"\"\n\n def __init__(self, config: CaLmConfig):\n super().__init__(config)\n self.calm = CaLmModel(config, add_pooling_layer=True)\n self.token_head = TokenPredictionHead(config)\n self.head_config = self.token_head.config\n\n # Initialize weights and apply final processing\n self.post_init()\n\n def forward(\n self,\n input_ids: Tensor | NestedTensor,\n attention_mask: Tensor | None = None,\n position_ids: Tensor | None = None,\n head_mask: Tensor | None = None,\n inputs_embeds: Tensor | NestedTensor | None = None,\n labels: Tensor | None = None,\n output_attentions: bool | None = None,\n output_hidden_states: bool | None = None,\n return_dict: bool | None = None,\n **kwargs,\n ) -> Tuple[Tensor, ...] | TokenPredictorOutput:\n return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n outputs = self.calm(\n input_ids,\n attention_mask=attention_mask,\n position_ids=position_ids,\n head_mask=head_mask,\n inputs_embeds=inputs_embeds,\n output_attentions=output_attentions,\n output_hidden_states=output_hidden_states,\n return_dict=return_dict,\n **kwargs,\n )\n output = self.token_head(outputs, attention_mask, input_ids, labels)\n logits, loss = output.logits, output.loss\n\n if not return_dict:\n output = (logits,) + outputs[2:]\n return ((loss,) + output) if loss is not None else output\n\n return TokenPredictorOutput(\n loss=loss,\n logits=logits,\n hidden_states=outputs.hidden_states,\n attentions=outputs.attentions,\n )\n
"},{"location":"models/calm/#multimolecule.models.calm.CaLmModel","title":"CaLmModel","text":" Bases: CaLmPreTrainedModel
Examples:
Python Console Session>>> from multimolecule import CaLmConfig, CaLmModel, RnaTokenizer\n>>> config = CaLmConfig()\n>>> model = CaLmModel(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input)\n>>> output[\"last_hidden_state\"].shape\ntorch.Size([1, 7, 768])\n>>> output[\"pooler_output\"].shape\ntorch.Size([1, 768])\n
Source code in multimolecule/models/calm/modeling_calm.py
Pythonclass CaLmModel(CaLmPreTrainedModel):\n \"\"\"\n Examples:\n >>> from multimolecule import CaLmConfig, CaLmModel, RnaTokenizer\n >>> config = CaLmConfig()\n >>> model = CaLmModel(config)\n >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n >>> output = model(**input)\n >>> output[\"last_hidden_state\"].shape\n torch.Size([1, 7, 768])\n >>> output[\"pooler_output\"].shape\n torch.Size([1, 768])\n \"\"\"\n\n def __init__(self, config: CaLmConfig, add_pooling_layer: bool = True):\n super().__init__(config)\n self.pad_token_id = config.pad_token_id\n self.embeddings = CaLmEmbeddings(config)\n self.encoder = CaLmEncoder(config)\n self.pooler = CaLmPooler(config) if add_pooling_layer else None\n\n # Initialize weights and apply final processing\n self.post_init()\n\n def get_input_embeddings(self):\n return self.embeddings.word_embeddings\n\n def set_input_embeddings(self, value):\n self.embeddings.word_embeddings = value\n\n def _prune_heads(self, heads_to_prune):\n \"\"\"\n Prunes heads of the model. heads_to_prune: dict of {layer_num: list of heads to prune in this layer} See base\n class PreTrainedModel\n \"\"\"\n for layer, heads in heads_to_prune.items():\n self.encoder.layer[layer].attention.prune_heads(heads)\n\n def forward(\n self,\n input_ids: Tensor | NestedTensor,\n attention_mask: Tensor | None = None,\n position_ids: Tensor | None = None,\n head_mask: Tensor | None = None,\n inputs_embeds: Tensor | NestedTensor | None = None,\n encoder_hidden_states: Tensor | None = None,\n encoder_attention_mask: Tensor | None = None,\n past_key_values: Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None = None,\n use_cache: bool | None = None,\n output_attentions: bool | None = None,\n output_hidden_states: bool | None = None,\n return_dict: bool | None = None,\n **kwargs,\n ) -> Tuple[Tensor, ...] | BaseModelOutputWithPoolingAndCrossAttentions:\n r\"\"\"\n Args:\n encoder_hidden_states:\n Shape: `(batch_size, sequence_length, hidden_size)`\n\n Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if\n the model is configured as a decoder.\n encoder_attention_mask:\n Shape: `(batch_size, sequence_length)`\n\n Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used\n in the cross-attention if the model is configured as a decoder. Mask values selected in `[0, 1]`:\n\n - 1 for tokens that are **not masked**,\n - 0 for tokens that are **masked**.\n past_key_values:\n Tuple of length `config.n_layers` with each tuple having 4 tensors of shape\n `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)\n\n Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up\n decoding.\n\n If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those\n that don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of\n all `decoder_input_ids` of shape `(batch_size, sequence_length)`.\n use_cache:\n If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding\n (see `past_key_values`).\n \"\"\"\n if kwargs:\n warn(\n f\"Additional keyword arguments `{', '.join(kwargs)}` are detected in \"\n f\"`{self.__class__.__name__}.forward`, they will be ignored.\\n\"\n \"This is provided for backward compatibility and may lead to unexpected behavior.\"\n )\n output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions\n output_hidden_states = (\n output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states\n )\n return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n\n if self.config.is_decoder:\n use_cache = use_cache if use_cache is not None else self.config.use_cache\n else:\n use_cache = False\n\n if isinstance(input_ids, NestedTensor):\n input_ids, attention_mask = input_ids.tensor, input_ids.mask\n if input_ids is not None and inputs_embeds is not None:\n raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n if input_ids is not None:\n self.warn_if_padding_and_no_attention_mask(input_ids, attention_mask)\n input_shape = input_ids.size()\n elif inputs_embeds is not None:\n input_shape = inputs_embeds.size()[:-1]\n else:\n raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n batch_size, seq_length = input_shape\n device = input_ids.device if input_ids is not None else inputs_embeds.device # type: ignore[union-attr]\n\n # past_key_values_length\n past_key_values_length = past_key_values[0][0].shape[2] if past_key_values is not None else 0\n\n if attention_mask is None:\n attention_mask = (\n input_ids.ne(self.pad_token_id)\n if self.pad_token_id is not None\n else torch.ones(((batch_size, seq_length + past_key_values_length)), device=device)\n )\n\n # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]\n # ourselves in which case we just need to make it broadcastable to all heads.\n extended_attention_mask: Tensor = self.get_extended_attention_mask(attention_mask, input_shape)\n\n # If a 2D or 3D attention mask is provided for the cross-attention\n # we need to make broadcastable to [batch_size, num_heads, seq_length, seq_length]\n if self.config.is_decoder and encoder_hidden_states is not None:\n encoder_batch_size, encoder_sequence_length, _ = encoder_hidden_states.size()\n encoder_hidden_shape = (encoder_batch_size, encoder_sequence_length)\n if encoder_attention_mask is None:\n encoder_attention_mask = torch.ones(encoder_hidden_shape, device=device)\n encoder_extended_attention_mask = self.invert_attention_mask(encoder_attention_mask)\n else:\n encoder_extended_attention_mask = None\n\n # Prepare head mask if needed\n # 1.0 in head_mask indicate we keep the head\n # attention_probs has shape bsz x n_heads x N x N\n # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]\n # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]\n head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)\n\n embedding_output = self.embeddings(\n input_ids=input_ids,\n position_ids=position_ids,\n attention_mask=attention_mask,\n inputs_embeds=inputs_embeds,\n past_key_values_length=past_key_values_length,\n )\n encoder_outputs = self.encoder(\n embedding_output,\n attention_mask=extended_attention_mask,\n head_mask=head_mask,\n encoder_hidden_states=encoder_hidden_states,\n encoder_attention_mask=encoder_extended_attention_mask,\n past_key_values=past_key_values,\n use_cache=use_cache,\n output_attentions=output_attentions,\n output_hidden_states=output_hidden_states,\n return_dict=return_dict,\n )\n sequence_output = encoder_outputs[0]\n pooled_output = self.pooler(sequence_output) if self.pooler is not None else None\n\n if not return_dict:\n return (sequence_output, pooled_output) + encoder_outputs[1:]\n\n return BaseModelOutputWithPoolingAndCrossAttentions(\n last_hidden_state=sequence_output,\n pooler_output=pooled_output,\n past_key_values=encoder_outputs.past_key_values,\n hidden_states=encoder_outputs.hidden_states,\n attentions=encoder_outputs.attentions,\n cross_attentions=encoder_outputs.cross_attentions,\n )\n
"},{"location":"models/calm/#multimolecule.models.calm.CaLmModel.forward","title":"forward","text":"Pythonforward(input_ids: Tensor | NestedTensor, attention_mask: Tensor | None = None, position_ids: Tensor | None = None, head_mask: Tensor | None = None, inputs_embeds: Tensor | NestedTensor | None = None, encoder_hidden_states: Tensor | None = None, encoder_attention_mask: Tensor | None = None, past_key_values: Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None = None, use_cache: bool | None = None, output_attentions: bool | None = None, output_hidden_states: bool | None = None, return_dict: bool | None = None, **kwargs) -> Tuple[Tensor, ...] | BaseModelOutputWithPoolingAndCrossAttentions\n
Parameters:
Name Type Description DefaultTensor | None
Shape: (batch_size, sequence_length, hidden_size)
Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if the model is configured as a decoder.
None
Tensor | None
Shape: (batch_size, sequence_length)
Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used in the cross-attention if the model is configured as a decoder. Mask values selected in [0, 1]
:
None
Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None
Tuple of length config.n_layers
with each tuple having 4 tensors of shape `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)
Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up decoding.
If past_key_values
are used, the user can optionally input only the last decoder_input_ids
(those that don\u2019t have their past key value states given to this model) of shape (batch_size, 1)
instead of all decoder_input_ids
of shape (batch_size, sequence_length)
.
None
bool | None
If set to True
, past_key_values
key value states are returned and can be used to speed up decoding (see past_key_values
).
None
Source code in multimolecule/models/calm/modeling_calm.py
Pythondef forward(\n self,\n input_ids: Tensor | NestedTensor,\n attention_mask: Tensor | None = None,\n position_ids: Tensor | None = None,\n head_mask: Tensor | None = None,\n inputs_embeds: Tensor | NestedTensor | None = None,\n encoder_hidden_states: Tensor | None = None,\n encoder_attention_mask: Tensor | None = None,\n past_key_values: Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None = None,\n use_cache: bool | None = None,\n output_attentions: bool | None = None,\n output_hidden_states: bool | None = None,\n return_dict: bool | None = None,\n **kwargs,\n) -> Tuple[Tensor, ...] | BaseModelOutputWithPoolingAndCrossAttentions:\n r\"\"\"\n Args:\n encoder_hidden_states:\n Shape: `(batch_size, sequence_length, hidden_size)`\n\n Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if\n the model is configured as a decoder.\n encoder_attention_mask:\n Shape: `(batch_size, sequence_length)`\n\n Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used\n in the cross-attention if the model is configured as a decoder. Mask values selected in `[0, 1]`:\n\n - 1 for tokens that are **not masked**,\n - 0 for tokens that are **masked**.\n past_key_values:\n Tuple of length `config.n_layers` with each tuple having 4 tensors of shape\n `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)\n\n Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up\n decoding.\n\n If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those\n that don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of\n all `decoder_input_ids` of shape `(batch_size, sequence_length)`.\n use_cache:\n If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding\n (see `past_key_values`).\n \"\"\"\n if kwargs:\n warn(\n f\"Additional keyword arguments `{', '.join(kwargs)}` are detected in \"\n f\"`{self.__class__.__name__}.forward`, they will be ignored.\\n\"\n \"This is provided for backward compatibility and may lead to unexpected behavior.\"\n )\n output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions\n output_hidden_states = (\n output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states\n )\n return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n\n if self.config.is_decoder:\n use_cache = use_cache if use_cache is not None else self.config.use_cache\n else:\n use_cache = False\n\n if isinstance(input_ids, NestedTensor):\n input_ids, attention_mask = input_ids.tensor, input_ids.mask\n if input_ids is not None and inputs_embeds is not None:\n raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n if input_ids is not None:\n self.warn_if_padding_and_no_attention_mask(input_ids, attention_mask)\n input_shape = input_ids.size()\n elif inputs_embeds is not None:\n input_shape = inputs_embeds.size()[:-1]\n else:\n raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n batch_size, seq_length = input_shape\n device = input_ids.device if input_ids is not None else inputs_embeds.device # type: ignore[union-attr]\n\n # past_key_values_length\n past_key_values_length = past_key_values[0][0].shape[2] if past_key_values is not None else 0\n\n if attention_mask is None:\n attention_mask = (\n input_ids.ne(self.pad_token_id)\n if self.pad_token_id is not None\n else torch.ones(((batch_size, seq_length + past_key_values_length)), device=device)\n )\n\n # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]\n # ourselves in which case we just need to make it broadcastable to all heads.\n extended_attention_mask: Tensor = self.get_extended_attention_mask(attention_mask, input_shape)\n\n # If a 2D or 3D attention mask is provided for the cross-attention\n # we need to make broadcastable to [batch_size, num_heads, seq_length, seq_length]\n if self.config.is_decoder and encoder_hidden_states is not None:\n encoder_batch_size, encoder_sequence_length, _ = encoder_hidden_states.size()\n encoder_hidden_shape = (encoder_batch_size, encoder_sequence_length)\n if encoder_attention_mask is None:\n encoder_attention_mask = torch.ones(encoder_hidden_shape, device=device)\n encoder_extended_attention_mask = self.invert_attention_mask(encoder_attention_mask)\n else:\n encoder_extended_attention_mask = None\n\n # Prepare head mask if needed\n # 1.0 in head_mask indicate we keep the head\n # attention_probs has shape bsz x n_heads x N x N\n # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]\n # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]\n head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)\n\n embedding_output = self.embeddings(\n input_ids=input_ids,\n position_ids=position_ids,\n attention_mask=attention_mask,\n inputs_embeds=inputs_embeds,\n past_key_values_length=past_key_values_length,\n )\n encoder_outputs = self.encoder(\n embedding_output,\n attention_mask=extended_attention_mask,\n head_mask=head_mask,\n encoder_hidden_states=encoder_hidden_states,\n encoder_attention_mask=encoder_extended_attention_mask,\n past_key_values=past_key_values,\n use_cache=use_cache,\n output_attentions=output_attentions,\n output_hidden_states=output_hidden_states,\n return_dict=return_dict,\n )\n sequence_output = encoder_outputs[0]\n pooled_output = self.pooler(sequence_output) if self.pooler is not None else None\n\n if not return_dict:\n return (sequence_output, pooled_output) + encoder_outputs[1:]\n\n return BaseModelOutputWithPoolingAndCrossAttentions(\n last_hidden_state=sequence_output,\n pooler_output=pooled_output,\n past_key_values=encoder_outputs.past_key_values,\n hidden_states=encoder_outputs.hidden_states,\n attentions=encoder_outputs.attentions,\n cross_attentions=encoder_outputs.cross_attentions,\n )\n
"},{"location":"models/calm/#multimolecule.models.calm.CaLmModel.forward(encoder_hidden_states)","title":"encoder_hidden_states
","text":""},{"location":"models/calm/#multimolecule.models.calm.CaLmModel.forward(encoder_attention_mask)","title":"encoder_attention_mask
","text":""},{"location":"models/calm/#multimolecule.models.calm.CaLmModel.forward(past_key_values)","title":"past_key_values
","text":""},{"location":"models/calm/#multimolecule.models.calm.CaLmModel.forward(use_cache)","title":"use_cache
","text":""},{"location":"models/calm/#multimolecule.models.calm.CaLmPreTrainedModel","title":"CaLmPreTrainedModel","text":" Bases: PreTrainedModel
An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained models.
Source code inmultimolecule/models/calm/modeling_calm.py
Pythonclass CaLmPreTrainedModel(PreTrainedModel):\n \"\"\"\n An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained\n models.\n \"\"\"\n\n config_class = CaLmConfig\n base_model_prefix = \"calm\"\n supports_gradient_checkpointing = True\n _no_split_modules = [\"CaLmLayer\", \"CaLmEmbeddings\"]\n\n # Copied from transformers.models.bert.modeling_bert.BertPreTrainedModel._init_weights\n def _init_weights(self, module: nn.Module):\n \"\"\"Initialize the weights\"\"\"\n if isinstance(module, nn.Linear):\n # Slightly different from the TF version which uses truncated_normal for initialization\n # cf https://github.com/pytorch/pytorch/pull/5617\n module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)\n if module.bias is not None:\n module.bias.data.zero_()\n elif isinstance(module, nn.Embedding):\n module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)\n if module.padding_idx is not None:\n module.weight.data[module.padding_idx].zero_()\n elif isinstance(module, nn.LayerNorm):\n module.bias.data.zero_()\n module.weight.data.fill_(1.0)\n
"},{"location":"models/configuration_utils/","title":"configuration_utils","text":""},{"location":"models/configuration_utils/#multimolecule.models.configuration_utils","title":"multimolecule.models.configuration_utils","text":""},{"location":"models/configuration_utils/#multimolecule.models.configuration_utils.HeadConfig","title":"HeadConfig","text":" Bases: BaseHeadConfig
Configuration class for a prediction head.
Parameters:
Name Type Description DefaultNumber of labels to use in the last layer added to the model, typically for a classification task.
Head should look for Config.num_labels
if is None
.
Problem type for XxxForYyyPrediction
models. Can be one of \"binary\"
, \"regression\"
, \"multiclass\"
or \"multilabel\"
.
Head should look for Config.problem_type
if is None
.
Dimensionality of the encoder layers and the pooler layer.
Head should look for Config.hidden_size
if is None
.
The dropout ratio for the hidden states.
requiredThe transform operation applied to hidden states.
requiredThe activation function of transform applied to hidden states.
requiredWhether to apply bias to the final prediction layer.
requiredThe activation function of the final prediction output.
requiredThe epsilon used by the layer normalization layers.
requiredThe name of the tensor required in model outputs.
If is None
, will use the default output name of the corresponding head.
The type of the head in the model.
This is used by MultiMoleculeModel
to construct heads.
multimolecule/module/heads/config.py
Pythonclass HeadConfig(BaseHeadConfig):\n r\"\"\"\n Configuration class for a prediction head.\n\n Args:\n num_labels:\n Number of labels to use in the last layer added to the model, typically for a classification task.\n\n Head should look for [`Config.num_labels`][multimolecule.PreTrainedConfig] if is `None`.\n problem_type:\n Problem type for `XxxForYyyPrediction` models. Can be one of `\"binary\"`, `\"regression\"`,\n `\"multiclass\"` or `\"multilabel\"`.\n\n Head should look for [`Config.problem_type`][multimolecule.PreTrainedConfig] if is `None`.\n hidden_size:\n Dimensionality of the encoder layers and the pooler layer.\n\n Head should look for [`Config.hidden_size`][multimolecule.PreTrainedConfig] if is `None`.\n dropout:\n The dropout ratio for the hidden states.\n transform:\n The transform operation applied to hidden states.\n transform_act:\n The activation function of transform applied to hidden states.\n bias:\n Whether to apply bias to the final prediction layer.\n act:\n The activation function of the final prediction output.\n layer_norm_eps:\n The epsilon used by the layer normalization layers.\n output_name:\n The name of the tensor required in model outputs.\n\n If is `None`, will use the default output name of the corresponding head.\n type:\n The type of the head in the model.\n\n This is used by [`MultiMoleculeModel`][multimolecule.MultiMoleculeModel] to construct heads.\n \"\"\"\n\n num_labels: Optional[int] = None\n problem_type: Optional[str] = None\n hidden_size: Optional[int] = None\n dropout: float = 0.0\n transform: Optional[str] = None\n transform_act: Optional[str] = \"gelu\"\n bias: bool = True\n act: Optional[str] = None\n layer_norm_eps: float = 1e-12\n output_name: Optional[str] = None\n type: Optional[str] = None\n
"},{"location":"models/configuration_utils/#multimolecule.models.configuration_utils.HeadConfig(num_labels)","title":"num_labels
","text":""},{"location":"models/configuration_utils/#multimolecule.models.configuration_utils.HeadConfig(problem_type)","title":"problem_type
","text":""},{"location":"models/configuration_utils/#multimolecule.models.configuration_utils.HeadConfig(hidden_size)","title":"hidden_size
","text":""},{"location":"models/configuration_utils/#multimolecule.models.configuration_utils.HeadConfig(dropout)","title":"dropout
","text":""},{"location":"models/configuration_utils/#multimolecule.models.configuration_utils.HeadConfig(transform)","title":"transform
","text":""},{"location":"models/configuration_utils/#multimolecule.models.configuration_utils.HeadConfig(transform_act)","title":"transform_act
","text":""},{"location":"models/configuration_utils/#multimolecule.models.configuration_utils.HeadConfig(bias)","title":"bias
","text":""},{"location":"models/configuration_utils/#multimolecule.models.configuration_utils.HeadConfig(act)","title":"act
","text":""},{"location":"models/configuration_utils/#multimolecule.models.configuration_utils.HeadConfig(layer_norm_eps)","title":"layer_norm_eps
","text":""},{"location":"models/configuration_utils/#multimolecule.models.configuration_utils.HeadConfig(output_name)","title":"output_name
","text":""},{"location":"models/configuration_utils/#multimolecule.models.configuration_utils.HeadConfig(type)","title":"type
","text":""},{"location":"models/configuration_utils/#multimolecule.models.configuration_utils.MaskedLMHeadConfig","title":"MaskedLMHeadConfig","text":" Bases: BaseHeadConfig
Configuration class for a Masked Language Modeling head.
Parameters:
Name Type Description DefaultDimensionality of the encoder layers and the pooler layer.
Head should look for Config.hidden_size
if is None
.
The dropout ratio for the hidden states.
requiredThe transform operation applied to hidden states.
requiredThe activation function of transform applied to hidden states.
requiredWhether to apply bias to the final prediction layer.
requiredThe activation function of the final prediction output.
requiredThe epsilon used by the layer normalization layers.
requiredThe name of the tensor required in model outputs.
If is None
, will use the default output name of the corresponding head.
multimolecule/module/heads/config.py
Pythonclass MaskedLMHeadConfig(BaseHeadConfig):\n r\"\"\"\n Configuration class for a Masked Language Modeling head.\n\n Args:\n hidden_size:\n Dimensionality of the encoder layers and the pooler layer.\n\n Head should look for [`Config.hidden_size`][multimolecule.PreTrainedConfig] if is `None`.\n dropout:\n The dropout ratio for the hidden states.\n transform:\n The transform operation applied to hidden states.\n transform_act:\n The activation function of transform applied to hidden states.\n bias:\n Whether to apply bias to the final prediction layer.\n act:\n The activation function of the final prediction output.\n layer_norm_eps:\n The epsilon used by the layer normalization layers.\n output_name:\n The name of the tensor required in model outputs.\n\n If is `None`, will use the default output name of the corresponding head.\n \"\"\"\n\n hidden_size: Optional[int] = None\n dropout: float = 0.0\n transform: Optional[str] = \"nonlinear\"\n transform_act: Optional[str] = \"gelu\"\n bias: bool = True\n act: Optional[str] = None\n layer_norm_eps: float = 1e-12\n output_name: Optional[str] = None\n
"},{"location":"models/configuration_utils/#multimolecule.models.configuration_utils.MaskedLMHeadConfig(hidden_size)","title":"hidden_size
","text":""},{"location":"models/configuration_utils/#multimolecule.models.configuration_utils.MaskedLMHeadConfig(dropout)","title":"dropout
","text":""},{"location":"models/configuration_utils/#multimolecule.models.configuration_utils.MaskedLMHeadConfig(transform)","title":"transform
","text":""},{"location":"models/configuration_utils/#multimolecule.models.configuration_utils.MaskedLMHeadConfig(transform_act)","title":"transform_act
","text":""},{"location":"models/configuration_utils/#multimolecule.models.configuration_utils.MaskedLMHeadConfig(bias)","title":"bias
","text":""},{"location":"models/configuration_utils/#multimolecule.models.configuration_utils.MaskedLMHeadConfig(act)","title":"act
","text":""},{"location":"models/configuration_utils/#multimolecule.models.configuration_utils.MaskedLMHeadConfig(layer_norm_eps)","title":"layer_norm_eps
","text":""},{"location":"models/configuration_utils/#multimolecule.models.configuration_utils.MaskedLMHeadConfig(output_name)","title":"output_name
","text":""},{"location":"models/configuration_utils/#multimolecule.models.configuration_utils.PreTrainedConfig","title":"PreTrainedConfig","text":" Bases: PretrainedConfig
Base class for all model configuration classes.
Source code inmultimolecule/models/configuration_utils.py
Pythonclass PreTrainedConfig(PretrainedConfig):\n r\"\"\"\n Base class for all model configuration classes.\n \"\"\"\n\n head: HeadConfig | None\n num_labels: int = 1\n\n hidden_size: int\n\n pad_token_id: int = 0\n bos_token_id: int = 1\n eos_token_id: int = 2\n unk_token_id: int = 3\n mask_token_id: int = 4\n null_token_id: int = 5\n\n def __init__(\n self,\n pad_token_id: int = 0,\n bos_token_id: int = 1,\n eos_token_id: int = 2,\n unk_token_id: int = 3,\n mask_token_id: int = 4,\n null_token_id: int = 5,\n num_labels: int = 1,\n **kwargs,\n ):\n super().__init__(\n pad_token_id=pad_token_id,\n bos_token_id=bos_token_id,\n eos_token_id=eos_token_id,\n unk_token_id=unk_token_id,\n mask_token_id=mask_token_id,\n null_token_id=null_token_id,\n num_labels=num_labels,\n **kwargs,\n )\n\n def to_dict(self):\n output = super().to_dict()\n for k, v in output.items():\n if hasattr(v, \"to_dict\"):\n output[k] = v.to_dict()\n if is_dataclass(v):\n output[k] = asdict(v)\n return output\n
"},{"location":"models/ernierna/","title":"ERNIE-RNA","text":"Pre-trained model on non-coding RNA (ncRNA) using a masked language modeling (MLM) objective.
"},{"location":"models/ernierna/#disclaimer","title":"Disclaimer","text":"This is an UNOFFICIAL implementation of the ERNIE-RNA: An RNA Language Model with Structure-enhanced Representations by Weijie Yin, Zhaoyu Zhang, Liang He, et al.
The OFFICIAL repository of ERNIE-RNA is at Bruce-ywj/ERNIE-RNA.
Tip
The MultiMolecule team has confirmed that the provided model and checkpoints are producing the same intermediate representations as the original implementation.
The team releasing ERNIE-RNA did not write this model card for this model so this model card has been written by the MultiMolecule team.
"},{"location":"models/ernierna/#model-details","title":"Model Details","text":"ERNIE-RNA is a bert-style model pre-trained on a large corpus of non-coding RNA sequences in a self-supervised fashion. This means that the model was trained on the raw nucleotides of RNA sequences only, with an automatic process to generate inputs and labels from those texts. Please refer to the Training Details section for more information on the training process.
"},{"location":"models/ernierna/#variations","title":"Variations","text":"multimolecule/ernierna
: The ERNIE-RNA model pre-trained on non-coding RNA sequences.multimolecule/ernierna-ss
: The ERNIE-RNA model fine-tuned on RNA secondary structure prediction.The model file depends on the multimolecule
library. You can install it using pip:
pip install multimolecule\n
"},{"location":"models/ernierna/#direct-use","title":"Direct Use","text":"You can use this model directly with a pipeline for masked language modeling:
Python>>> import multimolecule # you must import multimolecule to register models\n>>> from transformers import pipeline\n>>> unmasker = pipeline(\"fill-mask\", model=\"multimolecule/ernierna\")\n>>> unmasker(\"gguc<mask>cucugguuagaccagaucugagccu\")\n\n[{'score': 0.32839149236679077,\n 'token': 6,\n 'token_str': 'A',\n 'sequence': 'G G U C A C U C U G G U U A G A C C A G A U C U G A G C C U'},\n {'score': 0.3044775426387787,\n 'token': 9,\n 'token_str': 'U',\n 'sequence': 'G G U C U C U C U G G U U A G A C C A G A U C U G A G C C U'},\n {'score': 0.09914574027061462,\n 'token': 7,\n 'token_str': 'C',\n 'sequence': 'G G U C C C U C U G G U U A G A C C A G A U C U G A G C C U'},\n {'score': 0.09502048045396805,\n 'token': 24,\n 'token_str': '-',\n 'sequence': 'G G U C - C U C U G G U U A G A C C A G A U C U G A G C C U'},\n {'score': 0.06993662565946579,\n 'token': 21,\n 'token_str': '.',\n 'sequence': 'G G U C. C U C U G G U U A G A C C A G A U C U G A G C C U'}]\n
"},{"location":"models/ernierna/#downstream-use","title":"Downstream Use","text":""},{"location":"models/ernierna/#extract-features","title":"Extract Features","text":"Here is how to use this model to get the features of a given sequence in PyTorch:
Pythonfrom multimolecule import RnaTokenizer, ErnieRnaModel\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/ernierna\")\nmodel = ErnieRnaModel.from_pretrained(\"multimolecule/ernierna\")\n\ntext = \"UAGCUUAUCAGACUGAUGUUGA\"\ninput = tokenizer(text, return_tensors=\"pt\")\n\noutput = model(**input)\n
"},{"location":"models/ernierna/#sequence-classification-regression","title":"Sequence Classification / Regression","text":"Note: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for sequence classification or regression.
Here is how to use this model as backbone to fine-tune for a sequence-level task in PyTorch:
Pythonimport torch\nfrom multimolecule import RnaTokenizer, ErnieRnaForSequencePrediction\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/ernierna\")\nmodel = ErnieRnaForSequencePrediction.from_pretrained(\"multimolecule/ernierna\")\n\ntext = \"UAGCUUAUCAGACUGAUGUUGA\"\ninput = tokenizer(text, return_tensors=\"pt\")\nlabel = torch.tensor([1])\n\noutput = model(**input, labels=label)\n
"},{"location":"models/ernierna/#token-classification-regression","title":"Token Classification / Regression","text":"Note: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for nucleotide classification or regression.
Here is how to use this model as backbone to fine-tune for a nucleotide-level task in PyTorch:
Pythonimport torch\nfrom multimolecule import RnaTokenizer, ErnieRnaForTokenPrediction\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/ernierna\")\nmodel = ErnieRnaForTokenPrediction.from_pretrained(\"multimolecule/ernierna\")\n\ntext = \"UAGCUUAUCAGACUGAUGUUGA\"\ninput = tokenizer(text, return_tensors=\"pt\")\nlabel = torch.randint(2, (len(text), ))\n\noutput = model(**input, labels=label)\n
"},{"location":"models/ernierna/#contact-classification-regression","title":"Contact Classification / Regression","text":"Note: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for contact classification or regression.
Here is how to use this model as backbone to fine-tune for a contact-level task in PyTorch:
Pythonimport torch\nfrom multimolecule import RnaTokenizer, ErnieRnaForContactPrediction\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/ernierna\")\nmodel = ErnieRnaForContactPrediction.from_pretrained(\"multimolecule/ernierna\")\n\ntext = \"UAGCUUAUCAGACUGAUGUUGA\"\ninput = tokenizer(text, return_tensors=\"pt\")\nlabel = torch.randint(2, (len(text), len(text)))\n\noutput = model(**input, labels=label)\n
"},{"location":"models/ernierna/#training-details","title":"Training Details","text":"ERNIE-RNA used Masked Language Modeling (MLM) as the pre-training objective: taking a sequence, the model randomly masks 15% of the tokens in the input then runs the entire masked sentence through the model and has to predict the masked tokens. This is comparable to the Cloze task in language modeling.
"},{"location":"models/ernierna/#training-data","title":"Training Data","text":"The ERNIE-RNA model was pre-trained on RNAcentral. RNAcentral is a free, public resource that offers integrated access to a comprehensive and up-to-date set of non-coding RNA sequences provided by a collaborating group of Expert Databases representing a broad range of organisms and RNA types.
ERNIE-RNA applied CD-HIT (CD-HIT-EST) with a cut-off at 100% sequence identity to remove redundancy from the RNAcentral, resulting 25 million unique sequences. Sequences longer than 1024 nucleotides were subsequently excluded. The final dataset contains 20.4 million non-redundant RNA sequences. ERNIE-RNA preprocessed all tokens by replacing \u201cT\u201ds with \u201cS\u201ds.
Note that RnaTokenizer
will convert \u201cT\u201ds to \u201cU\u201ds for you, you may disable this behaviour by passing replace_T_with_U=False
.
ERNIE-RNA used masked language modeling (MLM) as the pre-training objective. The masking procedure is similar to the one used in BERT:
<mask>
.The model was trained on 24 NVIDIA V100 GPUs with 32GiB memories.
BibTeX:
BibTeX@article {Yin2024.03.17.585376,\n author = {Yin, Weijie and Zhang, Zhaoyu and He, Liang and Jiang, Rui and Zhang, Shuo and Liu, Gan and Zhang, Xuegong and Qin, Tao and Xie, Zhen},\n title = {ERNIE-RNA: An RNA Language Model with Structure-enhanced Representations},\n elocation-id = {2024.03.17.585376},\n year = {2024},\n doi = {10.1101/2024.03.17.585376},\n publisher = {Cold Spring Harbor Laboratory},\n abstract = {With large amounts of unlabeled RNA sequences data produced by high-throughput sequencing technologies, pre-trained RNA language models have been developed to estimate semantic space of RNA molecules, which facilities the understanding of grammar of RNA language. However, existing RNA language models overlook the impact of structure when modeling the RNA semantic space, resulting in incomplete feature extraction and suboptimal performance across various downstream tasks. In this study, we developed a RNA pre-trained language model named ERNIE-RNA (Enhanced Representations with base-pairing restriction for RNA modeling) based on a modified BERT (Bidirectional Encoder Representations from Transformers) by incorporating base-pairing restriction with no MSA (Multiple Sequence Alignment) information. We found that the attention maps from ERNIE-RNA with no fine-tuning are able to capture RNA structure in the zero-shot experiment more precisely than conventional methods such as fine-tuned RNAfold and RNAstructure, suggesting that the ERNIE-RNA can provide comprehensive RNA structural representations. Furthermore, ERNIE-RNA achieved SOTA (state-of-the-art) performance after fine-tuning for various downstream tasks, including RNA structural and functional predictions. In summary, our ERNIE-RNA model provides general features which can be widely and effectively applied in various subsequent research tasks. Our results indicate that introducing key knowledge-based prior information in the BERT framework may be a useful strategy to enhance the performance of other language models.Competing Interest StatementOne patent based on the study was submitted by Z.X. and W.Y., which is entitled as \"A Pre-training Approach for RNA Sequences and Its Applications\"(application number, no 202410262527.5). The remaining authors declare no competing interests.},\n URL = {https://www.biorxiv.org/content/early/2024/03/17/2024.03.17.585376},\n eprint = {https://www.biorxiv.org/content/early/2024/03/17/2024.03.17.585376.full.pdf},\n journal = {bioRxiv}\n}\n
"},{"location":"models/ernierna/#contact","title":"Contact","text":"Please use GitHub issues of MultiMolecule for any questions or comments on the model card.
Please contact the authors of the ERNIE-RNA paper for questions or comments on the paper/model.
"},{"location":"models/ernierna/#license","title":"License","text":"This model is licensed under the AGPL-3.0 License.
Text OnlySPDX-License-Identifier: AGPL-3.0-or-later\n
"},{"location":"models/ernierna/#multimolecule.models.ernierna","title":"multimolecule.models.ernierna","text":""},{"location":"models/ernierna/#multimolecule.models.ernierna.RnaTokenizer","title":"RnaTokenizer","text":" Bases: Tokenizer
Tokenizer for RNA sequences.
Parameters:
Name Type Description DefaultAlphabet | str | List[str] | None
alphabet to use for tokenization.
None
, the standard RNA alphabet will be used.string
, it should correspond to the name of a predefined alphabet. The options includestandard
extended
streamline
nucleobase
None
int
Size of kmer to tokenize.
1
bool
Whether to tokenize into codons.
False
bool
Whether to replace T with U.
True
bool
Whether to convert input to uppercase.
True
Examples:
Python Console Session>>> from multimolecule import RnaTokenizer\n>>> tokenizer = RnaTokenizer()\n>>> tokenizer('<pad><cls><eos><unk><mask><null>ACGUNRYSWKMBDHV.X*-I')[\"input_ids\"]\n[1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 2]\n>>> tokenizer('acgu')[\"input_ids\"]\n[1, 6, 7, 8, 9, 2]\n>>> tokenizer('acgt')[\"input_ids\"]\n[1, 6, 7, 8, 9, 2]\n>>> tokenizer = RnaTokenizer(replace_T_with_U=False)\n>>> tokenizer('acgt')[\"input_ids\"]\n[1, 6, 7, 8, 3, 2]\n>>> tokenizer = RnaTokenizer(nmers=3)\n>>> tokenizer('uagcuuauc')[\"input_ids\"]\n[1, 83, 17, 64, 49, 96, 84, 22, 2]\n>>> tokenizer = RnaTokenizer(codon=True)\n>>> tokenizer('uagcuuauc')[\"input_ids\"]\n[1, 83, 49, 22, 2]\n>>> tokenizer('uagcuuauca')[\"input_ids\"]\nTraceback (most recent call last):\nValueError: length of input sequence must be a multiple of 3 for codon tokenization, but got 10\n
Source code in multimolecule/tokenisers/rna/tokenization_rna.py
Pythonclass RnaTokenizer(Tokenizer):\n \"\"\"\n Tokenizer for RNA sequences.\n\n Args:\n alphabet: alphabet to use for tokenization.\n\n - If is `None`, the standard RNA alphabet will be used.\n - If is a `string`, it should correspond to the name of a predefined alphabet. The options include\n + `standard`\n + `extended`\n + `streamline`\n + `nucleobase`\n - If is an alphabet or a list of characters, that specific alphabet will be used.\n nmers: Size of kmer to tokenize.\n codon: Whether to tokenize into codons.\n replace_T_with_U: Whether to replace T with U.\n do_upper_case: Whether to convert input to uppercase.\n\n Examples:\n >>> from multimolecule import RnaTokenizer\n >>> tokenizer = RnaTokenizer()\n >>> tokenizer('<pad><cls><eos><unk><mask><null>ACGUNRYSWKMBDHV.X*-I')[\"input_ids\"]\n [1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 2]\n >>> tokenizer('acgu')[\"input_ids\"]\n [1, 6, 7, 8, 9, 2]\n >>> tokenizer('acgt')[\"input_ids\"]\n [1, 6, 7, 8, 9, 2]\n >>> tokenizer = RnaTokenizer(replace_T_with_U=False)\n >>> tokenizer('acgt')[\"input_ids\"]\n [1, 6, 7, 8, 3, 2]\n >>> tokenizer = RnaTokenizer(nmers=3)\n >>> tokenizer('uagcuuauc')[\"input_ids\"]\n [1, 83, 17, 64, 49, 96, 84, 22, 2]\n >>> tokenizer = RnaTokenizer(codon=True)\n >>> tokenizer('uagcuuauc')[\"input_ids\"]\n [1, 83, 49, 22, 2]\n >>> tokenizer('uagcuuauca')[\"input_ids\"]\n Traceback (most recent call last):\n ValueError: length of input sequence must be a multiple of 3 for codon tokenization, but got 10\n \"\"\"\n\n model_input_names = [\"input_ids\", \"attention_mask\"]\n\n def __init__(\n self,\n alphabet: Alphabet | str | List[str] | None = None,\n nmers: int = 1,\n codon: bool = False,\n replace_T_with_U: bool = True,\n do_upper_case: bool = True,\n additional_special_tokens: List | Tuple | None = None,\n **kwargs,\n ):\n if codon and (nmers > 1 and nmers != 3):\n raise ValueError(\"Codon and nmers cannot be used together.\")\n if codon:\n nmers = 3 # set to 3 to get correct vocab\n if not isinstance(alphabet, Alphabet):\n alphabet = get_alphabet(alphabet, nmers=nmers)\n super().__init__(\n alphabet=alphabet,\n nmers=nmers,\n codon=codon,\n replace_T_with_U=replace_T_with_U,\n do_upper_case=do_upper_case,\n additional_special_tokens=additional_special_tokens,\n **kwargs,\n )\n self.replace_T_with_U = replace_T_with_U\n self.nmers = nmers\n self.codon = codon\n\n def _tokenize(self, text: str, **kwargs):\n if self.do_upper_case:\n text = text.upper()\n if self.replace_T_with_U:\n text = text.replace(\"T\", \"U\")\n if self.codon:\n if len(text) % 3 != 0:\n raise ValueError(\n f\"length of input sequence must be a multiple of 3 for codon tokenization, but got {len(text)}\"\n )\n return [text[i : i + 3] for i in range(0, len(text), 3)]\n if self.nmers > 1:\n return [text[i : i + self.nmers] for i in range(len(text) - self.nmers + 1)] # noqa: E203\n return list(text)\n
"},{"location":"models/ernierna/#multimolecule.models.ernierna.RnaTokenizer(alphabet)","title":"alphabet
","text":""},{"location":"models/ernierna/#multimolecule.models.ernierna.RnaTokenizer(nmers)","title":"nmers
","text":""},{"location":"models/ernierna/#multimolecule.models.ernierna.RnaTokenizer(codon)","title":"codon
","text":""},{"location":"models/ernierna/#multimolecule.models.ernierna.RnaTokenizer(replace_T_with_U)","title":"replace_T_with_U
","text":""},{"location":"models/ernierna/#multimolecule.models.ernierna.RnaTokenizer(do_upper_case)","title":"do_upper_case
","text":""},{"location":"models/ernierna/#multimolecule.models.ernierna.ErnieRnaConfig","title":"ErnieRnaConfig","text":" Bases: PreTrainedConfig
This is the configuration class to store the configuration of a ErnieRnaModel
. It is used to instantiate a ErnieRna model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the ErnieRna Bruce-ywj/ERNIE-RNA architecture.
Configuration objects inherit from PreTrainedConfig
and can be used to control the model outputs. Read the documentation from PreTrainedConfig
for more information.
Parameters:
Name Type Description Defaultint
Vocabulary size of the ErnieRna model. Defines the number of different tokens that can be represented by the inputs_ids
passed when calling [ErnieRnaModel
].
26
int
Dimensionality of the encoder layers and the pooler layer.
768
int
Number of hidden layers in the Transformer encoder.
12
int
Number of attention heads for each attention layer in the Transformer encoder.
12
int
Dimensionality of the \u201cintermediate\u201d (often named feed-forward) layer in the Transformer encoder.
3072
float
The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
0.1
float
The dropout ratio for the attention probabilities.
0.1
int
The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
1026
float
The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
0.02
float
The epsilon used by the layer normalization layers.
1e-12
Examples:
Python Console Session>>> from multimolecule import ErnieRnaModel, ErnieRnaConfig\n>>> # Initializing a ERNIE-RNA multimolecule/ernierna style configuration\n>>> configuration = ErnieRnaConfig()\n>>> # Initializing a model (with random weights) from the multimolecule/ernierna style configuration\n>>> model = ErnieRnaModel(configuration)\n>>> # Accessing the model configuration\n>>> configuration = model.config\n
Source code in multimolecule/models/ernierna/configuration_ernierna.py
Pythonclass ErnieRnaConfig(PreTrainedConfig):\n r\"\"\"\n This is the configuration class to store the configuration of a\n [`ErnieRnaModel`][multimolecule.models.ErnieRnaModel]. It is used to instantiate a ErnieRna model according to the\n specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a\n similar configuration to that of the ErnieRna [Bruce-ywj/ERNIE-RNA](https://github.com/Bruce-ywj/ERNIE-RNA)\n architecture.\n\n Configuration objects inherit from [`PreTrainedConfig`][multimolecule.models.PreTrainedConfig] and can be used to\n control the model outputs. Read the documentation from [`PreTrainedConfig`][multimolecule.models.PreTrainedConfig]\n for more information.\n\n Args:\n vocab_size:\n Vocabulary size of the ErnieRna model. Defines the number of different tokens that can be represented by\n the `inputs_ids` passed when calling [`ErnieRnaModel`].\n hidden_size:\n Dimensionality of the encoder layers and the pooler layer.\n num_hidden_layers:\n Number of hidden layers in the Transformer encoder.\n num_attention_heads:\n Number of attention heads for each attention layer in the Transformer encoder.\n intermediate_size:\n Dimensionality of the \"intermediate\" (often named feed-forward) layer in the Transformer encoder.\n hidden_dropout:\n The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.\n attention_dropout:\n The dropout ratio for the attention probabilities.\n max_position_embeddings:\n The maximum sequence length that this model might ever be used with. Typically set this to something large\n just in case (e.g., 512 or 1024 or 2048).\n initializer_range:\n The standard deviation of the truncated_normal_initializer for initializing all weight matrices.\n layer_norm_eps:\n The epsilon used by the layer normalization layers.\n\n Examples:\n >>> from multimolecule import ErnieRnaModel, ErnieRnaConfig\n >>> # Initializing a ERNIE-RNA multimolecule/ernierna style configuration\n >>> configuration = ErnieRnaConfig()\n >>> # Initializing a model (with random weights) from the multimolecule/ernierna style configuration\n >>> model = ErnieRnaModel(configuration)\n >>> # Accessing the model configuration\n >>> configuration = model.config\n \"\"\"\n\n model_type = \"ernierna\"\n\n def __init__(\n self,\n vocab_size: int = 26,\n hidden_size: int = 768,\n num_hidden_layers: int = 12,\n num_attention_heads: int = 12,\n intermediate_size: int = 3072,\n hidden_act: str = \"gelu\",\n hidden_dropout: float = 0.1,\n attention_dropout: float = 0.1,\n max_position_embeddings: int = 1026,\n initializer_range: float = 0.02,\n layer_norm_eps: float = 1e-12,\n position_embedding_type: str = \"sinusoidal\",\n pairwise_alpha: float = 0.8,\n is_decoder: bool = False,\n use_cache: bool = True,\n head: HeadConfig | None = None,\n lm_head: MaskedLMHeadConfig | None = None,\n **kwargs,\n ):\n super().__init__(**kwargs)\n\n self.vocab_size = vocab_size\n self.type_vocab_size = 2\n self.hidden_size = hidden_size\n self.num_hidden_layers = num_hidden_layers\n self.num_attention_heads = num_attention_heads\n self.intermediate_size = intermediate_size\n self.hidden_act = hidden_act\n self.hidden_dropout = hidden_dropout\n self.attention_dropout = attention_dropout\n self.max_position_embeddings = max_position_embeddings\n self.initializer_range = initializer_range\n self.layer_norm_eps = layer_norm_eps\n self.position_embedding_type = position_embedding_type\n self.pairwise_alpha = pairwise_alpha\n self.is_decoder = is_decoder\n self.use_cache = use_cache\n self.head = HeadConfig(**head) if head is not None else None\n self.lm_head = MaskedLMHeadConfig(**lm_head) if lm_head is not None else None\n
"},{"location":"models/ernierna/#multimolecule.models.ernierna.ErnieRnaConfig(vocab_size)","title":"vocab_size
","text":""},{"location":"models/ernierna/#multimolecule.models.ernierna.ErnieRnaConfig(hidden_size)","title":"hidden_size
","text":""},{"location":"models/ernierna/#multimolecule.models.ernierna.ErnieRnaConfig(num_hidden_layers)","title":"num_hidden_layers
","text":""},{"location":"models/ernierna/#multimolecule.models.ernierna.ErnieRnaConfig(num_attention_heads)","title":"num_attention_heads
","text":""},{"location":"models/ernierna/#multimolecule.models.ernierna.ErnieRnaConfig(intermediate_size)","title":"intermediate_size
","text":""},{"location":"models/ernierna/#multimolecule.models.ernierna.ErnieRnaConfig(hidden_dropout)","title":"hidden_dropout
","text":""},{"location":"models/ernierna/#multimolecule.models.ernierna.ErnieRnaConfig(attention_dropout)","title":"attention_dropout
","text":""},{"location":"models/ernierna/#multimolecule.models.ernierna.ErnieRnaConfig(max_position_embeddings)","title":"max_position_embeddings
","text":""},{"location":"models/ernierna/#multimolecule.models.ernierna.ErnieRnaConfig(initializer_range)","title":"initializer_range
","text":""},{"location":"models/ernierna/#multimolecule.models.ernierna.ErnieRnaConfig(layer_norm_eps)","title":"layer_norm_eps
","text":""},{"location":"models/ernierna/#multimolecule.models.ernierna.ErnieRnaForContactClassification","title":"ErnieRnaForContactClassification","text":" Bases: ErnieRnaForPreTraining
Examples:
Python Console Session>>> from multimolecule.models import ErnieRnaConfig, ErnieRnaForContactClassification, RnaTokenizer\n>>> config = ErnieRnaConfig()\n>>> model = ErnieRnaForContactClassification(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input)\n
Source code in multimolecule/models/ernierna/modeling_ernierna.py
Pythonclass ErnieRnaForContactClassification(ErnieRnaForPreTraining):\n \"\"\"\n Examples:\n >>> from multimolecule.models import ErnieRnaConfig, ErnieRnaForContactClassification, RnaTokenizer\n >>> config = ErnieRnaConfig()\n >>> model = ErnieRnaForContactClassification(config)\n >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n >>> output = model(**input)\n \"\"\"\n\n def __init__(self, config: ErnieRnaConfig):\n super().__init__(config)\n self.ss_head = ErnieRnaContactClassificationHead(config)\n\n # Initialize weights and apply final processing\n self.post_init()\n\n def forward( # type: ignore[override] # pylint: disable=W0221\n self,\n input_ids: Tensor | NestedTensor,\n attention_mask: Tensor | None = None,\n position_ids: Tensor | None = None,\n head_mask: Tensor | None = None,\n inputs_embeds: Tensor | NestedTensor | None = None,\n labels_lm: Tensor | None = None,\n labels_ss: Tensor | None = None,\n output_attentions: bool | None = None,\n output_attention_biases: bool | None = None,\n output_hidden_states: bool | None = None,\n return_dict: bool | None = None,\n **kwargs,\n ) -> Tuple[Tensor, ...] | ErnieRnaForContactClassificationOutput:\n if output_attentions is False:\n warn(\"output_attentions must be True for contact classification and will be ignored.\")\n outputs = self.ernierna(\n input_ids,\n attention_mask=attention_mask,\n position_ids=position_ids,\n head_mask=head_mask,\n inputs_embeds=inputs_embeds,\n output_attentions=True,\n output_attention_biases=output_attention_biases,\n output_hidden_states=output_hidden_states,\n return_dict=return_dict,\n **kwargs,\n )\n output_lm = self.lm_head(outputs, labels_lm)\n output_ss = self.ss_head(outputs[-1][-1], attention_mask, input_ids, labels_ss)\n logits_lm, loss_lm = output_lm.logits, output_lm.loss\n logits_ss, loss_ss = output_ss.logits, output_ss.loss\n\n loss = None\n if loss_lm is not None and loss_ss is not None:\n loss = loss_lm + loss_ss\n elif loss_lm is not None:\n loss = loss_lm\n elif loss_ss is not None:\n loss = loss_ss\n\n if not return_dict:\n output = outputs[2:]\n output = ((logits_ss, loss_ss) + output) if loss_ss is not None else ((logits_ss,) + output)\n output = ((logits_lm, loss_lm) + output) if loss_lm is not None else ((logits_lm,) + output)\n return ((loss,) + output) if loss is not None else output\n\n return ErnieRnaForContactClassificationOutput(\n loss=loss,\n logits_lm=logits_lm,\n loss_lm=loss_lm,\n logits_ss=logits_ss,\n loss_ss=loss_ss,\n hidden_states=outputs.hidden_states,\n attentions=outputs.attentions,\n attention_biases=outputs.attention_biases,\n )\n
"},{"location":"models/ernierna/#multimolecule.models.ernierna.ErnieRnaForContactPrediction","title":"ErnieRnaForContactPrediction","text":" Bases: ErnieRnaPreTrainedModel
Examples:
Python Console Session>>> from multimolecule import ErnieRnaConfig, ErnieRnaForContactPrediction, RnaTokenizer\n>>> config = ErnieRnaConfig()\n>>> model = ErnieRnaForContactPrediction(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=torch.randint(2, (1, 5, 5)))\n>>> output[\"logits\"].shape\ntorch.Size([1, 5, 5, 1])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n
Source code in multimolecule/models/ernierna/modeling_ernierna.py
Pythonclass ErnieRnaForContactPrediction(ErnieRnaPreTrainedModel):\n \"\"\"\n Examples:\n >>> from multimolecule import ErnieRnaConfig, ErnieRnaForContactPrediction, RnaTokenizer\n >>> config = ErnieRnaConfig()\n >>> model = ErnieRnaForContactPrediction(config)\n >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n >>> output = model(**input, labels=torch.randint(2, (1, 5, 5)))\n >>> output[\"logits\"].shape\n torch.Size([1, 5, 5, 1])\n >>> output[\"loss\"] # doctest:+ELLIPSIS\n tensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n \"\"\"\n\n def __init__(self, config: ErnieRnaConfig):\n super().__init__(config)\n self.ernierna = ErnieRnaModel(config, add_pooling_layer=True)\n self.contact_head = ContactPredictionHead(config)\n self.head_config = self.contact_head.config\n\n # Initialize weights and apply final processing\n self.post_init()\n\n def forward(\n self,\n input_ids: Tensor | NestedTensor,\n attention_mask: Tensor | None = None,\n position_ids: Tensor | None = None,\n head_mask: Tensor | None = None,\n inputs_embeds: Tensor | NestedTensor | None = None,\n labels: Tensor | None = None,\n output_attentions: bool | None = None,\n output_hidden_states: bool | None = None,\n return_dict: bool | None = None,\n **kwargs,\n ) -> Tuple[Tensor, ...] | ErnieRnaContactPredictorOutput:\n if output_attentions is False:\n warn(\"output_attentions must be True for contact classification and will be ignored.\")\n return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n outputs = self.ernierna(\n input_ids,\n attention_mask=attention_mask,\n position_ids=position_ids,\n head_mask=head_mask,\n inputs_embeds=inputs_embeds,\n output_attentions=True,\n output_hidden_states=output_hidden_states,\n return_dict=return_dict,\n **kwargs,\n )\n output = self.contact_head(outputs, attention_mask, input_ids, labels)\n logits, loss = output.logits, output.loss\n\n if not return_dict:\n output = (logits,) + outputs[2:]\n return ((loss,) + output) if loss is not None else output\n\n return ErnieRnaContactPredictorOutput(\n loss=loss,\n logits=logits,\n hidden_states=outputs.hidden_states,\n attentions=outputs.attentions,\n )\n
"},{"location":"models/ernierna/#multimolecule.models.ernierna.ErnieRnaForMaskedLM","title":"ErnieRnaForMaskedLM","text":" Bases: ErnieRnaPreTrainedModel
Examples:
Python Console Session>>> from multimolecule import ErnieRnaConfig, ErnieRnaForMaskedLM, RnaTokenizer\n>>> config = ErnieRnaConfig()\n>>> model = ErnieRnaForMaskedLM(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=input[\"input_ids\"])\n>>> output[\"logits\"].shape\ntorch.Size([1, 7, 26])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<NllLossBackward0>)\n
Source code in multimolecule/models/ernierna/modeling_ernierna.py
Pythonclass ErnieRnaForMaskedLM(ErnieRnaPreTrainedModel):\n \"\"\"\n Examples:\n >>> from multimolecule import ErnieRnaConfig, ErnieRnaForMaskedLM, RnaTokenizer\n >>> config = ErnieRnaConfig()\n >>> model = ErnieRnaForMaskedLM(config)\n >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n >>> output = model(**input, labels=input[\"input_ids\"])\n >>> output[\"logits\"].shape\n torch.Size([1, 7, 26])\n >>> output[\"loss\"] # doctest:+ELLIPSIS\n tensor(..., grad_fn=<NllLossBackward0>)\n \"\"\"\n\n _tied_weights_keys = [\"lm_head.decoder.weight\", \"lm_head.decoder.bias\"]\n\n def __init__(self, config: ErnieRnaConfig):\n super().__init__(config)\n if config.is_decoder:\n logger.warning(\n \"If you want to use `BertForMaskedLM` make sure `config.is_decoder=False` for \"\n \"bi-directional self-attention.\"\n )\n self.ernierna = ErnieRnaModel(config, add_pooling_layer=False)\n self.lm_head = MaskedLMHead(config)\n\n # Initialize weights and apply final processing\n self.post_init()\n\n def get_output_embeddings(self):\n return self.lm_head.decoder\n\n def set_output_embeddings(self, new_embeddings):\n self.lm_head.decoder = new_embeddings\n\n def forward(\n self,\n input_ids: Tensor | NestedTensor,\n attention_mask: Tensor | None = None,\n position_ids: Tensor | None = None,\n head_mask: Tensor | None = None,\n inputs_embeds: Tensor | NestedTensor | None = None,\n encoder_hidden_states: Tensor | None = None,\n encoder_attention_mask: Tensor | None = None,\n labels: Tensor | None = None,\n output_attentions: bool | None = None,\n output_attention_biases: bool | None = None,\n output_hidden_states: bool | None = None,\n return_dict: bool | None = None,\n **kwargs,\n ) -> Tuple[Tensor, ...] | ErnieRnaForMaskedLMOutput:\n return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n outputs = self.ernierna(\n input_ids,\n attention_mask=attention_mask,\n position_ids=position_ids,\n head_mask=head_mask,\n inputs_embeds=inputs_embeds,\n encoder_hidden_states=encoder_hidden_states,\n encoder_attention_mask=encoder_attention_mask,\n output_attentions=output_attentions,\n output_attention_biases=output_attention_biases,\n output_hidden_states=output_hidden_states,\n return_dict=return_dict,\n **kwargs,\n )\n output = self.lm_head(outputs, labels)\n logits, loss = output.logits, output.loss\n\n if not return_dict:\n output = (logits,) + outputs[2:]\n return ((loss,) + output) if loss is not None else output\n\n return ErnieRnaForMaskedLMOutput(\n loss=loss,\n logits=logits,\n hidden_states=outputs.hidden_states,\n attentions=outputs.attentions,\n )\n
"},{"location":"models/ernierna/#multimolecule.models.ernierna.ErnieRnaForSequencePrediction","title":"ErnieRnaForSequencePrediction","text":" Bases: ErnieRnaPreTrainedModel
Examples:
Python Console Session>>> from multimolecule import ErnieRnaConfig, ErnieRnaForSequencePrediction, RnaTokenizer\n>>> config = ErnieRnaConfig()\n>>> model = ErnieRnaForSequencePrediction(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input)\n>>> output[\"logits\"].shape\ntorch.Size([1, 1])\n
Source code in multimolecule/models/ernierna/modeling_ernierna.py
Pythonclass ErnieRnaForSequencePrediction(ErnieRnaPreTrainedModel):\n \"\"\"\n Examples:\n >>> from multimolecule import ErnieRnaConfig, ErnieRnaForSequencePrediction, RnaTokenizer\n >>> config = ErnieRnaConfig()\n >>> model = ErnieRnaForSequencePrediction(config)\n >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n >>> output = model(**input)\n >>> output[\"logits\"].shape\n torch.Size([1, 1])\n \"\"\"\n\n def __init__(self, config: ErnieRnaConfig):\n super().__init__(config)\n self.ernierna = ErnieRnaModel(config)\n self.sequence_head = SequencePredictionHead(config)\n self.head_config = self.sequence_head.config\n\n # Initialize weights and apply final processing\n self.post_init()\n\n def forward(\n self,\n input_ids: Tensor | NestedTensor,\n attention_mask: Tensor | None = None,\n position_ids: Tensor | None = None,\n head_mask: Tensor | None = None,\n inputs_embeds: Tensor | NestedTensor | None = None,\n labels: Tensor | None = None,\n output_attentions: bool | None = None,\n output_attention_biases: bool | None = None,\n output_hidden_states: bool | None = None,\n return_dict: bool | None = None,\n **kwargs,\n ) -> Tuple[Tensor, ...] | ErnieRnaSequencePredictorOutput:\n return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n outputs = self.ernierna(\n input_ids,\n attention_mask=attention_mask,\n position_ids=position_ids,\n head_mask=head_mask,\n inputs_embeds=inputs_embeds,\n output_attentions=output_attentions,\n output_attention_biases=output_attention_biases,\n output_hidden_states=output_hidden_states,\n return_dict=return_dict,\n **kwargs,\n )\n output = self.sequence_head(outputs, labels)\n logits, loss = output.logits, output.loss\n\n if not return_dict:\n output = (logits,) + outputs[2:]\n return ((loss,) + output) if loss is not None else output\n\n return ErnieRnaSequencePredictorOutput(\n loss=loss,\n logits=logits,\n hidden_states=outputs.hidden_states,\n attentions=outputs.attentions,\n )\n
"},{"location":"models/ernierna/#multimolecule.models.ernierna.ErnieRnaForTokenPrediction","title":"ErnieRnaForTokenPrediction","text":" Bases: ErnieRnaPreTrainedModel
Examples:
Python Console Session>>> from multimolecule import ErnieRnaConfig, ErnieRnaForTokenPrediction, RnaTokenizer\n>>> config = ErnieRnaConfig()\n>>> model = ErnieRnaForTokenPrediction(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=torch.randint(2, (1, 5)))\n>>> output[\"logits\"].shape\ntorch.Size([1, 5, 1])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n
Source code in multimolecule/models/ernierna/modeling_ernierna.py
Pythonclass ErnieRnaForTokenPrediction(ErnieRnaPreTrainedModel):\n \"\"\"\n Examples:\n >>> from multimolecule import ErnieRnaConfig, ErnieRnaForTokenPrediction, RnaTokenizer\n >>> config = ErnieRnaConfig()\n >>> model = ErnieRnaForTokenPrediction(config)\n >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n >>> output = model(**input, labels=torch.randint(2, (1, 5)))\n >>> output[\"logits\"].shape\n torch.Size([1, 5, 1])\n >>> output[\"loss\"] # doctest:+ELLIPSIS\n tensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n \"\"\"\n\n def __init__(self, config: ErnieRnaConfig):\n super().__init__(config)\n self.num_labels = config.num_labels\n self.ernierna = ErnieRnaModel(config, add_pooling_layer=True)\n self.token_head = TokenPredictionHead(config)\n self.head_config = self.token_head.config\n\n # Initialize weights and apply final processing\n self.post_init()\n\n def forward(\n self,\n input_ids: Tensor | NestedTensor,\n attention_mask: Tensor | None = None,\n position_ids: Tensor | None = None,\n head_mask: Tensor | None = None,\n inputs_embeds: Tensor | NestedTensor | None = None,\n labels: Tensor | None = None,\n output_attentions: bool | None = None,\n output_attention_biases: bool | None = None,\n output_hidden_states: bool | None = None,\n return_dict: bool | None = None,\n **kwargs,\n ) -> Tuple[Tensor, ...] | ErnieRnaTokenPredictorOutput:\n return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n outputs = self.ernierna(\n input_ids,\n attention_mask=attention_mask,\n position_ids=position_ids,\n head_mask=head_mask,\n inputs_embeds=inputs_embeds,\n output_attentions=output_attentions,\n output_attention_biases=output_attention_biases,\n output_hidden_states=output_hidden_states,\n return_dict=return_dict,\n **kwargs,\n )\n output = self.token_head(outputs, attention_mask, input_ids, labels)\n logits, loss = output.logits, output.loss\n\n if not return_dict:\n output = (logits,) + outputs[2:]\n return ((loss,) + output) if loss is not None else output\n\n return ErnieRnaTokenPredictorOutput(\n loss=loss,\n logits=logits,\n hidden_states=outputs.hidden_states,\n attentions=outputs.attentions,\n )\n
"},{"location":"models/ernierna/#multimolecule.models.ernierna.ErnieRnaModel","title":"ErnieRnaModel","text":" Bases: ErnieRnaPreTrainedModel
Examples:
Python Console Session>>> from multimolecule import ErnieRnaConfig, ErnieRnaModel, RnaTokenizer\n>>> config = ErnieRnaConfig()\n>>> model = ErnieRnaModel(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input)\n>>> output[\"last_hidden_state\"].shape\ntorch.Size([1, 7, 768])\n>>> output[\"pooler_output\"].shape\ntorch.Size([1, 768])\n
Source code in multimolecule/models/ernierna/modeling_ernierna.py
Pythonclass ErnieRnaModel(ErnieRnaPreTrainedModel):\n \"\"\"\n Examples:\n >>> from multimolecule import ErnieRnaConfig, ErnieRnaModel, RnaTokenizer\n >>> config = ErnieRnaConfig()\n >>> model = ErnieRnaModel(config)\n >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n >>> output = model(**input)\n >>> output[\"last_hidden_state\"].shape\n torch.Size([1, 7, 768])\n >>> output[\"pooler_output\"].shape\n torch.Size([1, 768])\n \"\"\"\n\n pairwise_bias_map: Tensor\n\n def __init__(\n self, config: ErnieRnaConfig, add_pooling_layer: bool = True, tokenizer: PreTrainedTokenizer | None = None\n ):\n super().__init__(config)\n if tokenizer is None:\n tokenizer = AutoTokenizer.from_pretrained(\"multimolecule/rna\")\n self.tokenizer = tokenizer\n self.pad_token_id = tokenizer.pad_token_id\n self.vocab_size = len(self.tokenizer)\n if self.vocab_size != config.vocab_size:\n raise ValueError(\n f\"Vocab size in tokenizer ({self.vocab_size}) does not match the one in config ({config.vocab_size})\"\n )\n token_to_ids = self.tokenizer._token_to_id\n tokens = sorted(token_to_ids, key=token_to_ids.get)\n pairwise_bias_dict = get_pairwise_bias_dict(config.pairwise_alpha)\n self.register_buffer(\n \"pairwise_bias_map\",\n torch.tensor([[pairwise_bias_dict.get(f\"{i}{j}\", 0) for i in tokens] for j in tokens]),\n persistent=False,\n )\n self.pairwise_bias_proj = nn.Sequential(\n nn.Linear(1, config.num_attention_heads // 2),\n nn.GELU(),\n nn.Linear(config.num_attention_heads // 2, config.num_attention_heads),\n )\n self.embeddings = ErnieRnaEmbeddings(config)\n self.encoder = ErnieRnaEncoder(config)\n self.pooler = ErnieRnaPooler(config) if add_pooling_layer else None\n\n # Initialize weights and apply final processing\n self.post_init()\n\n def get_input_embeddings(self):\n return self.embeddings.word_embeddings\n\n def set_input_embeddings(self, value):\n self.embeddings.word_embeddings = value\n\n def _prune_heads(self, heads_to_prune):\n \"\"\"\n Prunes heads of the model. heads_to_prune: dict of {layer_num: list of heads to prune in this layer} See base\n class PreTrainedModel\n \"\"\"\n for layer, heads in heads_to_prune.items():\n self.encoder.layer[layer].attention.prune_heads(heads)\n\n def get_pairwise_bias(\n self, input_ids: Tensor | NestedTensor, attention_mask: Tensor | NestedTensor | None = None\n ) -> Tensor | NestedTensor:\n batch_size, seq_len = input_ids.shape\n\n # Broadcasting data indices to compute indices\n data_index_x = input_ids.unsqueeze(2).expand(batch_size, seq_len, seq_len)\n data_index_y = input_ids.unsqueeze(1).expand(batch_size, seq_len, seq_len)\n\n # Get bias from pairwise_bias_map\n return self.pairwise_bias_map[data_index_x, data_index_y]\n\n # Zhiyuan: Is it really necessary to mask the bias?\n # The mask position should have been nan, and the implementation is incorrect anyway\n # if attention_mask is not None:\n # attention_mask = attention_mask.unsqueeze(1).expand(batch_size, seq_len, seq_len)\n # bias = bias * attention_mask\n\n def forward(\n self,\n input_ids: Tensor | NestedTensor,\n attention_mask: Tensor | None = None,\n position_ids: Tensor | None = None,\n head_mask: Tensor | None = None,\n inputs_embeds: Tensor | NestedTensor | None = None,\n encoder_hidden_states: Tensor | None = None,\n encoder_attention_mask: Tensor | None = None,\n past_key_values: Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None = None,\n use_cache: bool | None = None,\n output_attentions: bool | None = None,\n output_attention_biases: bool | None = None,\n output_hidden_states: bool | None = None,\n return_dict: bool | None = None,\n **kwargs,\n ) -> Tuple[Tensor, ...] | ErnieRnaModelOutputWithPoolingAndCrossAttentions:\n r\"\"\"\n Args:\n encoder_hidden_states:\n Shape: `(batch_size, sequence_length, hidden_size)`\n\n Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if\n the model is configured as a decoder.\n encoder_attention_mask:\n Shape: `(batch_size, sequence_length)`\n\n Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used\n in the cross-attention if the model is configured as a decoder. Mask values selected in `[0, 1]`:\n\n - 1 for tokens that are **not masked**,\n - 0 for tokens that are **masked**.\n past_key_values:\n Tuple of length `config.n_layers` with each tuple having 4 tensors of shape\n `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)\n\n Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up\n decoding.\n\n If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those\n that don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of\n all `decoder_input_ids` of shape `(batch_size, sequence_length)`.\n use_cache:\n If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding\n (see `past_key_values`).\n \"\"\"\n if kwargs:\n warn(\n f\"Additional keyword arguments `{', '.join(kwargs)}` are detected in \"\n f\"`{self.__class__.__name__}.forward`, they will be ignored.\\n\"\n \"This is provided for backward compatibility and may lead to unexpected behavior.\"\n )\n output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions\n output_hidden_states = (\n output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states\n )\n return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n\n if self.config.is_decoder:\n use_cache = use_cache if use_cache is not None else self.config.use_cache\n else:\n use_cache = False\n\n pairwise_bias = self.get_pairwise_bias(input_ids, attention_mask)\n attention_bias = self.pairwise_bias_proj(pairwise_bias.unsqueeze(-1)).transpose(1, 3)\n\n if isinstance(input_ids, NestedTensor):\n input_ids, attention_mask = input_ids.tensor, input_ids.mask\n if input_ids is not None and inputs_embeds is not None:\n raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n if input_ids is not None:\n self.warn_if_padding_and_no_attention_mask(input_ids, attention_mask)\n input_shape = input_ids.size()\n elif inputs_embeds is not None:\n input_shape = inputs_embeds.size()[:-1]\n else:\n raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n batch_size, seq_length = input_shape\n device = input_ids.device if input_ids is not None else inputs_embeds.device # type: ignore[union-attr]\n\n # past_key_values_length\n past_key_values_length = past_key_values[0][0].shape[2] if past_key_values is not None else 0\n\n if attention_mask is None:\n attention_mask = (\n input_ids.ne(self.pad_token_id)\n if self.pad_token_id is not None\n else torch.ones(((batch_size, seq_length + past_key_values_length)), device=device)\n )\n\n # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]\n # ourselves in which case we just need to make it broadcastable to all heads.\n extended_attention_mask: Tensor = self.get_extended_attention_mask(attention_mask, input_shape)\n\n # If a 2D or 3D attention mask is provided for the cross-attention\n # we need to make broadcastable to [batch_size, num_heads, seq_length, seq_length]\n if self.config.is_decoder and encoder_hidden_states is not None:\n encoder_batch_size, encoder_sequence_length, _ = encoder_hidden_states.size()\n encoder_hidden_shape = (encoder_batch_size, encoder_sequence_length)\n if encoder_attention_mask is None:\n encoder_attention_mask = torch.ones(encoder_hidden_shape, device=device)\n encoder_extended_attention_mask = self.invert_attention_mask(encoder_attention_mask)\n else:\n encoder_extended_attention_mask = None\n\n # Prepare head mask if needed\n # 1.0 in head_mask indicate we keep the head\n # attention_probs has shape bsz x n_heads x N x N\n # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]\n # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]\n head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)\n\n embedding_output = self.embeddings(\n input_ids=input_ids,\n position_ids=position_ids,\n inputs_embeds=inputs_embeds,\n past_key_values_length=past_key_values_length,\n )\n encoder_outputs = self.encoder(\n embedding_output,\n attention_mask=extended_attention_mask,\n attention_bias=attention_bias,\n head_mask=head_mask,\n encoder_hidden_states=encoder_hidden_states,\n encoder_attention_mask=encoder_extended_attention_mask,\n past_key_values=past_key_values,\n use_cache=use_cache,\n output_attentions=output_attentions,\n output_attention_biases=output_attention_biases,\n output_hidden_states=output_hidden_states,\n return_dict=return_dict,\n )\n sequence_output = encoder_outputs[0]\n pooled_output = self.pooler(sequence_output) if self.pooler is not None else None\n\n if not return_dict:\n return (sequence_output, pooled_output) + encoder_outputs[1:]\n\n return ErnieRnaModelOutputWithPoolingAndCrossAttentions(\n last_hidden_state=sequence_output,\n pooler_output=pooled_output,\n past_key_values=encoder_outputs.past_key_values,\n hidden_states=encoder_outputs.hidden_states,\n attention_biases=encoder_outputs.attention_biases,\n attentions=encoder_outputs.attentions,\n cross_attentions=encoder_outputs.cross_attentions,\n )\n
"},{"location":"models/ernierna/#multimolecule.models.ernierna.ErnieRnaModel.forward","title":"forward","text":"Pythonforward(input_ids: Tensor | NestedTensor, attention_mask: Tensor | None = None, position_ids: Tensor | None = None, head_mask: Tensor | None = None, inputs_embeds: Tensor | NestedTensor | None = None, encoder_hidden_states: Tensor | None = None, encoder_attention_mask: Tensor | None = None, past_key_values: Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None = None, use_cache: bool | None = None, output_attentions: bool | None = None, output_attention_biases: bool | None = None, output_hidden_states: bool | None = None, return_dict: bool | None = None, **kwargs) -> Tuple[Tensor, ...] | ErnieRnaModelOutputWithPoolingAndCrossAttentions\n
Parameters:
Name Type Description DefaultTensor | None
Shape: (batch_size, sequence_length, hidden_size)
Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if the model is configured as a decoder.
None
Tensor | None
Shape: (batch_size, sequence_length)
Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used in the cross-attention if the model is configured as a decoder. Mask values selected in [0, 1]
:
None
Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None
Tuple of length config.n_layers
with each tuple having 4 tensors of shape `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)
Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up decoding.
If past_key_values
are used, the user can optionally input only the last decoder_input_ids
(those that don\u2019t have their past key value states given to this model) of shape (batch_size, 1)
instead of all decoder_input_ids
of shape (batch_size, sequence_length)
.
None
bool | None
If set to True
, past_key_values
key value states are returned and can be used to speed up decoding (see past_key_values
).
None
Source code in multimolecule/models/ernierna/modeling_ernierna.py
Pythondef forward(\n self,\n input_ids: Tensor | NestedTensor,\n attention_mask: Tensor | None = None,\n position_ids: Tensor | None = None,\n head_mask: Tensor | None = None,\n inputs_embeds: Tensor | NestedTensor | None = None,\n encoder_hidden_states: Tensor | None = None,\n encoder_attention_mask: Tensor | None = None,\n past_key_values: Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None = None,\n use_cache: bool | None = None,\n output_attentions: bool | None = None,\n output_attention_biases: bool | None = None,\n output_hidden_states: bool | None = None,\n return_dict: bool | None = None,\n **kwargs,\n) -> Tuple[Tensor, ...] | ErnieRnaModelOutputWithPoolingAndCrossAttentions:\n r\"\"\"\n Args:\n encoder_hidden_states:\n Shape: `(batch_size, sequence_length, hidden_size)`\n\n Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if\n the model is configured as a decoder.\n encoder_attention_mask:\n Shape: `(batch_size, sequence_length)`\n\n Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used\n in the cross-attention if the model is configured as a decoder. Mask values selected in `[0, 1]`:\n\n - 1 for tokens that are **not masked**,\n - 0 for tokens that are **masked**.\n past_key_values:\n Tuple of length `config.n_layers` with each tuple having 4 tensors of shape\n `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)\n\n Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up\n decoding.\n\n If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those\n that don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of\n all `decoder_input_ids` of shape `(batch_size, sequence_length)`.\n use_cache:\n If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding\n (see `past_key_values`).\n \"\"\"\n if kwargs:\n warn(\n f\"Additional keyword arguments `{', '.join(kwargs)}` are detected in \"\n f\"`{self.__class__.__name__}.forward`, they will be ignored.\\n\"\n \"This is provided for backward compatibility and may lead to unexpected behavior.\"\n )\n output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions\n output_hidden_states = (\n output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states\n )\n return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n\n if self.config.is_decoder:\n use_cache = use_cache if use_cache is not None else self.config.use_cache\n else:\n use_cache = False\n\n pairwise_bias = self.get_pairwise_bias(input_ids, attention_mask)\n attention_bias = self.pairwise_bias_proj(pairwise_bias.unsqueeze(-1)).transpose(1, 3)\n\n if isinstance(input_ids, NestedTensor):\n input_ids, attention_mask = input_ids.tensor, input_ids.mask\n if input_ids is not None and inputs_embeds is not None:\n raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n if input_ids is not None:\n self.warn_if_padding_and_no_attention_mask(input_ids, attention_mask)\n input_shape = input_ids.size()\n elif inputs_embeds is not None:\n input_shape = inputs_embeds.size()[:-1]\n else:\n raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n batch_size, seq_length = input_shape\n device = input_ids.device if input_ids is not None else inputs_embeds.device # type: ignore[union-attr]\n\n # past_key_values_length\n past_key_values_length = past_key_values[0][0].shape[2] if past_key_values is not None else 0\n\n if attention_mask is None:\n attention_mask = (\n input_ids.ne(self.pad_token_id)\n if self.pad_token_id is not None\n else torch.ones(((batch_size, seq_length + past_key_values_length)), device=device)\n )\n\n # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]\n # ourselves in which case we just need to make it broadcastable to all heads.\n extended_attention_mask: Tensor = self.get_extended_attention_mask(attention_mask, input_shape)\n\n # If a 2D or 3D attention mask is provided for the cross-attention\n # we need to make broadcastable to [batch_size, num_heads, seq_length, seq_length]\n if self.config.is_decoder and encoder_hidden_states is not None:\n encoder_batch_size, encoder_sequence_length, _ = encoder_hidden_states.size()\n encoder_hidden_shape = (encoder_batch_size, encoder_sequence_length)\n if encoder_attention_mask is None:\n encoder_attention_mask = torch.ones(encoder_hidden_shape, device=device)\n encoder_extended_attention_mask = self.invert_attention_mask(encoder_attention_mask)\n else:\n encoder_extended_attention_mask = None\n\n # Prepare head mask if needed\n # 1.0 in head_mask indicate we keep the head\n # attention_probs has shape bsz x n_heads x N x N\n # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]\n # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]\n head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)\n\n embedding_output = self.embeddings(\n input_ids=input_ids,\n position_ids=position_ids,\n inputs_embeds=inputs_embeds,\n past_key_values_length=past_key_values_length,\n )\n encoder_outputs = self.encoder(\n embedding_output,\n attention_mask=extended_attention_mask,\n attention_bias=attention_bias,\n head_mask=head_mask,\n encoder_hidden_states=encoder_hidden_states,\n encoder_attention_mask=encoder_extended_attention_mask,\n past_key_values=past_key_values,\n use_cache=use_cache,\n output_attentions=output_attentions,\n output_attention_biases=output_attention_biases,\n output_hidden_states=output_hidden_states,\n return_dict=return_dict,\n )\n sequence_output = encoder_outputs[0]\n pooled_output = self.pooler(sequence_output) if self.pooler is not None else None\n\n if not return_dict:\n return (sequence_output, pooled_output) + encoder_outputs[1:]\n\n return ErnieRnaModelOutputWithPoolingAndCrossAttentions(\n last_hidden_state=sequence_output,\n pooler_output=pooled_output,\n past_key_values=encoder_outputs.past_key_values,\n hidden_states=encoder_outputs.hidden_states,\n attention_biases=encoder_outputs.attention_biases,\n attentions=encoder_outputs.attentions,\n cross_attentions=encoder_outputs.cross_attentions,\n )\n
"},{"location":"models/ernierna/#multimolecule.models.ernierna.ErnieRnaModel.forward(encoder_hidden_states)","title":"encoder_hidden_states
","text":""},{"location":"models/ernierna/#multimolecule.models.ernierna.ErnieRnaModel.forward(encoder_attention_mask)","title":"encoder_attention_mask
","text":""},{"location":"models/ernierna/#multimolecule.models.ernierna.ErnieRnaModel.forward(past_key_values)","title":"past_key_values
","text":""},{"location":"models/ernierna/#multimolecule.models.ernierna.ErnieRnaModel.forward(use_cache)","title":"use_cache
","text":""},{"location":"models/ernierna/#multimolecule.models.ernierna.ErnieRnaPreTrainedModel","title":"ErnieRnaPreTrainedModel","text":" Bases: PreTrainedModel
An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained models.
Source code inmultimolecule/models/ernierna/modeling_ernierna.py
Pythonclass ErnieRnaPreTrainedModel(PreTrainedModel):\n \"\"\"\n An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained\n models.\n \"\"\"\n\n config_class = ErnieRnaConfig\n base_model_prefix = \"ernierna\"\n supports_gradient_checkpointing = True\n _no_split_modules = [\"ErnieRnaLayer\", \"ErnieRnaEmbeddings\"]\n\n # Copied from transformers.models.bert.modeling_bert.BertPreTrainedModel._init_weights\n def _init_weights(self, module: nn.Module):\n \"\"\"Initialize the weights\"\"\"\n if isinstance(module, nn.Linear):\n # Slightly different from the TF version which uses truncated_normal for initialization\n # cf https://github.com/pytorch/pytorch/pull/5617\n module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)\n if module.bias is not None:\n module.bias.data.zero_()\n elif isinstance(module, nn.Embedding):\n module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)\n if module.padding_idx is not None:\n module.weight.data[module.padding_idx].zero_()\n elif isinstance(module, nn.LayerNorm):\n module.bias.data.zero_()\n module.weight.data.fill_(1.0)\n
"},{"location":"models/modeling_outputs/","title":"modeling_outputs","text":""},{"location":"models/modeling_outputs/#multimolecule.models.modeling_outputs","title":"multimolecule.models.modeling_outputs","text":""},{"location":"models/modeling_outputs/#multimolecule.models.modeling_outputs.SequencePredictorOutput","title":"SequencePredictorOutput dataclass
","text":" Bases: ModelOutput
Base class for outputs of sentence classification & regression models.
Parameters:
Name Type Description DefaultFloatTensor | None
torch.FloatTensor
of shape (1,)
.
Optional, returned when labels
is provided
None
FloatTensor
torch.FloatTensor
of shape (batch_size, sequence_length, config.num_labels)
Prediction outputs.
None
Tuple[FloatTensor, ...] | None
Tuple of torch.FloatTensor
(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size)
.
Optional, returned when output_hidden_states=True
is passed or when `config.output_hidden_states=True
Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
None
Tuple[FloatTensor, ...] | None
Tuple of torch.FloatTensor
(one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length)
.
Optional, eturned when output_attentions=True
is passed or when config.output_attentions=True
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
None
Source code in multimolecule/models/modeling_outputs.py
Python@dataclass\nclass SequencePredictorOutput(ModelOutput):\n \"\"\"\n Base class for outputs of sentence classification & regression models.\n\n Args:\n loss:\n `torch.FloatTensor` of shape `(1,)`.\n\n Optional, returned when `labels` is provided\n logits:\n `torch.FloatTensor` of shape `(batch_size, sequence_length, config.num_labels)`\n\n Prediction outputs.\n hidden_states:\n Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +\n one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.\n\n Optional, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True\n\n Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.\n attentions:\n Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,\n sequence_length)`.\n\n Optional, eturned when `output_attentions=True` is passed or when `config.output_attentions=True`\n\n Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n heads.\n \"\"\"\n\n loss: torch.FloatTensor | None = None\n logits: torch.FloatTensor = None\n hidden_states: Tuple[torch.FloatTensor, ...] | None = None\n attentions: Tuple[torch.FloatTensor, ...] | None = None\n
"},{"location":"models/modeling_outputs/#multimolecule.models.modeling_outputs.SequencePredictorOutput(loss)","title":"loss
","text":""},{"location":"models/modeling_outputs/#multimolecule.models.modeling_outputs.SequencePredictorOutput(logits)","title":"logits
","text":""},{"location":"models/modeling_outputs/#multimolecule.models.modeling_outputs.SequencePredictorOutput(hidden_states)","title":"hidden_states
","text":""},{"location":"models/modeling_outputs/#multimolecule.models.modeling_outputs.SequencePredictorOutput(attentions)","title":"attentions
","text":""},{"location":"models/modeling_outputs/#multimolecule.models.modeling_outputs.TokenPredictorOutput","title":"TokenPredictorOutput dataclass
","text":" Bases: ModelOutput
Base class for outputs of token classification & regression models.
Parameters:
Name Type Description DefaultFloatTensor | None
torch.FloatTensor
of shape (1,)
.
Optional, returned when labels
is provided
None
FloatTensor
torch.FloatTensor
of shape (batch_size, sequence_length, config.num_labels)
Prediction outputs.
None
Tuple[FloatTensor, ...] | None
Tuple of torch.FloatTensor
(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size)
.
Optional, returned when output_hidden_states=True
is passed or when `config.output_hidden_states=True
Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
None
Tuple[FloatTensor, ...] | None
Tuple of torch.FloatTensor
(one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length)
.
Optional, eturned when output_attentions=True
is passed or when config.output_attentions=True
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
None
Source code in multimolecule/models/modeling_outputs.py
Python@dataclass\nclass TokenPredictorOutput(ModelOutput):\n \"\"\"\n Base class for outputs of token classification & regression models.\n\n Args:\n loss:\n `torch.FloatTensor` of shape `(1,)`.\n\n Optional, returned when `labels` is provided\n logits:\n `torch.FloatTensor` of shape `(batch_size, sequence_length, config.num_labels)`\n\n Prediction outputs.\n hidden_states:\n Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +\n one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.\n\n Optional, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True\n\n Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.\n attentions:\n Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,\n sequence_length)`.\n\n Optional, eturned when `output_attentions=True` is passed or when `config.output_attentions=True`\n\n Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n heads.\n \"\"\"\n\n loss: torch.FloatTensor | None = None\n logits: torch.FloatTensor = None\n hidden_states: Tuple[torch.FloatTensor, ...] | None = None\n attentions: Tuple[torch.FloatTensor, ...] | None = None\n
"},{"location":"models/modeling_outputs/#multimolecule.models.modeling_outputs.TokenPredictorOutput(loss)","title":"loss
","text":""},{"location":"models/modeling_outputs/#multimolecule.models.modeling_outputs.TokenPredictorOutput(logits)","title":"logits
","text":""},{"location":"models/modeling_outputs/#multimolecule.models.modeling_outputs.TokenPredictorOutput(hidden_states)","title":"hidden_states
","text":""},{"location":"models/modeling_outputs/#multimolecule.models.modeling_outputs.TokenPredictorOutput(attentions)","title":"attentions
","text":""},{"location":"models/modeling_outputs/#multimolecule.models.modeling_outputs.ContactPredictorOutput","title":"ContactPredictorOutput dataclass
","text":" Bases: ModelOutput
Base class for outputs of contact classification & regression models.
Parameters:
Name Type Description DefaultFloatTensor | None
torch.FloatTensor
of shape (1,)
.
Optional, returned when labels
is provided
None
FloatTensor
torch.FloatTensor
of shape (batch_size, sequence_length, config.num_labels)
Prediction outputs.
None
Tuple[FloatTensor, ...] | None
Tuple of torch.FloatTensor
(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size)
.
Optional, returned when output_hidden_states=True
is passed or when `config.output_hidden_states=True
Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
None
Tuple[FloatTensor, ...] | None
Tuple of torch.FloatTensor
(one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length)
.
Optional, eturned when output_attentions=True
is passed or when config.output_attentions=True
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
None
Source code in multimolecule/models/modeling_outputs.py
Python@dataclass\nclass ContactPredictorOutput(ModelOutput):\n \"\"\"\n Base class for outputs of contact classification & regression models.\n\n Args:\n loss:\n `torch.FloatTensor` of shape `(1,)`.\n\n Optional, returned when `labels` is provided\n logits:\n `torch.FloatTensor` of shape `(batch_size, sequence_length, config.num_labels)`\n\n Prediction outputs.\n hidden_states:\n Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +\n one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.\n\n Optional, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True\n\n Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.\n attentions:\n Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,\n sequence_length)`.\n\n Optional, eturned when `output_attentions=True` is passed or when `config.output_attentions=True`\n\n Attentions weights after the attention softmax, used to compute the weighted average in the self-attention\n heads.\n \"\"\"\n\n loss: torch.FloatTensor | None = None\n logits: torch.FloatTensor = None\n hidden_states: Tuple[torch.FloatTensor, ...] | None = None\n attentions: Tuple[torch.FloatTensor, ...] | None = None\n
"},{"location":"models/modeling_outputs/#multimolecule.models.modeling_outputs.ContactPredictorOutput(loss)","title":"loss
","text":""},{"location":"models/modeling_outputs/#multimolecule.models.modeling_outputs.ContactPredictorOutput(logits)","title":"logits
","text":""},{"location":"models/modeling_outputs/#multimolecule.models.modeling_outputs.ContactPredictorOutput(hidden_states)","title":"hidden_states
","text":""},{"location":"models/modeling_outputs/#multimolecule.models.modeling_outputs.ContactPredictorOutput(attentions)","title":"attentions
","text":""},{"location":"models/rinalmo/","title":"RiNALMo","text":"Pre-trained model on non-coding RNA (ncRNA) using a masked language modeling (MLM) objective.
"},{"location":"models/rinalmo/#disclaimer","title":"Disclaimer","text":"This is an UNOFFICIAL implementation of the RiNALMo: General-Purpose RNA Language Models Can Generalize Well on Structure Prediction Tasks by Rafael Josip Peni\u0107, et al.
The OFFICIAL repository of RiNALMo is at lbcb-sci/RiNALMo.
Tip
The MultiMolecule team has confirmed that the provided model and checkpoints are producing the same intermediate representations as the original implementation.
The team releasing RiNALMo did not write this model card for this model so this model card has been written by the MultiMolecule team.
"},{"location":"models/rinalmo/#model-details","title":"Model Details","text":"RiNALMo is a bert-style model pre-trained on a large corpus of non-coding RNA sequences in a self-supervised fashion. This means that the model was trained on the raw nucleotides of RNA sequences only, with an automatic process to generate inputs and labels from those texts. Please refer to the Training Details section for more information on the training process.
"},{"location":"models/rinalmo/#model-specification","title":"Model Specification","text":"Num Layers Hidden Size Num Heads Intermediate Size Num Parameters (M) FLOPs (G) MACs (G) Max Num Tokens 33 1280 20 5120 650.88 168.92 84.43 1022"},{"location":"models/rinalmo/#links","title":"Links","text":"multimolecule/rinalmo
The model file depends on the multimolecule
library. You can install it using pip:
pip install multimolecule\n
"},{"location":"models/rinalmo/#direct-use","title":"Direct Use","text":"You can use this model directly with a pipeline for masked language modeling:
Python>>> import multimolecule # you must import multimolecule to register models\n>>> from transformers import pipeline\n>>> unmasker = pipeline(\"fill-mask\", model=\"multimolecule/rinalmo\")\n>>> unmasker(\"gguc<mask>cucugguuagaccagaucugagccu\")\n\n[{'score': 0.3932918310165405,\n 'token': 6,\n 'token_str': 'A',\n 'sequence': 'G G U C A C U C U G G U U A G A C C A G A U C U G A G C C U'},\n {'score': 0.2897723913192749,\n 'token': 9,\n 'token_str': 'U',\n 'sequence': 'G G U C U C U C U G G U U A G A C C A G A U C U G A G C C U'},\n {'score': 0.15423105657100677,\n 'token': 22,\n 'token_str': 'X',\n 'sequence': 'G G U C X C U C U G G U U A G A C C A G A U C U G A G C C U'},\n {'score': 0.12160095572471619,\n 'token': 7,\n 'token_str': 'C',\n 'sequence': 'G G U C C C U C U G G U U A G A C C A G A U C U G A G C C U'},\n {'score': 0.0408296100795269,\n 'token': 8,\n 'token_str': 'G',\n 'sequence': 'G G U C G C U C U G G U U A G A C C A G A U C U G A G C C U'}]\n
"},{"location":"models/rinalmo/#downstream-use","title":"Downstream Use","text":""},{"location":"models/rinalmo/#extract-features","title":"Extract Features","text":"Here is how to use this model to get the features of a given sequence in PyTorch:
Pythonfrom multimolecule import RnaTokenizer, RiNALMoModel\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rinalmo\")\nmodel = RiNALMoModel.from_pretrained(\"multimolecule/rinalmo\")\n\ntext = \"UAGCUUAUCAGACUGAUGUUGA\"\ninput = tokenizer(text, return_tensors=\"pt\")\n\noutput = model(**input)\n
"},{"location":"models/rinalmo/#sequence-classification-regression","title":"Sequence Classification / Regression","text":"Note: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for sequence classification or regression.
Here is how to use this model as backbone to fine-tune for a sequence-level task in PyTorch:
Pythonimport torch\nfrom multimolecule import RnaTokenizer, RiNALMoForSequencePrediction\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rinalmo\")\nmodel = RiNALMoForSequencePrediction.from_pretrained(\"multimolecule/rinalmo\")\n\ntext = \"UAGCUUAUCAGACUGAUGUUGA\"\ninput = tokenizer(text, return_tensors=\"pt\")\nlabel = torch.tensor([1])\n\noutput = model(**input, labels=label)\n
"},{"location":"models/rinalmo/#token-classification-regression","title":"Token Classification / Regression","text":"Note: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for nucleotide classification or regression.
Here is how to use this model as backbone to fine-tune for a nucleotide-level task in PyTorch:
Pythonimport torch\nfrom multimolecule import RnaTokenizer, RiNALMoForTokenPrediction\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rinalmo\")\nmodel = RiNALMoForTokenPrediction.from_pretrained(\"multimolecule/rinalmo\")\n\ntext = \"UAGCUUAUCAGACUGAUGUUGA\"\ninput = tokenizer(text, return_tensors=\"pt\")\nlabel = torch.randint(2, (len(text), ))\n\noutput = model(**input, labels=label)\n
"},{"location":"models/rinalmo/#contact-classification-regression","title":"Contact Classification / Regression","text":"Note: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for contact classification or regression.
Here is how to use this model as backbone to fine-tune for a contact-level task in PyTorch:
Pythonimport torch\nfrom multimolecule import RnaTokenizer, RiNALMoForContactPrediction\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rinalmo\")\nmodel = RiNALMoForContactPrediction.from_pretrained(\"multimolecule/rinalmo\")\n\ntext = \"UAGCUUAUCAGACUGAUGUUGA\"\ninput = tokenizer(text, return_tensors=\"pt\")\nlabel = torch.randint(2, (len(text), len(text)))\n\noutput = model(**input, labels=label)\n
"},{"location":"models/rinalmo/#training-details","title":"Training Details","text":"RiNALMo used Masked Language Modeling (MLM) as the pre-training objective: taking a sequence, the model randomly masks 15% of the tokens in the input then runs the entire masked sentence through the model and has to predict the masked tokens. This is comparable to the Cloze task in language modeling.
"},{"location":"models/rinalmo/#training-data","title":"Training Data","text":"The RiNALMo model was pre-trained on a cocktail of databases including RNAcentral, Rfam, Ensembl Genome Browser, and Nucleotide. The training data contains 36 million unique ncRNA sequences.
To ensure sequence diversity in each training batch, RiNALMo clustered the sequences with MMSeqs2 into 17 million clusters and then sampled each sequence in the batch from a different cluster.
RiNALMo preprocessed all tokens by replacing \u201cU\u201ds with \u201cT\u201ds.
Note that during model conversions, \u201cT\u201d is replaced with \u201cU\u201d. RnaTokenizer
will convert \u201cT\u201ds to \u201cU\u201ds for you, you may disable this behaviour by passing replace_T_with_U=False
.
RiNALMo used masked language modeling (MLM) as the pre-training objective. The masking procedure is similar to the one used in BERT:
<mask>
.The model was trained on 7 NVIDIA A100 GPUs with 80GiB memories.
BibTeX:
BibTeX@article{penic2024rinalmo,\n title={RiNALMo: General-Purpose RNA Language Models Can Generalize Well on Structure Prediction Tasks},\n author={Peni\u0107, Rafael Josip and Vla\u0161i\u0107, Tin and Huber, Roland G. and Wan, Yue and \u0160iki\u0107, Mile},\n journal={arXiv preprint arXiv:2403.00043},\n year={2024}\n}\n
"},{"location":"models/rinalmo/#contact","title":"Contact","text":"Please use GitHub issues of MultiMolecule for any questions or comments on the model card.
Please contact the authors of the RiNALMo paper for questions or comments on the paper/model.
"},{"location":"models/rinalmo/#license","title":"License","text":"This model is licensed under the AGPL-3.0 License.
Text OnlySPDX-License-Identifier: AGPL-3.0-or-later\n
"},{"location":"models/rinalmo/#multimolecule.models.rinalmo","title":"multimolecule.models.rinalmo","text":""},{"location":"models/rinalmo/#multimolecule.models.rinalmo.RnaTokenizer","title":"RnaTokenizer","text":" Bases: Tokenizer
Tokenizer for RNA sequences.
Parameters:
Name Type Description DefaultAlphabet | str | List[str] | None
alphabet to use for tokenization.
None
, the standard RNA alphabet will be used.string
, it should correspond to the name of a predefined alphabet. The options includestandard
extended
streamline
nucleobase
None
int
Size of kmer to tokenize.
1
bool
Whether to tokenize into codons.
False
bool
Whether to replace T with U.
True
bool
Whether to convert input to uppercase.
True
Examples:
Python Console Session>>> from multimolecule import RnaTokenizer\n>>> tokenizer = RnaTokenizer()\n>>> tokenizer('<pad><cls><eos><unk><mask><null>ACGUNRYSWKMBDHV.X*-I')[\"input_ids\"]\n[1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 2]\n>>> tokenizer('acgu')[\"input_ids\"]\n[1, 6, 7, 8, 9, 2]\n>>> tokenizer('acgt')[\"input_ids\"]\n[1, 6, 7, 8, 9, 2]\n>>> tokenizer = RnaTokenizer(replace_T_with_U=False)\n>>> tokenizer('acgt')[\"input_ids\"]\n[1, 6, 7, 8, 3, 2]\n>>> tokenizer = RnaTokenizer(nmers=3)\n>>> tokenizer('uagcuuauc')[\"input_ids\"]\n[1, 83, 17, 64, 49, 96, 84, 22, 2]\n>>> tokenizer = RnaTokenizer(codon=True)\n>>> tokenizer('uagcuuauc')[\"input_ids\"]\n[1, 83, 49, 22, 2]\n>>> tokenizer('uagcuuauca')[\"input_ids\"]\nTraceback (most recent call last):\nValueError: length of input sequence must be a multiple of 3 for codon tokenization, but got 10\n
Source code in multimolecule/tokenisers/rna/tokenization_rna.py
Pythonclass RnaTokenizer(Tokenizer):\n \"\"\"\n Tokenizer for RNA sequences.\n\n Args:\n alphabet: alphabet to use for tokenization.\n\n - If is `None`, the standard RNA alphabet will be used.\n - If is a `string`, it should correspond to the name of a predefined alphabet. The options include\n + `standard`\n + `extended`\n + `streamline`\n + `nucleobase`\n - If is an alphabet or a list of characters, that specific alphabet will be used.\n nmers: Size of kmer to tokenize.\n codon: Whether to tokenize into codons.\n replace_T_with_U: Whether to replace T with U.\n do_upper_case: Whether to convert input to uppercase.\n\n Examples:\n >>> from multimolecule import RnaTokenizer\n >>> tokenizer = RnaTokenizer()\n >>> tokenizer('<pad><cls><eos><unk><mask><null>ACGUNRYSWKMBDHV.X*-I')[\"input_ids\"]\n [1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 2]\n >>> tokenizer('acgu')[\"input_ids\"]\n [1, 6, 7, 8, 9, 2]\n >>> tokenizer('acgt')[\"input_ids\"]\n [1, 6, 7, 8, 9, 2]\n >>> tokenizer = RnaTokenizer(replace_T_with_U=False)\n >>> tokenizer('acgt')[\"input_ids\"]\n [1, 6, 7, 8, 3, 2]\n >>> tokenizer = RnaTokenizer(nmers=3)\n >>> tokenizer('uagcuuauc')[\"input_ids\"]\n [1, 83, 17, 64, 49, 96, 84, 22, 2]\n >>> tokenizer = RnaTokenizer(codon=True)\n >>> tokenizer('uagcuuauc')[\"input_ids\"]\n [1, 83, 49, 22, 2]\n >>> tokenizer('uagcuuauca')[\"input_ids\"]\n Traceback (most recent call last):\n ValueError: length of input sequence must be a multiple of 3 for codon tokenization, but got 10\n \"\"\"\n\n model_input_names = [\"input_ids\", \"attention_mask\"]\n\n def __init__(\n self,\n alphabet: Alphabet | str | List[str] | None = None,\n nmers: int = 1,\n codon: bool = False,\n replace_T_with_U: bool = True,\n do_upper_case: bool = True,\n additional_special_tokens: List | Tuple | None = None,\n **kwargs,\n ):\n if codon and (nmers > 1 and nmers != 3):\n raise ValueError(\"Codon and nmers cannot be used together.\")\n if codon:\n nmers = 3 # set to 3 to get correct vocab\n if not isinstance(alphabet, Alphabet):\n alphabet = get_alphabet(alphabet, nmers=nmers)\n super().__init__(\n alphabet=alphabet,\n nmers=nmers,\n codon=codon,\n replace_T_with_U=replace_T_with_U,\n do_upper_case=do_upper_case,\n additional_special_tokens=additional_special_tokens,\n **kwargs,\n )\n self.replace_T_with_U = replace_T_with_U\n self.nmers = nmers\n self.codon = codon\n\n def _tokenize(self, text: str, **kwargs):\n if self.do_upper_case:\n text = text.upper()\n if self.replace_T_with_U:\n text = text.replace(\"T\", \"U\")\n if self.codon:\n if len(text) % 3 != 0:\n raise ValueError(\n f\"length of input sequence must be a multiple of 3 for codon tokenization, but got {len(text)}\"\n )\n return [text[i : i + 3] for i in range(0, len(text), 3)]\n if self.nmers > 1:\n return [text[i : i + self.nmers] for i in range(len(text) - self.nmers + 1)] # noqa: E203\n return list(text)\n
"},{"location":"models/rinalmo/#multimolecule.models.rinalmo.RnaTokenizer(alphabet)","title":"alphabet
","text":""},{"location":"models/rinalmo/#multimolecule.models.rinalmo.RnaTokenizer(nmers)","title":"nmers
","text":""},{"location":"models/rinalmo/#multimolecule.models.rinalmo.RnaTokenizer(codon)","title":"codon
","text":""},{"location":"models/rinalmo/#multimolecule.models.rinalmo.RnaTokenizer(replace_T_with_U)","title":"replace_T_with_U
","text":""},{"location":"models/rinalmo/#multimolecule.models.rinalmo.RnaTokenizer(do_upper_case)","title":"do_upper_case
","text":""},{"location":"models/rinalmo/#multimolecule.models.rinalmo.RiNALMoConfig","title":"RiNALMoConfig","text":" Bases: PreTrainedConfig
This is the configuration class to store the configuration of a RiNALMoModel
. It is used to instantiate a RiNALMo model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the RiNALMo lbcb-sci/RiNALMo architecture.
Configuration objects inherit from PreTrainedConfig
and can be used to control the model outputs. Read the documentation from PreTrainedConfig
for more information.
Parameters:
Name Type Description Defaultint
Vocabulary size of the RiNALMo model. Defines the number of different tokens that can be represented by the inputs_ids
passed when calling [RiNALMoModel
].
26
int
Dimensionality of the encoder layers and the pooler layer.
1280
int
Number of hidden layers in the Transformer encoder.
33
int
Number of attention heads for each attention layer in the Transformer encoder.
20
int
Dimensionality of the \u201cintermediate\u201d (often named feed-forward) layer in the Transformer encoder.
5120
float
The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
0.1
float
The dropout ratio for the attention probabilities.
0.1
int
The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
1024
float
The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
0.02
float
The epsilon used by the layer normalization layers.
1e-12
str
Type of position embedding. Choose one of \"absolute\"
, \"relative_key\"
, \"relative_key_query\", \"rotary\"
. For positional embeddings use \"absolute\"
. For more information on \"relative_key\"
, please refer to Self-Attention with Relative Position Representations (Shaw et al.). For more information on \"relative_key_query\"
, please refer to Method 4 in Improve Transformer Models with Better Relative Position Embeddings (Huang et al.).
'rotary'
bool
Whether the model is used as a decoder or not. If False
, the model is used as an encoder.
False
bool
Whether or not the model should return the last key/values attentions (not used by all models). Only relevant if config.is_decoder=True
.
True
bool
Whether to apply layer normalization after embeddings but before the main stem of the network.
True
bool
When this is enabled, masked tokens are treated as if they had been dropped out by input dropout.
True
Examples:
Python Console Session>>> from multimolecule import RiNALMoModel, RiNALMoConfig\n>>> # Initializing a RiNALMo multimolecule/rinalmo style configuration\n>>> configuration = RiNALMoConfig()\n>>> # Initializing a model (with random weights) from the multimolecule/rinalmo style configuration\n>>> model = RiNALMoModel(configuration)\n>>> # Accessing the model configuration\n>>> configuration = model.config\n
Source code in multimolecule/models/rinalmo/configuration_rinalmo.py
Pythonclass RiNALMoConfig(PreTrainedConfig):\n r\"\"\"\n This is the configuration class to store the configuration of a [`RiNALMoModel`][multimolecule.models.RiNALMoModel].\n It is used to instantiate a RiNALMo model according to the specified arguments, defining the model architecture.\n Instantiating a configuration with the defaults will yield a similar configuration to that of the RiNALMo\n [lbcb-sci/RiNALMo](https://github.com/lbcb-sci/RiNALMo) architecture.\n\n Configuration objects inherit from [`PreTrainedConfig`][multimolecule.models.PreTrainedConfig] and can be used to\n control the model outputs. Read the documentation from [`PreTrainedConfig`][multimolecule.models.PreTrainedConfig]\n for more information.\n\n Args:\n vocab_size:\n Vocabulary size of the RiNALMo model. Defines the number of different tokens that can be represented by the\n `inputs_ids` passed when calling [`RiNALMoModel`].\n hidden_size:\n Dimensionality of the encoder layers and the pooler layer.\n num_hidden_layers:\n Number of hidden layers in the Transformer encoder.\n num_attention_heads:\n Number of attention heads for each attention layer in the Transformer encoder.\n intermediate_size:\n Dimensionality of the \"intermediate\" (often named feed-forward) layer in the Transformer encoder.\n hidden_dropout:\n The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.\n attention_dropout:\n The dropout ratio for the attention probabilities.\n max_position_embeddings:\n The maximum sequence length that this model might ever be used with. Typically set this to something large\n just in case (e.g., 512 or 1024 or 2048).\n initializer_range:\n The standard deviation of the truncated_normal_initializer for initializing all weight matrices.\n layer_norm_eps:\n The epsilon used by the layer normalization layers.\n position_embedding_type:\n Type of position embedding. Choose one of `\"absolute\"`, `\"relative_key\"`, `\"relative_key_query\", \"rotary\"`.\n For positional embeddings use `\"absolute\"`. For more information on `\"relative_key\"`, please refer to\n [Self-Attention with Relative Position Representations (Shaw et al.)](https://arxiv.org/abs/1803.02155).\n For more information on `\"relative_key_query\"`, please refer to *Method 4* in [Improve Transformer Models\n with Better Relative Position Embeddings (Huang et al.)](https://arxiv.org/abs/2009.13658).\n is_decoder:\n Whether the model is used as a decoder or not. If `False`, the model is used as an encoder.\n use_cache:\n Whether or not the model should return the last key/values attentions (not used by all models). Only\n relevant if `config.is_decoder=True`.\n emb_layer_norm_before:\n Whether to apply layer normalization after embeddings but before the main stem of the network.\n token_dropout:\n When this is enabled, masked tokens are treated as if they had been dropped out by input dropout.\n\n Examples:\n >>> from multimolecule import RiNALMoModel, RiNALMoConfig\n >>> # Initializing a RiNALMo multimolecule/rinalmo style configuration\n >>> configuration = RiNALMoConfig()\n >>> # Initializing a model (with random weights) from the multimolecule/rinalmo style configuration\n >>> model = RiNALMoModel(configuration)\n >>> # Accessing the model configuration\n >>> configuration = model.config\n \"\"\"\n\n model_type = \"rinalmo\"\n\n def __init__(\n self,\n vocab_size: int = 26,\n hidden_size: int = 1280,\n num_hidden_layers: int = 33,\n num_attention_heads: int = 20,\n intermediate_size: int = 5120,\n hidden_act: str = \"gelu\",\n hidden_dropout: float = 0.1,\n attention_dropout: float = 0.1,\n max_position_embeddings: int = 1024,\n initializer_range: float = 0.02,\n layer_norm_eps: float = 1e-12,\n position_embedding_type: str = \"rotary\",\n is_decoder: bool = False,\n use_cache: bool = True,\n emb_layer_norm_before: bool = True,\n learnable_beta: bool = True,\n token_dropout: bool = True,\n head: HeadConfig | None = None,\n lm_head: MaskedLMHeadConfig | None = None,\n **kwargs,\n ):\n super().__init__(**kwargs)\n self.vocab_size = vocab_size\n self.hidden_size = hidden_size\n self.num_hidden_layers = num_hidden_layers\n self.num_attention_heads = num_attention_heads\n self.intermediate_size = intermediate_size\n self.hidden_act = hidden_act\n self.hidden_dropout = hidden_dropout\n self.attention_dropout = attention_dropout\n self.max_position_embeddings = max_position_embeddings\n self.initializer_range = initializer_range\n self.layer_norm_eps = layer_norm_eps\n self.position_embedding_type = position_embedding_type\n self.is_decoder = is_decoder\n self.use_cache = use_cache\n self.learnable_beta = learnable_beta\n self.token_dropout = token_dropout\n self.head = HeadConfig(**head) if head is not None else None\n self.lm_head = MaskedLMHeadConfig(**lm_head) if lm_head is not None else None\n self.emb_layer_norm_before = emb_layer_norm_before\n
"},{"location":"models/rinalmo/#multimolecule.models.rinalmo.RiNALMoConfig(vocab_size)","title":"vocab_size
","text":""},{"location":"models/rinalmo/#multimolecule.models.rinalmo.RiNALMoConfig(hidden_size)","title":"hidden_size
","text":""},{"location":"models/rinalmo/#multimolecule.models.rinalmo.RiNALMoConfig(num_hidden_layers)","title":"num_hidden_layers
","text":""},{"location":"models/rinalmo/#multimolecule.models.rinalmo.RiNALMoConfig(num_attention_heads)","title":"num_attention_heads
","text":""},{"location":"models/rinalmo/#multimolecule.models.rinalmo.RiNALMoConfig(intermediate_size)","title":"intermediate_size
","text":""},{"location":"models/rinalmo/#multimolecule.models.rinalmo.RiNALMoConfig(hidden_dropout)","title":"hidden_dropout
","text":""},{"location":"models/rinalmo/#multimolecule.models.rinalmo.RiNALMoConfig(attention_dropout)","title":"attention_dropout
","text":""},{"location":"models/rinalmo/#multimolecule.models.rinalmo.RiNALMoConfig(max_position_embeddings)","title":"max_position_embeddings
","text":""},{"location":"models/rinalmo/#multimolecule.models.rinalmo.RiNALMoConfig(initializer_range)","title":"initializer_range
","text":""},{"location":"models/rinalmo/#multimolecule.models.rinalmo.RiNALMoConfig(layer_norm_eps)","title":"layer_norm_eps
","text":""},{"location":"models/rinalmo/#multimolecule.models.rinalmo.RiNALMoConfig(position_embedding_type)","title":"position_embedding_type
","text":""},{"location":"models/rinalmo/#multimolecule.models.rinalmo.RiNALMoConfig(is_decoder)","title":"is_decoder
","text":""},{"location":"models/rinalmo/#multimolecule.models.rinalmo.RiNALMoConfig(use_cache)","title":"use_cache
","text":""},{"location":"models/rinalmo/#multimolecule.models.rinalmo.RiNALMoConfig(emb_layer_norm_before)","title":"emb_layer_norm_before
","text":""},{"location":"models/rinalmo/#multimolecule.models.rinalmo.RiNALMoConfig(token_dropout)","title":"token_dropout
","text":""},{"location":"models/rinalmo/#multimolecule.models.rinalmo.RiNALMoForContactPrediction","title":"RiNALMoForContactPrediction","text":" Bases: RiNALMoPreTrainedModel
Examples:
Python Console Session>>> from multimolecule import RiNALMoConfig, RiNALMoForContactPrediction, RnaTokenizer\n>>> config = RiNALMoConfig()\n>>> model = RiNALMoForContactPrediction(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=torch.randint(2, (1, 5, 5)))\n>>> output[\"logits\"].shape\ntorch.Size([1, 5, 5, 1])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n
Source code in multimolecule/models/rinalmo/modeling_rinalmo.py
Pythonclass RiNALMoForContactPrediction(RiNALMoPreTrainedModel):\n \"\"\"\n Examples:\n >>> from multimolecule import RiNALMoConfig, RiNALMoForContactPrediction, RnaTokenizer\n >>> config = RiNALMoConfig()\n >>> model = RiNALMoForContactPrediction(config)\n >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n >>> output = model(**input, labels=torch.randint(2, (1, 5, 5)))\n >>> output[\"logits\"].shape\n torch.Size([1, 5, 5, 1])\n >>> output[\"loss\"] # doctest:+ELLIPSIS\n tensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n \"\"\"\n\n def __init__(self, config: RiNALMoConfig):\n super().__init__(config)\n self.rinalmo = RiNALMoModel(config, add_pooling_layer=True)\n self.contact_head = ContactPredictionHead(config)\n self.head_config = self.contact_head.config\n\n # Initialize weights and apply final processing\n self.post_init()\n\n def forward(\n self,\n input_ids: Tensor | NestedTensor,\n attention_mask: Tensor | None = None,\n position_ids: Tensor | None = None,\n head_mask: Tensor | None = None,\n inputs_embeds: Tensor | NestedTensor | None = None,\n labels: Tensor | None = None,\n output_attentions: bool | None = None,\n output_hidden_states: bool | None = None,\n return_dict: bool | None = None,\n **kwargs,\n ) -> Tuple[Tensor, ...] | ContactPredictorOutput:\n if output_attentions is False:\n warn(\"output_attentions must be True for contact classification and will be ignored.\")\n return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n outputs = self.rinalmo(\n input_ids,\n attention_mask=attention_mask,\n position_ids=position_ids,\n head_mask=head_mask,\n inputs_embeds=inputs_embeds,\n output_attentions=True,\n output_hidden_states=output_hidden_states,\n return_dict=return_dict,\n **kwargs,\n )\n output = self.contact_head(outputs, attention_mask, input_ids, labels)\n logits, loss = output.logits, output.loss\n\n if not return_dict:\n output = (logits,) + outputs[2:]\n return ((loss,) + output) if loss is not None else output\n\n return ContactPredictorOutput(\n loss=loss,\n logits=logits,\n hidden_states=outputs.hidden_states,\n attentions=outputs.attentions,\n )\n
"},{"location":"models/rinalmo/#multimolecule.models.rinalmo.RiNALMoForMaskedLM","title":"RiNALMoForMaskedLM","text":" Bases: RiNALMoPreTrainedModel
Examples:
Python Console Session>>> from multimolecule import RiNALMoConfig, RiNALMoForMaskedLM, RnaTokenizer\n>>> config = RiNALMoConfig()\n>>> model = RiNALMoForMaskedLM(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=input[\"input_ids\"])\n>>> output[\"logits\"].shape\ntorch.Size([1, 7, 26])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<NllLossBackward0>)\n
Source code in multimolecule/models/rinalmo/modeling_rinalmo.py
Pythonclass RiNALMoForMaskedLM(RiNALMoPreTrainedModel):\n \"\"\"\n Examples:\n >>> from multimolecule import RiNALMoConfig, RiNALMoForMaskedLM, RnaTokenizer\n >>> config = RiNALMoConfig()\n >>> model = RiNALMoForMaskedLM(config)\n >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n >>> output = model(**input, labels=input[\"input_ids\"])\n >>> output[\"logits\"].shape\n torch.Size([1, 7, 26])\n >>> output[\"loss\"] # doctest:+ELLIPSIS\n tensor(..., grad_fn=<NllLossBackward0>)\n \"\"\"\n\n _tied_weights_keys = [\"lm_head.decoder.weight\", \"lm_head.decoder.bias\"]\n\n def __init__(self, config: RiNALMoConfig):\n super().__init__(config)\n if config.is_decoder:\n logger.warning(\n \"If you want to use `RiNALMoForMaskedLM` make sure `config.is_decoder=False` for \"\n \"bi-directional self-attention.\"\n )\n self.rinalmo = RiNALMoModel(config, add_pooling_layer=False)\n self.lm_head = MaskedLMHead(config)\n\n # Initialize weights and apply final processing\n self.post_init()\n\n def forward(\n self,\n input_ids: Tensor | NestedTensor,\n attention_mask: Tensor | None = None,\n position_ids: Tensor | None = None,\n head_mask: Tensor | None = None,\n inputs_embeds: Tensor | NestedTensor | None = None,\n encoder_hidden_states: Tensor | None = None,\n encoder_attention_mask: Tensor | None = None,\n labels: Tensor | None = None,\n output_attentions: bool | None = None,\n output_hidden_states: bool | None = None,\n return_dict: bool | None = None,\n **kwargs,\n ) -> Tuple[Tensor, ...] | MaskedLMOutput:\n return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n outputs = self.rinalmo(\n input_ids,\n attention_mask=attention_mask,\n position_ids=position_ids,\n head_mask=head_mask,\n inputs_embeds=inputs_embeds,\n encoder_hidden_states=encoder_hidden_states,\n encoder_attention_mask=encoder_attention_mask,\n output_attentions=output_attentions,\n output_hidden_states=output_hidden_states,\n return_dict=return_dict,\n **kwargs,\n )\n output = self.lm_head(outputs, labels)\n logits, loss = output.logits, output.loss\n\n if not return_dict:\n output = (logits,) + outputs[2:]\n return ((loss,) + output) if loss is not None else output\n\n return MaskedLMOutput(\n loss=loss,\n logits=logits,\n hidden_states=outputs.hidden_states,\n attentions=outputs.attentions,\n )\n
"},{"location":"models/rinalmo/#multimolecule.models.rinalmo.RiNALMoForSequencePrediction","title":"RiNALMoForSequencePrediction","text":" Bases: RiNALMoPreTrainedModel
Examples:
Python Console Session>>> from multimolecule import RiNALMoConfig, RiNALMoForSequencePrediction, RnaTokenizer\n>>> config = RiNALMoConfig()\n>>> model = RiNALMoForSequencePrediction(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=torch.tensor([[1]]))\n>>> output[\"logits\"].shape\ntorch.Size([1, 1])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n
Source code in multimolecule/models/rinalmo/modeling_rinalmo.py
Pythonclass RiNALMoForSequencePrediction(RiNALMoPreTrainedModel):\n \"\"\"\n Examples:\n >>> from multimolecule import RiNALMoConfig, RiNALMoForSequencePrediction, RnaTokenizer\n >>> config = RiNALMoConfig()\n >>> model = RiNALMoForSequencePrediction(config)\n >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n >>> output = model(**input, labels=torch.tensor([[1]]))\n >>> output[\"logits\"].shape\n torch.Size([1, 1])\n >>> output[\"loss\"] # doctest:+ELLIPSIS\n tensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n \"\"\"\n\n def __init__(self, config: RiNALMoConfig):\n super().__init__(config)\n self.rinalmo = RiNALMoModel(config, add_pooling_layer=True)\n self.sequence_head = SequencePredictionHead(config)\n self.head_config = self.sequence_head.config\n\n # Initialize weights and apply final processing\n self.post_init()\n\n def forward(\n self,\n input_ids: Tensor | NestedTensor,\n attention_mask: Tensor | None = None,\n position_ids: Tensor | None = None,\n head_mask: Tensor | None = None,\n inputs_embeds: Tensor | NestedTensor | None = None,\n labels: Tensor | None = None,\n output_attentions: bool | None = None,\n output_hidden_states: bool | None = None,\n return_dict: bool | None = None,\n **kwargs,\n ) -> Tuple[Tensor, ...] | SequencePredictorOutput:\n return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n outputs = self.rinalmo(\n input_ids,\n attention_mask=attention_mask,\n position_ids=position_ids,\n head_mask=head_mask,\n inputs_embeds=inputs_embeds,\n output_attentions=output_attentions,\n output_hidden_states=output_hidden_states,\n return_dict=return_dict,\n **kwargs,\n )\n output = self.sequence_head(outputs, labels)\n logits, loss = output.logits, output.loss\n\n if not return_dict:\n output = (logits,) + outputs[2:]\n return ((loss,) + output) if loss is not None else output\n\n return SequencePredictorOutput(\n loss=loss,\n logits=logits,\n hidden_states=outputs.hidden_states,\n attentions=outputs.attentions,\n )\n
"},{"location":"models/rinalmo/#multimolecule.models.rinalmo.RiNALMoForTokenPrediction","title":"RiNALMoForTokenPrediction","text":" Bases: RiNALMoPreTrainedModel
Examples:
Python Console Session>>> from multimolecule import RiNALMoConfig, RiNALMoForTokenPrediction, RnaTokenizer\n>>> config = RiNALMoConfig()\n>>> model = RiNALMoForTokenPrediction(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=torch.randint(2, (1, 5)))\n>>> output[\"logits\"].shape\ntorch.Size([1, 5, 1])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n
Source code in multimolecule/models/rinalmo/modeling_rinalmo.py
Pythonclass RiNALMoForTokenPrediction(RiNALMoPreTrainedModel):\n \"\"\"\n Examples:\n >>> from multimolecule import RiNALMoConfig, RiNALMoForTokenPrediction, RnaTokenizer\n >>> config = RiNALMoConfig()\n >>> model = RiNALMoForTokenPrediction(config)\n >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n >>> output = model(**input, labels=torch.randint(2, (1, 5)))\n >>> output[\"logits\"].shape\n torch.Size([1, 5, 1])\n >>> output[\"loss\"] # doctest:+ELLIPSIS\n tensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n \"\"\"\n\n def __init__(self, config: RiNALMoConfig):\n super().__init__(config)\n self.rinalmo = RiNALMoModel(config, add_pooling_layer=True)\n self.token_head = TokenPredictionHead(config)\n self.head_config = self.token_head.config\n\n # Initialize weights and apply final processing\n self.post_init()\n\n def forward(\n self,\n input_ids: Tensor | NestedTensor,\n attention_mask: Tensor | None = None,\n position_ids: Tensor | None = None,\n head_mask: Tensor | None = None,\n inputs_embeds: Tensor | NestedTensor | None = None,\n labels: Tensor | None = None,\n output_attentions: bool | None = None,\n output_hidden_states: bool | None = None,\n return_dict: bool | None = None,\n **kwargs,\n ) -> Tuple[Tensor, ...] | TokenPredictorOutput:\n return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n outputs = self.rinalmo(\n input_ids,\n attention_mask=attention_mask,\n position_ids=position_ids,\n head_mask=head_mask,\n inputs_embeds=inputs_embeds,\n output_attentions=output_attentions,\n output_hidden_states=output_hidden_states,\n return_dict=return_dict,\n **kwargs,\n )\n output = self.token_head(outputs, attention_mask, input_ids, labels)\n logits, loss = output.logits, output.loss\n\n if not return_dict:\n output = (logits,) + outputs[2:]\n return ((loss,) + output) if loss is not None else output\n\n return TokenPredictorOutput(\n loss=loss,\n logits=logits,\n hidden_states=outputs.hidden_states,\n attentions=outputs.attentions,\n )\n
"},{"location":"models/rinalmo/#multimolecule.models.rinalmo.RiNALMoModel","title":"RiNALMoModel","text":" Bases: RiNALMoPreTrainedModel
Examples:
Python Console Session>>> from multimolecule import RiNALMoConfig, RiNALMoModel, RnaTokenizer\n>>> config = RiNALMoConfig()\n>>> model = RiNALMoModel(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input)\n>>> output[\"last_hidden_state\"].shape\ntorch.Size([1, 7, 1280])\n>>> output[\"pooler_output\"].shape\ntorch.Size([1, 1280])\n
Source code in multimolecule/models/rinalmo/modeling_rinalmo.py
Pythonclass RiNALMoModel(RiNALMoPreTrainedModel):\n \"\"\"\n Examples:\n >>> from multimolecule import RiNALMoConfig, RiNALMoModel, RnaTokenizer\n >>> config = RiNALMoConfig()\n >>> model = RiNALMoModel(config)\n >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n >>> output = model(**input)\n >>> output[\"last_hidden_state\"].shape\n torch.Size([1, 7, 1280])\n >>> output[\"pooler_output\"].shape\n torch.Size([1, 1280])\n \"\"\"\n\n def __init__(self, config: RiNALMoConfig, add_pooling_layer: bool = True):\n super().__init__(config)\n self.pad_token_id = config.pad_token_id\n self.embeddings = RiNALMoEmbeddings(config)\n self.encoder = RiNALMoEncoder(config)\n self.pooler = RiNALMoPooler(config) if add_pooling_layer else None\n\n # Initialize weights and apply final processing\n self.post_init()\n\n def get_input_embeddings(self):\n return self.embeddings.word_embeddings\n\n def set_input_embeddings(self, value):\n self.embeddings.word_embeddings = value\n\n def _prune_heads(self, heads_to_prune):\n \"\"\"\n Prunes heads of the model. heads_to_prune: dict of {layer_num: list of heads to prune in this layer} See base\n class PreTrainedModel\n \"\"\"\n for layer, heads in heads_to_prune.items():\n self.encoder.layer[layer].attention.prune_heads(heads)\n\n def forward(\n self,\n input_ids: Tensor | NestedTensor,\n attention_mask: Tensor | None = None,\n position_ids: Tensor | None = None,\n head_mask: Tensor | None = None,\n inputs_embeds: Tensor | NestedTensor | None = None,\n encoder_hidden_states: Tensor | None = None,\n encoder_attention_mask: Tensor | None = None,\n past_key_values: Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None = None,\n use_cache: bool | None = None,\n output_attentions: bool | None = None,\n output_hidden_states: bool | None = None,\n return_dict: bool | None = None,\n **kwargs,\n ) -> Tuple[Tensor, ...] | BaseModelOutputWithPoolingAndCrossAttentions:\n r\"\"\"\n Args:\n encoder_hidden_states:\n Shape: `(batch_size, sequence_length, hidden_size)`\n\n Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if\n the model is configured as a decoder.\n encoder_attention_mask:\n Shape: `(batch_size, sequence_length)`\n\n Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used\n in the cross-attention if the model is configured as a decoder. Mask values selected in `[0, 1]`:\n\n - 1 for tokens that are **not masked**,\n - 0 for tokens that are **masked**.\n past_key_values:\n Tuple of length `config.n_layers` with each tuple having 4 tensors of shape\n `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)\n\n Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up\n decoding.\n\n If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those\n that don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of\n all `decoder_input_ids` of shape `(batch_size, sequence_length)`.\n use_cache:\n If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding\n (see `past_key_values`).\n \"\"\"\n if kwargs:\n warn(\n f\"Additional keyword arguments `{', '.join(kwargs)}` are detected in \"\n f\"`{self.__class__.__name__}.forward`, they will be ignored.\\n\"\n \"This is provided for backward compatibility and may lead to unexpected behavior.\"\n )\n output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions\n output_hidden_states = (\n output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states\n )\n return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n\n if self.config.is_decoder:\n use_cache = use_cache if use_cache is not None else self.config.use_cache\n else:\n use_cache = False\n\n if isinstance(input_ids, NestedTensor):\n input_ids, attention_mask = input_ids.tensor, input_ids.mask\n if input_ids is not None and inputs_embeds is not None:\n raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n if input_ids is not None:\n self.warn_if_padding_and_no_attention_mask(input_ids, attention_mask)\n input_shape = input_ids.size()\n elif inputs_embeds is not None:\n input_shape = inputs_embeds.size()[:-1]\n else:\n raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n batch_size, seq_length = input_shape\n device = input_ids.device if input_ids is not None else inputs_embeds.device # type: ignore[union-attr]\n\n # past_key_values_length\n past_key_values_length = past_key_values[0][0].shape[2] if past_key_values is not None else 0\n\n if attention_mask is None:\n attention_mask = (\n input_ids.ne(self.pad_token_id)\n if self.pad_token_id is not None\n else torch.ones(((batch_size, seq_length + past_key_values_length)), device=device)\n )\n\n # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]\n # ourselves in which case we just need to make it broadcastable to all heads.\n extended_attention_mask: Tensor = self.get_extended_attention_mask(attention_mask, input_shape)\n\n # If a 2D or 3D attention mask is provided for the cross-attention\n # we need to make broadcastable to [batch_size, num_heads, seq_length, seq_length]\n if self.config.is_decoder and encoder_hidden_states is not None:\n encoder_batch_size, encoder_sequence_length, _ = encoder_hidden_states.size()\n encoder_hidden_shape = (encoder_batch_size, encoder_sequence_length)\n if encoder_attention_mask is None:\n encoder_attention_mask = torch.ones(encoder_hidden_shape, device=device)\n encoder_extended_attention_mask = self.invert_attention_mask(encoder_attention_mask)\n else:\n encoder_extended_attention_mask = None\n\n # Prepare head mask if needed\n # 1.0 in head_mask indicate we keep the head\n # attention_probs has shape bsz x n_heads x N x N\n # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]\n # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]\n head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)\n\n embedding_output = self.embeddings(\n input_ids=input_ids,\n position_ids=position_ids,\n attention_mask=attention_mask,\n inputs_embeds=inputs_embeds,\n past_key_values_length=past_key_values_length,\n )\n encoder_outputs = self.encoder(\n embedding_output,\n attention_mask=extended_attention_mask,\n head_mask=head_mask,\n encoder_hidden_states=encoder_hidden_states,\n encoder_attention_mask=encoder_extended_attention_mask,\n past_key_values=past_key_values,\n use_cache=use_cache,\n output_attentions=output_attentions,\n output_hidden_states=output_hidden_states,\n return_dict=return_dict,\n )\n sequence_output = encoder_outputs[0]\n pooled_output = self.pooler(sequence_output) if self.pooler is not None else None\n\n if not return_dict:\n return (sequence_output, pooled_output) + encoder_outputs[1:]\n\n return BaseModelOutputWithPoolingAndCrossAttentions(\n last_hidden_state=sequence_output,\n pooler_output=pooled_output,\n past_key_values=encoder_outputs.past_key_values,\n hidden_states=encoder_outputs.hidden_states,\n attentions=encoder_outputs.attentions,\n cross_attentions=encoder_outputs.cross_attentions,\n )\n
"},{"location":"models/rinalmo/#multimolecule.models.rinalmo.RiNALMoModel.forward","title":"forward","text":"Pythonforward(input_ids: Tensor | NestedTensor, attention_mask: Tensor | None = None, position_ids: Tensor | None = None, head_mask: Tensor | None = None, inputs_embeds: Tensor | NestedTensor | None = None, encoder_hidden_states: Tensor | None = None, encoder_attention_mask: Tensor | None = None, past_key_values: Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None = None, use_cache: bool | None = None, output_attentions: bool | None = None, output_hidden_states: bool | None = None, return_dict: bool | None = None, **kwargs) -> Tuple[Tensor, ...] | BaseModelOutputWithPoolingAndCrossAttentions\n
Parameters:
Name Type Description DefaultTensor | None
Shape: (batch_size, sequence_length, hidden_size)
Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if the model is configured as a decoder.
None
Tensor | None
Shape: (batch_size, sequence_length)
Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used in the cross-attention if the model is configured as a decoder. Mask values selected in [0, 1]
:
None
Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None
Tuple of length config.n_layers
with each tuple having 4 tensors of shape `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)
Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up decoding.
If past_key_values
are used, the user can optionally input only the last decoder_input_ids
(those that don\u2019t have their past key value states given to this model) of shape (batch_size, 1)
instead of all decoder_input_ids
of shape (batch_size, sequence_length)
.
None
bool | None
If set to True
, past_key_values
key value states are returned and can be used to speed up decoding (see past_key_values
).
None
Source code in multimolecule/models/rinalmo/modeling_rinalmo.py
Pythondef forward(\n self,\n input_ids: Tensor | NestedTensor,\n attention_mask: Tensor | None = None,\n position_ids: Tensor | None = None,\n head_mask: Tensor | None = None,\n inputs_embeds: Tensor | NestedTensor | None = None,\n encoder_hidden_states: Tensor | None = None,\n encoder_attention_mask: Tensor | None = None,\n past_key_values: Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None = None,\n use_cache: bool | None = None,\n output_attentions: bool | None = None,\n output_hidden_states: bool | None = None,\n return_dict: bool | None = None,\n **kwargs,\n) -> Tuple[Tensor, ...] | BaseModelOutputWithPoolingAndCrossAttentions:\n r\"\"\"\n Args:\n encoder_hidden_states:\n Shape: `(batch_size, sequence_length, hidden_size)`\n\n Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if\n the model is configured as a decoder.\n encoder_attention_mask:\n Shape: `(batch_size, sequence_length)`\n\n Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used\n in the cross-attention if the model is configured as a decoder. Mask values selected in `[0, 1]`:\n\n - 1 for tokens that are **not masked**,\n - 0 for tokens that are **masked**.\n past_key_values:\n Tuple of length `config.n_layers` with each tuple having 4 tensors of shape\n `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)\n\n Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up\n decoding.\n\n If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those\n that don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of\n all `decoder_input_ids` of shape `(batch_size, sequence_length)`.\n use_cache:\n If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding\n (see `past_key_values`).\n \"\"\"\n if kwargs:\n warn(\n f\"Additional keyword arguments `{', '.join(kwargs)}` are detected in \"\n f\"`{self.__class__.__name__}.forward`, they will be ignored.\\n\"\n \"This is provided for backward compatibility and may lead to unexpected behavior.\"\n )\n output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions\n output_hidden_states = (\n output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states\n )\n return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n\n if self.config.is_decoder:\n use_cache = use_cache if use_cache is not None else self.config.use_cache\n else:\n use_cache = False\n\n if isinstance(input_ids, NestedTensor):\n input_ids, attention_mask = input_ids.tensor, input_ids.mask\n if input_ids is not None and inputs_embeds is not None:\n raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n if input_ids is not None:\n self.warn_if_padding_and_no_attention_mask(input_ids, attention_mask)\n input_shape = input_ids.size()\n elif inputs_embeds is not None:\n input_shape = inputs_embeds.size()[:-1]\n else:\n raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n batch_size, seq_length = input_shape\n device = input_ids.device if input_ids is not None else inputs_embeds.device # type: ignore[union-attr]\n\n # past_key_values_length\n past_key_values_length = past_key_values[0][0].shape[2] if past_key_values is not None else 0\n\n if attention_mask is None:\n attention_mask = (\n input_ids.ne(self.pad_token_id)\n if self.pad_token_id is not None\n else torch.ones(((batch_size, seq_length + past_key_values_length)), device=device)\n )\n\n # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]\n # ourselves in which case we just need to make it broadcastable to all heads.\n extended_attention_mask: Tensor = self.get_extended_attention_mask(attention_mask, input_shape)\n\n # If a 2D or 3D attention mask is provided for the cross-attention\n # we need to make broadcastable to [batch_size, num_heads, seq_length, seq_length]\n if self.config.is_decoder and encoder_hidden_states is not None:\n encoder_batch_size, encoder_sequence_length, _ = encoder_hidden_states.size()\n encoder_hidden_shape = (encoder_batch_size, encoder_sequence_length)\n if encoder_attention_mask is None:\n encoder_attention_mask = torch.ones(encoder_hidden_shape, device=device)\n encoder_extended_attention_mask = self.invert_attention_mask(encoder_attention_mask)\n else:\n encoder_extended_attention_mask = None\n\n # Prepare head mask if needed\n # 1.0 in head_mask indicate we keep the head\n # attention_probs has shape bsz x n_heads x N x N\n # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]\n # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]\n head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)\n\n embedding_output = self.embeddings(\n input_ids=input_ids,\n position_ids=position_ids,\n attention_mask=attention_mask,\n inputs_embeds=inputs_embeds,\n past_key_values_length=past_key_values_length,\n )\n encoder_outputs = self.encoder(\n embedding_output,\n attention_mask=extended_attention_mask,\n head_mask=head_mask,\n encoder_hidden_states=encoder_hidden_states,\n encoder_attention_mask=encoder_extended_attention_mask,\n past_key_values=past_key_values,\n use_cache=use_cache,\n output_attentions=output_attentions,\n output_hidden_states=output_hidden_states,\n return_dict=return_dict,\n )\n sequence_output = encoder_outputs[0]\n pooled_output = self.pooler(sequence_output) if self.pooler is not None else None\n\n if not return_dict:\n return (sequence_output, pooled_output) + encoder_outputs[1:]\n\n return BaseModelOutputWithPoolingAndCrossAttentions(\n last_hidden_state=sequence_output,\n pooler_output=pooled_output,\n past_key_values=encoder_outputs.past_key_values,\n hidden_states=encoder_outputs.hidden_states,\n attentions=encoder_outputs.attentions,\n cross_attentions=encoder_outputs.cross_attentions,\n )\n
"},{"location":"models/rinalmo/#multimolecule.models.rinalmo.RiNALMoModel.forward(encoder_hidden_states)","title":"encoder_hidden_states
","text":""},{"location":"models/rinalmo/#multimolecule.models.rinalmo.RiNALMoModel.forward(encoder_attention_mask)","title":"encoder_attention_mask
","text":""},{"location":"models/rinalmo/#multimolecule.models.rinalmo.RiNALMoModel.forward(past_key_values)","title":"past_key_values
","text":""},{"location":"models/rinalmo/#multimolecule.models.rinalmo.RiNALMoModel.forward(use_cache)","title":"use_cache
","text":""},{"location":"models/rinalmo/#multimolecule.models.rinalmo.RiNALMoPreTrainedModel","title":"RiNALMoPreTrainedModel","text":" Bases: PreTrainedModel
An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained models.
Source code inmultimolecule/models/rinalmo/modeling_rinalmo.py
Pythonclass RiNALMoPreTrainedModel(PreTrainedModel):\n \"\"\"\n An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained\n models.\n \"\"\"\n\n config_class = RiNALMoConfig\n base_model_prefix = \"rinalmo\"\n supports_gradient_checkpointing = True\n _no_split_modules = [\"RiNALMoLayer\", \"RiNALMoEmbeddings\"]\n\n # Copied from transformers.models.bert.modeling_bert.BertPreTrainedModel._init_weights\n def _init_weights(self, module: nn.Module):\n \"\"\"Initialize the weights\"\"\"\n if isinstance(module, nn.Linear):\n # Slightly different from the TF version which uses truncated_normal for initialization\n # cf https://github.com/pytorch/pytorch/pull/5617\n module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)\n if module.bias is not None:\n module.bias.data.zero_()\n elif isinstance(module, nn.Embedding):\n module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)\n if module.padding_idx is not None:\n module.weight.data[module.padding_idx].zero_()\n elif isinstance(module, nn.LayerNorm):\n module.bias.data.zero_()\n module.weight.data.fill_(1.0)\n
"},{"location":"models/rnabert/","title":"RNABERT","text":"Pre-trained model on non-coding RNA (ncRNA) using masked language modeling (MLM) and structural alignment learning (SAL) objectives.
"},{"location":"models/rnabert/#disclaimer","title":"Disclaimer","text":"This is an UNOFFICIAL implementation of the Informative RNA-base embedding for functional RNA clustering and structural alignment by Manato Akiyama and Yasubumi Sakakibara.
The OFFICIAL repository of RNABERT is at mana438/RNABERT.
Caution
The MultiMolecule team is aware of a potential risk in reproducing the results of RNABERT.
The original implementation of RNABERT does not prepend <cls>
and append <eos>
tokens to the input sequence. This should not affect the performance of the model in most cases, but it can lead to unexpected behavior in some cases.
Please set cls_token=None
and eos_token=None
explicitly in the tokenizer if you want the exact behavior of the original implementation.
Tip
The MultiMolecule team has confirmed that the provided model and checkpoints are producing the same intermediate representations as the original implementation.
The team releasing RNABERT did not write this model card for this model so this model card has been written by the MultiMolecule team.
"},{"location":"models/rnabert/#model-details","title":"Model Details","text":"RNABERT is a bert-style model pre-trained on a large corpus of non-coding RNA sequences in a self-supervised fashion. This means that the model was trained on the raw nucleotides of RNA sequences only, with an automatic process to generate inputs and labels from those texts. Please refer to the Training Details section for more information on the training process.
"},{"location":"models/rnabert/#model-specification","title":"Model Specification","text":"Num Layers Hidden Size Num Heads Intermediate Size Num Parameters (M) FLOPs (G) MACs (G) Max Num Tokens 6 120 12 40 0.48 0.15 0.08 440"},{"location":"models/rnabert/#links","title":"Links","text":"The model file depends on the multimolecule
library. You can install it using pip:
pip install multimolecule\n
"},{"location":"models/rnabert/#direct-use","title":"Direct Use","text":"You can use this model directly with a pipeline for masked language modeling:
Python>>> import multimolecule # you must import multimolecule to register models\n>>> from transformers import pipeline\n>>> unmasker = pipeline(\"fill-mask\", model=\"multimolecule/rnabert\")\n>>> unmasker(\"gguc<mask>cucugguuagaccagaucugagccu\")\n\n[{'score': 0.03852083534002304,\n 'token': 24,\n 'token_str': '-',\n 'sequence': 'G G U C - C U C U G G U U A G A C C A G A U C U G A G C C U'},\n {'score': 0.03851056098937988,\n 'token': 10,\n 'token_str': 'N',\n 'sequence': 'G G U C N C U C U G G U U A G A C C A G A U C U G A G C C U'},\n {'score': 0.03849703073501587,\n 'token': 25,\n 'token_str': 'I',\n 'sequence': 'G G U C I C U C U G G U U A G A C C A G A U C U G A G C C U'},\n {'score': 0.03848597779870033,\n 'token': 3,\n 'token_str': '<unk>',\n 'sequence': 'G G U C C U C U G G U U A G A C C A G A U C U G A G C C U'},\n {'score': 0.038484156131744385,\n 'token': 5,\n 'token_str': '<null>',\n 'sequence': 'G G U C C U C U G G U U A G A C C A G A U C U G A G C C U'}]\n
"},{"location":"models/rnabert/#downstream-use","title":"Downstream Use","text":""},{"location":"models/rnabert/#extract-features","title":"Extract Features","text":"Here is how to use this model to get the features of a given sequence in PyTorch:
Pythonfrom multimolecule import RnaTokenizer, RnaBertModel\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rnabert\")\nmodel = RnaBertModel.from_pretrained(\"multimolecule/rnabert\")\n\ntext = \"UAGCUUAUCAGACUGAUGUUGA\"\ninput = tokenizer(text, return_tensors=\"pt\")\n\noutput = model(**input)\n
"},{"location":"models/rnabert/#sequence-classification-regression","title":"Sequence Classification / Regression","text":"Note: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for sequence classification or regression.
Here is how to use this model as backbone to fine-tune for a sequence-level task in PyTorch:
Pythonimport torch\nfrom multimolecule import RnaTokenizer, RnaBertForSequencePrediction\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rnabert\")\nmodel = RnaBertForSequencePrediction.from_pretrained(\"multimolecule/rnabert\")\n\ntext = \"UAGCUUAUCAGACUGAUGUUGA\"\ninput = tokenizer(text, return_tensors=\"pt\")\nlabel = torch.tensor([1])\n\noutput = model(**input, labels=label)\n
"},{"location":"models/rnabert/#token-classification-regression","title":"Token Classification / Regression","text":"Note: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for nucleotide classification or regression.
Here is how to use this model as backbone to fine-tune for a nucleotide-level task in PyTorch:
Pythonimport torch\nfrom multimolecule import RnaTokenizer, RnaBertForTokenPrediction\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rnabert\")\nmodel = RnaBertForTokenPrediction.from_pretrained(\"multimolecule/rnabert\")\n\ntext = \"UAGCUUAUCAGACUGAUGUUGA\"\ninput = tokenizer(text, return_tensors=\"pt\")\nlabel = torch.randint(2, (len(text), ))\n\noutput = model(**input, labels=label)\n
"},{"location":"models/rnabert/#contact-classification-regression","title":"Contact Classification / Regression","text":"Note: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for contact classification or regression.
Here is how to use this model as backbone to fine-tune for a contact-level task in PyTorch:
Pythonimport torch\nfrom multimolecule import RnaTokenizer, RnaBertForContactPrediction\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rnabert\")\nmodel = RnaBertForContactPrediction.from_pretrained(\"multimolecule/rnabert\")\n\ntext = \"UAGCUUAUCAGACUGAUGUUGA\"\ninput = tokenizer(text, return_tensors=\"pt\")\nlabel = torch.randint(2, (len(text), len(text)))\n\noutput = model(**input, labels=label)\n
"},{"location":"models/rnabert/#training-details","title":"Training Details","text":"RNABERT has two pre-training objectives: masked language modeling (MLM) and structural alignment learning (SAL).
The RNABERT model was pre-trained on RNAcentral. RNAcentral is a free, public resource that offers integrated access to a comprehensive and up-to-date set of non-coding RNA sequences provided by a collaborating group of Expert Databases representing a broad range of organisms and RNA types.
RNABERT used a subset of 76, 237 human ncRNA sequences from RNAcentral for pre-training. RNABERT preprocessed all tokens by replacing \u201cU\u201ds with \u201cT\u201ds.
Note that during model conversions, \u201cT\u201d is replaced with \u201cU\u201d. RnaTokenizer
will convert \u201cT\u201ds to \u201cU\u201ds for you, you may disable this behaviour by passing replace_T_with_U=False
.
RNABERT preprocess the dataset by applying 10 different mask patterns to the 72, 237 human ncRNA sequences. The final dataset contains 722, 370 sequences. The masking procedure is similar to the one used in BERT:
<mask>
.The model was trained on 1 NVIDIA V100 GPU.
"},{"location":"models/rnabert/#citation","title":"Citation","text":"BibTeX:
BibTeX@article{akiyama2022informative,\n author = {Akiyama, Manato and Sakakibara, Yasubumi},\n title = \"{Informative RNA base embedding for RNA structural alignment and clustering by deep representation learning}\",\n journal = {NAR Genomics and Bioinformatics},\n volume = {4},\n number = {1},\n pages = {lqac012},\n year = {2022},\n month = {02},\n abstract = \"{Effective embedding is actively conducted by applying deep learning to biomolecular information. Obtaining better embeddings enhances the quality of downstream analyses, such as DNA sequence motif detection and protein function prediction. In this study, we adopt a pre-training algorithm for the effective embedding of RNA bases to acquire semantically rich representations and apply this algorithm to two fundamental RNA sequence problems: structural alignment and clustering. By using the pre-training algorithm to embed the four bases of RNA in a position-dependent manner using a large number of RNA sequences from various RNA families, a context-sensitive embedding representation is obtained. As a result, not only base information but also secondary structure and context information of RNA sequences are embedded for each base. We call this \u2018informative base embedding\u2019 and use it to achieve accuracies superior to those of existing state-of-the-art methods on RNA structural alignment and RNA family clustering tasks. Furthermore, upon performing RNA sequence alignment by combining this informative base embedding with a simple Needleman\u2013Wunsch alignment algorithm, we succeed in calculating structural alignments with a time complexity of O(n2) instead of the O(n6) time complexity of the naive implementation of Sankoff-style algorithm for input RNA sequence of length n.}\",\n issn = {2631-9268},\n doi = {10.1093/nargab/lqac012},\n url = {https://doi.org/10.1093/nargab/lqac012},\n eprint = {https://academic.oup.com/nargab/article-pdf/4/1/lqac012/42577168/lqac012.pdf},\n}\n
"},{"location":"models/rnabert/#contact","title":"Contact","text":"Please use GitHub issues of MultiMolecule for any questions or comments on the model card.
Please contact the authors of the RNABERT paper for questions or comments on the paper/model.
"},{"location":"models/rnabert/#license","title":"License","text":"This model is licensed under the AGPL-3.0 License.
Text OnlySPDX-License-Identifier: AGPL-3.0-or-later\n
"},{"location":"models/rnabert/#multimolecule.models.rnabert","title":"multimolecule.models.rnabert","text":""},{"location":"models/rnabert/#multimolecule.models.rnabert.RnaTokenizer","title":"RnaTokenizer","text":" Bases: Tokenizer
Tokenizer for RNA sequences.
Parameters:
Name Type Description DefaultAlphabet | str | List[str] | None
alphabet to use for tokenization.
None
, the standard RNA alphabet will be used.string
, it should correspond to the name of a predefined alphabet. The options includestandard
extended
streamline
nucleobase
None
int
Size of kmer to tokenize.
1
bool
Whether to tokenize into codons.
False
bool
Whether to replace T with U.
True
bool
Whether to convert input to uppercase.
True
Examples:
Python Console Session>>> from multimolecule import RnaTokenizer\n>>> tokenizer = RnaTokenizer()\n>>> tokenizer('<pad><cls><eos><unk><mask><null>ACGUNRYSWKMBDHV.X*-I')[\"input_ids\"]\n[1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 2]\n>>> tokenizer('acgu')[\"input_ids\"]\n[1, 6, 7, 8, 9, 2]\n>>> tokenizer('acgt')[\"input_ids\"]\n[1, 6, 7, 8, 9, 2]\n>>> tokenizer = RnaTokenizer(replace_T_with_U=False)\n>>> tokenizer('acgt')[\"input_ids\"]\n[1, 6, 7, 8, 3, 2]\n>>> tokenizer = RnaTokenizer(nmers=3)\n>>> tokenizer('uagcuuauc')[\"input_ids\"]\n[1, 83, 17, 64, 49, 96, 84, 22, 2]\n>>> tokenizer = RnaTokenizer(codon=True)\n>>> tokenizer('uagcuuauc')[\"input_ids\"]\n[1, 83, 49, 22, 2]\n>>> tokenizer('uagcuuauca')[\"input_ids\"]\nTraceback (most recent call last):\nValueError: length of input sequence must be a multiple of 3 for codon tokenization, but got 10\n
Source code in multimolecule/tokenisers/rna/tokenization_rna.py
Pythonclass RnaTokenizer(Tokenizer):\n \"\"\"\n Tokenizer for RNA sequences.\n\n Args:\n alphabet: alphabet to use for tokenization.\n\n - If is `None`, the standard RNA alphabet will be used.\n - If is a `string`, it should correspond to the name of a predefined alphabet. The options include\n + `standard`\n + `extended`\n + `streamline`\n + `nucleobase`\n - If is an alphabet or a list of characters, that specific alphabet will be used.\n nmers: Size of kmer to tokenize.\n codon: Whether to tokenize into codons.\n replace_T_with_U: Whether to replace T with U.\n do_upper_case: Whether to convert input to uppercase.\n\n Examples:\n >>> from multimolecule import RnaTokenizer\n >>> tokenizer = RnaTokenizer()\n >>> tokenizer('<pad><cls><eos><unk><mask><null>ACGUNRYSWKMBDHV.X*-I')[\"input_ids\"]\n [1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 2]\n >>> tokenizer('acgu')[\"input_ids\"]\n [1, 6, 7, 8, 9, 2]\n >>> tokenizer('acgt')[\"input_ids\"]\n [1, 6, 7, 8, 9, 2]\n >>> tokenizer = RnaTokenizer(replace_T_with_U=False)\n >>> tokenizer('acgt')[\"input_ids\"]\n [1, 6, 7, 8, 3, 2]\n >>> tokenizer = RnaTokenizer(nmers=3)\n >>> tokenizer('uagcuuauc')[\"input_ids\"]\n [1, 83, 17, 64, 49, 96, 84, 22, 2]\n >>> tokenizer = RnaTokenizer(codon=True)\n >>> tokenizer('uagcuuauc')[\"input_ids\"]\n [1, 83, 49, 22, 2]\n >>> tokenizer('uagcuuauca')[\"input_ids\"]\n Traceback (most recent call last):\n ValueError: length of input sequence must be a multiple of 3 for codon tokenization, but got 10\n \"\"\"\n\n model_input_names = [\"input_ids\", \"attention_mask\"]\n\n def __init__(\n self,\n alphabet: Alphabet | str | List[str] | None = None,\n nmers: int = 1,\n codon: bool = False,\n replace_T_with_U: bool = True,\n do_upper_case: bool = True,\n additional_special_tokens: List | Tuple | None = None,\n **kwargs,\n ):\n if codon and (nmers > 1 and nmers != 3):\n raise ValueError(\"Codon and nmers cannot be used together.\")\n if codon:\n nmers = 3 # set to 3 to get correct vocab\n if not isinstance(alphabet, Alphabet):\n alphabet = get_alphabet(alphabet, nmers=nmers)\n super().__init__(\n alphabet=alphabet,\n nmers=nmers,\n codon=codon,\n replace_T_with_U=replace_T_with_U,\n do_upper_case=do_upper_case,\n additional_special_tokens=additional_special_tokens,\n **kwargs,\n )\n self.replace_T_with_U = replace_T_with_U\n self.nmers = nmers\n self.codon = codon\n\n def _tokenize(self, text: str, **kwargs):\n if self.do_upper_case:\n text = text.upper()\n if self.replace_T_with_U:\n text = text.replace(\"T\", \"U\")\n if self.codon:\n if len(text) % 3 != 0:\n raise ValueError(\n f\"length of input sequence must be a multiple of 3 for codon tokenization, but got {len(text)}\"\n )\n return [text[i : i + 3] for i in range(0, len(text), 3)]\n if self.nmers > 1:\n return [text[i : i + self.nmers] for i in range(len(text) - self.nmers + 1)] # noqa: E203\n return list(text)\n
"},{"location":"models/rnabert/#multimolecule.models.rnabert.RnaTokenizer(alphabet)","title":"alphabet
","text":""},{"location":"models/rnabert/#multimolecule.models.rnabert.RnaTokenizer(nmers)","title":"nmers
","text":""},{"location":"models/rnabert/#multimolecule.models.rnabert.RnaTokenizer(codon)","title":"codon
","text":""},{"location":"models/rnabert/#multimolecule.models.rnabert.RnaTokenizer(replace_T_with_U)","title":"replace_T_with_U
","text":""},{"location":"models/rnabert/#multimolecule.models.rnabert.RnaTokenizer(do_upper_case)","title":"do_upper_case
","text":""},{"location":"models/rnabert/#multimolecule.models.rnabert.RnaBertConfig","title":"RnaBertConfig","text":" Bases: PreTrainedConfig
This is the configuration class to store the configuration of a RnaBertModel
. It is used to instantiate a RnaBert model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the RnaBert mana438/RNABERT architecture.
Configuration objects inherit from PreTrainedConfig
and can be used to control the model outputs. Read the documentation from PreTrainedConfig
for more information.
Parameters:
Name Type Description Defaultint
Vocabulary size of the RnaBert model. Defines the number of different tokens that can be represented by the inputs_ids
passed when calling [RnaBertModel
].
26
int | None
Dimensionality of the encoder layers and the pooler layer.
None
int
Number of hidden layers in the Transformer encoder.
6
int
Number of attention heads for each attention layer in the Transformer encoder.
12
int
Dimensionality of the \u201cintermediate\u201d (often named feed-forward) layer in the Transformer encoder.
40
float
The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
0.0
float
The dropout ratio for the attention probabilities.
0.0
int
The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
440
float
The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
0.02
float
The epsilon used by the layer normalization layers.
1e-12
Examples:
Python Console Session>>> from multimolecule import RnaBertModel, RnaBertConfig\n>>> # Initializing a RNABERT multimolecule/rnabert style configuration\n>>> configuration = RnaBertConfig()\n>>> # Initializing a model (with random weights) from the multimolecule/rnabert style configuration\n>>> model = RnaBertModel(configuration)\n>>> # Accessing the model configuration\n>>> configuration = model.config\n
Source code in multimolecule/models/rnabert/configuration_rnabert.py
Pythonclass RnaBertConfig(PreTrainedConfig):\n r\"\"\"\n This is the configuration class to store the configuration of a [`RnaBertModel`][multimolecule.models.RnaBertModel].\n It is used to instantiate a RnaBert model according to the specified arguments, defining the model architecture.\n Instantiating a configuration with the defaults will yield a similar configuration to that of the RnaBert\n [mana438/RNABERT](https://github.com/mana438/RNABERT) architecture.\n\n Configuration objects inherit from [`PreTrainedConfig`][multimolecule.models.PreTrainedConfig] and can be used to\n control the model outputs. Read the documentation from [`PreTrainedConfig`][multimolecule.models.PreTrainedConfig]\n for more information.\n\n Args:\n vocab_size:\n Vocabulary size of the RnaBert model. Defines the number of different tokens that can be represented by the\n `inputs_ids` passed when calling [`RnaBertModel`].\n hidden_size:\n Dimensionality of the encoder layers and the pooler layer.\n num_hidden_layers:\n Number of hidden layers in the Transformer encoder.\n num_attention_heads:\n Number of attention heads for each attention layer in the Transformer encoder.\n intermediate_size:\n Dimensionality of the \"intermediate\" (often named feed-forward) layer in the Transformer encoder.\n hidden_dropout:\n The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.\n attention_dropout:\n The dropout ratio for the attention probabilities.\n max_position_embeddings:\n The maximum sequence length that this model might ever be used with. Typically set this to something large\n just in case (e.g., 512 or 1024 or 2048).\n initializer_range:\n The standard deviation of the truncated_normal_initializer for initializing all weight matrices.\n layer_norm_eps:\n The epsilon used by the layer normalization layers.\n\n Examples:\n >>> from multimolecule import RnaBertModel, RnaBertConfig\n >>> # Initializing a RNABERT multimolecule/rnabert style configuration\n >>> configuration = RnaBertConfig()\n >>> # Initializing a model (with random weights) from the multimolecule/rnabert style configuration\n >>> model = RnaBertModel(configuration)\n >>> # Accessing the model configuration\n >>> configuration = model.config\n \"\"\"\n\n model_type = \"rnabert\"\n\n def __init__(\n self,\n vocab_size: int = 26,\n ss_vocab_size: int = 8,\n hidden_size: int | None = None,\n multiple: int | None = None,\n num_hidden_layers: int = 6,\n num_attention_heads: int = 12,\n intermediate_size: int = 40,\n hidden_act: str = \"gelu\",\n hidden_dropout: float = 0.0,\n attention_dropout: float = 0.0,\n max_position_embeddings: int = 440,\n initializer_range: float = 0.02,\n layer_norm_eps: float = 1e-12,\n position_embedding_type: str = \"absolute\",\n is_decoder: bool = False,\n use_cache: bool = True,\n head: HeadConfig | None = None,\n lm_head: MaskedLMHeadConfig | None = None,\n **kwargs,\n ):\n if hidden_size is None:\n hidden_size = num_attention_heads * multiple if multiple is not None else 120\n super().__init__(**kwargs)\n\n self.vocab_size = vocab_size\n self.ss_vocab_size = ss_vocab_size\n self.type_vocab_size = 2\n self.hidden_size = hidden_size\n self.num_hidden_layers = num_hidden_layers\n self.num_attention_heads = num_attention_heads\n self.intermediate_size = intermediate_size\n self.hidden_act = hidden_act\n self.hidden_dropout = hidden_dropout\n self.attention_dropout = attention_dropout\n self.max_position_embeddings = max_position_embeddings\n self.initializer_range = initializer_range\n self.layer_norm_eps = layer_norm_eps\n self.position_embedding_type = position_embedding_type\n self.is_decoder = is_decoder\n self.use_cache = use_cache\n self.head = HeadConfig(**head) if head is not None else None\n self.lm_head = MaskedLMHeadConfig(**lm_head) if lm_head is not None else None\n
"},{"location":"models/rnabert/#multimolecule.models.rnabert.RnaBertConfig(vocab_size)","title":"vocab_size
","text":""},{"location":"models/rnabert/#multimolecule.models.rnabert.RnaBertConfig(hidden_size)","title":"hidden_size
","text":""},{"location":"models/rnabert/#multimolecule.models.rnabert.RnaBertConfig(num_hidden_layers)","title":"num_hidden_layers
","text":""},{"location":"models/rnabert/#multimolecule.models.rnabert.RnaBertConfig(num_attention_heads)","title":"num_attention_heads
","text":""},{"location":"models/rnabert/#multimolecule.models.rnabert.RnaBertConfig(intermediate_size)","title":"intermediate_size
","text":""},{"location":"models/rnabert/#multimolecule.models.rnabert.RnaBertConfig(hidden_dropout)","title":"hidden_dropout
","text":""},{"location":"models/rnabert/#multimolecule.models.rnabert.RnaBertConfig(attention_dropout)","title":"attention_dropout
","text":""},{"location":"models/rnabert/#multimolecule.models.rnabert.RnaBertConfig(max_position_embeddings)","title":"max_position_embeddings
","text":""},{"location":"models/rnabert/#multimolecule.models.rnabert.RnaBertConfig(initializer_range)","title":"initializer_range
","text":""},{"location":"models/rnabert/#multimolecule.models.rnabert.RnaBertConfig(layer_norm_eps)","title":"layer_norm_eps
","text":""},{"location":"models/rnabert/#multimolecule.models.rnabert.RnaBertForContactPrediction","title":"RnaBertForContactPrediction","text":" Bases: RnaBertPreTrainedModel
Examples:
Python Console Session>>> from multimolecule import RnaBertConfig, RnaBertForContactPrediction, RnaTokenizer\n>>> config = RnaBertConfig()\n>>> model = RnaBertForContactPrediction(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=torch.randint(2, (1, 5, 5)))\n>>> output[\"logits\"].shape\ntorch.Size([1, 5, 5, 1])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n
Source code in multimolecule/models/rnabert/modeling_rnabert.py
Pythonclass RnaBertForContactPrediction(RnaBertPreTrainedModel):\n \"\"\"\n Examples:\n >>> from multimolecule import RnaBertConfig, RnaBertForContactPrediction, RnaTokenizer\n >>> config = RnaBertConfig()\n >>> model = RnaBertForContactPrediction(config)\n >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n >>> output = model(**input, labels=torch.randint(2, (1, 5, 5)))\n >>> output[\"logits\"].shape\n torch.Size([1, 5, 5, 1])\n >>> output[\"loss\"] # doctest:+ELLIPSIS\n tensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n \"\"\"\n\n def __init__(self, config: RnaBertConfig):\n super().__init__(config)\n self.rnabert = RnaBertModel(config, add_pooling_layer=True)\n self.contact_head = ContactPredictionHead(config)\n self.head_config = self.contact_head.config\n\n # Initialize weights and apply final processing\n self.post_init()\n\n def forward(\n self,\n input_ids: Tensor | NestedTensor,\n attention_mask: Tensor | None = None,\n position_ids: Tensor | None = None,\n head_mask: Tensor | None = None,\n inputs_embeds: Tensor | NestedTensor | None = None,\n labels: Tensor | None = None,\n output_attentions: bool | None = None,\n output_hidden_states: bool | None = None,\n return_dict: bool | None = None,\n **kwargs,\n ) -> Tuple[Tensor, ...] | ContactPredictorOutput:\n if output_attentions is False:\n warn(\"output_attentions must be True for contact classification and will be ignored.\")\n return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n outputs = self.rnabert(\n input_ids,\n attention_mask=attention_mask,\n position_ids=position_ids,\n head_mask=head_mask,\n inputs_embeds=inputs_embeds,\n output_attentions=True,\n output_hidden_states=output_hidden_states,\n return_dict=return_dict,\n **kwargs,\n )\n output = self.contact_head(outputs, attention_mask, input_ids, labels)\n logits, loss = output.logits, output.loss\n\n if not return_dict:\n output = (logits,) + outputs[2:]\n return ((loss,) + output) if loss is not None else output\n\n return ContactPredictorOutput(\n loss=loss,\n logits=logits,\n hidden_states=outputs.hidden_states,\n attentions=outputs.attentions,\n )\n
"},{"location":"models/rnabert/#multimolecule.models.rnabert.RnaBertForMaskedLM","title":"RnaBertForMaskedLM","text":" Bases: RnaBertPreTrainedModel
Examples:
Python Console Session>>> from multimolecule import RnaBertConfig, RnaBertForMaskedLM, RnaTokenizer\n>>> config = RnaBertConfig()\n>>> model = RnaBertForMaskedLM(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=input[\"input_ids\"])\n>>> output[\"logits\"].shape\ntorch.Size([1, 7, 26])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<NllLossBackward0>)\n
Source code in multimolecule/models/rnabert/modeling_rnabert.py
Pythonclass RnaBertForMaskedLM(RnaBertPreTrainedModel):\n \"\"\"\n Examples:\n >>> from multimolecule import RnaBertConfig, RnaBertForMaskedLM, RnaTokenizer\n >>> config = RnaBertConfig()\n >>> model = RnaBertForMaskedLM(config)\n >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n >>> output = model(**input, labels=input[\"input_ids\"])\n >>> output[\"logits\"].shape\n torch.Size([1, 7, 26])\n >>> output[\"loss\"] # doctest:+ELLIPSIS\n tensor(..., grad_fn=<NllLossBackward0>)\n \"\"\"\n\n _tied_weights_keys = [\"lm_head.decoder.weight\", \"lm_head.decoder.bias\"]\n\n def __init__(self, config: RnaBertConfig):\n super().__init__(config)\n self.rnabert = RnaBertModel(config, add_pooling_layer=False)\n self.lm_head = MaskedLMHead(config)\n\n # Initialize weights and apply final processing\n self.post_init()\n\n def forward(\n self,\n input_ids: Tensor | NestedTensor,\n attention_mask: Tensor | None = None,\n labels: Tensor | None = None,\n output_attentions: bool = False,\n output_hidden_states: bool = False,\n return_dict: bool = True,\n **kwargs,\n ) -> Tuple[Tensor, ...] | MaskedLMOutput:\n return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n outputs = self.rnabert(\n input_ids,\n attention_mask,\n output_attentions=output_attentions,\n output_hidden_states=output_hidden_states,\n return_dict=return_dict,\n **kwargs,\n )\n output = self.lm_head(outputs, labels)\n logits, loss = output.logits, output.loss\n\n if not return_dict:\n output = (logits,) + outputs[2:]\n return ((loss,) + output) if loss is not None else output\n\n return MaskedLMOutput(\n loss=loss,\n logits=logits,\n hidden_states=outputs.hidden_states,\n attentions=outputs.attentions,\n )\n
"},{"location":"models/rnabert/#multimolecule.models.rnabert.RnaBertForPreTraining","title":"RnaBertForPreTraining","text":" Bases: RnaBertPreTrainedModel
Examples:
Python Console Session>>> from multimolecule import RnaBertConfig, RnaBertForPreTraining, RnaTokenizer\n>>> config = RnaBertConfig()\n>>> model = RnaBertForPreTraining(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels_mlm=input[\"input_ids\"])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<AddBackward0>)\n>>> output[\"logits_mlm\"].shape\ntorch.Size([1, 7, 26])\n>>> output[\"logits_ss\"].shape\ntorch.Size([1, 7, 8])\n>>> output[\"logits_sal\"].shape\ntorch.Size([1, 2])\n
Source code in multimolecule/models/rnabert/modeling_rnabert.py
Pythonclass RnaBertForPreTraining(RnaBertPreTrainedModel):\n \"\"\"\n Examples:\n >>> from multimolecule import RnaBertConfig, RnaBertForPreTraining, RnaTokenizer\n >>> config = RnaBertConfig()\n >>> model = RnaBertForPreTraining(config)\n >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n >>> output = model(**input, labels_mlm=input[\"input_ids\"])\n >>> output[\"loss\"] # doctest:+ELLIPSIS\n tensor(..., grad_fn=<AddBackward0>)\n >>> output[\"logits_mlm\"].shape\n torch.Size([1, 7, 26])\n >>> output[\"logits_ss\"].shape\n torch.Size([1, 7, 8])\n >>> output[\"logits_sal\"].shape\n torch.Size([1, 2])\n \"\"\"\n\n _tied_weights_keys = [\n \"lm_head.decoder.weight\",\n \"lm_head.decoder.bias\",\n \"pretrain.predictions.decoder.weight\",\n \"pretrain.predictions.decoder.bias\",\n \"pretrain.predictions_ss.decoder.weight\",\n \"pretrain.predictions_ss.decoder.bias\",\n ]\n\n def __init__(self, config: RnaBertConfig):\n super().__init__(config)\n self.rnabert = RnaBertModel(config, add_pooling_layer=True)\n self.pretrain = RnaBertPreTrainingHeads(config)\n\n # Initialize weights and apply final processing\n self.post_init()\n\n def forward(\n self,\n input_ids: Tensor | NestedTensor,\n attention_mask: Tensor | None = None,\n labels_mlm: Tensor | None = None,\n labels_ss: Tensor | None = None,\n labels_sal: Tensor | None = None,\n output_attentions: bool = False,\n output_hidden_states: bool = False,\n return_dict: bool = True,\n **kwargs,\n ) -> Tuple[Tensor, ...] | RnaBertForPreTrainingOutput:\n return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n outputs = self.rnabert(\n input_ids,\n attention_mask,\n output_attentions=output_attentions,\n output_hidden_states=output_hidden_states,\n return_dict=return_dict,\n **kwargs,\n )\n total_loss, logits_mlm, logits_ss, logits_sal = self.pretrain(\n outputs, labels_mlm=labels_mlm, labels_ss=labels_ss, labels_sal=labels_sal\n )\n\n if not return_dict:\n output = (logits_mlm, logits_ss, logits_sal) + outputs[2:]\n return ((total_loss,) + output) if total_loss is not None else output\n\n return RnaBertForPreTrainingOutput(\n loss=total_loss,\n logits_mlm=logits_mlm,\n logits_ss=logits_ss,\n logits_sal=logits_sal,\n hidden_states=outputs.hidden_states,\n attentions=outputs.attentions,\n )\n
"},{"location":"models/rnabert/#multimolecule.models.rnabert.RnaBertForSequencePrediction","title":"RnaBertForSequencePrediction","text":" Bases: RnaBertPreTrainedModel
Examples:
Python Console Session>>> from multimolecule import RnaBertConfig, RnaBertForSequencePrediction, RnaTokenizer\n>>> config = RnaBertConfig()\n>>> model = RnaBertForSequencePrediction(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=torch.tensor([[1]]))\n>>> output[\"logits\"].shape\ntorch.Size([1, 1])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n
Source code in multimolecule/models/rnabert/modeling_rnabert.py
Pythonclass RnaBertForSequencePrediction(RnaBertPreTrainedModel):\n \"\"\"\n Examples:\n >>> from multimolecule import RnaBertConfig, RnaBertForSequencePrediction, RnaTokenizer\n >>> config = RnaBertConfig()\n >>> model = RnaBertForSequencePrediction(config)\n >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n >>> output = model(**input, labels=torch.tensor([[1]]))\n >>> output[\"logits\"].shape\n torch.Size([1, 1])\n >>> output[\"loss\"] # doctest:+ELLIPSIS\n tensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n \"\"\"\n\n def __init__(self, config: RnaBertConfig):\n super().__init__(config)\n self.rnabert = RnaBertModel(config, add_pooling_layer=True)\n self.sequence_head = SequencePredictionHead(config)\n self.head_config = self.sequence_head.config\n\n # Initialize weights and apply final processing\n self.post_init()\n\n def forward(\n self,\n input_ids: Tensor | NestedTensor,\n attention_mask: Tensor | None = None,\n position_ids: Tensor | None = None,\n head_mask: Tensor | None = None,\n inputs_embeds: Tensor | NestedTensor | None = None,\n labels: Tensor | None = None,\n output_attentions: bool | None = None,\n output_hidden_states: bool | None = None,\n return_dict: bool | None = None,\n **kwargs,\n ) -> Tuple[Tensor, ...] | SequencePredictorOutput:\n return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n outputs = self.rnabert(\n input_ids,\n attention_mask=attention_mask,\n position_ids=position_ids,\n head_mask=head_mask,\n inputs_embeds=inputs_embeds,\n output_attentions=output_attentions,\n output_hidden_states=output_hidden_states,\n return_dict=return_dict,\n **kwargs,\n )\n output = self.sequence_head(outputs, labels)\n logits, loss = output.logits, output.loss\n\n if not return_dict:\n output = (logits,) + outputs[2:]\n return ((loss,) + output) if loss is not None else output\n\n return SequencePredictorOutput(\n loss=loss,\n logits=logits,\n hidden_states=outputs.hidden_states,\n attentions=outputs.attentions,\n )\n
"},{"location":"models/rnabert/#multimolecule.models.rnabert.RnaBertForTokenPrediction","title":"RnaBertForTokenPrediction","text":" Bases: RnaBertPreTrainedModel
Examples:
Python Console Session>>> from multimolecule import RnaBertConfig, RnaBertForTokenPrediction, RnaTokenizer\n>>> config = RnaBertConfig()\n>>> model = RnaBertForTokenPrediction(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=torch.randint(2, (1, 5)))\n>>> output[\"logits\"].shape\ntorch.Size([1, 5, 1])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n
Source code in multimolecule/models/rnabert/modeling_rnabert.py
Pythonclass RnaBertForTokenPrediction(RnaBertPreTrainedModel):\n \"\"\"\n Examples:\n >>> from multimolecule import RnaBertConfig, RnaBertForTokenPrediction, RnaTokenizer\n >>> config = RnaBertConfig()\n >>> model = RnaBertForTokenPrediction(config)\n >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n >>> output = model(**input, labels=torch.randint(2, (1, 5)))\n >>> output[\"logits\"].shape\n torch.Size([1, 5, 1])\n >>> output[\"loss\"] # doctest:+ELLIPSIS\n tensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n \"\"\"\n\n def __init__(self, config: RnaBertConfig):\n super().__init__(config)\n self.rnabert = RnaBertModel(config, add_pooling_layer=True)\n self.token_head = TokenPredictionHead(config)\n self.head_config = self.token_head.config\n\n # Initialize weights and apply final processing\n self.post_init()\n\n def forward(\n self,\n input_ids: Tensor | NestedTensor,\n attention_mask: Tensor | None = None,\n position_ids: Tensor | None = None,\n head_mask: Tensor | None = None,\n inputs_embeds: Tensor | NestedTensor | None = None,\n labels: Tensor | None = None,\n output_attentions: bool | None = None,\n output_hidden_states: bool | None = None,\n return_dict: bool | None = None,\n **kwargs,\n ) -> Tuple[Tensor, ...] | TokenPredictorOutput:\n return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n outputs = self.rnabert(\n input_ids,\n attention_mask=attention_mask,\n position_ids=position_ids,\n head_mask=head_mask,\n inputs_embeds=inputs_embeds,\n output_attentions=output_attentions,\n output_hidden_states=output_hidden_states,\n return_dict=return_dict,\n **kwargs,\n )\n output = self.token_head(outputs, attention_mask, input_ids, labels)\n logits, loss = output.logits, output.loss\n\n if not return_dict:\n output = (logits,) + outputs[2:]\n return ((loss,) + output) if loss is not None else output\n\n return TokenPredictorOutput(\n loss=loss,\n logits=logits,\n hidden_states=outputs.hidden_states,\n attentions=outputs.attentions,\n )\n
"},{"location":"models/rnabert/#multimolecule.models.rnabert.RnaBertModel","title":"RnaBertModel","text":" Bases: RnaBertPreTrainedModel
Examples:
Python Console Session>>> from multimolecule import RnaBertConfig, RnaBertModel, RnaTokenizer\n>>> config = RnaBertConfig()\n>>> model = RnaBertModel(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input)\n>>> output[\"last_hidden_state\"].shape\ntorch.Size([1, 7, 120])\n>>> output[\"pooler_output\"].shape\ntorch.Size([1, 120])\n
Source code in multimolecule/models/rnabert/modeling_rnabert.py
Pythonclass RnaBertModel(RnaBertPreTrainedModel):\n \"\"\"\n Examples:\n >>> from multimolecule import RnaBertConfig, RnaBertModel, RnaTokenizer\n >>> config = RnaBertConfig()\n >>> model = RnaBertModel(config)\n >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n >>> output = model(**input)\n >>> output[\"last_hidden_state\"].shape\n torch.Size([1, 7, 120])\n >>> output[\"pooler_output\"].shape\n torch.Size([1, 120])\n \"\"\"\n\n def __init__(self, config: RnaBertConfig, add_pooling_layer: bool = True):\n super().__init__(config)\n self.pad_token_id = config.pad_token_id\n self.embeddings = RnaBertEmbeddings(config)\n self.encoder = RnaBertEncoder(config)\n self.pooler = RnaBertPooler(config) if add_pooling_layer else None\n\n # Initialize weights and apply final processing\n self.post_init()\n\n def get_input_embeddings(self):\n return self.embeddings.word_embeddings\n\n def set_input_embeddings(self, value):\n self.embeddings.word_embeddings = value\n\n def _prune_heads(self, heads_to_prune):\n \"\"\"\n Prunes heads of the model. heads_to_prune: dict of {layer_num: list of heads to prune in this layer} See base\n class PreTrainedModel\n \"\"\"\n for layer, heads in heads_to_prune.items():\n self.encoder.layer[layer].attention.prune_heads(heads)\n\n def forward(\n self,\n input_ids: Tensor | NestedTensor,\n attention_mask: Tensor | None = None,\n position_ids: Tensor | None = None,\n head_mask: Tensor | None = None,\n inputs_embeds: Tensor | NestedTensor | None = None,\n encoder_hidden_states: Tensor | None = None,\n encoder_attention_mask: Tensor | None = None,\n past_key_values: Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None = None,\n use_cache: bool | None = None,\n output_attentions: bool | None = None,\n output_hidden_states: bool | None = None,\n return_dict: bool | None = None,\n **kwargs,\n ) -> Tuple[Tensor, ...] | BaseModelOutputWithPoolingAndCrossAttentions:\n r\"\"\"\n Args:\n encoder_hidden_states:\n Shape: `(batch_size, sequence_length, hidden_size)`\n\n Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if\n the model is configured as a decoder.\n encoder_attention_mask:\n Shape: `(batch_size, sequence_length)`\n\n Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used\n in the cross-attention if the model is configured as a decoder. Mask values selected in `[0, 1]`:\n\n - 1 for tokens that are **not masked**,\n - 0 for tokens that are **masked**.\n past_key_values:\n Tuple of length `config.n_layers` with each tuple having 4 tensors of shape\n `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)\n\n Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up\n decoding.\n\n If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those\n that don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of\n all `decoder_input_ids` of shape `(batch_size, sequence_length)`.\n use_cache:\n If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding\n (see `past_key_values`).\n \"\"\"\n if kwargs:\n warn(\n f\"Additional keyword arguments `{', '.join(kwargs)}` are detected in \"\n f\"`{self.__class__.__name__}.forward`, they will be ignored.\\n\"\n \"This is provided for backward compatibility and may lead to unexpected behavior.\"\n )\n output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions\n output_hidden_states = (\n output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states\n )\n return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n\n if self.config.is_decoder:\n use_cache = use_cache if use_cache is not None else self.config.use_cache\n else:\n use_cache = False\n\n if isinstance(input_ids, NestedTensor):\n input_ids, attention_mask = input_ids.tensor, input_ids.mask\n if input_ids is not None and inputs_embeds is not None:\n raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n if input_ids is not None:\n self.warn_if_padding_and_no_attention_mask(input_ids, attention_mask)\n input_shape = input_ids.size()\n elif inputs_embeds is not None:\n input_shape = inputs_embeds.size()[:-1]\n else:\n raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n batch_size, seq_length = input_shape\n device = input_ids.device if input_ids is not None else inputs_embeds.device # type: ignore[union-attr]\n\n # past_key_values_length\n past_key_values_length = past_key_values[0][0].shape[2] if past_key_values is not None else 0\n\n if attention_mask is None:\n attention_mask = (\n input_ids.ne(self.pad_token_id)\n if self.pad_token_id is not None\n else torch.ones(((batch_size, seq_length + past_key_values_length)), device=device)\n )\n\n # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]\n # ourselves in which case we just need to make it broadcastable to all heads.\n extended_attention_mask: Tensor = self.get_extended_attention_mask(attention_mask, input_shape)\n\n # If a 2D or 3D attention mask is provided for the cross-attention\n # we need to make broadcastable to [batch_size, num_heads, seq_length, seq_length]\n if self.config.is_decoder and encoder_hidden_states is not None:\n encoder_batch_size, encoder_sequence_length, _ = encoder_hidden_states.size()\n encoder_hidden_shape = (encoder_batch_size, encoder_sequence_length)\n if encoder_attention_mask is None:\n encoder_attention_mask = torch.ones(encoder_hidden_shape, device=device)\n encoder_extended_attention_mask = self.invert_attention_mask(encoder_attention_mask)\n else:\n encoder_extended_attention_mask = None\n\n # Prepare head mask if needed\n # 1.0 in head_mask indicate we keep the head\n # attention_probs has shape bsz x n_heads x N x N\n # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]\n # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]\n head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)\n\n embedding_output = self.embeddings(\n input_ids=input_ids,\n position_ids=position_ids,\n inputs_embeds=inputs_embeds,\n past_key_values_length=past_key_values_length,\n )\n encoder_outputs = self.encoder(\n embedding_output,\n attention_mask=extended_attention_mask,\n head_mask=head_mask,\n encoder_hidden_states=encoder_hidden_states,\n encoder_attention_mask=encoder_extended_attention_mask,\n past_key_values=past_key_values,\n use_cache=use_cache,\n output_attentions=output_attentions,\n output_hidden_states=output_hidden_states,\n return_dict=return_dict,\n )\n sequence_output = encoder_outputs[0]\n pooled_output = self.pooler(sequence_output) if self.pooler is not None else None\n\n if not return_dict:\n return (sequence_output, pooled_output) + encoder_outputs[1:]\n\n return BaseModelOutputWithPoolingAndCrossAttentions(\n last_hidden_state=sequence_output,\n pooler_output=pooled_output,\n past_key_values=encoder_outputs.past_key_values,\n hidden_states=encoder_outputs.hidden_states,\n attentions=encoder_outputs.attentions,\n cross_attentions=encoder_outputs.cross_attentions,\n )\n
"},{"location":"models/rnabert/#multimolecule.models.rnabert.RnaBertModel.forward","title":"forward","text":"Pythonforward(input_ids: Tensor | NestedTensor, attention_mask: Tensor | None = None, position_ids: Tensor | None = None, head_mask: Tensor | None = None, inputs_embeds: Tensor | NestedTensor | None = None, encoder_hidden_states: Tensor | None = None, encoder_attention_mask: Tensor | None = None, past_key_values: Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None = None, use_cache: bool | None = None, output_attentions: bool | None = None, output_hidden_states: bool | None = None, return_dict: bool | None = None, **kwargs) -> Tuple[Tensor, ...] | BaseModelOutputWithPoolingAndCrossAttentions\n
Parameters:
Name Type Description DefaultTensor | None
Shape: (batch_size, sequence_length, hidden_size)
Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if the model is configured as a decoder.
None
Tensor | None
Shape: (batch_size, sequence_length)
Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used in the cross-attention if the model is configured as a decoder. Mask values selected in [0, 1]
:
None
Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None
Tuple of length config.n_layers
with each tuple having 4 tensors of shape `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)
Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up decoding.
If past_key_values
are used, the user can optionally input only the last decoder_input_ids
(those that don\u2019t have their past key value states given to this model) of shape (batch_size, 1)
instead of all decoder_input_ids
of shape (batch_size, sequence_length)
.
None
bool | None
If set to True
, past_key_values
key value states are returned and can be used to speed up decoding (see past_key_values
).
None
Source code in multimolecule/models/rnabert/modeling_rnabert.py
Pythondef forward(\n self,\n input_ids: Tensor | NestedTensor,\n attention_mask: Tensor | None = None,\n position_ids: Tensor | None = None,\n head_mask: Tensor | None = None,\n inputs_embeds: Tensor | NestedTensor | None = None,\n encoder_hidden_states: Tensor | None = None,\n encoder_attention_mask: Tensor | None = None,\n past_key_values: Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None = None,\n use_cache: bool | None = None,\n output_attentions: bool | None = None,\n output_hidden_states: bool | None = None,\n return_dict: bool | None = None,\n **kwargs,\n) -> Tuple[Tensor, ...] | BaseModelOutputWithPoolingAndCrossAttentions:\n r\"\"\"\n Args:\n encoder_hidden_states:\n Shape: `(batch_size, sequence_length, hidden_size)`\n\n Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if\n the model is configured as a decoder.\n encoder_attention_mask:\n Shape: `(batch_size, sequence_length)`\n\n Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used\n in the cross-attention if the model is configured as a decoder. Mask values selected in `[0, 1]`:\n\n - 1 for tokens that are **not masked**,\n - 0 for tokens that are **masked**.\n past_key_values:\n Tuple of length `config.n_layers` with each tuple having 4 tensors of shape\n `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)\n\n Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up\n decoding.\n\n If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those\n that don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of\n all `decoder_input_ids` of shape `(batch_size, sequence_length)`.\n use_cache:\n If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding\n (see `past_key_values`).\n \"\"\"\n if kwargs:\n warn(\n f\"Additional keyword arguments `{', '.join(kwargs)}` are detected in \"\n f\"`{self.__class__.__name__}.forward`, they will be ignored.\\n\"\n \"This is provided for backward compatibility and may lead to unexpected behavior.\"\n )\n output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions\n output_hidden_states = (\n output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states\n )\n return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n\n if self.config.is_decoder:\n use_cache = use_cache if use_cache is not None else self.config.use_cache\n else:\n use_cache = False\n\n if isinstance(input_ids, NestedTensor):\n input_ids, attention_mask = input_ids.tensor, input_ids.mask\n if input_ids is not None and inputs_embeds is not None:\n raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n if input_ids is not None:\n self.warn_if_padding_and_no_attention_mask(input_ids, attention_mask)\n input_shape = input_ids.size()\n elif inputs_embeds is not None:\n input_shape = inputs_embeds.size()[:-1]\n else:\n raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n batch_size, seq_length = input_shape\n device = input_ids.device if input_ids is not None else inputs_embeds.device # type: ignore[union-attr]\n\n # past_key_values_length\n past_key_values_length = past_key_values[0][0].shape[2] if past_key_values is not None else 0\n\n if attention_mask is None:\n attention_mask = (\n input_ids.ne(self.pad_token_id)\n if self.pad_token_id is not None\n else torch.ones(((batch_size, seq_length + past_key_values_length)), device=device)\n )\n\n # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]\n # ourselves in which case we just need to make it broadcastable to all heads.\n extended_attention_mask: Tensor = self.get_extended_attention_mask(attention_mask, input_shape)\n\n # If a 2D or 3D attention mask is provided for the cross-attention\n # we need to make broadcastable to [batch_size, num_heads, seq_length, seq_length]\n if self.config.is_decoder and encoder_hidden_states is not None:\n encoder_batch_size, encoder_sequence_length, _ = encoder_hidden_states.size()\n encoder_hidden_shape = (encoder_batch_size, encoder_sequence_length)\n if encoder_attention_mask is None:\n encoder_attention_mask = torch.ones(encoder_hidden_shape, device=device)\n encoder_extended_attention_mask = self.invert_attention_mask(encoder_attention_mask)\n else:\n encoder_extended_attention_mask = None\n\n # Prepare head mask if needed\n # 1.0 in head_mask indicate we keep the head\n # attention_probs has shape bsz x n_heads x N x N\n # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]\n # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]\n head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)\n\n embedding_output = self.embeddings(\n input_ids=input_ids,\n position_ids=position_ids,\n inputs_embeds=inputs_embeds,\n past_key_values_length=past_key_values_length,\n )\n encoder_outputs = self.encoder(\n embedding_output,\n attention_mask=extended_attention_mask,\n head_mask=head_mask,\n encoder_hidden_states=encoder_hidden_states,\n encoder_attention_mask=encoder_extended_attention_mask,\n past_key_values=past_key_values,\n use_cache=use_cache,\n output_attentions=output_attentions,\n output_hidden_states=output_hidden_states,\n return_dict=return_dict,\n )\n sequence_output = encoder_outputs[0]\n pooled_output = self.pooler(sequence_output) if self.pooler is not None else None\n\n if not return_dict:\n return (sequence_output, pooled_output) + encoder_outputs[1:]\n\n return BaseModelOutputWithPoolingAndCrossAttentions(\n last_hidden_state=sequence_output,\n pooler_output=pooled_output,\n past_key_values=encoder_outputs.past_key_values,\n hidden_states=encoder_outputs.hidden_states,\n attentions=encoder_outputs.attentions,\n cross_attentions=encoder_outputs.cross_attentions,\n )\n
"},{"location":"models/rnabert/#multimolecule.models.rnabert.RnaBertModel.forward(encoder_hidden_states)","title":"encoder_hidden_states
","text":""},{"location":"models/rnabert/#multimolecule.models.rnabert.RnaBertModel.forward(encoder_attention_mask)","title":"encoder_attention_mask
","text":""},{"location":"models/rnabert/#multimolecule.models.rnabert.RnaBertModel.forward(past_key_values)","title":"past_key_values
","text":""},{"location":"models/rnabert/#multimolecule.models.rnabert.RnaBertModel.forward(use_cache)","title":"use_cache
","text":""},{"location":"models/rnabert/#multimolecule.models.rnabert.RnaBertPreTrainedModel","title":"RnaBertPreTrainedModel","text":" Bases: PreTrainedModel
An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained models.
Source code inmultimolecule/models/rnabert/modeling_rnabert.py
Pythonclass RnaBertPreTrainedModel(PreTrainedModel):\n \"\"\"\n An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained\n models.\n \"\"\"\n\n config_class = RnaBertConfig\n base_model_prefix = \"rnabert\"\n supports_gradient_checkpointing = True\n _no_split_modules = [\"RnaBertLayer\", \"RnaBertEmbeddings\"]\n\n # Copied from transformers.models.bert.modeling_bert.BertPreTrainedModel._init_weights\n def _init_weights(self, module: nn.Module):\n \"\"\"Initialize the weights\"\"\"\n if isinstance(module, nn.Linear):\n # Slightly different from the TF version which uses truncated_normal for initialization\n # cf https://github.com/pytorch/pytorch/pull/5617\n module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)\n if module.bias is not None:\n module.bias.data.zero_()\n elif isinstance(module, nn.Embedding):\n module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)\n if module.padding_idx is not None:\n module.weight.data[module.padding_idx].zero_()\n elif isinstance(module, nn.LayerNorm):\n module.bias.data.zero_()\n module.weight.data.fill_(1.0)\n
"},{"location":"models/rnaernie/","title":"RNAErnie","text":"Pre-trained model on non-coding RNA (ncRNA) using a multi-stage masked language modeling (MLM) objective.
"},{"location":"models/rnaernie/#statement","title":"Statement","text":"Multi-purpose RNA language modelling with motif-aware pretraining and type-guided fine-tuning is published in Nature Machine Intelligence, which is a Closed Access / Author-Fee journal.
Machine learning has been at the forefront of the movement for free and open access to research.
We see no role for closed access or author-fee publication in the future of machine learning research and believe the adoption of these journals as an outlet of record for the machine learning community would be a retrograde step.
The MultiMolecule team is committed to the principles of open access and open science.
We do NOT endorse the publication of manuscripts in Closed Access / Author-Fee journals and encourage the community to support Open Access journals and conferences.
Please consider signing the Statement on Nature Machine Intelligence.
"},{"location":"models/rnaernie/#disclaimer","title":"Disclaimer","text":"This is an UNOFFICIAL implementation of the RNAErnie: An RNA Language Model with Structure-enhanced Representations by Ning Wang, Jiang Bian, Haoyi Xiong, et al.
The OFFICIAL repository of RNAErnie is at CatIIIIIIII/RNAErnie.
Warning
The MultiMolecule team is unable to confirm that the provided model and checkpoints are producing the same intermediate representations as the original implementation. This is because
The proposed method is published in a Closed Access / Author-Fee journal.
The team releasing RNAErnie did not write this model card for this model so this model card has been written by the MultiMolecule team.
"},{"location":"models/rnaernie/#model-details","title":"Model Details","text":"RNAErnie is a bert-style model pre-trained on a large corpus of non-coding RNA sequences in a self-supervised fashion. This means that the model was trained on the raw nucleotides of RNA sequences only, with an automatic process to generate inputs and labels from those texts. Please refer to the Training Details section for more information on the training process.
Note that during the conversion process, additional tokens such as [IND]
and ncRNA class symbols are removed.
multimolecule/rnaernie
The model file depends on the multimolecule
library. You can install it using pip:
pip install multimolecule\n
"},{"location":"models/rnaernie/#direct-use","title":"Direct Use","text":"You can use this model directly with a pipeline for masked language modeling:
Python>>> import multimolecule # you must import multimolecule to register models\n>>> from transformers import pipeline\n>>> unmasker = pipeline(\"fill-mask\", model=\"multimolecule/rnaernie\")\n>>> unmasker(\"gguc<mask>cucugguuagaccagaucugagccu\")\n\n[{'score': 0.09252794831991196,\n 'token': 8,\n 'token_str': 'G',\n 'sequence': 'G G U C G C U C U G G U U A G A C C A G A U C U G A G C C U'},\n {'score': 0.09062391519546509,\n 'token': 11,\n 'token_str': 'R',\n 'sequence': 'G G U C R C U C U G G U U A G A C C A G A U C U G A G C C U'},\n {'score': 0.08875908702611923,\n 'token': 6,\n 'token_str': 'A',\n 'sequence': 'G G U C A C U C U G G U U A G A C C A G A U C U G A G C C U'},\n {'score': 0.07809742540121078,\n 'token': 20,\n 'token_str': 'V',\n 'sequence': 'G G U C V C U C U G G U U A G A C C A G A U C U G A G C C U'},\n {'score': 0.07325706630945206,\n 'token': 13,\n 'token_str': 'S',\n 'sequence': 'G G U C S C U C U G G U U A G A C C A G A U C U G A G C C U'}]\n
"},{"location":"models/rnaernie/#downstream-use","title":"Downstream Use","text":""},{"location":"models/rnaernie/#extract-features","title":"Extract Features","text":"Here is how to use this model to get the features of a given sequence in PyTorch:
Pythonfrom multimolecule import RnaTokenizer, RnaErnieModel\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rnaernie\")\nmodel = RnaErnieModel.from_pretrained(\"multimolecule/rnaernie\")\n\ntext = \"UAGCUUAUCAGACUGAUGUUGA\"\ninput = tokenizer(text, return_tensors=\"pt\")\n\noutput = model(**input)\n
"},{"location":"models/rnaernie/#sequence-classification-regression","title":"Sequence Classification / Regression","text":"Note: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for sequence classification or regression.
Here is how to use this model as backbone to fine-tune for a sequence-level task in PyTorch:
Pythonimport torch\nfrom multimolecule import RnaTokenizer, RnaErnieForSequencePrediction\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rnaernie\")\nmodel = RnaErnieForSequencePrediction.from_pretrained(\"multimolecule/rnaernie\")\n\ntext = \"UAGCUUAUCAGACUGAUGUUGA\"\ninput = tokenizer(text, return_tensors=\"pt\")\nlabel = torch.tensor([1])\n\noutput = model(**input, labels=label)\n
"},{"location":"models/rnaernie/#token-classification-regression","title":"Token Classification / Regression","text":"Note: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for nucleotide classification or regression.
Here is how to use this model as backbone to fine-tune for a nucleotide-level task in PyTorch:
Pythonimport torch\nfrom multimolecule import RnaTokenizer, RnaErnieForTokenPrediction\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rnaernie\")\nmodel = RnaErnieForTokenPrediction.from_pretrained(\"multimolecule/rnaernie\")\n\ntext = \"UAGCUUAUCAGACUGAUGUUGA\"\ninput = tokenizer(text, return_tensors=\"pt\")\nlabel = torch.randint(2, (len(text), ))\n\noutput = model(**input, labels=label)\n
"},{"location":"models/rnaernie/#contact-classification-regression","title":"Contact Classification / Regression","text":"Note: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for contact classification or regression.
Here is how to use this model as backbone to fine-tune for a contact-level task in PyTorch:
Pythonimport torch\nfrom multimolecule import RnaTokenizer, RnaErnieForContactPrediction\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rnaernie\")\nmodel = RnaErnieForContactPrediction.from_pretrained(\"multimolecule/rnaernie\")\n\ntext = \"UAGCUUAUCAGACUGAUGUUGA\"\ninput = tokenizer(text, return_tensors=\"pt\")\nlabel = torch.randint(2, (len(text), len(text)))\n\noutput = model(**input, labels=label)\n
"},{"location":"models/rnaernie/#training-details","title":"Training Details","text":"RNAErnie used Masked Language Modeling (MLM) as the pre-training objective: taking a sequence, the model randomly masks 15% of the tokens in the input then runs the entire masked sentence through the model and has to predict the masked tokens. This is comparable to the Cloze task in language modeling.
"},{"location":"models/rnaernie/#training-data","title":"Training Data","text":"The RNAErnie model was pre-trained on RNAcentral. RNAcentral is a free, public resource that offers integrated access to a comprehensive and up-to-date set of non-coding RNA sequences provided by a collaborating group of Expert Databases representing a broad range of organisms and RNA types.
RNAErnie used a subset of RNAcentral for pre-training. The subset contains 23 million sequences. RNAErnie preprocessed all tokens by replacing \u201cT\u201ds with \u201cS\u201ds.
Note that RnaTokenizer
will convert \u201cT\u201ds to \u201cU\u201ds for you, you may disable this behaviour by passing replace_T_with_U=False
.
RNAErnie used masked language modeling (MLM) as the pre-training objective. The masking procedure is similar to the one used in BERT:
<mask>
.RNAErnie uses a special 3-stage training pipeline to pre-train the model, each with a different masking strategy:
Base-level Masking: The masking applies to each nucleotide in the sequence. Subsequence-level Masking: The masking applies to subsequences of 4-8bp in the sequence. Motif-level Masking: The model is trained on motif datasets.
The model was trained on 4 NVIDIA V100 GPUs with 32GiB memories.
Citation information is not available for papers published in Closed Access / Author-Fee journals.
"},{"location":"models/rnaernie/#contact","title":"Contact","text":"Please use GitHub issues of MultiMolecule for any questions or comments on the model card.
Please contact the authors of the RNAErnie paper for questions or comments on the paper/model.
"},{"location":"models/rnaernie/#license","title":"License","text":"This model is licensed under the AGPL-3.0 License.
Text OnlySPDX-License-Identifier: AGPL-3.0-or-later\n
"},{"location":"models/rnaernie/#multimolecule.models.rnaernie","title":"multimolecule.models.rnaernie","text":""},{"location":"models/rnaernie/#multimolecule.models.rnaernie.RnaTokenizer","title":"RnaTokenizer","text":" Bases: Tokenizer
Tokenizer for RNA sequences.
Parameters:
Name Type Description DefaultAlphabet | str | List[str] | None
alphabet to use for tokenization.
None
, the standard RNA alphabet will be used.string
, it should correspond to the name of a predefined alphabet. The options includestandard
extended
streamline
nucleobase
None
int
Size of kmer to tokenize.
1
bool
Whether to tokenize into codons.
False
bool
Whether to replace T with U.
True
bool
Whether to convert input to uppercase.
True
Examples:
Python Console Session>>> from multimolecule import RnaTokenizer\n>>> tokenizer = RnaTokenizer()\n>>> tokenizer('<pad><cls><eos><unk><mask><null>ACGUNRYSWKMBDHV.X*-I')[\"input_ids\"]\n[1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 2]\n>>> tokenizer('acgu')[\"input_ids\"]\n[1, 6, 7, 8, 9, 2]\n>>> tokenizer('acgt')[\"input_ids\"]\n[1, 6, 7, 8, 9, 2]\n>>> tokenizer = RnaTokenizer(replace_T_with_U=False)\n>>> tokenizer('acgt')[\"input_ids\"]\n[1, 6, 7, 8, 3, 2]\n>>> tokenizer = RnaTokenizer(nmers=3)\n>>> tokenizer('uagcuuauc')[\"input_ids\"]\n[1, 83, 17, 64, 49, 96, 84, 22, 2]\n>>> tokenizer = RnaTokenizer(codon=True)\n>>> tokenizer('uagcuuauc')[\"input_ids\"]\n[1, 83, 49, 22, 2]\n>>> tokenizer('uagcuuauca')[\"input_ids\"]\nTraceback (most recent call last):\nValueError: length of input sequence must be a multiple of 3 for codon tokenization, but got 10\n
Source code in multimolecule/tokenisers/rna/tokenization_rna.py
Pythonclass RnaTokenizer(Tokenizer):\n \"\"\"\n Tokenizer for RNA sequences.\n\n Args:\n alphabet: alphabet to use for tokenization.\n\n - If is `None`, the standard RNA alphabet will be used.\n - If is a `string`, it should correspond to the name of a predefined alphabet. The options include\n + `standard`\n + `extended`\n + `streamline`\n + `nucleobase`\n - If is an alphabet or a list of characters, that specific alphabet will be used.\n nmers: Size of kmer to tokenize.\n codon: Whether to tokenize into codons.\n replace_T_with_U: Whether to replace T with U.\n do_upper_case: Whether to convert input to uppercase.\n\n Examples:\n >>> from multimolecule import RnaTokenizer\n >>> tokenizer = RnaTokenizer()\n >>> tokenizer('<pad><cls><eos><unk><mask><null>ACGUNRYSWKMBDHV.X*-I')[\"input_ids\"]\n [1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 2]\n >>> tokenizer('acgu')[\"input_ids\"]\n [1, 6, 7, 8, 9, 2]\n >>> tokenizer('acgt')[\"input_ids\"]\n [1, 6, 7, 8, 9, 2]\n >>> tokenizer = RnaTokenizer(replace_T_with_U=False)\n >>> tokenizer('acgt')[\"input_ids\"]\n [1, 6, 7, 8, 3, 2]\n >>> tokenizer = RnaTokenizer(nmers=3)\n >>> tokenizer('uagcuuauc')[\"input_ids\"]\n [1, 83, 17, 64, 49, 96, 84, 22, 2]\n >>> tokenizer = RnaTokenizer(codon=True)\n >>> tokenizer('uagcuuauc')[\"input_ids\"]\n [1, 83, 49, 22, 2]\n >>> tokenizer('uagcuuauca')[\"input_ids\"]\n Traceback (most recent call last):\n ValueError: length of input sequence must be a multiple of 3 for codon tokenization, but got 10\n \"\"\"\n\n model_input_names = [\"input_ids\", \"attention_mask\"]\n\n def __init__(\n self,\n alphabet: Alphabet | str | List[str] | None = None,\n nmers: int = 1,\n codon: bool = False,\n replace_T_with_U: bool = True,\n do_upper_case: bool = True,\n additional_special_tokens: List | Tuple | None = None,\n **kwargs,\n ):\n if codon and (nmers > 1 and nmers != 3):\n raise ValueError(\"Codon and nmers cannot be used together.\")\n if codon:\n nmers = 3 # set to 3 to get correct vocab\n if not isinstance(alphabet, Alphabet):\n alphabet = get_alphabet(alphabet, nmers=nmers)\n super().__init__(\n alphabet=alphabet,\n nmers=nmers,\n codon=codon,\n replace_T_with_U=replace_T_with_U,\n do_upper_case=do_upper_case,\n additional_special_tokens=additional_special_tokens,\n **kwargs,\n )\n self.replace_T_with_U = replace_T_with_U\n self.nmers = nmers\n self.codon = codon\n\n def _tokenize(self, text: str, **kwargs):\n if self.do_upper_case:\n text = text.upper()\n if self.replace_T_with_U:\n text = text.replace(\"T\", \"U\")\n if self.codon:\n if len(text) % 3 != 0:\n raise ValueError(\n f\"length of input sequence must be a multiple of 3 for codon tokenization, but got {len(text)}\"\n )\n return [text[i : i + 3] for i in range(0, len(text), 3)]\n if self.nmers > 1:\n return [text[i : i + self.nmers] for i in range(len(text) - self.nmers + 1)] # noqa: E203\n return list(text)\n
"},{"location":"models/rnaernie/#multimolecule.models.rnaernie.RnaTokenizer(alphabet)","title":"alphabet
","text":""},{"location":"models/rnaernie/#multimolecule.models.rnaernie.RnaTokenizer(nmers)","title":"nmers
","text":""},{"location":"models/rnaernie/#multimolecule.models.rnaernie.RnaTokenizer(codon)","title":"codon
","text":""},{"location":"models/rnaernie/#multimolecule.models.rnaernie.RnaTokenizer(replace_T_with_U)","title":"replace_T_with_U
","text":""},{"location":"models/rnaernie/#multimolecule.models.rnaernie.RnaTokenizer(do_upper_case)","title":"do_upper_case
","text":""},{"location":"models/rnaernie/#multimolecule.models.rnaernie.RnaErnieConfig","title":"RnaErnieConfig","text":" Bases: PreTrainedConfig
This is the configuration class to store the configuration of a RnaErnieModel
. It is used to instantiate a RnaErnie model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the RnaErnie Bruce-ywj/rnaernie architecture.
Configuration objects inherit from PreTrainedConfig
and can be used to control the model outputs. Read the documentation from PreTrainedConfig
for more information.
Parameters:
Name Type Description Defaultint
Vocabulary size of the RnaErnie model. Defines the number of different tokens that can be represented by the inputs_ids
passed when calling [RnaErnieModel
].
26
int
Dimensionality of the encoder layers and the pooler layer.
768
int
Number of hidden layers in the Transformer encoder.
12
int
Number of attention heads for each attention layer in the Transformer encoder.
12
int
Dimensionality of the \u201cintermediate\u201d (often named feed-forward) layer in the Transformer encoder.
3072
float
The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
0.1
float
The dropout ratio for the attention probabilities.
0.1
int
The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
513
float
The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
0.02
float
The epsilon used by the layer normalization layers.
1e-12
Examples:
Python Console Session>>> from multimolecule import RnaErnieModel, RnaErnieConfig\n>>> # Initializing a rnaernie multimolecule/rnaernie style configuration\n>>> configuration = RnaErnieConfig()\n>>> # Initializing a model (with random weights) from the multimolecule/rnaernie style configuration\n>>> model = RnaErnieModel(configuration)\n>>> # Accessing the model configuration\n>>> configuration = model.config\n
Source code in multimolecule/models/rnaernie/configuration_rnaernie.py
Pythonclass RnaErnieConfig(PreTrainedConfig):\n r\"\"\"\n This is the configuration class to store the configuration of a\n [`RnaErnieModel`][multimolecule.models.RnaErnieModel]. It is used to instantiate a RnaErnie model according to the\n specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a\n similar configuration to that of the RnaErnie [Bruce-ywj/rnaernie](https://github.com/Bruce-ywj/rnaernie)\n architecture.\n\n Configuration objects inherit from [`PreTrainedConfig`][multimolecule.models.PreTrainedConfig] and can be used to\n control the model outputs. Read the documentation from [`PreTrainedConfig`][multimolecule.models.PreTrainedConfig]\n for more information.\n\n Args:\n vocab_size:\n Vocabulary size of the RnaErnie model. Defines the number of different tokens that can be represented by\n the `inputs_ids` passed when calling [`RnaErnieModel`].\n hidden_size:\n Dimensionality of the encoder layers and the pooler layer.\n num_hidden_layers:\n Number of hidden layers in the Transformer encoder.\n num_attention_heads:\n Number of attention heads for each attention layer in the Transformer encoder.\n intermediate_size:\n Dimensionality of the \"intermediate\" (often named feed-forward) layer in the Transformer encoder.\n hidden_dropout:\n The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.\n attention_dropout:\n The dropout ratio for the attention probabilities.\n max_position_embeddings:\n The maximum sequence length that this model might ever be used with. Typically set this to something large\n just in case (e.g., 512 or 1024 or 2048).\n initializer_range:\n The standard deviation of the truncated_normal_initializer for initializing all weight matrices.\n layer_norm_eps:\n The epsilon used by the layer normalization layers.\n\n Examples:\n >>> from multimolecule import RnaErnieModel, RnaErnieConfig\n >>> # Initializing a rnaernie multimolecule/rnaernie style configuration\n >>> configuration = RnaErnieConfig()\n >>> # Initializing a model (with random weights) from the multimolecule/rnaernie style configuration\n >>> model = RnaErnieModel(configuration)\n >>> # Accessing the model configuration\n >>> configuration = model.config\n \"\"\"\n\n model_type = \"rnaernie\"\n\n def __init__(\n self,\n vocab_size: int = 26,\n hidden_size: int = 768,\n num_hidden_layers: int = 12,\n num_attention_heads: int = 12,\n intermediate_size: int = 3072,\n hidden_act: str = \"relu\",\n hidden_dropout: float = 0.1,\n attention_dropout: float = 0.1,\n max_position_embeddings: int = 513,\n initializer_range: float = 0.02,\n layer_norm_eps: float = 1e-12,\n position_embedding_type: str = \"absolute\",\n is_decoder: bool = False,\n use_cache: bool = True,\n head: HeadConfig | None = None,\n lm_head: MaskedLMHeadConfig | None = None,\n **kwargs,\n ):\n super().__init__(**kwargs)\n\n self.vocab_size = vocab_size\n self.type_vocab_size = 2\n self.hidden_size = hidden_size\n self.num_hidden_layers = num_hidden_layers\n self.num_attention_heads = num_attention_heads\n self.intermediate_size = intermediate_size\n self.hidden_act = hidden_act\n self.hidden_dropout = hidden_dropout\n self.attention_dropout = attention_dropout\n self.max_position_embeddings = max_position_embeddings\n self.initializer_range = initializer_range\n self.layer_norm_eps = layer_norm_eps\n self.position_embedding_type = position_embedding_type\n self.is_decoder = is_decoder\n self.use_cache = use_cache\n self.head = HeadConfig(**head) if head is not None else None\n self.lm_head = MaskedLMHeadConfig(**lm_head) if lm_head is not None else None\n
"},{"location":"models/rnaernie/#multimolecule.models.rnaernie.RnaErnieConfig(vocab_size)","title":"vocab_size
","text":""},{"location":"models/rnaernie/#multimolecule.models.rnaernie.RnaErnieConfig(hidden_size)","title":"hidden_size
","text":""},{"location":"models/rnaernie/#multimolecule.models.rnaernie.RnaErnieConfig(num_hidden_layers)","title":"num_hidden_layers
","text":""},{"location":"models/rnaernie/#multimolecule.models.rnaernie.RnaErnieConfig(num_attention_heads)","title":"num_attention_heads
","text":""},{"location":"models/rnaernie/#multimolecule.models.rnaernie.RnaErnieConfig(intermediate_size)","title":"intermediate_size
","text":""},{"location":"models/rnaernie/#multimolecule.models.rnaernie.RnaErnieConfig(hidden_dropout)","title":"hidden_dropout
","text":""},{"location":"models/rnaernie/#multimolecule.models.rnaernie.RnaErnieConfig(attention_dropout)","title":"attention_dropout
","text":""},{"location":"models/rnaernie/#multimolecule.models.rnaernie.RnaErnieConfig(max_position_embeddings)","title":"max_position_embeddings
","text":""},{"location":"models/rnaernie/#multimolecule.models.rnaernie.RnaErnieConfig(initializer_range)","title":"initializer_range
","text":""},{"location":"models/rnaernie/#multimolecule.models.rnaernie.RnaErnieConfig(layer_norm_eps)","title":"layer_norm_eps
","text":""},{"location":"models/rnaernie/#multimolecule.models.rnaernie.RnaErnieForContactPrediction","title":"RnaErnieForContactPrediction","text":" Bases: RnaErniePreTrainedModel
Examples:
Python Console Session>>> from multimolecule import RnaErnieConfig, RnaErnieForContactPrediction, RnaTokenizer\n>>> config = RnaErnieConfig()\n>>> model = RnaErnieForContactPrediction(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=torch.randint(2, (1, 5, 5)))\n>>> output[\"logits\"].shape\ntorch.Size([1, 5, 5, 1])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n
Source code in multimolecule/models/rnaernie/modeling_rnaernie.py
Pythonclass RnaErnieForContactPrediction(RnaErniePreTrainedModel):\n \"\"\"\n Examples:\n >>> from multimolecule import RnaErnieConfig, RnaErnieForContactPrediction, RnaTokenizer\n >>> config = RnaErnieConfig()\n >>> model = RnaErnieForContactPrediction(config)\n >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n >>> output = model(**input, labels=torch.randint(2, (1, 5, 5)))\n >>> output[\"logits\"].shape\n torch.Size([1, 5, 5, 1])\n >>> output[\"loss\"] # doctest:+ELLIPSIS\n tensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n \"\"\"\n\n def __init__(self, config: RnaErnieConfig):\n super().__init__(config)\n self.rnaernie = RnaErnieModel(config, add_pooling_layer=True)\n self.contact_head = ContactPredictionHead(config)\n self.head_config = self.contact_head.config\n\n # Initialize weights and apply final processing\n self.post_init()\n\n def forward(\n self,\n input_ids: Tensor | NestedTensor,\n attention_mask: Tensor | None = None,\n position_ids: Tensor | None = None,\n head_mask: Tensor | None = None,\n inputs_embeds: Tensor | NestedTensor | None = None,\n labels: Tensor | None = None,\n output_attentions: bool | None = None,\n output_hidden_states: bool | None = None,\n return_dict: bool | None = None,\n **kwargs,\n ) -> Tuple[Tensor, ...] | ContactPredictorOutput:\n if output_attentions is False:\n warn(\"output_attentions must be True for contact classification and will be ignored.\")\n return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n outputs = self.rnaernie(\n input_ids,\n attention_mask=attention_mask,\n position_ids=position_ids,\n head_mask=head_mask,\n inputs_embeds=inputs_embeds,\n output_attentions=True,\n output_hidden_states=output_hidden_states,\n return_dict=return_dict,\n **kwargs,\n )\n output = self.contact_head(outputs, attention_mask, input_ids, labels)\n logits, loss = output.logits, output.loss\n\n if not return_dict:\n output = (logits,) + outputs[2:]\n return ((loss,) + output) if loss is not None else output\n\n return ContactPredictorOutput(\n loss=loss,\n logits=logits,\n hidden_states=outputs.hidden_states,\n attentions=outputs.attentions,\n )\n
"},{"location":"models/rnaernie/#multimolecule.models.rnaernie.RnaErnieForMaskedLM","title":"RnaErnieForMaskedLM","text":" Bases: RnaErniePreTrainedModel
Examples:
Python Console Session>>> from multimolecule import RnaErnieConfig, RnaErnieForMaskedLM, RnaTokenizer\n>>> config = RnaErnieConfig()\n>>> model = RnaErnieForMaskedLM(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=input[\"input_ids\"])\n>>> output[\"logits\"].shape\ntorch.Size([1, 7, 26])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<NllLossBackward0>)\n
Source code in multimolecule/models/rnaernie/modeling_rnaernie.py
Pythonclass RnaErnieForMaskedLM(RnaErniePreTrainedModel):\n \"\"\"\n Examples:\n >>> from multimolecule import RnaErnieConfig, RnaErnieForMaskedLM, RnaTokenizer\n >>> config = RnaErnieConfig()\n >>> model = RnaErnieForMaskedLM(config)\n >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n >>> output = model(**input, labels=input[\"input_ids\"])\n >>> output[\"logits\"].shape\n torch.Size([1, 7, 26])\n >>> output[\"loss\"] # doctest:+ELLIPSIS\n tensor(..., grad_fn=<NllLossBackward0>)\n \"\"\"\n\n _tied_weights_keys = [\"lm_head.decoder.bias\", \"lm_head.decoder.weight\"]\n\n def __init__(self, config: RnaErnieConfig):\n super().__init__(config)\n if config.is_decoder:\n logger.warning(\n \"If you want to use `RnaErnieForMaskedLM` make sure `config.is_decoder=False` for \"\n \"bi-directional self-attention.\"\n )\n self.rnaernie = RnaErnieModel(config, add_pooling_layer=False)\n self.lm_head = MaskedLMHead(config)\n\n # Initialize weights and apply final processing\n self.post_init()\n\n def get_output_embeddings(self):\n return self.lm_head.decoder\n\n def set_output_embeddings(self, new_embeddings):\n self.lm_head.decoder = new_embeddings\n\n def forward(\n self,\n input_ids: Tensor | NestedTensor,\n attention_mask: Tensor | None = None,\n position_ids: Tensor | None = None,\n head_mask: Tensor | None = None,\n inputs_embeds: Tensor | NestedTensor | None = None,\n encoder_hidden_states: Tensor | None = None,\n encoder_attention_mask: Tensor | None = None,\n labels: Tensor | None = None,\n output_attentions: bool | None = None,\n output_hidden_states: bool | None = None,\n return_dict: bool | None = None,\n **kwargs,\n ) -> Tuple[Tensor, ...] | MaskedLMOutput:\n return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n outputs = self.rnaernie(\n input_ids,\n attention_mask=attention_mask,\n position_ids=position_ids,\n head_mask=head_mask,\n inputs_embeds=inputs_embeds,\n encoder_hidden_states=encoder_hidden_states,\n encoder_attention_mask=encoder_attention_mask,\n output_attentions=output_attentions,\n output_hidden_states=output_hidden_states,\n return_dict=return_dict,\n **kwargs,\n )\n output = self.lm_head(outputs, labels)\n logits, loss = output.logits, output.loss\n\n if not return_dict:\n output = (logits,) + outputs[2:]\n return ((loss,) + output) if loss is not None else output\n\n return MaskedLMOutput(\n loss=loss,\n logits=logits,\n hidden_states=outputs.hidden_states,\n attentions=outputs.attentions,\n )\n
"},{"location":"models/rnaernie/#multimolecule.models.rnaernie.RnaErnieForSequencePrediction","title":"RnaErnieForSequencePrediction","text":" Bases: RnaErniePreTrainedModel
Examples:
Python Console Session>>> from multimolecule import RnaErnieConfig, RnaErnieForSequencePrediction, RnaTokenizer\n>>> config = RnaErnieConfig()\n>>> model = RnaErnieForSequencePrediction(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=torch.tensor([[1]]))\n>>> output[\"logits\"].shape\ntorch.Size([1, 1])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n
Source code in multimolecule/models/rnaernie/modeling_rnaernie.py
Pythonclass RnaErnieForSequencePrediction(RnaErniePreTrainedModel):\n \"\"\"\n Examples:\n >>> from multimolecule import RnaErnieConfig, RnaErnieForSequencePrediction, RnaTokenizer\n >>> config = RnaErnieConfig()\n >>> model = RnaErnieForSequencePrediction(config)\n >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n >>> output = model(**input, labels=torch.tensor([[1]]))\n >>> output[\"logits\"].shape\n torch.Size([1, 1])\n >>> output[\"loss\"] # doctest:+ELLIPSIS\n tensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n \"\"\"\n\n def __init__(self, config):\n super().__init__(config)\n self.rnaernie = RnaErnieModel(config)\n self.sequence_head = SequencePredictionHead(config)\n self.head_config = self.sequence_head.config\n\n # Initialize weights and apply final processing\n self.post_init()\n\n def forward(\n self,\n input_ids: Tensor | NestedTensor,\n attention_mask: Tensor | None = None,\n position_ids: Tensor | None = None,\n head_mask: Tensor | None = None,\n inputs_embeds: Tensor | NestedTensor | None = None,\n labels: Tensor | None = None,\n output_attentions: bool | None = None,\n output_hidden_states: bool | None = None,\n return_dict: bool | None = None,\n **kwargs,\n ) -> Tuple[Tensor, ...] | SequencePredictorOutput:\n return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n outputs = self.rnaernie(\n input_ids,\n attention_mask=attention_mask,\n position_ids=position_ids,\n head_mask=head_mask,\n inputs_embeds=inputs_embeds,\n output_attentions=output_attentions,\n output_hidden_states=output_hidden_states,\n return_dict=return_dict,\n **kwargs,\n )\n output = self.sequence_head(outputs, labels)\n logits, loss = output.logits, output.loss\n\n if not return_dict:\n output = (logits,) + outputs[2:]\n return ((loss,) + output) if loss is not None else output\n\n return SequencePredictorOutput(\n loss=loss,\n logits=logits,\n hidden_states=outputs.hidden_states,\n attentions=outputs.attentions,\n )\n
"},{"location":"models/rnaernie/#multimolecule.models.rnaernie.RnaErnieForTokenPrediction","title":"RnaErnieForTokenPrediction","text":" Bases: RnaErniePreTrainedModel
Examples:
Python Console Session>>> from multimolecule import RnaErnieConfig, RnaErnieForTokenPrediction, RnaTokenizer\n>>> config = RnaErnieConfig()\n>>> model = RnaErnieForTokenPrediction(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=torch.randint(2, (1, 5)))\n>>> output[\"logits\"].shape\ntorch.Size([1, 5, 1])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n
Source code in multimolecule/models/rnaernie/modeling_rnaernie.py
Pythonclass RnaErnieForTokenPrediction(RnaErniePreTrainedModel):\n \"\"\"\n Examples:\n >>> from multimolecule import RnaErnieConfig, RnaErnieForTokenPrediction, RnaTokenizer\n >>> config = RnaErnieConfig()\n >>> model = RnaErnieForTokenPrediction(config)\n >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n >>> output = model(**input, labels=torch.randint(2, (1, 5)))\n >>> output[\"logits\"].shape\n torch.Size([1, 5, 1])\n >>> output[\"loss\"] # doctest:+ELLIPSIS\n tensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n \"\"\"\n\n def __init__(self, config: RnaErnieConfig):\n super().__init__(config)\n self.rnaernie = RnaErnieModel(config, add_pooling_layer=True)\n self.token_head = TokenPredictionHead(config)\n self.head_config = self.token_head.config\n\n # Initialize weights and apply final processing\n self.post_init()\n\n def forward(\n self,\n input_ids: Tensor | NestedTensor,\n attention_mask: Tensor | None = None,\n position_ids: Tensor | None = None,\n head_mask: Tensor | None = None,\n inputs_embeds: Tensor | NestedTensor | None = None,\n labels: Tensor | None = None,\n output_attentions: bool | None = None,\n output_hidden_states: bool | None = None,\n return_dict: bool | None = None,\n **kwargs,\n ) -> Tuple[Tensor, ...] | TokenPredictorOutput:\n return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n outputs = self.rnaernie(\n input_ids,\n attention_mask=attention_mask,\n position_ids=position_ids,\n head_mask=head_mask,\n inputs_embeds=inputs_embeds,\n output_attentions=output_attentions,\n output_hidden_states=output_hidden_states,\n return_dict=return_dict,\n **kwargs,\n )\n output = self.token_head(outputs, attention_mask, input_ids, labels)\n logits, loss = output.logits, output.loss\n\n if not return_dict:\n output = (logits,) + outputs[2:]\n return ((loss,) + output) if loss is not None else output\n\n return TokenPredictorOutput(\n loss=loss,\n logits=logits,\n hidden_states=outputs.hidden_states,\n attentions=outputs.attentions,\n )\n
"},{"location":"models/rnaernie/#multimolecule.models.rnaernie.RnaErnieModel","title":"RnaErnieModel","text":" Bases: RnaErniePreTrainedModel
Examples:
Python Console Session>>> from multimolecule import RnaErnieConfig, RnaErnieModel, RnaTokenizer\n>>> config = RnaErnieConfig()\n>>> model = RnaErnieModel(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input)\n>>> output[\"last_hidden_state\"].shape\ntorch.Size([1, 7, 768])\n>>> output[\"pooler_output\"].shape\ntorch.Size([1, 768])\n
Source code in multimolecule/models/rnaernie/modeling_rnaernie.py
Pythonclass RnaErnieModel(RnaErniePreTrainedModel):\n \"\"\"\n Examples:\n >>> from multimolecule import RnaErnieConfig, RnaErnieModel, RnaTokenizer\n >>> config = RnaErnieConfig()\n >>> model = RnaErnieModel(config)\n >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n >>> output = model(**input)\n >>> output[\"last_hidden_state\"].shape\n torch.Size([1, 7, 768])\n >>> output[\"pooler_output\"].shape\n torch.Size([1, 768])\n \"\"\"\n\n def __init__(self, config: RnaErnieConfig, add_pooling_layer: bool = True):\n super().__init__(config)\n self.pad_token_id = config.pad_token_id\n\n self.embeddings = RnaErnieEmbeddings(config)\n self.encoder = RnaErnieEncoder(config)\n\n self.pooler = RnaErniePooler(config) if add_pooling_layer else None\n\n # Initialize weights and apply final processing\n self.post_init()\n\n def get_input_embeddings(self):\n return self.embeddings.word_embeddings\n\n def set_input_embeddings(self, value):\n self.embeddings.word_embeddings = value\n\n def _prune_heads(self, heads_to_prune):\n \"\"\"\n Prunes heads of the model. heads_to_prune: dict of {layer_num: list of heads to prune in this layer} See base\n class PreTrainedModel\n \"\"\"\n for layer, heads in heads_to_prune.items():\n self.encoder.layer[layer].attention.prune_heads(heads)\n\n def forward(\n self,\n input_ids: Tensor | NestedTensor,\n attention_mask: Tensor | None = None,\n position_ids: Tensor | None = None,\n head_mask: Tensor | None = None,\n inputs_embeds: Tensor | NestedTensor | None = None,\n encoder_hidden_states: Tensor | None = None,\n encoder_attention_mask: Tensor | None = None,\n past_key_values: Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None = None,\n use_cache: bool | None = None,\n output_attentions: bool | None = None,\n output_hidden_states: bool | None = None,\n return_dict: bool | None = None,\n **kwargs,\n ) -> Tuple[Tensor, ...] | BaseModelOutputWithPoolingAndCrossAttentions:\n r\"\"\"\n Args:\n encoder_hidden_states:\n Shape: `(batch_size, sequence_length, hidden_size)`\n\n Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if\n the model is configured as a decoder.\n encoder_attention_mask:\n Shape: `(batch_size, sequence_length)`\n\n Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used\n in the cross-attention if the model is configured as a decoder. Mask values selected in `[0, 1]`:\n\n - 1 for tokens that are **not masked**,\n - 0 for tokens that are **masked**.\n past_key_values:\n Tuple of length `config.n_layers` with each tuple having 4 tensors of shape\n `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)\n\n Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up\n decoding.\n\n If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those\n that don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of\n all `decoder_input_ids` of shape `(batch_size, sequence_length)`.\n use_cache:\n If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding\n (see `past_key_values`).\n \"\"\"\n if kwargs:\n warn(\n f\"Additional keyword arguments `{', '.join(kwargs)}` are detected in \"\n f\"`{self.__class__.__name__}.forward`, they will be ignored.\\n\"\n \"This is provided for backward compatibility and may lead to unexpected behavior.\"\n )\n output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions\n output_hidden_states = (\n output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states\n )\n return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n\n if self.config.is_decoder:\n use_cache = use_cache if use_cache is not None else self.config.use_cache\n else:\n use_cache = False\n\n if isinstance(input_ids, NestedTensor):\n input_ids, attention_mask = input_ids.tensor, input_ids.mask\n if input_ids is not None and inputs_embeds is not None:\n raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n if input_ids is not None:\n self.warn_if_padding_and_no_attention_mask(input_ids, attention_mask)\n input_shape = input_ids.size()\n elif inputs_embeds is not None:\n input_shape = inputs_embeds.size()[:-1]\n else:\n raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n batch_size, seq_length = input_shape\n device = input_ids.device if input_ids is not None else inputs_embeds.device # type: ignore[union-attr]\n\n # past_key_values_length\n past_key_values_length = past_key_values[0][0].shape[2] if past_key_values is not None else 0\n\n if attention_mask is None:\n attention_mask = (\n input_ids.ne(self.pad_token_id)\n if self.pad_token_id is not None\n else torch.ones(((batch_size, seq_length + past_key_values_length)), device=device)\n )\n\n # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]\n # ourselves in which case we just need to make it broadcastable to all heads.\n extended_attention_mask: Tensor = self.get_extended_attention_mask(attention_mask, input_shape)\n\n # If a 2D or 3D attention mask is provided for the cross-attention\n # we need to make broadcastable to [batch_size, num_heads, seq_length, seq_length]\n if self.config.is_decoder and encoder_hidden_states is not None:\n encoder_batch_size, encoder_sequence_length, _ = encoder_hidden_states.size()\n encoder_hidden_shape = (encoder_batch_size, encoder_sequence_length)\n if encoder_attention_mask is None:\n encoder_attention_mask = torch.ones(encoder_hidden_shape, device=device)\n encoder_extended_attention_mask = self.invert_attention_mask(encoder_attention_mask)\n else:\n encoder_extended_attention_mask = None\n\n # Prepare head mask if needed\n # 1.0 in head_mask indicate we keep the head\n # attention_probs has shape bsz x n_heads x N x N\n # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]\n # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]\n head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)\n\n embedding_output = self.embeddings(\n input_ids=input_ids,\n position_ids=position_ids,\n inputs_embeds=inputs_embeds,\n past_key_values_length=past_key_values_length,\n )\n encoder_outputs = self.encoder(\n embedding_output,\n attention_mask=extended_attention_mask,\n head_mask=head_mask,\n encoder_hidden_states=encoder_hidden_states,\n encoder_attention_mask=encoder_extended_attention_mask,\n past_key_values=past_key_values,\n use_cache=use_cache,\n output_attentions=output_attentions,\n output_hidden_states=output_hidden_states,\n return_dict=return_dict,\n )\n sequence_output = encoder_outputs[0]\n pooled_output = self.pooler(sequence_output) if self.pooler is not None else None\n\n if not return_dict:\n return (sequence_output, pooled_output) + encoder_outputs[1:]\n\n return BaseModelOutputWithPoolingAndCrossAttentions(\n last_hidden_state=sequence_output,\n pooler_output=pooled_output,\n past_key_values=encoder_outputs.past_key_values,\n hidden_states=encoder_outputs.hidden_states,\n attentions=encoder_outputs.attentions,\n cross_attentions=encoder_outputs.cross_attentions,\n )\n
"},{"location":"models/rnaernie/#multimolecule.models.rnaernie.RnaErnieModel.forward","title":"forward","text":"Pythonforward(input_ids: Tensor | NestedTensor, attention_mask: Tensor | None = None, position_ids: Tensor | None = None, head_mask: Tensor | None = None, inputs_embeds: Tensor | NestedTensor | None = None, encoder_hidden_states: Tensor | None = None, encoder_attention_mask: Tensor | None = None, past_key_values: Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None = None, use_cache: bool | None = None, output_attentions: bool | None = None, output_hidden_states: bool | None = None, return_dict: bool | None = None, **kwargs) -> Tuple[Tensor, ...] | BaseModelOutputWithPoolingAndCrossAttentions\n
Parameters:
Name Type Description DefaultTensor | None
Shape: (batch_size, sequence_length, hidden_size)
Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if the model is configured as a decoder.
None
Tensor | None
Shape: (batch_size, sequence_length)
Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used in the cross-attention if the model is configured as a decoder. Mask values selected in [0, 1]
:
None
Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None
Tuple of length config.n_layers
with each tuple having 4 tensors of shape `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)
Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up decoding.
If past_key_values
are used, the user can optionally input only the last decoder_input_ids
(those that don\u2019t have their past key value states given to this model) of shape (batch_size, 1)
instead of all decoder_input_ids
of shape (batch_size, sequence_length)
.
None
bool | None
If set to True
, past_key_values
key value states are returned and can be used to speed up decoding (see past_key_values
).
None
Source code in multimolecule/models/rnaernie/modeling_rnaernie.py
Pythondef forward(\n self,\n input_ids: Tensor | NestedTensor,\n attention_mask: Tensor | None = None,\n position_ids: Tensor | None = None,\n head_mask: Tensor | None = None,\n inputs_embeds: Tensor | NestedTensor | None = None,\n encoder_hidden_states: Tensor | None = None,\n encoder_attention_mask: Tensor | None = None,\n past_key_values: Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None = None,\n use_cache: bool | None = None,\n output_attentions: bool | None = None,\n output_hidden_states: bool | None = None,\n return_dict: bool | None = None,\n **kwargs,\n) -> Tuple[Tensor, ...] | BaseModelOutputWithPoolingAndCrossAttentions:\n r\"\"\"\n Args:\n encoder_hidden_states:\n Shape: `(batch_size, sequence_length, hidden_size)`\n\n Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if\n the model is configured as a decoder.\n encoder_attention_mask:\n Shape: `(batch_size, sequence_length)`\n\n Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used\n in the cross-attention if the model is configured as a decoder. Mask values selected in `[0, 1]`:\n\n - 1 for tokens that are **not masked**,\n - 0 for tokens that are **masked**.\n past_key_values:\n Tuple of length `config.n_layers` with each tuple having 4 tensors of shape\n `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)\n\n Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up\n decoding.\n\n If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those\n that don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of\n all `decoder_input_ids` of shape `(batch_size, sequence_length)`.\n use_cache:\n If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding\n (see `past_key_values`).\n \"\"\"\n if kwargs:\n warn(\n f\"Additional keyword arguments `{', '.join(kwargs)}` are detected in \"\n f\"`{self.__class__.__name__}.forward`, they will be ignored.\\n\"\n \"This is provided for backward compatibility and may lead to unexpected behavior.\"\n )\n output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions\n output_hidden_states = (\n output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states\n )\n return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n\n if self.config.is_decoder:\n use_cache = use_cache if use_cache is not None else self.config.use_cache\n else:\n use_cache = False\n\n if isinstance(input_ids, NestedTensor):\n input_ids, attention_mask = input_ids.tensor, input_ids.mask\n if input_ids is not None and inputs_embeds is not None:\n raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n if input_ids is not None:\n self.warn_if_padding_and_no_attention_mask(input_ids, attention_mask)\n input_shape = input_ids.size()\n elif inputs_embeds is not None:\n input_shape = inputs_embeds.size()[:-1]\n else:\n raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n batch_size, seq_length = input_shape\n device = input_ids.device if input_ids is not None else inputs_embeds.device # type: ignore[union-attr]\n\n # past_key_values_length\n past_key_values_length = past_key_values[0][0].shape[2] if past_key_values is not None else 0\n\n if attention_mask is None:\n attention_mask = (\n input_ids.ne(self.pad_token_id)\n if self.pad_token_id is not None\n else torch.ones(((batch_size, seq_length + past_key_values_length)), device=device)\n )\n\n # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]\n # ourselves in which case we just need to make it broadcastable to all heads.\n extended_attention_mask: Tensor = self.get_extended_attention_mask(attention_mask, input_shape)\n\n # If a 2D or 3D attention mask is provided for the cross-attention\n # we need to make broadcastable to [batch_size, num_heads, seq_length, seq_length]\n if self.config.is_decoder and encoder_hidden_states is not None:\n encoder_batch_size, encoder_sequence_length, _ = encoder_hidden_states.size()\n encoder_hidden_shape = (encoder_batch_size, encoder_sequence_length)\n if encoder_attention_mask is None:\n encoder_attention_mask = torch.ones(encoder_hidden_shape, device=device)\n encoder_extended_attention_mask = self.invert_attention_mask(encoder_attention_mask)\n else:\n encoder_extended_attention_mask = None\n\n # Prepare head mask if needed\n # 1.0 in head_mask indicate we keep the head\n # attention_probs has shape bsz x n_heads x N x N\n # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]\n # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]\n head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)\n\n embedding_output = self.embeddings(\n input_ids=input_ids,\n position_ids=position_ids,\n inputs_embeds=inputs_embeds,\n past_key_values_length=past_key_values_length,\n )\n encoder_outputs = self.encoder(\n embedding_output,\n attention_mask=extended_attention_mask,\n head_mask=head_mask,\n encoder_hidden_states=encoder_hidden_states,\n encoder_attention_mask=encoder_extended_attention_mask,\n past_key_values=past_key_values,\n use_cache=use_cache,\n output_attentions=output_attentions,\n output_hidden_states=output_hidden_states,\n return_dict=return_dict,\n )\n sequence_output = encoder_outputs[0]\n pooled_output = self.pooler(sequence_output) if self.pooler is not None else None\n\n if not return_dict:\n return (sequence_output, pooled_output) + encoder_outputs[1:]\n\n return BaseModelOutputWithPoolingAndCrossAttentions(\n last_hidden_state=sequence_output,\n pooler_output=pooled_output,\n past_key_values=encoder_outputs.past_key_values,\n hidden_states=encoder_outputs.hidden_states,\n attentions=encoder_outputs.attentions,\n cross_attentions=encoder_outputs.cross_attentions,\n )\n
"},{"location":"models/rnaernie/#multimolecule.models.rnaernie.RnaErnieModel.forward(encoder_hidden_states)","title":"encoder_hidden_states
","text":""},{"location":"models/rnaernie/#multimolecule.models.rnaernie.RnaErnieModel.forward(encoder_attention_mask)","title":"encoder_attention_mask
","text":""},{"location":"models/rnaernie/#multimolecule.models.rnaernie.RnaErnieModel.forward(past_key_values)","title":"past_key_values
","text":""},{"location":"models/rnaernie/#multimolecule.models.rnaernie.RnaErnieModel.forward(use_cache)","title":"use_cache
","text":""},{"location":"models/rnaernie/#multimolecule.models.rnaernie.RnaErniePreTrainedModel","title":"RnaErniePreTrainedModel","text":" Bases: PreTrainedModel
An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained models.
Source code inmultimolecule/models/rnaernie/modeling_rnaernie.py
Pythonclass RnaErniePreTrainedModel(PreTrainedModel):\n \"\"\"\n An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained\n models.\n \"\"\"\n\n config_class = RnaErnieConfig\n base_model_prefix = \"rnaernie\"\n supports_gradient_checkpointing = True\n _no_split_modules = [\"RnaErnieLayer\", \"RnaErnieEmbeddings\"]\n\n # Copied from transformers.models.bert.modeling_bert.BertPreTrainedModel._init_weights\n def _init_weights(self, module: nn.Module):\n \"\"\"Initialize the weights\"\"\"\n if isinstance(module, nn.Linear):\n # Slightly different from the TF version which uses truncated_normal for initialization\n # cf https://github.com/pytorch/pytorch/pull/5617\n module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)\n if module.bias is not None:\n module.bias.data.zero_()\n elif isinstance(module, nn.Embedding):\n module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)\n if module.padding_idx is not None:\n module.weight.data[module.padding_idx].zero_()\n elif isinstance(module, nn.LayerNorm):\n module.bias.data.zero_()\n module.weight.data.fill_(1.0)\n\n def _set_gradient_checkpointing(self, module, value=False):\n if isinstance(module, RnaErnieEncoder):\n module.gradient_checkpointing = value\n
"},{"location":"models/rnafm/","title":"RNA-FM","text":"Pre-trained model on non-coding RNA (ncRNA) using a masked language modeling (MLM) objective.
"},{"location":"models/rnafm/#disclaimer","title":"Disclaimer","text":"This is an UNOFFICIAL implementation of the Interpretable RNA Foundation Model from Unannotated Data for Highly Accurate RNA Structure and Function Predictions by Jiayang Chen, Zhihang Hue, Siqi Sun, et al.
The OFFICIAL repository of RNA-FM is at ml4bio/RNA-FM.
Tip
The MultiMolecule team has confirmed that the provided model and checkpoints are producing the same intermediate representations as the original implementation.
The team releasing RNA-FM did not write this model card for this model so this model card has been written by the MultiMolecule team.
"},{"location":"models/rnafm/#model-details","title":"Model Details","text":"RNA-FM is a bert-style model pre-trained on a large corpus of non-coding RNA sequences in a self-supervised fashion. This means that the model was trained on the raw nucleotides of RNA sequences only, with an automatic process to generate inputs and labels from those texts. Please refer to the Training Details section for more information on the training process.
"},{"location":"models/rnafm/#variations","title":"Variations","text":"multimolecule/rnafm
: The RNA-FM model pre-trained on non-coding RNA sequences.multimolecule/mrnafm
: The RNA-FM model pre-trained on mRNA coding sequences.The model file depends on the multimolecule
library. You can install it using pip:
pip install multimolecule\n
"},{"location":"models/rnafm/#direct-use","title":"Direct Use","text":"You can use this model directly with a pipeline for masked language modeling:
Python>>> import multimolecule # you must import multimolecule to register models\n>>> from transformers import pipeline\n>>> unmasker = pipeline(\"fill-mask\", model=\"multimolecule/rnafm\")\n>>> unmasker(\"gguc<mask>cucugguuagaccagaucugagccu\")\n\n[{'score': 0.2752501964569092,\n 'token': 21,\n 'token_str': '.',\n 'sequence': 'G G U C. C U C U G G U U A G A C C A G A U C U G A G C C U'},\n {'score': 0.22108642756938934,\n 'token': 23,\n 'token_str': '*',\n 'sequence': 'G G U C * C U C U G G U U A G A C C A G A U C U G A G C C U'},\n {'score': 0.18201279640197754,\n 'token': 25,\n 'token_str': 'I',\n 'sequence': 'G G U C I C U C U G G U U A G A C C A G A U C U G A G C C U'},\n {'score': 0.10875876247882843,\n 'token': 9,\n 'token_str': 'U',\n 'sequence': 'G G U C U C U C U G G U U A G A C C A G A U C U G A G C C U'},\n {'score': 0.08898332715034485,\n 'token': 6,\n 'token_str': 'A',\n 'sequence': 'G G U C A C U C U G G U U A G A C C A G A U C U G A G C C U'}]\n
"},{"location":"models/rnafm/#downstream-use","title":"Downstream Use","text":""},{"location":"models/rnafm/#extract-features","title":"Extract Features","text":"Here is how to use this model to get the features of a given sequence in PyTorch:
Pythonfrom multimolecule import RnaTokenizer, RnaFmModel\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rnafm\")\nmodel = RnaFmModel.from_pretrained(\"multimolecule/rnafm\")\n\ntext = \"UAGCUUAUCAGACUGAUGUUGA\"\ninput = tokenizer(text, return_tensors=\"pt\")\n\noutput = model(**input)\n
"},{"location":"models/rnafm/#sequence-classification-regression","title":"Sequence Classification / Regression","text":"Note: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for sequence classification or regression.
Here is how to use this model as backbone to fine-tune for a sequence-level task in PyTorch:
Pythonimport torch\nfrom multimolecule import RnaTokenizer, RnaFmForSequencePrediction\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rnafm\")\nmodel = RnaFmForSequencePrediction.from_pretrained(\"multimolecule/rnafm\")\n\ntext = \"UAGCUUAUCAGACUGAUGUUGA\"\ninput = tokenizer(text, return_tensors=\"pt\")\nlabel = torch.tensor([1])\n\noutput = model(**input, labels=label)\n
"},{"location":"models/rnafm/#token-classification-regression","title":"Token Classification / Regression","text":"Note: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for nucleotide classification or regression.
Here is how to use this model as backbone to fine-tune for a nucleotide-level task in PyTorch:
Pythonimport torch\nfrom multimolecule import RnaTokenizer, RnaFmForTokenPrediction\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rnafm\")\nmodel = RnaFmForTokenPrediction.from_pretrained(\"multimolecule/rnafm\")\n\ntext = \"UAGCUUAUCAGACUGAUGUUGA\"\ninput = tokenizer(text, return_tensors=\"pt\")\nlabel = torch.randint(2, (len(text), ))\n\noutput = model(**input, labels=label)\n
"},{"location":"models/rnafm/#contact-classification-regression","title":"Contact Classification / Regression","text":"Note: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for contact classification or regression.
Here is how to use this model as backbone to fine-tune for a contact-level task in PyTorch:
Pythonimport torch\nfrom multimolecule import RnaTokenizer, RnaFmForContactPrediction\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rnafm\")\nmodel = RnaFmForContactPrediction.from_pretrained(\"multimolecule/rnafm\")\n\ntext = \"UAGCUUAUCAGACUGAUGUUGA\"\ninput = tokenizer(text, return_tensors=\"pt\")\nlabel = torch.randint(2, (len(text), len(text)))\n\noutput = model(**input, labels=label)\n
"},{"location":"models/rnafm/#training-details","title":"Training Details","text":"RNA-FM used Masked Language Modeling (MLM) as the pre-training objective: taking a sequence, the model randomly masks 15% of the tokens in the input then runs the entire masked sentence through the model and has to predict the masked tokens. This is comparable to the Cloze task in language modeling.
"},{"location":"models/rnafm/#training-data","title":"Training Data","text":"The RNA-FM model was pre-trained on RNAcentral. RNAcentral is a free, public resource that offers integrated access to a comprehensive and up-to-date set of non-coding RNA sequences provided by a collaborating group of Expert Databases representing a broad range of organisms and RNA types.
RNA-FM applied CD-HIT (CD-HIT-EST) with a cut-off at 100% sequence identity to remove redundancy from the RNAcentral. The final dataset contains 23.7 million non-redundant RNA sequences.
RNA-FM preprocessed all tokens by replacing \u201cU\u201ds with \u201cT\u201ds.
Note that during model conversions, \u201cT\u201d is replaced with \u201cU\u201d. RnaTokenizer
will convert \u201cT\u201ds to \u201cU\u201ds for you, you may disable this behaviour by passing replace_T_with_U=False
.
RNA-FM used masked language modeling (MLM) as the pre-training objective. The masking procedure is similar to the one used in BERT:
<mask>
.The model was trained on 8 NVIDIA A100 GPUs with 80GiB memories.
BibTeX:
BibTeX@article{chen2022interpretable,\n title={Interpretable rna foundation model from unannotated data for highly accurate rna structure and function predictions},\n author={Chen, Jiayang and Hu, Zhihang and Sun, Siqi and Tan, Qingxiong and Wang, Yixuan and Yu, Qinze and Zong, Licheng and Hong, Liang and Xiao, Jin and King, Irwin and others},\n journal={arXiv preprint arXiv:2204.00300},\n year={2022}\n}\n
"},{"location":"models/rnafm/#contact","title":"Contact","text":"Please use GitHub issues of MultiMolecule for any questions or comments on the model card.
Please contact the authors of the RNA-FM paper for questions or comments on the paper/model.
"},{"location":"models/rnafm/#license","title":"License","text":"This model is licensed under the AGPL-3.0 License.
Text OnlySPDX-License-Identifier: AGPL-3.0-or-later\n
"},{"location":"models/rnafm/#multimolecule.models.rnafm","title":"multimolecule.models.rnafm","text":""},{"location":"models/rnafm/#multimolecule.models.rnafm.RnaTokenizer","title":"RnaTokenizer","text":" Bases: Tokenizer
Tokenizer for RNA sequences.
Parameters:
Name Type Description DefaultAlphabet | str | List[str] | None
alphabet to use for tokenization.
None
, the standard RNA alphabet will be used.string
, it should correspond to the name of a predefined alphabet. The options includestandard
extended
streamline
nucleobase
None
int
Size of kmer to tokenize.
1
bool
Whether to tokenize into codons.
False
bool
Whether to replace T with U.
True
bool
Whether to convert input to uppercase.
True
Examples:
Python Console Session>>> from multimolecule import RnaTokenizer\n>>> tokenizer = RnaTokenizer()\n>>> tokenizer('<pad><cls><eos><unk><mask><null>ACGUNRYSWKMBDHV.X*-I')[\"input_ids\"]\n[1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 2]\n>>> tokenizer('acgu')[\"input_ids\"]\n[1, 6, 7, 8, 9, 2]\n>>> tokenizer('acgt')[\"input_ids\"]\n[1, 6, 7, 8, 9, 2]\n>>> tokenizer = RnaTokenizer(replace_T_with_U=False)\n>>> tokenizer('acgt')[\"input_ids\"]\n[1, 6, 7, 8, 3, 2]\n>>> tokenizer = RnaTokenizer(nmers=3)\n>>> tokenizer('uagcuuauc')[\"input_ids\"]\n[1, 83, 17, 64, 49, 96, 84, 22, 2]\n>>> tokenizer = RnaTokenizer(codon=True)\n>>> tokenizer('uagcuuauc')[\"input_ids\"]\n[1, 83, 49, 22, 2]\n>>> tokenizer('uagcuuauca')[\"input_ids\"]\nTraceback (most recent call last):\nValueError: length of input sequence must be a multiple of 3 for codon tokenization, but got 10\n
Source code in multimolecule/tokenisers/rna/tokenization_rna.py
Pythonclass RnaTokenizer(Tokenizer):\n \"\"\"\n Tokenizer for RNA sequences.\n\n Args:\n alphabet: alphabet to use for tokenization.\n\n - If is `None`, the standard RNA alphabet will be used.\n - If is a `string`, it should correspond to the name of a predefined alphabet. The options include\n + `standard`\n + `extended`\n + `streamline`\n + `nucleobase`\n - If is an alphabet or a list of characters, that specific alphabet will be used.\n nmers: Size of kmer to tokenize.\n codon: Whether to tokenize into codons.\n replace_T_with_U: Whether to replace T with U.\n do_upper_case: Whether to convert input to uppercase.\n\n Examples:\n >>> from multimolecule import RnaTokenizer\n >>> tokenizer = RnaTokenizer()\n >>> tokenizer('<pad><cls><eos><unk><mask><null>ACGUNRYSWKMBDHV.X*-I')[\"input_ids\"]\n [1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 2]\n >>> tokenizer('acgu')[\"input_ids\"]\n [1, 6, 7, 8, 9, 2]\n >>> tokenizer('acgt')[\"input_ids\"]\n [1, 6, 7, 8, 9, 2]\n >>> tokenizer = RnaTokenizer(replace_T_with_U=False)\n >>> tokenizer('acgt')[\"input_ids\"]\n [1, 6, 7, 8, 3, 2]\n >>> tokenizer = RnaTokenizer(nmers=3)\n >>> tokenizer('uagcuuauc')[\"input_ids\"]\n [1, 83, 17, 64, 49, 96, 84, 22, 2]\n >>> tokenizer = RnaTokenizer(codon=True)\n >>> tokenizer('uagcuuauc')[\"input_ids\"]\n [1, 83, 49, 22, 2]\n >>> tokenizer('uagcuuauca')[\"input_ids\"]\n Traceback (most recent call last):\n ValueError: length of input sequence must be a multiple of 3 for codon tokenization, but got 10\n \"\"\"\n\n model_input_names = [\"input_ids\", \"attention_mask\"]\n\n def __init__(\n self,\n alphabet: Alphabet | str | List[str] | None = None,\n nmers: int = 1,\n codon: bool = False,\n replace_T_with_U: bool = True,\n do_upper_case: bool = True,\n additional_special_tokens: List | Tuple | None = None,\n **kwargs,\n ):\n if codon and (nmers > 1 and nmers != 3):\n raise ValueError(\"Codon and nmers cannot be used together.\")\n if codon:\n nmers = 3 # set to 3 to get correct vocab\n if not isinstance(alphabet, Alphabet):\n alphabet = get_alphabet(alphabet, nmers=nmers)\n super().__init__(\n alphabet=alphabet,\n nmers=nmers,\n codon=codon,\n replace_T_with_U=replace_T_with_U,\n do_upper_case=do_upper_case,\n additional_special_tokens=additional_special_tokens,\n **kwargs,\n )\n self.replace_T_with_U = replace_T_with_U\n self.nmers = nmers\n self.codon = codon\n\n def _tokenize(self, text: str, **kwargs):\n if self.do_upper_case:\n text = text.upper()\n if self.replace_T_with_U:\n text = text.replace(\"T\", \"U\")\n if self.codon:\n if len(text) % 3 != 0:\n raise ValueError(\n f\"length of input sequence must be a multiple of 3 for codon tokenization, but got {len(text)}\"\n )\n return [text[i : i + 3] for i in range(0, len(text), 3)]\n if self.nmers > 1:\n return [text[i : i + self.nmers] for i in range(len(text) - self.nmers + 1)] # noqa: E203\n return list(text)\n
"},{"location":"models/rnafm/#multimolecule.models.rnafm.RnaTokenizer(alphabet)","title":"alphabet
","text":""},{"location":"models/rnafm/#multimolecule.models.rnafm.RnaTokenizer(nmers)","title":"nmers
","text":""},{"location":"models/rnafm/#multimolecule.models.rnafm.RnaTokenizer(codon)","title":"codon
","text":""},{"location":"models/rnafm/#multimolecule.models.rnafm.RnaTokenizer(replace_T_with_U)","title":"replace_T_with_U
","text":""},{"location":"models/rnafm/#multimolecule.models.rnafm.RnaTokenizer(do_upper_case)","title":"do_upper_case
","text":""},{"location":"models/rnafm/#multimolecule.models.rnafm.RnaFmConfig","title":"RnaFmConfig","text":" Bases: PreTrainedConfig
This is the configuration class to store the configuration of a RnaFmModel
. It is used to instantiate a RNA-FM model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the RNA-FM ml4bio/RNA-FM architecture.
Configuration objects inherit from PreTrainedConfig
and can be used to control the model outputs. Read the documentation from PreTrainedConfig
for more information.
Parameters:
Name Type Description Defaultint | None
Vocabulary size of the RNA-FM model. Defines the number of different tokens that can be represented by the inputs_ids
passed when calling [RnaFmModel
]. Defaults to 25 if codon=False
else 131.
None
bool
Whether to use codon tokenization.
False
int
Dimensionality of the encoder layers and the pooler layer.
640
int
Number of hidden layers in the Transformer encoder.
12
int
Number of attention heads for each attention layer in the Transformer encoder.
20
int
Dimensionality of the \u201cintermediate\u201d (often named feed-forward) layer in the Transformer encoder.
5120
float
The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
0.1
float
The dropout ratio for the attention probabilities.
0.1
int
The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
1026
float
The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
0.02
float
The epsilon used by the layer normalization layers.
1e-12
str
Type of position embedding. Choose one of \"absolute\"
, \"relative_key\"
, \"relative_key_query\", \"rotary\"
. For positional embeddings use \"absolute\"
. For more information on \"relative_key\"
, please refer to Self-Attention with Relative Position Representations (Shaw et al.). For more information on \"relative_key_query\"
, please refer to Method 4 in Improve Transformer Models with Better Relative Position Embeddings (Huang et al.).
'absolute'
bool
Whether the model is used as a decoder or not. If False
, the model is used as an encoder.
False
bool
Whether or not the model should return the last key/values attentions (not used by all models). Only relevant if config.is_decoder=True
.
True
bool
Whether to apply layer normalization after embeddings but before the main stem of the network.
True
bool
When this is enabled, masked tokens are treated as if they had been dropped out by input dropout.
False
Examples:
Python Console Session>>> from multimolecule import RnaFmModel, RnaFmConfig\n>>> # Initializing a RNA-FM multimolecule/rnafm style configuration\n>>> configuration = RnaFmConfig()\n>>> # Initializing a model (with random weights) from the multimolecule/rnafm style configuration\n>>> model = RnaFmModel(configuration)\n>>> # Accessing the model configuration\n>>> configuration = model.config\n
Source code in multimolecule/models/rnafm/configuration_rnafm.py
Pythonclass RnaFmConfig(PreTrainedConfig):\n r\"\"\"\n This is the configuration class to store the configuration of a [`RnaFmModel`][multimolecule.models.RnaFmModel].\n It is used to instantiate a RNA-FM model according to the specified arguments, defining the model architecture.\n Instantiating a configuration with the defaults will yield a similar configuration to that of the RNA-FM\n [ml4bio/RNA-FM](https://github.com/ml4bio/RNA-FM) architecture.\n\n Configuration objects inherit from [`PreTrainedConfig`][multimolecule.models.PreTrainedConfig] and can be used to\n control the model outputs. Read the documentation from [`PreTrainedConfig`][multimolecule.models.PreTrainedConfig]\n for more information.\n\n Args:\n vocab_size:\n Vocabulary size of the RNA-FM model. Defines the number of different tokens that can be represented by the\n `inputs_ids` passed when calling [`RnaFmModel`].\n Defaults to 25 if `codon=False` else 131.\n codon:\n Whether to use codon tokenization.\n hidden_size:\n Dimensionality of the encoder layers and the pooler layer.\n num_hidden_layers:\n Number of hidden layers in the Transformer encoder.\n num_attention_heads:\n Number of attention heads for each attention layer in the Transformer encoder.\n intermediate_size:\n Dimensionality of the \"intermediate\" (often named feed-forward) layer in the Transformer encoder.\n hidden_dropout:\n The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.\n attention_dropout:\n The dropout ratio for the attention probabilities.\n max_position_embeddings:\n The maximum sequence length that this model might ever be used with. Typically set this to something large\n just in case (e.g., 512 or 1024 or 2048).\n initializer_range:\n The standard deviation of the truncated_normal_initializer for initializing all weight matrices.\n layer_norm_eps:\n The epsilon used by the layer normalization layers.\n position_embedding_type:\n Type of position embedding. Choose one of `\"absolute\"`, `\"relative_key\"`, `\"relative_key_query\", \"rotary\"`.\n For positional embeddings use `\"absolute\"`. For more information on `\"relative_key\"`, please refer to\n [Self-Attention with Relative Position Representations (Shaw et al.)](https://arxiv.org/abs/1803.02155).\n For more information on `\"relative_key_query\"`, please refer to *Method 4* in [Improve Transformer Models\n with Better Relative Position Embeddings (Huang et al.)](https://arxiv.org/abs/2009.13658).\n is_decoder:\n Whether the model is used as a decoder or not. If `False`, the model is used as an encoder.\n use_cache:\n Whether or not the model should return the last key/values attentions (not used by all models). Only\n relevant if `config.is_decoder=True`.\n emb_layer_norm_before:\n Whether to apply layer normalization after embeddings but before the main stem of the network.\n token_dropout:\n When this is enabled, masked tokens are treated as if they had been dropped out by input dropout.\n\n Examples:\n >>> from multimolecule import RnaFmModel, RnaFmConfig\n >>> # Initializing a RNA-FM multimolecule/rnafm style configuration\n >>> configuration = RnaFmConfig()\n >>> # Initializing a model (with random weights) from the multimolecule/rnafm style configuration\n >>> model = RnaFmModel(configuration)\n >>> # Accessing the model configuration\n >>> configuration = model.config\n \"\"\"\n\n model_type = \"rnafm\"\n\n def __init__(\n self,\n vocab_size: int | None = None,\n codon: bool = False,\n hidden_size: int = 640,\n num_hidden_layers: int = 12,\n num_attention_heads: int = 20,\n intermediate_size: int = 5120,\n hidden_act: str = \"gelu\",\n hidden_dropout: float = 0.1,\n attention_dropout: float = 0.1,\n max_position_embeddings: int = 1026,\n initializer_range: float = 0.02,\n layer_norm_eps: float = 1e-12,\n position_embedding_type: str = \"absolute\",\n is_decoder: bool = False,\n use_cache: bool = True,\n emb_layer_norm_before: bool = True,\n token_dropout: bool = False,\n head: HeadConfig | None = None,\n lm_head: MaskedLMHeadConfig | None = None,\n **kwargs,\n ):\n super().__init__(**kwargs)\n if vocab_size is None:\n vocab_size = 131 if codon else 26\n self.vocab_size = vocab_size\n self.codon = codon\n self.hidden_size = hidden_size\n self.num_hidden_layers = num_hidden_layers\n self.num_attention_heads = num_attention_heads\n self.intermediate_size = intermediate_size\n self.hidden_act = hidden_act\n self.hidden_dropout = hidden_dropout\n self.attention_dropout = attention_dropout\n self.max_position_embeddings = max_position_embeddings\n self.initializer_range = initializer_range\n self.layer_norm_eps = layer_norm_eps\n self.position_embedding_type = position_embedding_type\n self.is_decoder = is_decoder\n self.use_cache = use_cache\n self.emb_layer_norm_before = emb_layer_norm_before\n self.token_dropout = token_dropout\n self.head = HeadConfig(**head) if head is not None else None\n self.lm_head = MaskedLMHeadConfig(**lm_head) if lm_head is not None else None\n
"},{"location":"models/rnafm/#multimolecule.models.rnafm.RnaFmConfig(vocab_size)","title":"vocab_size
","text":""},{"location":"models/rnafm/#multimolecule.models.rnafm.RnaFmConfig(codon)","title":"codon
","text":""},{"location":"models/rnafm/#multimolecule.models.rnafm.RnaFmConfig(hidden_size)","title":"hidden_size
","text":""},{"location":"models/rnafm/#multimolecule.models.rnafm.RnaFmConfig(num_hidden_layers)","title":"num_hidden_layers
","text":""},{"location":"models/rnafm/#multimolecule.models.rnafm.RnaFmConfig(num_attention_heads)","title":"num_attention_heads
","text":""},{"location":"models/rnafm/#multimolecule.models.rnafm.RnaFmConfig(intermediate_size)","title":"intermediate_size
","text":""},{"location":"models/rnafm/#multimolecule.models.rnafm.RnaFmConfig(hidden_dropout)","title":"hidden_dropout
","text":""},{"location":"models/rnafm/#multimolecule.models.rnafm.RnaFmConfig(attention_dropout)","title":"attention_dropout
","text":""},{"location":"models/rnafm/#multimolecule.models.rnafm.RnaFmConfig(max_position_embeddings)","title":"max_position_embeddings
","text":""},{"location":"models/rnafm/#multimolecule.models.rnafm.RnaFmConfig(initializer_range)","title":"initializer_range
","text":""},{"location":"models/rnafm/#multimolecule.models.rnafm.RnaFmConfig(layer_norm_eps)","title":"layer_norm_eps
","text":""},{"location":"models/rnafm/#multimolecule.models.rnafm.RnaFmConfig(position_embedding_type)","title":"position_embedding_type
","text":""},{"location":"models/rnafm/#multimolecule.models.rnafm.RnaFmConfig(is_decoder)","title":"is_decoder
","text":""},{"location":"models/rnafm/#multimolecule.models.rnafm.RnaFmConfig(use_cache)","title":"use_cache
","text":""},{"location":"models/rnafm/#multimolecule.models.rnafm.RnaFmConfig(emb_layer_norm_before)","title":"emb_layer_norm_before
","text":""},{"location":"models/rnafm/#multimolecule.models.rnafm.RnaFmConfig(token_dropout)","title":"token_dropout
","text":""},{"location":"models/rnafm/#multimolecule.models.rnafm.RnaFmForContactPrediction","title":"RnaFmForContactPrediction","text":" Bases: RnaFmPreTrainedModel
Examples:
Python Console Session>>> from multimolecule import RnaFmConfig, RnaFmForContactPrediction, RnaTokenizer\n>>> config = RnaFmConfig()\n>>> model = RnaFmForContactPrediction(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=torch.randint(2, (1, 5, 5)))\n>>> output[\"logits\"].shape\ntorch.Size([1, 5, 5, 1])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n
Source code in multimolecule/models/rnafm/modeling_rnafm.py
Pythonclass RnaFmForContactPrediction(RnaFmPreTrainedModel):\n \"\"\"\n Examples:\n >>> from multimolecule import RnaFmConfig, RnaFmForContactPrediction, RnaTokenizer\n >>> config = RnaFmConfig()\n >>> model = RnaFmForContactPrediction(config)\n >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n >>> output = model(**input, labels=torch.randint(2, (1, 5, 5)))\n >>> output[\"logits\"].shape\n torch.Size([1, 5, 5, 1])\n >>> output[\"loss\"] # doctest:+ELLIPSIS\n tensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n \"\"\"\n\n def __init__(self, config: RnaFmConfig):\n super().__init__(config)\n self.rnafm = RnaFmModel(config, add_pooling_layer=True)\n self.contact_head = ContactPredictionHead(config)\n self.head_config = self.contact_head.config\n\n # Initialize weights and apply final processing\n self.post_init()\n\n def forward(\n self,\n input_ids: Tensor | NestedTensor,\n attention_mask: Tensor | None = None,\n position_ids: Tensor | None = None,\n head_mask: Tensor | None = None,\n inputs_embeds: Tensor | NestedTensor | None = None,\n labels: Tensor | None = None,\n output_attentions: bool | None = None,\n output_hidden_states: bool | None = None,\n return_dict: bool | None = None,\n **kwargs,\n ) -> Tuple[Tensor, ...] | ContactPredictorOutput:\n if output_attentions is False:\n warn(\"output_attentions must be True for contact classification and will be ignored.\")\n return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n outputs = self.rnafm(\n input_ids,\n attention_mask=attention_mask,\n position_ids=position_ids,\n head_mask=head_mask,\n inputs_embeds=inputs_embeds,\n output_attentions=True,\n output_hidden_states=output_hidden_states,\n return_dict=return_dict,\n **kwargs,\n )\n output = self.contact_head(outputs, attention_mask, input_ids, labels)\n logits, loss = output.logits, output.loss\n\n if not return_dict:\n output = (logits,) + outputs[2:]\n return ((loss,) + output) if loss is not None else output\n\n return ContactPredictorOutput(\n loss=loss,\n logits=logits,\n hidden_states=outputs.hidden_states,\n attentions=outputs.attentions,\n )\n
"},{"location":"models/rnafm/#multimolecule.models.rnafm.RnaFmForMaskedLM","title":"RnaFmForMaskedLM","text":" Bases: RnaFmPreTrainedModel
Examples:
Python Console Session>>> from multimolecule import RnaFmConfig, RnaFmForMaskedLM, RnaTokenizer\n>>> config = RnaFmConfig()\n>>> model = RnaFmForMaskedLM(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=input[\"input_ids\"])\n>>> output[\"logits\"].shape\ntorch.Size([1, 7, 26])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<NllLossBackward0>)\n
Source code in multimolecule/models/rnafm/modeling_rnafm.py
Pythonclass RnaFmForMaskedLM(RnaFmPreTrainedModel):\n \"\"\"\n Examples:\n >>> from multimolecule import RnaFmConfig, RnaFmForMaskedLM, RnaTokenizer\n >>> config = RnaFmConfig()\n >>> model = RnaFmForMaskedLM(config)\n >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n >>> output = model(**input, labels=input[\"input_ids\"])\n >>> output[\"logits\"].shape\n torch.Size([1, 7, 26])\n >>> output[\"loss\"] # doctest:+ELLIPSIS\n tensor(..., grad_fn=<NllLossBackward0>)\n \"\"\"\n\n _tied_weights_keys = [\"lm_head.decoder.weight\", \"lm_head.decoder.bias\"]\n\n def __init__(self, config: RnaFmConfig):\n super().__init__(config)\n if config.is_decoder:\n logger.warning(\n \"If you want to use `RnaFmForMaskedLM` make sure `config.is_decoder=False` for \"\n \"bi-directional self-attention.\"\n )\n self.rnafm = RnaFmModel(config, add_pooling_layer=False)\n self.lm_head = MaskedLMHead(config)\n\n # Initialize weights and apply final processing\n self.post_init()\n\n def forward(\n self,\n input_ids: Tensor | NestedTensor,\n attention_mask: Tensor | None = None,\n position_ids: Tensor | None = None,\n head_mask: Tensor | None = None,\n inputs_embeds: Tensor | NestedTensor | None = None,\n encoder_hidden_states: Tensor | None = None,\n encoder_attention_mask: Tensor | None = None,\n labels: Tensor | None = None,\n output_attentions: bool | None = None,\n output_hidden_states: bool | None = None,\n return_dict: bool | None = None,\n **kwargs,\n ) -> Tuple[Tensor, ...] | MaskedLMOutput:\n return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n outputs = self.rnafm(\n input_ids,\n attention_mask=attention_mask,\n position_ids=position_ids,\n head_mask=head_mask,\n inputs_embeds=inputs_embeds,\n encoder_hidden_states=encoder_hidden_states,\n encoder_attention_mask=encoder_attention_mask,\n output_attentions=output_attentions,\n output_hidden_states=output_hidden_states,\n return_dict=return_dict,\n **kwargs,\n )\n output = self.lm_head(outputs, labels)\n logits, loss = output.logits, output.loss\n\n if not return_dict:\n output = (logits,) + outputs[2:]\n return ((loss,) + output) if loss is not None else output\n\n return MaskedLMOutput(\n loss=loss,\n logits=logits,\n hidden_states=outputs.hidden_states,\n attentions=outputs.attentions,\n )\n
"},{"location":"models/rnafm/#multimolecule.models.rnafm.RnaFmForPreTraining","title":"RnaFmForPreTraining","text":" Bases: RnaFmPreTrainedModel
Examples:
Python Console Session>>> from multimolecule import RnaFmConfig, RnaFmForPreTraining, RnaTokenizer\n>>> config = RnaFmConfig()\n>>> model = RnaFmForPreTraining(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels_mlm=input[\"input_ids\"])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<AddBackward0>)\n>>> output[\"logits\"].shape\ntorch.Size([1, 7, 26])\n>>> output[\"contact_map\"].shape\ntorch.Size([1, 5, 5, 1])\n
Source code in multimolecule/models/rnafm/modeling_rnafm.py
Pythonclass RnaFmForPreTraining(RnaFmPreTrainedModel):\n \"\"\"\n Examples:\n >>> from multimolecule import RnaFmConfig, RnaFmForPreTraining, RnaTokenizer\n >>> config = RnaFmConfig()\n >>> model = RnaFmForPreTraining(config)\n >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n >>> output = model(**input, labels_mlm=input[\"input_ids\"])\n >>> output[\"loss\"] # doctest:+ELLIPSIS\n tensor(..., grad_fn=<AddBackward0>)\n >>> output[\"logits\"].shape\n torch.Size([1, 7, 26])\n >>> output[\"contact_map\"].shape\n torch.Size([1, 5, 5, 1])\n \"\"\"\n\n _tied_weights_keys = [\n \"lm_head.decoder.weight\",\n \"lm_head.decoder.bias\",\n \"pretrain.predictions.decoder.weight\",\n \"pretrain.predictions.decoder.bias\",\n \"pretrain.predictions_ss.decoder.weight\",\n \"pretrain.predictions_ss.decoder.bias\",\n ]\n\n def __init__(self, config: RnaFmConfig):\n super().__init__(config)\n if config.is_decoder:\n logger.warning(\n \"If you want to use `RnaFmForPreTraining` make sure `config.is_decoder=False` for \"\n \"bi-directional self-attention.\"\n )\n self.rnafm = RnaFmModel(config, add_pooling_layer=False)\n self.pretrain = RnaFmPreTrainingHeads(config)\n\n # Initialize weights and apply final processing\n self.post_init()\n\n def get_output_embeddings(self):\n return self.pretrain.predictions.decoder\n\n def set_output_embeddings(self, embeddings):\n self.pretrain.predictions.decoder = embeddings\n\n def forward(\n self,\n input_ids: Tensor | NestedTensor,\n attention_mask: Tensor | None = None,\n position_ids: Tensor | None = None,\n head_mask: Tensor | None = None,\n inputs_embeds: Tensor | NestedTensor | None = None,\n encoder_hidden_states: Tensor | None = None,\n encoder_attention_mask: Tensor | None = None,\n labels_mlm: Tensor | None = None,\n labels_contact: Tensor | None = None,\n output_attentions: bool | None = None,\n output_hidden_states: bool | None = None,\n return_dict: bool | None = None,\n **kwargs,\n ) -> Tuple[Tensor, ...] | RnaFmForPreTrainingOutput:\n if output_attentions is False:\n warn(\"output_attentions must be True for contact classification and will be ignored.\")\n return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n outputs = self.rnafm(\n input_ids,\n attention_mask=attention_mask,\n position_ids=position_ids,\n head_mask=head_mask,\n inputs_embeds=inputs_embeds,\n encoder_hidden_states=encoder_hidden_states,\n encoder_attention_mask=encoder_attention_mask,\n output_attentions=True,\n output_hidden_states=output_hidden_states,\n return_dict=return_dict,\n **kwargs,\n )\n total_loss, logits, contact_map = self.pretrain(\n outputs, attention_mask, input_ids, labels_mlm=labels_mlm, labels_contact=labels_contact\n )\n\n if not return_dict:\n output = (logits,) + outputs[2:]\n return ((total_loss,) + output) if total_loss is not None else output\n\n return RnaFmForPreTrainingOutput(\n loss=total_loss,\n logits=logits,\n contact_map=contact_map,\n hidden_states=outputs.hidden_states,\n attentions=outputs.attentions,\n )\n
"},{"location":"models/rnafm/#multimolecule.models.rnafm.RnaFmForSequencePrediction","title":"RnaFmForSequencePrediction","text":" Bases: RnaFmPreTrainedModel
Examples:
Python Console Session>>> from multimolecule import RnaFmConfig, RnaFmForSequencePrediction, RnaTokenizer\n>>> config = RnaFmConfig()\n>>> model = RnaFmForSequencePrediction(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=torch.tensor([[1]]))\n>>> output[\"logits\"].shape\ntorch.Size([1, 1])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n
Source code in multimolecule/models/rnafm/modeling_rnafm.py
Pythonclass RnaFmForSequencePrediction(RnaFmPreTrainedModel):\n \"\"\"\n Examples:\n >>> from multimolecule import RnaFmConfig, RnaFmForSequencePrediction, RnaTokenizer\n >>> config = RnaFmConfig()\n >>> model = RnaFmForSequencePrediction(config)\n >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n >>> output = model(**input, labels=torch.tensor([[1]]))\n >>> output[\"logits\"].shape\n torch.Size([1, 1])\n >>> output[\"loss\"] # doctest:+ELLIPSIS\n tensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n \"\"\"\n\n def __init__(self, config: RnaFmConfig):\n super().__init__(config)\n self.rnafm = RnaFmModel(config, add_pooling_layer=True)\n self.sequence_head = SequencePredictionHead(config)\n self.head_config = self.sequence_head.config\n\n # Initialize weights and apply final processing\n self.post_init()\n\n def forward(\n self,\n input_ids: Tensor | NestedTensor,\n attention_mask: Tensor | None = None,\n position_ids: Tensor | None = None,\n head_mask: Tensor | None = None,\n inputs_embeds: Tensor | NestedTensor | None = None,\n labels: Tensor | None = None,\n output_attentions: bool | None = None,\n output_hidden_states: bool | None = None,\n return_dict: bool | None = None,\n **kwargs,\n ) -> Tuple[Tensor, ...] | SequencePredictorOutput:\n return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n outputs = self.rnafm(\n input_ids,\n attention_mask=attention_mask,\n position_ids=position_ids,\n head_mask=head_mask,\n inputs_embeds=inputs_embeds,\n output_attentions=output_attentions,\n output_hidden_states=output_hidden_states,\n return_dict=return_dict,\n **kwargs,\n )\n output = self.sequence_head(outputs, labels)\n logits, loss = output.logits, output.loss\n\n if not return_dict:\n output = (logits,) + outputs[2:]\n return ((loss,) + output) if loss is not None else output\n\n return SequencePredictorOutput(\n loss=loss,\n logits=logits,\n hidden_states=outputs.hidden_states,\n attentions=outputs.attentions,\n )\n
"},{"location":"models/rnafm/#multimolecule.models.rnafm.RnaFmForTokenPrediction","title":"RnaFmForTokenPrediction","text":" Bases: RnaFmPreTrainedModel
Examples:
Python Console Session>>> from multimolecule import RnaFmConfig, RnaFmForTokenPrediction, RnaTokenizer\n>>> config = RnaFmConfig()\n>>> model = RnaFmForTokenPrediction(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=torch.randint(2, (1, 5)))\n>>> output[\"logits\"].shape\ntorch.Size([1, 5, 1])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n
Source code in multimolecule/models/rnafm/modeling_rnafm.py
Pythonclass RnaFmForTokenPrediction(RnaFmPreTrainedModel):\n \"\"\"\n Examples:\n >>> from multimolecule import RnaFmConfig, RnaFmForTokenPrediction, RnaTokenizer\n >>> config = RnaFmConfig()\n >>> model = RnaFmForTokenPrediction(config)\n >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n >>> output = model(**input, labels=torch.randint(2, (1, 5)))\n >>> output[\"logits\"].shape\n torch.Size([1, 5, 1])\n >>> output[\"loss\"] # doctest:+ELLIPSIS\n tensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n \"\"\"\n\n def __init__(self, config: RnaFmConfig):\n super().__init__(config)\n self.rnafm = RnaFmModel(config, add_pooling_layer=True)\n self.token_head = TokenPredictionHead(config)\n self.head_config = self.token_head.config\n\n # Initialize weights and apply final processing\n self.post_init()\n\n def forward(\n self,\n input_ids: Tensor | NestedTensor,\n attention_mask: Tensor | None = None,\n position_ids: Tensor | None = None,\n head_mask: Tensor | None = None,\n inputs_embeds: Tensor | NestedTensor | None = None,\n labels: Tensor | None = None,\n output_attentions: bool | None = None,\n output_hidden_states: bool | None = None,\n return_dict: bool | None = None,\n **kwargs,\n ) -> Tuple[Tensor, ...] | TokenPredictorOutput:\n return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n outputs = self.rnafm(\n input_ids,\n attention_mask=attention_mask,\n position_ids=position_ids,\n head_mask=head_mask,\n inputs_embeds=inputs_embeds,\n output_attentions=output_attentions,\n output_hidden_states=output_hidden_states,\n return_dict=return_dict,\n **kwargs,\n )\n output = self.token_head(outputs, attention_mask, input_ids, labels)\n logits, loss = output.logits, output.loss\n\n if not return_dict:\n output = (logits,) + outputs[2:]\n return ((loss,) + output) if loss is not None else output\n\n return TokenPredictorOutput(\n loss=loss,\n logits=logits,\n hidden_states=outputs.hidden_states,\n attentions=outputs.attentions,\n )\n
"},{"location":"models/rnafm/#multimolecule.models.rnafm.RnaFmModel","title":"RnaFmModel","text":" Bases: RnaFmPreTrainedModel
Examples:
Python Console Session>>> from multimolecule import RnaFmConfig, RnaFmModel, RnaTokenizer\n>>> config = RnaFmConfig()\n>>> model = RnaFmModel(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input)\n>>> output[\"last_hidden_state\"].shape\ntorch.Size([1, 7, 640])\n>>> output[\"pooler_output\"].shape\ntorch.Size([1, 640])\n
Source code in multimolecule/models/rnafm/modeling_rnafm.py
Pythonclass RnaFmModel(RnaFmPreTrainedModel):\n \"\"\"\n Examples:\n >>> from multimolecule import RnaFmConfig, RnaFmModel, RnaTokenizer\n >>> config = RnaFmConfig()\n >>> model = RnaFmModel(config)\n >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n >>> output = model(**input)\n >>> output[\"last_hidden_state\"].shape\n torch.Size([1, 7, 640])\n >>> output[\"pooler_output\"].shape\n torch.Size([1, 640])\n \"\"\"\n\n def __init__(self, config: RnaFmConfig, add_pooling_layer: bool = True):\n super().__init__(config)\n self.pad_token_id = config.pad_token_id\n self.embeddings = RnaFmEmbeddings(config)\n self.encoder = RnaFmEncoder(config)\n self.pooler = RnaFmPooler(config) if add_pooling_layer else None\n\n # Initialize weights and apply final processing\n self.post_init()\n\n def get_input_embeddings(self):\n return self.embeddings.word_embeddings\n\n def set_input_embeddings(self, value):\n self.embeddings.word_embeddings = value\n\n def _prune_heads(self, heads_to_prune):\n \"\"\"\n Prunes heads of the model. heads_to_prune: dict of {layer_num: list of heads to prune in this layer} See base\n class PreTrainedModel\n \"\"\"\n for layer, heads in heads_to_prune.items():\n self.encoder.layer[layer].attention.prune_heads(heads)\n\n def forward(\n self,\n input_ids: Tensor | NestedTensor,\n attention_mask: Tensor | None = None,\n position_ids: Tensor | None = None,\n head_mask: Tensor | None = None,\n inputs_embeds: Tensor | NestedTensor | None = None,\n encoder_hidden_states: Tensor | None = None,\n encoder_attention_mask: Tensor | None = None,\n past_key_values: Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None = None,\n use_cache: bool | None = None,\n output_attentions: bool | None = None,\n output_hidden_states: bool | None = None,\n return_dict: bool | None = None,\n **kwargs,\n ) -> Tuple[Tensor, ...] | BaseModelOutputWithPoolingAndCrossAttentions:\n r\"\"\"\n Args:\n encoder_hidden_states:\n Shape: `(batch_size, sequence_length, hidden_size)`\n\n Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if\n the model is configured as a decoder.\n encoder_attention_mask:\n Shape: `(batch_size, sequence_length)`\n\n Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used\n in the cross-attention if the model is configured as a decoder. Mask values selected in `[0, 1]`:\n\n - 1 for tokens that are **not masked**,\n - 0 for tokens that are **masked**.\n past_key_values:\n Tuple of length `config.n_layers` with each tuple having 4 tensors of shape\n `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)\n\n Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up\n decoding.\n\n If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those\n that don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of\n all `decoder_input_ids` of shape `(batch_size, sequence_length)`.\n use_cache:\n If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding\n (see `past_key_values`).\n \"\"\"\n if kwargs:\n warn(\n f\"Additional keyword arguments `{', '.join(kwargs)}` are detected in \"\n f\"`{self.__class__.__name__}.forward`, they will be ignored.\\n\"\n \"This is provided for backward compatibility and may lead to unexpected behavior.\"\n )\n output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions\n output_hidden_states = (\n output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states\n )\n return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n\n if self.config.is_decoder:\n use_cache = use_cache if use_cache is not None else self.config.use_cache\n else:\n use_cache = False\n\n if isinstance(input_ids, NestedTensor):\n input_ids, attention_mask = input_ids.tensor, input_ids.mask\n if input_ids is not None and inputs_embeds is not None:\n raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n if input_ids is not None:\n self.warn_if_padding_and_no_attention_mask(input_ids, attention_mask)\n input_shape = input_ids.size()\n elif inputs_embeds is not None:\n input_shape = inputs_embeds.size()[:-1]\n else:\n raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n batch_size, seq_length = input_shape\n device = input_ids.device if input_ids is not None else inputs_embeds.device # type: ignore[union-attr]\n\n # past_key_values_length\n past_key_values_length = past_key_values[0][0].shape[2] if past_key_values is not None else 0\n\n if attention_mask is None:\n attention_mask = (\n input_ids.ne(self.pad_token_id)\n if self.pad_token_id is not None\n else torch.ones(((batch_size, seq_length + past_key_values_length)), device=device)\n )\n\n # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]\n # ourselves in which case we just need to make it broadcastable to all heads.\n extended_attention_mask: Tensor = self.get_extended_attention_mask(attention_mask, input_shape)\n\n # If a 2D or 3D attention mask is provided for the cross-attention\n # we need to make broadcastable to [batch_size, num_heads, seq_length, seq_length]\n if self.config.is_decoder and encoder_hidden_states is not None:\n encoder_batch_size, encoder_sequence_length, _ = encoder_hidden_states.size()\n encoder_hidden_shape = (encoder_batch_size, encoder_sequence_length)\n if encoder_attention_mask is None:\n encoder_attention_mask = torch.ones(encoder_hidden_shape, device=device)\n encoder_extended_attention_mask = self.invert_attention_mask(encoder_attention_mask)\n else:\n encoder_extended_attention_mask = None\n\n # Prepare head mask if needed\n # 1.0 in head_mask indicate we keep the head\n # attention_probs has shape bsz x n_heads x N x N\n # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]\n # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]\n head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)\n\n embedding_output = self.embeddings(\n input_ids=input_ids,\n position_ids=position_ids,\n attention_mask=attention_mask,\n inputs_embeds=inputs_embeds,\n past_key_values_length=past_key_values_length,\n )\n encoder_outputs = self.encoder(\n embedding_output,\n attention_mask=extended_attention_mask,\n head_mask=head_mask,\n encoder_hidden_states=encoder_hidden_states,\n encoder_attention_mask=encoder_extended_attention_mask,\n past_key_values=past_key_values,\n use_cache=use_cache,\n output_attentions=output_attentions,\n output_hidden_states=output_hidden_states,\n return_dict=return_dict,\n )\n sequence_output = encoder_outputs[0]\n pooled_output = self.pooler(sequence_output) if self.pooler is not None else None\n\n if not return_dict:\n return (sequence_output, pooled_output) + encoder_outputs[1:]\n\n return BaseModelOutputWithPoolingAndCrossAttentions(\n last_hidden_state=sequence_output,\n pooler_output=pooled_output,\n past_key_values=encoder_outputs.past_key_values,\n hidden_states=encoder_outputs.hidden_states,\n attentions=encoder_outputs.attentions,\n cross_attentions=encoder_outputs.cross_attentions,\n )\n
"},{"location":"models/rnafm/#multimolecule.models.rnafm.RnaFmModel.forward","title":"forward","text":"Pythonforward(input_ids: Tensor | NestedTensor, attention_mask: Tensor | None = None, position_ids: Tensor | None = None, head_mask: Tensor | None = None, inputs_embeds: Tensor | NestedTensor | None = None, encoder_hidden_states: Tensor | None = None, encoder_attention_mask: Tensor | None = None, past_key_values: Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None = None, use_cache: bool | None = None, output_attentions: bool | None = None, output_hidden_states: bool | None = None, return_dict: bool | None = None, **kwargs) -> Tuple[Tensor, ...] | BaseModelOutputWithPoolingAndCrossAttentions\n
Parameters:
Name Type Description DefaultTensor | None
Shape: (batch_size, sequence_length, hidden_size)
Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if the model is configured as a decoder.
None
Tensor | None
Shape: (batch_size, sequence_length)
Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used in the cross-attention if the model is configured as a decoder. Mask values selected in [0, 1]
:
None
Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None
Tuple of length config.n_layers
with each tuple having 4 tensors of shape `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)
Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up decoding.
If past_key_values
are used, the user can optionally input only the last decoder_input_ids
(those that don\u2019t have their past key value states given to this model) of shape (batch_size, 1)
instead of all decoder_input_ids
of shape (batch_size, sequence_length)
.
None
bool | None
If set to True
, past_key_values
key value states are returned and can be used to speed up decoding (see past_key_values
).
None
Source code in multimolecule/models/rnafm/modeling_rnafm.py
Pythondef forward(\n self,\n input_ids: Tensor | NestedTensor,\n attention_mask: Tensor | None = None,\n position_ids: Tensor | None = None,\n head_mask: Tensor | None = None,\n inputs_embeds: Tensor | NestedTensor | None = None,\n encoder_hidden_states: Tensor | None = None,\n encoder_attention_mask: Tensor | None = None,\n past_key_values: Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None = None,\n use_cache: bool | None = None,\n output_attentions: bool | None = None,\n output_hidden_states: bool | None = None,\n return_dict: bool | None = None,\n **kwargs,\n) -> Tuple[Tensor, ...] | BaseModelOutputWithPoolingAndCrossAttentions:\n r\"\"\"\n Args:\n encoder_hidden_states:\n Shape: `(batch_size, sequence_length, hidden_size)`\n\n Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if\n the model is configured as a decoder.\n encoder_attention_mask:\n Shape: `(batch_size, sequence_length)`\n\n Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used\n in the cross-attention if the model is configured as a decoder. Mask values selected in `[0, 1]`:\n\n - 1 for tokens that are **not masked**,\n - 0 for tokens that are **masked**.\n past_key_values:\n Tuple of length `config.n_layers` with each tuple having 4 tensors of shape\n `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)\n\n Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up\n decoding.\n\n If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those\n that don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of\n all `decoder_input_ids` of shape `(batch_size, sequence_length)`.\n use_cache:\n If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding\n (see `past_key_values`).\n \"\"\"\n if kwargs:\n warn(\n f\"Additional keyword arguments `{', '.join(kwargs)}` are detected in \"\n f\"`{self.__class__.__name__}.forward`, they will be ignored.\\n\"\n \"This is provided for backward compatibility and may lead to unexpected behavior.\"\n )\n output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions\n output_hidden_states = (\n output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states\n )\n return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n\n if self.config.is_decoder:\n use_cache = use_cache if use_cache is not None else self.config.use_cache\n else:\n use_cache = False\n\n if isinstance(input_ids, NestedTensor):\n input_ids, attention_mask = input_ids.tensor, input_ids.mask\n if input_ids is not None and inputs_embeds is not None:\n raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n if input_ids is not None:\n self.warn_if_padding_and_no_attention_mask(input_ids, attention_mask)\n input_shape = input_ids.size()\n elif inputs_embeds is not None:\n input_shape = inputs_embeds.size()[:-1]\n else:\n raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n batch_size, seq_length = input_shape\n device = input_ids.device if input_ids is not None else inputs_embeds.device # type: ignore[union-attr]\n\n # past_key_values_length\n past_key_values_length = past_key_values[0][0].shape[2] if past_key_values is not None else 0\n\n if attention_mask is None:\n attention_mask = (\n input_ids.ne(self.pad_token_id)\n if self.pad_token_id is not None\n else torch.ones(((batch_size, seq_length + past_key_values_length)), device=device)\n )\n\n # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]\n # ourselves in which case we just need to make it broadcastable to all heads.\n extended_attention_mask: Tensor = self.get_extended_attention_mask(attention_mask, input_shape)\n\n # If a 2D or 3D attention mask is provided for the cross-attention\n # we need to make broadcastable to [batch_size, num_heads, seq_length, seq_length]\n if self.config.is_decoder and encoder_hidden_states is not None:\n encoder_batch_size, encoder_sequence_length, _ = encoder_hidden_states.size()\n encoder_hidden_shape = (encoder_batch_size, encoder_sequence_length)\n if encoder_attention_mask is None:\n encoder_attention_mask = torch.ones(encoder_hidden_shape, device=device)\n encoder_extended_attention_mask = self.invert_attention_mask(encoder_attention_mask)\n else:\n encoder_extended_attention_mask = None\n\n # Prepare head mask if needed\n # 1.0 in head_mask indicate we keep the head\n # attention_probs has shape bsz x n_heads x N x N\n # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]\n # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]\n head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)\n\n embedding_output = self.embeddings(\n input_ids=input_ids,\n position_ids=position_ids,\n attention_mask=attention_mask,\n inputs_embeds=inputs_embeds,\n past_key_values_length=past_key_values_length,\n )\n encoder_outputs = self.encoder(\n embedding_output,\n attention_mask=extended_attention_mask,\n head_mask=head_mask,\n encoder_hidden_states=encoder_hidden_states,\n encoder_attention_mask=encoder_extended_attention_mask,\n past_key_values=past_key_values,\n use_cache=use_cache,\n output_attentions=output_attentions,\n output_hidden_states=output_hidden_states,\n return_dict=return_dict,\n )\n sequence_output = encoder_outputs[0]\n pooled_output = self.pooler(sequence_output) if self.pooler is not None else None\n\n if not return_dict:\n return (sequence_output, pooled_output) + encoder_outputs[1:]\n\n return BaseModelOutputWithPoolingAndCrossAttentions(\n last_hidden_state=sequence_output,\n pooler_output=pooled_output,\n past_key_values=encoder_outputs.past_key_values,\n hidden_states=encoder_outputs.hidden_states,\n attentions=encoder_outputs.attentions,\n cross_attentions=encoder_outputs.cross_attentions,\n )\n
"},{"location":"models/rnafm/#multimolecule.models.rnafm.RnaFmModel.forward(encoder_hidden_states)","title":"encoder_hidden_states
","text":""},{"location":"models/rnafm/#multimolecule.models.rnafm.RnaFmModel.forward(encoder_attention_mask)","title":"encoder_attention_mask
","text":""},{"location":"models/rnafm/#multimolecule.models.rnafm.RnaFmModel.forward(past_key_values)","title":"past_key_values
","text":""},{"location":"models/rnafm/#multimolecule.models.rnafm.RnaFmModel.forward(use_cache)","title":"use_cache
","text":""},{"location":"models/rnafm/#multimolecule.models.rnafm.RnaFmPreTrainedModel","title":"RnaFmPreTrainedModel","text":" Bases: PreTrainedModel
An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained models.
Source code inmultimolecule/models/rnafm/modeling_rnafm.py
Pythonclass RnaFmPreTrainedModel(PreTrainedModel):\n \"\"\"\n An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained\n models.\n \"\"\"\n\n config_class = RnaFmConfig\n base_model_prefix = \"rnafm\"\n supports_gradient_checkpointing = True\n _no_split_modules = [\"RnaFmLayer\", \"RnaFmEmbeddings\"]\n\n # Copied from transformers.models.bert.modeling_bert.BertPreTrainedModel._init_weights\n def _init_weights(self, module: nn.Module):\n \"\"\"Initialize the weights\"\"\"\n if isinstance(module, nn.Linear):\n # Slightly different from the TF version which uses truncated_normal for initialization\n # cf https://github.com/pytorch/pytorch/pull/5617\n module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)\n if module.bias is not None:\n module.bias.data.zero_()\n elif isinstance(module, nn.Embedding):\n module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)\n if module.padding_idx is not None:\n module.weight.data[module.padding_idx].zero_()\n elif isinstance(module, nn.LayerNorm):\n module.bias.data.zero_()\n module.weight.data.fill_(1.0)\n
"},{"location":"models/rnamsm/","title":"RNA-MSM","text":"Pre-trained model on non-coding RNA (ncRNA) with multi (homologous) sequence alignment using a masked language modeling (MLM) objective.
"},{"location":"models/rnamsm/#disclaimer","title":"Disclaimer","text":"This is an UNOFFICIAL implementation of the Multiple sequence alignment-based RNA language model and its application to structural inference by Yikun Zhang, Mei Lang, Jiuhong Jiang, Zhiqiang Gao, et al.
The OFFICIAL repository of RNA-MSM is at yikunpku/RNA-MSM.
Caution
The MultiMolecule team is aware of a potential risk in reproducing the results of RNA-MSM.
The original implementation of RNA-MSM used a custom tokenizer that does not append <eos>
token to the end of the input sequence in consistent to MSA Transformer. This should not affect the performance of the model in most cases, but it can lead to unexpected behavior in some cases.
Please set eos_token=None
explicitly in the tokenizer if you want the exact behavior of the original implementation.
See more at issue #10
Tip
The MultiMolecule team has confirmed that the provided model and checkpoints are producing the same intermediate representations as the original implementation.
The team releasing RNA-MSM did not write this model card for this model so this model card has been written by the MultiMolecule team.
"},{"location":"models/rnamsm/#model-details","title":"Model Details","text":"RNA-MSM is a bert-style model pre-trained on a large corpus of non-coding RNA sequences in a self-supervised fashion. This means that the model was trained on the raw nucleotides of RNA sequences only, with an automatic process to generate inputs and labels from those texts. Please refer to the Training Details section for more information on the training process.
"},{"location":"models/rnamsm/#model-specification","title":"Model Specification","text":"Num Layers Hidden Size Num Heads Intermediate Size Num Parameters (M) FLOPs (G) MACs (G) Max Num Tokens 10 768 12 3072 95.92 21.66 10.57 1024"},{"location":"models/rnamsm/#links","title":"Links","text":"The model file depends on the multimolecule
library. You can install it using pip:
pip install multimolecule\n
"},{"location":"models/rnamsm/#direct-use","title":"Direct Use","text":"You can use this model directly with a pipeline for masked language modeling:
Python>>> import multimolecule # you must import multimolecule to register models\n>>> from transformers import pipeline\n>>> unmasker = pipeline(\"fill-mask\", model=\"multimolecule/rnamsm\")\n>>> unmasker(\"gguc<mask>cucugguuagaccagaucugagccu\")\n\n[{'score': 0.25111356377601624,\n 'token': 9,\n 'token_str': 'U',\n 'sequence': 'G G U C U C U C U G G U U A G A C C A G A U C U G A G C C U'},\n {'score': 0.1200353354215622,\n 'token': 14,\n 'token_str': 'W',\n 'sequence': 'G G U C W C U C U G G U U A G A C C A G A U C U G A G C C U'},\n {'score': 0.10132723301649094,\n 'token': 15,\n 'token_str': 'K',\n 'sequence': 'G G U C K C U C U G G U U A G A C C A G A U C U G A G C C U'},\n {'score': 0.08383019268512726,\n 'token': 18,\n 'token_str': 'D',\n 'sequence': 'G G U C D C U C U G G U U A G A C C A G A U C U G A G C C U'},\n {'score': 0.05737845227122307,\n 'token': 6,\n 'token_str': 'A',\n 'sequence': 'G G U C A C U C U G G U U A G A C C A G A U C U G A G C C U'}]\n
"},{"location":"models/rnamsm/#downstream-use","title":"Downstream Use","text":""},{"location":"models/rnamsm/#extract-features","title":"Extract Features","text":"Here is how to use this model to get the features of a given sequence in PyTorch:
Pythonfrom multimolecule import RnaTokenizer, RnaMsmModel\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rnamsm\")\nmodel = RnaMsmModel.from_pretrained(\"multimolecule/rnamsm\")\n\ntext = \"UAGCUUAUCAGACUGAUGUUGA\"\ninput = tokenizer(text, return_tensors=\"pt\")\n\noutput = model(**input)\n
"},{"location":"models/rnamsm/#sequence-classification-regression","title":"Sequence Classification / Regression","text":"Note: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for sequence classification or regression.
Here is how to use this model as backbone to fine-tune for a sequence-level task in PyTorch:
Pythonimport torch\nfrom multimolecule import RnaTokenizer, RnaMsmForSequencePrediction\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rnamsm\")\nmodel = RnaMsmForSequencePrediction.from_pretrained(\"multimolecule/rnamsm\")\n\ntext = \"UAGCUUAUCAGACUGAUGUUGA\"\ninput = tokenizer(text, return_tensors=\"pt\")\nlabel = torch.tensor([1])\n\noutput = model(**input, labels=label)\n
"},{"location":"models/rnamsm/#token-classification-regression","title":"Token Classification / Regression","text":"Note: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for nucleotide classification or regression.
Here is how to use this model as backbone to fine-tune for a nucleotide-level task in PyTorch:
Pythonimport torch\nfrom multimolecule import RnaTokenizer, RnaMsmForTokenPrediction\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rnamsm\")\nmodel = RnaMsmForNucleotidPrediction.from_pretrained(\"multimolecule/rnamsm\")\n\ntext = \"UAGCUUAUCAGACUGAUGUUGA\"\ninput = tokenizer(text, return_tensors=\"pt\")\nlabel = torch.randint(2, (len(text), ))\n\noutput = model(**input, labels=label)\n
"},{"location":"models/rnamsm/#contact-classification-regression","title":"Contact Classification / Regression","text":"Note: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for contact classification or regression.
Here is how to use this model as backbone to fine-tune for a contact-level task in PyTorch:
Pythonimport torch\nfrom multimolecule import RnaTokenizer, RnaMsmForContactPrediction\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rnamsm\")\nmodel = RnaMsmForContactPrediction.from_pretrained(\"multimolecule/rnamsm\")\n\ntext = \"UAGCUUAUCAGACUGAUGUUGA\"\ninput = tokenizer(text, return_tensors=\"pt\")\nlabel = torch.randint(2, (len(text), len(text)))\n\noutput = model(**input, labels=label)\n
"},{"location":"models/rnamsm/#training-details","title":"Training Details","text":"RNA-MSM used Masked Language Modeling (MLM) as the pre-training objective: taking a sequence, the model randomly masks 15% of the tokens in the input then runs the entire masked sentence through the model and has to predict the masked tokens. This is comparable to the Cloze task in language modeling.
"},{"location":"models/rnamsm/#training-data","title":"Training Data","text":"The RNA-MSM model was pre-trained on Rfam. The Rfam database is a collection of RNA sequence families of structural RNAs including non-coding RNA genes as well as cis-regulatory elements. RNA-MSM used Rfam 14.7 which contains 4,069 RNA families.
To avoid potential overfitting in structural inference, RNA-MSM excluded families with experimentally determined structures, such as ribosomal RNAs, transfer RNAs, and small nuclear RNAs. The final dataset contains 3,932 RNA families. The median value for the number of MSA sequences for these families by RNAcmap3 is 2,184.
To increase the number of homologous sequences, RNA-MSM used an automatic pipeline, RNAcmap3, for homolog search and sequence alignment. RNAcmap3 is a pipeline that combines the BLAST-N, INFERNAL, Easel, RNAfold and evolutionary coupling tools to generate homologous sequences.
RNA-MSM preprocessed all tokens by replacing \u201cT\u201ds with \u201cU\u201ds and substituting \u201cR\u201d, \u201cY\u201d, \u201cK\u201d, \u201cM\u201d, \u201cS\u201d, \u201cW\u201d, \u201cB\u201d, \u201cD\u201d, \u201cH\u201d, \u201cV\u201d, \u201cN\u201d with \u201cX\u201d.
Note that RnaTokenizer
will convert \u201cT\u201ds to \u201cU\u201ds for you, you may disable this behaviour by passing replace_T_with_U=False
. RnaTokenizer
does not perform other substitutions.
RNA-MSM used masked language modeling (MLM) as the pre-training objective. The masking procedure is similar to the one used in BERT:
<mask>
.The model was trained on 8 NVIDIA V100 GPUs with 32GiB memories.
BibTeX:
BibTeX@article{zhang2023multiple,\n author = {Zhang, Yikun and Lang, Mei and Jiang, Jiuhong and Gao, Zhiqiang and Xu, Fan and Litfin, Thomas and Chen, Ke and Singh, Jaswinder and Huang, Xiansong and Song, Guoli and Tian, Yonghong and Zhan, Jian and Chen, Jie and Zhou, Yaoqi},\n title = \"{Multiple sequence alignment-based RNA language model and its application to structural inference}\",\n journal = {Nucleic Acids Research},\n volume = {52},\n number = {1},\n pages = {e3-e3},\n year = {2023},\n month = {11},\n abstract = \"{Compared with proteins, DNA and RNA are more difficult languages to interpret because four-letter coded DNA/RNA sequences have less information content than 20-letter coded protein sequences. While BERT (Bidirectional Encoder Representations from Transformers)-like language models have been developed for RNA, they are ineffective at capturing the evolutionary information from homologous sequences because\u00a0unlike proteins, RNA sequences are less conserved. Here, we have developed an unsupervised multiple sequence alignment-based RNA language model (RNA-MSM) by utilizing homologous sequences from an automatic pipeline, RNAcmap, as it can provide significantly more homologous sequences than manually annotated Rfam. We demonstrate that the resulting unsupervised, two-dimensional attention maps and one-dimensional embeddings from RNA-MSM contain structural information. In fact, they can be directly mapped with high accuracy to 2D base pairing probabilities and 1D solvent accessibilities, respectively. Further fine-tuning led to significantly improved performance on these two downstream tasks compared with existing state-of-the-art techniques including SPOT-RNA2 and RNAsnap2. By comparison, RNA-FM, a BERT-based RNA language model, performs worse than one-hot encoding with its embedding in base pair and solvent-accessible surface area prediction. We anticipate that the pre-trained RNA-MSM model can be fine-tuned on many other tasks related to RNA structure and function.}\",\n issn = {0305-1048},\n doi = {10.1093/nar/gkad1031},\n url = {https://doi.org/10.1093/nar/gkad1031},\n eprint = {https://academic.oup.com/nar/article-pdf/52/1/e3/55443207/gkad1031.pdf},\n}\n
"},{"location":"models/rnamsm/#contact","title":"Contact","text":"Please use GitHub issues of MultiMolecule for any questions or comments on the model card.
Please contact the authors of the RNA-MSM paper for questions or comments on the paper/model.
"},{"location":"models/rnamsm/#license","title":"License","text":"This model is licensed under the AGPL-3.0 License.
Text OnlySPDX-License-Identifier: AGPL-3.0-or-later\n
"},{"location":"models/rnamsm/#multimolecule.models.rnamsm","title":"multimolecule.models.rnamsm","text":""},{"location":"models/rnamsm/#multimolecule.models.rnamsm.RnaTokenizer","title":"RnaTokenizer","text":" Bases: Tokenizer
Tokenizer for RNA sequences.
Parameters:
Name Type Description DefaultAlphabet | str | List[str] | None
alphabet to use for tokenization.
None
, the standard RNA alphabet will be used.string
, it should correspond to the name of a predefined alphabet. The options includestandard
extended
streamline
nucleobase
None
int
Size of kmer to tokenize.
1
bool
Whether to tokenize into codons.
False
bool
Whether to replace T with U.
True
bool
Whether to convert input to uppercase.
True
Examples:
Python Console Session>>> from multimolecule import RnaTokenizer\n>>> tokenizer = RnaTokenizer()\n>>> tokenizer('<pad><cls><eos><unk><mask><null>ACGUNRYSWKMBDHV.X*-I')[\"input_ids\"]\n[1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 2]\n>>> tokenizer('acgu')[\"input_ids\"]\n[1, 6, 7, 8, 9, 2]\n>>> tokenizer('acgt')[\"input_ids\"]\n[1, 6, 7, 8, 9, 2]\n>>> tokenizer = RnaTokenizer(replace_T_with_U=False)\n>>> tokenizer('acgt')[\"input_ids\"]\n[1, 6, 7, 8, 3, 2]\n>>> tokenizer = RnaTokenizer(nmers=3)\n>>> tokenizer('uagcuuauc')[\"input_ids\"]\n[1, 83, 17, 64, 49, 96, 84, 22, 2]\n>>> tokenizer = RnaTokenizer(codon=True)\n>>> tokenizer('uagcuuauc')[\"input_ids\"]\n[1, 83, 49, 22, 2]\n>>> tokenizer('uagcuuauca')[\"input_ids\"]\nTraceback (most recent call last):\nValueError: length of input sequence must be a multiple of 3 for codon tokenization, but got 10\n
Source code in multimolecule/tokenisers/rna/tokenization_rna.py
Pythonclass RnaTokenizer(Tokenizer):\n \"\"\"\n Tokenizer for RNA sequences.\n\n Args:\n alphabet: alphabet to use for tokenization.\n\n - If is `None`, the standard RNA alphabet will be used.\n - If is a `string`, it should correspond to the name of a predefined alphabet. The options include\n + `standard`\n + `extended`\n + `streamline`\n + `nucleobase`\n - If is an alphabet or a list of characters, that specific alphabet will be used.\n nmers: Size of kmer to tokenize.\n codon: Whether to tokenize into codons.\n replace_T_with_U: Whether to replace T with U.\n do_upper_case: Whether to convert input to uppercase.\n\n Examples:\n >>> from multimolecule import RnaTokenizer\n >>> tokenizer = RnaTokenizer()\n >>> tokenizer('<pad><cls><eos><unk><mask><null>ACGUNRYSWKMBDHV.X*-I')[\"input_ids\"]\n [1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 2]\n >>> tokenizer('acgu')[\"input_ids\"]\n [1, 6, 7, 8, 9, 2]\n >>> tokenizer('acgt')[\"input_ids\"]\n [1, 6, 7, 8, 9, 2]\n >>> tokenizer = RnaTokenizer(replace_T_with_U=False)\n >>> tokenizer('acgt')[\"input_ids\"]\n [1, 6, 7, 8, 3, 2]\n >>> tokenizer = RnaTokenizer(nmers=3)\n >>> tokenizer('uagcuuauc')[\"input_ids\"]\n [1, 83, 17, 64, 49, 96, 84, 22, 2]\n >>> tokenizer = RnaTokenizer(codon=True)\n >>> tokenizer('uagcuuauc')[\"input_ids\"]\n [1, 83, 49, 22, 2]\n >>> tokenizer('uagcuuauca')[\"input_ids\"]\n Traceback (most recent call last):\n ValueError: length of input sequence must be a multiple of 3 for codon tokenization, but got 10\n \"\"\"\n\n model_input_names = [\"input_ids\", \"attention_mask\"]\n\n def __init__(\n self,\n alphabet: Alphabet | str | List[str] | None = None,\n nmers: int = 1,\n codon: bool = False,\n replace_T_with_U: bool = True,\n do_upper_case: bool = True,\n additional_special_tokens: List | Tuple | None = None,\n **kwargs,\n ):\n if codon and (nmers > 1 and nmers != 3):\n raise ValueError(\"Codon and nmers cannot be used together.\")\n if codon:\n nmers = 3 # set to 3 to get correct vocab\n if not isinstance(alphabet, Alphabet):\n alphabet = get_alphabet(alphabet, nmers=nmers)\n super().__init__(\n alphabet=alphabet,\n nmers=nmers,\n codon=codon,\n replace_T_with_U=replace_T_with_U,\n do_upper_case=do_upper_case,\n additional_special_tokens=additional_special_tokens,\n **kwargs,\n )\n self.replace_T_with_U = replace_T_with_U\n self.nmers = nmers\n self.codon = codon\n\n def _tokenize(self, text: str, **kwargs):\n if self.do_upper_case:\n text = text.upper()\n if self.replace_T_with_U:\n text = text.replace(\"T\", \"U\")\n if self.codon:\n if len(text) % 3 != 0:\n raise ValueError(\n f\"length of input sequence must be a multiple of 3 for codon tokenization, but got {len(text)}\"\n )\n return [text[i : i + 3] for i in range(0, len(text), 3)]\n if self.nmers > 1:\n return [text[i : i + self.nmers] for i in range(len(text) - self.nmers + 1)] # noqa: E203\n return list(text)\n
"},{"location":"models/rnamsm/#multimolecule.models.rnamsm.RnaTokenizer(alphabet)","title":"alphabet
","text":""},{"location":"models/rnamsm/#multimolecule.models.rnamsm.RnaTokenizer(nmers)","title":"nmers
","text":""},{"location":"models/rnamsm/#multimolecule.models.rnamsm.RnaTokenizer(codon)","title":"codon
","text":""},{"location":"models/rnamsm/#multimolecule.models.rnamsm.RnaTokenizer(replace_T_with_U)","title":"replace_T_with_U
","text":""},{"location":"models/rnamsm/#multimolecule.models.rnamsm.RnaTokenizer(do_upper_case)","title":"do_upper_case
","text":""},{"location":"models/rnamsm/#multimolecule.models.rnamsm.RnaMsmConfig","title":"RnaMsmConfig","text":" Bases: PreTrainedConfig
This is the configuration class to store the configuration of a RnaMsmModel
. It is used to instantiate a RnaMsm model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the RnaMsm yikunpku/RNA-MSM architecture.
Configuration objects inherit from PreTrainedConfig
and can be used to control the model outputs. Read the documentation from PreTrainedConfig
for more information.
Parameters:
Name Type Description Defaultint
Vocabulary size of the RnaMsm model. Defines the number of different tokens that can be represented by the inputs_ids
passed when calling [RnaMsmModel
].
26
int
Dimensionality of the encoder layers and the pooler layer.
768
int
Number of hidden layers in the Transformer encoder.
10
int
Number of attention heads for each attention layer in the Transformer encoder.
12
int
Dimensionality of the \u201cintermediate\u201d (often named feed-forward) layer in the Transformer encoder.
3072
float
The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
0.1
float
The dropout ratio for the attention probabilities.
0.1
int
The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
1024
float
The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
0.02
float
The epsilon used by the layer normalization layers.
1e-12
Examples:
Python Console Session>>> from multimolecule import RnaMsmModel, RnaMsmConfig\n>>> # Initializing a RNA-MSM multimolecule/rnamsm style configuration\n>>> configuration = RnaMsmConfig()\n>>> # Initializing a model (with random weights) from the multimolecule/rnamsm style configuration\n>>> model = RnaMsmModel(configuration)\n>>> # Accessing the model configuration\n>>> configuration = model.config\n
Source code in multimolecule/models/rnamsm/configuration_rnamsm.py
Pythonclass RnaMsmConfig(PreTrainedConfig):\n r\"\"\"\n This is the configuration class to store the configuration of a [`RnaMsmModel`][multimolecule.models.RnaMsmModel].\n It is used to instantiate a RnaMsm model according to the specified arguments, defining the model architecture.\n Instantiating a configuration with the defaults will yield a similar configuration to that of the RnaMsm\n [yikunpku/RNA-MSM](https://github.com/yikunpku/RNA-MSM) architecture.\n\n Configuration objects inherit from [`PreTrainedConfig`][multimolecule.models.PreTrainedConfig] and can be used to\n control the model outputs. Read the documentation from [`PreTrainedConfig`][multimolecule.models.PreTrainedConfig]\n for more information.\n\n Args:\n vocab_size:\n Vocabulary size of the RnaMsm model. Defines the number of different tokens that can be represented by the\n `inputs_ids` passed when calling [`RnaMsmModel`].\n hidden_size:\n Dimensionality of the encoder layers and the pooler layer.\n num_hidden_layers:\n Number of hidden layers in the Transformer encoder.\n num_attention_heads:\n Number of attention heads for each attention layer in the Transformer encoder.\n intermediate_size:\n Dimensionality of the \"intermediate\" (often named feed-forward) layer in the Transformer encoder.\n hidden_dropout:\n The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.\n attention_dropout:\n The dropout ratio for the attention probabilities.\n max_position_embeddings:\n The maximum sequence length that this model might ever be used with. Typically set this to something large\n just in case (e.g., 512 or 1024 or 2048).\n initializer_range:\n The standard deviation of the truncated_normal_initializer for initializing all weight matrices.\n layer_norm_eps:\n The epsilon used by the layer normalization layers.\n\n Examples:\n >>> from multimolecule import RnaMsmModel, RnaMsmConfig\n >>> # Initializing a RNA-MSM multimolecule/rnamsm style configuration\n >>> configuration = RnaMsmConfig()\n >>> # Initializing a model (with random weights) from the multimolecule/rnamsm style configuration\n >>> model = RnaMsmModel(configuration)\n >>> # Accessing the model configuration\n >>> configuration = model.config\n \"\"\"\n\n model_type = \"rnamsm\"\n\n def __init__(\n self,\n vocab_size: int = 26,\n hidden_size: int = 768,\n num_hidden_layers: int = 10,\n num_attention_heads: int = 12,\n intermediate_size: int = 3072,\n hidden_act: str = \"gelu\",\n hidden_dropout: float = 0.1,\n attention_dropout: float = 0.1,\n max_position_embeddings: int = 1024,\n initializer_range: float = 0.02,\n layer_norm_eps: float = 1e-12,\n position_embedding_type: str = \"absolute\",\n is_decoder: bool = False,\n use_cache: bool = True,\n max_tokens_per_msa: int = 2**14,\n layer_type: str = \"standard\",\n attention_type: str = \"standard\",\n embed_positions_msa: bool = True,\n attention_bias: bool = True,\n head: HeadConfig | None = None,\n lm_head: MaskedLMHeadConfig | None = None,\n **kwargs,\n ):\n super().__init__(**kwargs)\n\n self.vocab_size = vocab_size\n self.hidden_size = hidden_size\n self.num_hidden_layers = num_hidden_layers\n self.num_attention_heads = num_attention_heads\n self.intermediate_size = intermediate_size\n self.hidden_act = hidden_act\n self.hidden_dropout = hidden_dropout\n self.attention_dropout = attention_dropout\n self.max_position_embeddings = max_position_embeddings\n self.initializer_range = initializer_range\n self.layer_norm_eps = layer_norm_eps\n self.position_embedding_type = position_embedding_type\n self.is_decoder = is_decoder\n self.use_cache = use_cache\n self.max_tokens_per_msa = max_tokens_per_msa\n self.layer_type = layer_type\n self.attention_type = attention_type\n self.embed_positions_msa = embed_positions_msa\n self.attention_bias = attention_bias\n self.head = HeadConfig(**head) if head is not None else None\n self.lm_head = MaskedLMHeadConfig(**lm_head) if lm_head is not None else None\n
"},{"location":"models/rnamsm/#multimolecule.models.rnamsm.RnaMsmConfig(vocab_size)","title":"vocab_size
","text":""},{"location":"models/rnamsm/#multimolecule.models.rnamsm.RnaMsmConfig(hidden_size)","title":"hidden_size
","text":""},{"location":"models/rnamsm/#multimolecule.models.rnamsm.RnaMsmConfig(num_hidden_layers)","title":"num_hidden_layers
","text":""},{"location":"models/rnamsm/#multimolecule.models.rnamsm.RnaMsmConfig(num_attention_heads)","title":"num_attention_heads
","text":""},{"location":"models/rnamsm/#multimolecule.models.rnamsm.RnaMsmConfig(intermediate_size)","title":"intermediate_size
","text":""},{"location":"models/rnamsm/#multimolecule.models.rnamsm.RnaMsmConfig(hidden_dropout)","title":"hidden_dropout
","text":""},{"location":"models/rnamsm/#multimolecule.models.rnamsm.RnaMsmConfig(attention_dropout)","title":"attention_dropout
","text":""},{"location":"models/rnamsm/#multimolecule.models.rnamsm.RnaMsmConfig(max_position_embeddings)","title":"max_position_embeddings
","text":""},{"location":"models/rnamsm/#multimolecule.models.rnamsm.RnaMsmConfig(initializer_range)","title":"initializer_range
","text":""},{"location":"models/rnamsm/#multimolecule.models.rnamsm.RnaMsmConfig(layer_norm_eps)","title":"layer_norm_eps
","text":""},{"location":"models/rnamsm/#multimolecule.models.rnamsm.RnaMsmForContactPrediction","title":"RnaMsmForContactPrediction","text":" Bases: RnaMsmPreTrainedModel
Examples:
Python Console Session>>> from multimolecule import RnaMsmConfig, RnaMsmForContactPrediction, RnaTokenizer\n>>> config = RnaMsmConfig()\n>>> model = RnaMsmForContactPrediction(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=torch.randint(2, (1, 5, 5)))\n>>> output[\"logits\"].shape\ntorch.Size([1, 5, 5, 1])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n
Source code in multimolecule/models/rnamsm/modeling_rnamsm.py
Pythonclass RnaMsmForContactPrediction(RnaMsmPreTrainedModel):\n \"\"\"\n Examples:\n >>> from multimolecule import RnaMsmConfig, RnaMsmForContactPrediction, RnaTokenizer\n >>> config = RnaMsmConfig()\n >>> model = RnaMsmForContactPrediction(config)\n >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n >>> output = model(**input, labels=torch.randint(2, (1, 5, 5)))\n >>> output[\"logits\"].shape\n torch.Size([1, 5, 5, 1])\n >>> output[\"loss\"] # doctest:+ELLIPSIS\n tensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n \"\"\"\n\n def __init__(self, config: RnaMsmConfig):\n super().__init__(config)\n self.rnamsm = RnaMsmModel(config, add_pooling_layer=True)\n head_config = HeadConfig(output_name=\"row_attentions\")\n self.contact_head = ContactPredictionHead(config, head_config)\n self.head_config = self.contact_head.config\n\n # Initialize weights and apply final processing\n self.post_init()\n\n def forward(\n self,\n input_ids: Tensor | NestedTensor,\n attention_mask: Tensor | None = None,\n position_ids: Tensor | None = None,\n head_mask: Tensor | None = None,\n inputs_embeds: Tensor | NestedTensor | None = None,\n labels: Tensor | None = None,\n output_attentions: bool | None = None,\n output_hidden_states: bool | None = None,\n return_dict: bool | None = None,\n **kwargs,\n ) -> Tuple[Tensor, ...] | RnaMsmContactPredictorOutput:\n if output_attentions is False:\n warn(\"output_attentions must be True for contact classification and will be ignored.\")\n return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n outputs = self.rnamsm(\n input_ids,\n attention_mask=attention_mask,\n position_ids=position_ids,\n head_mask=head_mask,\n inputs_embeds=inputs_embeds,\n output_attentions=True,\n output_hidden_states=output_hidden_states,\n return_dict=return_dict,\n **kwargs,\n )\n output = self.contact_head(outputs, attention_mask, input_ids, labels)\n logits, loss = output.logits, output.loss\n\n if not return_dict:\n output = (logits,) + outputs[2:]\n return ((loss,) + output) if loss is not None else output\n\n return RnaMsmContactPredictorOutput(\n loss=loss,\n logits=logits,\n hidden_states=outputs.hidden_states,\n col_attentions=outputs.col_attentions,\n row_attentions=outputs.row_attentions,\n )\n
"},{"location":"models/rnamsm/#multimolecule.models.rnamsm.RnaMsmForMaskedLM","title":"RnaMsmForMaskedLM","text":" Bases: RnaMsmPreTrainedModel
Examples:
Python Console Session>>> from multimolecule import RnaMsmConfig, RnaMsmForMaskedLM, RnaTokenizer\n>>> config = RnaMsmConfig()\n>>> model = RnaMsmForMaskedLM(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=input[\"input_ids\"])\n>>> output[\"logits\"].shape\ntorch.Size([1, 7, 26])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<NllLossBackward0>)\n
Source code in multimolecule/models/rnamsm/modeling_rnamsm.py
Pythonclass RnaMsmForMaskedLM(RnaMsmPreTrainedModel):\n \"\"\"\n Examples:\n >>> from multimolecule import RnaMsmConfig, RnaMsmForMaskedLM, RnaTokenizer\n >>> config = RnaMsmConfig()\n >>> model = RnaMsmForMaskedLM(config)\n >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n >>> output = model(**input, labels=input[\"input_ids\"])\n >>> output[\"logits\"].shape\n torch.Size([1, 7, 26])\n >>> output[\"loss\"] # doctest:+ELLIPSIS\n tensor(..., grad_fn=<NllLossBackward0>)\n \"\"\"\n\n _tied_weights_keys = [\"lm_head.decoder.weight\", \"lm_head.decoder.bias\"]\n\n def __init__(self, config: RnaMsmConfig):\n super().__init__(config)\n self.rnamsm = RnaMsmModel(config, add_pooling_layer=False)\n self.lm_head = MaskedLMHead(config, weight=self.rnamsm.embeddings.word_embeddings.weight)\n\n # Initialize weights and apply final processing\n self.post_init()\n\n def forward(\n self,\n input_ids: Tensor | NestedTensor,\n attention_mask: Tensor | None = None,\n position_ids: Tensor | None = None,\n inputs_embeds: Tensor | NestedTensor | None = None,\n labels: Tensor | None = None,\n output_attentions: bool | None = None,\n output_hidden_states: bool | None = None,\n return_dict: bool | None = None,\n **kwargs,\n ) -> Tuple[Tensor, ...] | RnaMsmForMaskedLMOutput:\n return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n outputs = self.rnamsm(\n input_ids,\n attention_mask=attention_mask,\n position_ids=position_ids,\n inputs_embeds=inputs_embeds,\n output_attentions=output_attentions,\n output_hidden_states=output_hidden_states,\n return_dict=return_dict,\n **kwargs,\n )\n output = self.lm_head(outputs, labels)\n logits, loss = output.logits, output.loss\n\n if not return_dict:\n output = (logits,) + outputs[2:]\n return ((loss,) + output) if loss is not None else output\n\n return RnaMsmForMaskedLMOutput(\n loss=loss,\n logits=logits,\n hidden_states=outputs.hidden_states,\n col_attentions=outputs.col_attentions,\n row_attentions=outputs.row_attentions,\n )\n
"},{"location":"models/rnamsm/#multimolecule.models.rnamsm.RnaMsmForPreTraining","title":"RnaMsmForPreTraining","text":" Bases: RnaMsmPreTrainedModel
Examples:
Python Console Session>>> from multimolecule import RnaMsmConfig, RnaMsmForPreTraining, RnaTokenizer\n>>> config = RnaMsmConfig()\n>>> model = RnaMsmForPreTraining(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels_mlm=input[\"input_ids\"])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<AddBackward0>)\n>>> output[\"logits\"].shape\ntorch.Size([1, 7, 26])\n>>> output[\"contact_map\"].shape\ntorch.Size([1, 5, 5, 1])\n
Source code in multimolecule/models/rnamsm/modeling_rnamsm.py
Pythonclass RnaMsmForPreTraining(RnaMsmPreTrainedModel):\n \"\"\"\n Examples:\n >>> from multimolecule import RnaMsmConfig, RnaMsmForPreTraining, RnaTokenizer\n >>> config = RnaMsmConfig()\n >>> model = RnaMsmForPreTraining(config)\n >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n >>> output = model(**input, labels_mlm=input[\"input_ids\"])\n >>> output[\"loss\"] # doctest:+ELLIPSIS\n tensor(..., grad_fn=<AddBackward0>)\n >>> output[\"logits\"].shape\n torch.Size([1, 7, 26])\n >>> output[\"contact_map\"].shape\n torch.Size([1, 5, 5, 1])\n \"\"\"\n\n _tied_weights_keys = [\n \"lm_head.decoder.weight\",\n \"lm_head.decoder.bias\",\n \"pretrain.predictions.decoder.weight\",\n \"pretrain.predictions.decoder.bias\",\n \"pretrain.predictions_ss.decoder.weight\",\n \"pretrain.predictions_ss.decoder.bias\",\n ]\n\n def __init__(self, config: RnaMsmConfig):\n super().__init__(config)\n self.rnamsm = RnaMsmModel(config, add_pooling_layer=False)\n self.pretrain = RnaMsmPreTrainingHeads(config, weight=self.rnamsm.embeddings.word_embeddings.weight)\n\n # Initialize weights and apply final processing\n self.post_init()\n\n def forward(\n self,\n input_ids: Tensor | NestedTensor,\n attention_mask: Tensor | None = None,\n position_ids: Tensor | None = None,\n inputs_embeds: Tensor | NestedTensor | None = None,\n labels_mlm: Tensor | None = None,\n labels_contact: Tensor | None = None,\n output_attentions: bool | None = None,\n output_hidden_states: bool | None = None,\n return_dict: bool | None = None,\n **kwargs,\n ) -> Tuple[Tensor, ...] | RnaMsmForPreTrainingOutput:\n if output_attentions is False:\n warn(\"output_attentions must be True for contact classification and will be ignored.\")\n return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n outputs = self.rnamsm(\n input_ids,\n attention_mask=attention_mask,\n position_ids=position_ids,\n inputs_embeds=inputs_embeds,\n output_attentions=True,\n output_hidden_states=output_hidden_states,\n return_dict=return_dict,\n **kwargs,\n )\n total_loss, logits, contact_map = self.pretrain(\n outputs, attention_mask, input_ids, labels_mlm=labels_mlm, labels_contact=labels_contact\n )\n\n if not return_dict:\n output = (logits, contact_map) + outputs[2:]\n return ((total_loss,) + output) if total_loss is not None else output\n\n return RnaMsmForPreTrainingOutput(\n loss=total_loss,\n logits=logits,\n contact_map=contact_map,\n hidden_states=outputs.hidden_states,\n col_attentions=outputs.col_attentions,\n row_attentions=outputs.row_attentions,\n )\n
"},{"location":"models/rnamsm/#multimolecule.models.rnamsm.RnaMsmForSequencePrediction","title":"RnaMsmForSequencePrediction","text":" Bases: RnaMsmPreTrainedModel
Examples:
Python Console Session>>> from multimolecule import RnaMsmConfig, RnaMsmForSequencePrediction, RnaTokenizer\n>>> config = RnaMsmConfig()\n>>> model = RnaMsmForSequencePrediction(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=torch.tensor([[1]]))\n>>> output[\"logits\"].shape\ntorch.Size([1, 1])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n
Source code in multimolecule/models/rnamsm/modeling_rnamsm.py
Pythonclass RnaMsmForSequencePrediction(RnaMsmPreTrainedModel):\n \"\"\"\n Examples:\n >>> from multimolecule import RnaMsmConfig, RnaMsmForSequencePrediction, RnaTokenizer\n >>> config = RnaMsmConfig()\n >>> model = RnaMsmForSequencePrediction(config)\n >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n >>> output = model(**input, labels=torch.tensor([[1]]))\n >>> output[\"logits\"].shape\n torch.Size([1, 1])\n >>> output[\"loss\"] # doctest:+ELLIPSIS\n tensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n \"\"\"\n\n def __init__(self, config: RnaMsmConfig):\n super().__init__(config)\n self.rnamsm = RnaMsmModel(config, add_pooling_layer=True)\n self.sequence_head = SequencePredictionHead(config)\n self.head_config = self.sequence_head.config\n\n # Initialize weights and apply final processing\n self.post_init()\n\n def forward(\n self,\n input_ids: Tensor | NestedTensor,\n attention_mask: Tensor | None = None,\n position_ids: Tensor | None = None,\n inputs_embeds: Tensor | NestedTensor | None = None,\n labels: Tensor | None = None,\n output_attentions: bool = False,\n output_hidden_states: bool = False,\n return_dict: bool = True,\n **kwargs,\n ) -> Tuple[Tensor, ...] | RnaMsmSequencePredictorOutput:\n return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n outputs = self.rnamsm(\n input_ids,\n attention_mask=attention_mask,\n position_ids=position_ids,\n inputs_embeds=inputs_embeds,\n output_attentions=output_attentions,\n output_hidden_states=output_hidden_states,\n return_dict=return_dict,\n **kwargs,\n )\n output = self.sequence_head(outputs, labels)\n logits, loss = output.logits, output.loss\n\n if not return_dict:\n output = (logits,) + outputs[2:]\n return ((loss,) + output) if loss is not None else output\n\n return RnaMsmSequencePredictorOutput(\n loss=loss,\n logits=logits,\n hidden_states=outputs.hidden_states,\n col_attentions=outputs.col_attentions,\n row_attentions=outputs.row_attentions,\n )\n
"},{"location":"models/rnamsm/#multimolecule.models.rnamsm.RnaMsmForTokenPrediction","title":"RnaMsmForTokenPrediction","text":" Bases: RnaMsmPreTrainedModel
Examples:
Python Console Session>>> from multimolecule import RnaMsmConfig, RnaMsmForTokenPrediction, RnaTokenizer\n>>> config = RnaMsmConfig()\n>>> model = RnaMsmForTokenPrediction(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=torch.randint(2, (1, 5)))\n>>> output[\"logits\"].shape\ntorch.Size([1, 5, 1])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n
Source code in multimolecule/models/rnamsm/modeling_rnamsm.py
Pythonclass RnaMsmForTokenPrediction(RnaMsmPreTrainedModel):\n \"\"\"\n Examples:\n >>> from multimolecule import RnaMsmConfig, RnaMsmForTokenPrediction, RnaTokenizer\n >>> config = RnaMsmConfig()\n >>> model = RnaMsmForTokenPrediction(config)\n >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n >>> output = model(**input, labels=torch.randint(2, (1, 5)))\n >>> output[\"logits\"].shape\n torch.Size([1, 5, 1])\n >>> output[\"loss\"] # doctest:+ELLIPSIS\n tensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n \"\"\"\n\n def __init__(self, config: RnaMsmConfig):\n super().__init__(config)\n self.rnamsm = RnaMsmModel(config, add_pooling_layer=True)\n self.token_head = TokenPredictionHead(config)\n self.head_config = self.token_head.config\n\n # Initialize weights and apply final processing\n self.post_init()\n\n def forward(\n self,\n input_ids: Tensor | NestedTensor,\n attention_mask: Tensor | None = None,\n position_ids: Tensor | None = None,\n inputs_embeds: Tensor | NestedTensor | None = None,\n labels: Tensor | None = None,\n output_attentions: bool = False,\n output_hidden_states: bool = False,\n return_dict: bool = True,\n **kwargs,\n ) -> Tuple[Tensor, ...] | RnaMsmTokenPredictorOutput:\n return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n outputs = self.rnamsm(\n input_ids,\n attention_mask=attention_mask,\n position_ids=position_ids,\n inputs_embeds=inputs_embeds,\n output_attentions=output_attentions,\n output_hidden_states=output_hidden_states,\n return_dict=return_dict,\n **kwargs,\n )\n output = self.token_head(outputs, attention_mask, input_ids, labels)\n logits, loss = output.logits, output.loss\n\n if not return_dict:\n output = (logits,) + outputs[2:]\n return ((loss,) + output) if loss is not None else output\n\n return RnaMsmTokenPredictorOutput(\n loss=loss,\n logits=logits,\n hidden_states=outputs.hidden_states,\n col_attentions=outputs.col_attentions,\n row_attentions=outputs.row_attentions,\n )\n
"},{"location":"models/rnamsm/#multimolecule.models.rnamsm.RnaMsmModel","title":"RnaMsmModel","text":" Bases: RnaMsmPreTrainedModel
Examples:
Python Console Session>>> from multimolecule import RnaMsmConfig, RnaMsmModel, RnaTokenizer\n>>> config = RnaMsmConfig()\n>>> model = RnaMsmModel(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input)\n>>> output[\"last_hidden_state\"].shape\ntorch.Size([1, 7, 768])\n>>> output[\"pooler_output\"].shape\ntorch.Size([1, 768])\n
Source code in multimolecule/models/rnamsm/modeling_rnamsm.py
Pythonclass RnaMsmModel(RnaMsmPreTrainedModel):\n \"\"\"\n Examples:\n >>> from multimolecule import RnaMsmConfig, RnaMsmModel, RnaTokenizer\n >>> config = RnaMsmConfig()\n >>> model = RnaMsmModel(config)\n >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n >>> output = model(**input)\n >>> output[\"last_hidden_state\"].shape\n torch.Size([1, 7, 768])\n >>> output[\"pooler_output\"].shape\n torch.Size([1, 768])\n \"\"\"\n\n def __init__(self, config: RnaMsmConfig, add_pooling_layer: bool = True):\n super().__init__(config)\n self.pad_token_id = config.pad_token_id\n self.embeddings = RnaMsmEmbeddings(config)\n self.encoder = RnaMsmEncoder(config)\n self.pooler = RnaMsmPooler(config) if add_pooling_layer else None\n\n # Initialize weights and apply final processing\n self.post_init()\n\n def forward(\n self,\n input_ids: Tensor | NestedTensor,\n attention_mask: Tensor | None = None,\n position_ids: Tensor | None = None,\n inputs_embeds: Tensor | NestedTensor | None = None,\n output_attentions: bool | None = None,\n output_hidden_states: bool | None = None,\n return_dict: bool | None = None,\n **kwargs,\n ) -> Tuple[Tensor, ...] | RnaMsmModelOutputWithPooling:\n if kwargs:\n warn(\n f\"Additional keyword arguments `{', '.join(kwargs)}` are detected in \"\n f\"`{self.__class__.__name__}.forward`, they will be ignored.\\n\"\n \"This is provided for backward compatibility and may lead to unexpected behavior.\"\n )\n output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions\n output_hidden_states = (\n output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states\n )\n return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n\n if isinstance(input_ids, NestedTensor):\n input_ids, attention_mask = input_ids.tensor, input_ids.mask\n if input_ids is not None and inputs_embeds is not None:\n raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n if input_ids is not None:\n self.warn_if_padding_and_no_attention_mask(input_ids, attention_mask)\n elif inputs_embeds is None:\n raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n if attention_mask is None:\n attention_mask = (\n input_ids.ne(self.pad_token_id) if self.pad_token_id is not None else torch.ones_like(input_ids)\n )\n\n unsqueeze_input = input_ids.ndim == 2\n if unsqueeze_input:\n input_ids = input_ids.unsqueeze(1)\n if attention_mask.ndim == 2:\n attention_mask = attention_mask.unsqueeze(1)\n\n embedding_output = self.embeddings(\n input_ids=input_ids,\n position_ids=position_ids,\n attention_mask=attention_mask,\n inputs_embeds=inputs_embeds,\n )\n encoder_outputs = self.encoder(\n embedding_output,\n attention_mask=attention_mask,\n output_attentions=output_attentions,\n output_hidden_states=output_hidden_states,\n return_dict=return_dict,\n )\n sequence_output = encoder_outputs[0]\n if unsqueeze_input:\n sequence_output = sequence_output.squeeze(1)\n pooled_output = self.pooler(sequence_output) if self.pooler is not None else None\n\n if not return_dict:\n return (sequence_output, pooled_output) + encoder_outputs[1:]\n\n return RnaMsmModelOutputWithPooling(\n last_hidden_state=sequence_output,\n pooler_output=pooled_output,\n hidden_states=encoder_outputs.hidden_states,\n col_attentions=encoder_outputs.col_attentions,\n row_attentions=encoder_outputs.row_attentions,\n )\n
"},{"location":"models/rnamsm/#multimolecule.models.rnamsm.RnaMsmPreTrainedModel","title":"RnaMsmPreTrainedModel","text":" Bases: PreTrainedModel
An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained models.
Source code inmultimolecule/models/rnamsm/modeling_rnamsm.py
Pythonclass RnaMsmPreTrainedModel(PreTrainedModel):\n \"\"\"\n An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained\n models.\n \"\"\"\n\n config_class = RnaMsmConfig\n base_model_prefix = \"rnamsm\"\n supports_gradient_checkpointing = True\n _no_split_modules = [\"RnaMsmLayer\", \"RnaMsmAxialLayer\", \"RnaMsmPkmLayer\", \"RnaMsmEmbeddings\"]\n\n # Copied from transformers.models.bert.modeling_bert.BertPreTrainedModel._init_weights\n def _init_weights(self, module: nn.Module):\n \"\"\"Initialize the weights\"\"\"\n if isinstance(module, nn.Linear):\n # Slightly different from the TF version which uses truncated_normal for initialization\n # cf https://github.com/pytorch/pytorch/pull/5617\n module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)\n if module.bias is not None:\n module.bias.data.zero_()\n elif isinstance(module, nn.Embedding):\n module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)\n if module.padding_idx is not None:\n module.weight.data[module.padding_idx].zero_()\n elif isinstance(module, nn.LayerNorm) and module.elementwise_affine:\n module.bias.data.zero_()\n module.weight.data.fill_(1.0)\n
"},{"location":"models/splicebert/","title":"SpliceBERT","text":"Pre-trained model on messenger RNA precursor (pre-mRNA) using a masked language modeling (MLM) objective.
"},{"location":"models/splicebert/#disclaimer","title":"Disclaimer","text":"This is an UNOFFICIAL implementation of the Self-supervised learning on millions of pre-mRNA sequences improves sequence-based RNA splicing prediction by Ken Chen, et al.
The OFFICIAL repository of SpliceBERT is at chenkenbio/SpliceBERT.
Tip
The MultiMolecule team has confirmed that the provided model and checkpoints are producing the same intermediate representations as the original implementation.
The team releasing SpliceBERT did not write this model card for this model so this model card has been written by the MultiMolecule team.
"},{"location":"models/splicebert/#model-details","title":"Model Details","text":"SpliceBERT is a bert-style model pre-trained on a large corpus of messenger RNA precursor sequences in a self-supervised fashion. This means that the model was trained on the raw nucleotides of RNA sequences only, with an automatic process to generate inputs and labels from those texts. Please refer to the Training Details section for more information on the training process.
"},{"location":"models/splicebert/#variations","title":"Variations","text":"multimolecule/splicebert
: The SpliceBERT model.multimolecule/splicebert.510
: The intermediate SpliceBERT model.multimolecule/splicebert-human.510
: The intermediate SpliceBERT model pre-trained on human data only.The model file depends on the multimolecule
library. You can install it using pip:
pip install multimolecule\n
"},{"location":"models/splicebert/#direct-use","title":"Direct Use","text":"You can use this model directly with a pipeline for masked language modeling:
Python>>> import multimolecule # you must import multimolecule to register models\n>>> from transformers import pipeline\n>>> unmasker = pipeline(\"fill-mask\", model=\"multimolecule/splicebert\")\n>>> unmasker(\"gguc<mask>cucugguuagaccagaucugagccu\")\n\n[{'score': 0.340412974357605,\n 'token': 9,\n 'token_str': 'U',\n 'sequence': 'G G U C U C U C U G G U U A G A C C A G A U C U G A G C C U'},\n {'score': 0.13882005214691162,\n 'token': 12,\n 'token_str': 'Y',\n 'sequence': 'G G U C Y C U C U G G U U A G A C C A G A U C U G A G C C U'},\n {'score': 0.056610625237226486,\n 'token': 7,\n 'token_str': 'C',\n 'sequence': 'G G U C C C U C U G G U U A G A C C A G A U C U G A G C C U'},\n {'score': 0.05455885827541351,\n 'token': 19,\n 'token_str': 'H',\n 'sequence': 'G G U C H C U C U G G U U A G A C C A G A U C U G A G C C U'},\n {'score': 0.05356108024716377,\n 'token': 14,\n 'token_str': 'W',\n 'sequence': 'G G U C W C U C U G G U U A G A C C A G A U C U G A G C C U'}]\n
"},{"location":"models/splicebert/#downstream-use","title":"Downstream Use","text":""},{"location":"models/splicebert/#extract-features","title":"Extract Features","text":"Here is how to use this model to get the features of a given sequence in PyTorch:
Pythonfrom multimolecule import RnaTokenizer, SpliceBertModel\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/splicebert\")\nmodel = SpliceBertModel.from_pretrained(\"multimolecule/splicebert\")\n\ntext = \"UAGCUUAUCAGACUGAUGUUGA\"\ninput = tokenizer(text, return_tensors=\"pt\")\n\noutput = model(**input)\n
"},{"location":"models/splicebert/#sequence-classification-regression","title":"Sequence Classification / Regression","text":"Note: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for sequence classification or regression.
Here is how to use this model as backbone to fine-tune for a sequence-level task in PyTorch:
Pythonimport torch\nfrom multimolecule import RnaTokenizer, SpliceBertForSequencePrediction\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/splicebert\")\nmodel = SpliceBertForSequencePrediction.from_pretrained(\"multimolecule/splicebert\")\n\ntext = \"UAGCUUAUCAGACUGAUGUUGA\"\ninput = tokenizer(text, return_tensors=\"pt\")\nlabel = torch.tensor([1])\n\noutput = model(**input, labels=label)\n
"},{"location":"models/splicebert/#token-classification-regression","title":"Token Classification / Regression","text":"Note: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for nucleotide classification or regression.
Here is how to use this model as backbone to fine-tune for a nucleotide-level task in PyTorch:
Pythonimport torch\nfrom multimolecule import RnaTokenizer, SpliceBertForTokenPrediction\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/splicebert\")\nmodel = SpliceBertForTokenPrediction.from_pretrained(\"multimolecule/splicebert\")\n\ntext = \"UAGCUUAUCAGACUGAUGUUGA\"\ninput = tokenizer(text, return_tensors=\"pt\")\nlabel = torch.randint(2, (len(text), ))\n\noutput = model(**input, labels=label)\n
"},{"location":"models/splicebert/#contact-classification-regression","title":"Contact Classification / Regression","text":"Note: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for contact classification or regression.
Here is how to use this model as backbone to fine-tune for a contact-level task in PyTorch:
Pythonimport torch\nfrom multimolecule import RnaTokenizer, SpliceBertForContactPrediction\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/splicebert\")\nmodel = SpliceBertForContactPrediction.from_pretrained(\"multimolecule/splicebert\")\n\ntext = \"UAGCUUAUCAGACUGAUGUUGA\"\ninput = tokenizer(text, return_tensors=\"pt\")\nlabel = torch.randint(2, (len(text), len(text)))\n\noutput = model(**input, labels=label)\n
"},{"location":"models/splicebert/#training-details","title":"Training Details","text":"SpliceBERT used Masked Language Modeling (MLM) as the pre-training objective: taking a sequence, the model randomly masks 15% of the tokens in the input then runs the entire masked sentence through the model and has to predict the masked tokens. This is comparable to the Cloze task in language modeling.
"},{"location":"models/splicebert/#training-data","title":"Training Data","text":"The SpliceBERT model was pre-trained on messenger RNA precursor sequences from UCSC Genome Browser. UCSC Genome Browser provides visualization, analysis, and download of comprehensive vertebrate genome data with aligned annotation tracks (known genes, predicted genes, ESTs, mRNAs, CpG islands, etc.).
SpliceBERT collected reference genomes and gene annotations from the UCSC Genome Browser for 72 vertebrate species. It applied bedtools getfasta to extract pre-mRNA sequences from the reference genomes based on the gene annotations. The pre-mRNA sequences are then used to pre-train SpliceBERT. The pre-training data contains 2 million pre-mRNA sequences with a total length of 65 billion nucleotides.
Note RnaTokenizer
will convert \u201cT\u201ds to \u201cU\u201ds for you, you may disable this behaviour by passing replace_T_with_U=False
.
SpliceBERT used masked language modeling (MLM) as the pre-training objective. The masking procedure is similar to the one used in BERT:
<mask>
.The model was trained on 8 NVIDIA V100 GPUs.
SpliceBERT trained model in a two-stage training process:
The intermediate model after the first stage is available as multimolecule/splicebert.510
.
SpliceBERT also pre-trained a model on human data only to validate the contribution of multi-species pre-training. The intermediate model after the first stage is available as multimolecule/splicebert-human.510
.
BibTeX:
BibTeX@article {chen2023self,\n author = {Chen, Ken and Zhou, Yue and Ding, Maolin and Wang, Yu and Ren, Zhixiang and Yang, Yuedong},\n title = {Self-supervised learning on millions of pre-mRNA sequences improves sequence-based RNA splicing prediction},\n elocation-id = {2023.01.31.526427},\n year = {2023},\n doi = {10.1101/2023.01.31.526427},\n publisher = {Cold Spring Harbor Laboratory},\n abstract = {RNA splicing is an important post-transcriptional process of gene expression in eukaryotic cells. Predicting RNA splicing from primary sequences can facilitate the interpretation of genomic variants. In this study, we developed a novel self-supervised pre-trained language model, SpliceBERT, to improve sequence-based RNA splicing prediction. Pre-training on pre-mRNA sequences from vertebrates enables SpliceBERT to capture evolutionary conservation information and characterize the unique property of splice sites. SpliceBERT also improves zero-shot prediction of variant effects on splicing by considering sequence context information, and achieves superior performance for predicting branchpoint in the human genome and splice sites across species. Our study highlighted the importance of pre-training genomic language models on a diverse range of species and suggested that pre-trained language models were promising for deciphering the sequence logic of RNA splicing.Competing Interest StatementThe authors have declared no competing interest.},\n URL = {https://www.biorxiv.org/content/early/2023/05/09/2023.01.31.526427},\n eprint = {https://www.biorxiv.org/content/early/2023/05/09/2023.01.31.526427.full.pdf},\n journal = {bioRxiv}\n}\n
"},{"location":"models/splicebert/#contact","title":"Contact","text":"Please use GitHub issues of MultiMolecule for any questions or comments on the model card.
Please contact the authors of the SpliceBERT paper for questions or comments on the paper/model.
"},{"location":"models/splicebert/#license","title":"License","text":"This model is licensed under the AGPL-3.0 License.
Text OnlySPDX-License-Identifier: AGPL-3.0-or-later\n
"},{"location":"models/splicebert/#multimolecule.models.splicebert","title":"multimolecule.models.splicebert","text":""},{"location":"models/splicebert/#multimolecule.models.splicebert.RnaTokenizer","title":"RnaTokenizer","text":" Bases: Tokenizer
Tokenizer for RNA sequences.
Parameters:
Name Type Description DefaultAlphabet | str | List[str] | None
alphabet to use for tokenization.
None
, the standard RNA alphabet will be used.string
, it should correspond to the name of a predefined alphabet. The options includestandard
extended
streamline
nucleobase
None
int
Size of kmer to tokenize.
1
bool
Whether to tokenize into codons.
False
bool
Whether to replace T with U.
True
bool
Whether to convert input to uppercase.
True
Examples:
Python Console Session>>> from multimolecule import RnaTokenizer\n>>> tokenizer = RnaTokenizer()\n>>> tokenizer('<pad><cls><eos><unk><mask><null>ACGUNRYSWKMBDHV.X*-I')[\"input_ids\"]\n[1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 2]\n>>> tokenizer('acgu')[\"input_ids\"]\n[1, 6, 7, 8, 9, 2]\n>>> tokenizer('acgt')[\"input_ids\"]\n[1, 6, 7, 8, 9, 2]\n>>> tokenizer = RnaTokenizer(replace_T_with_U=False)\n>>> tokenizer('acgt')[\"input_ids\"]\n[1, 6, 7, 8, 3, 2]\n>>> tokenizer = RnaTokenizer(nmers=3)\n>>> tokenizer('uagcuuauc')[\"input_ids\"]\n[1, 83, 17, 64, 49, 96, 84, 22, 2]\n>>> tokenizer = RnaTokenizer(codon=True)\n>>> tokenizer('uagcuuauc')[\"input_ids\"]\n[1, 83, 49, 22, 2]\n>>> tokenizer('uagcuuauca')[\"input_ids\"]\nTraceback (most recent call last):\nValueError: length of input sequence must be a multiple of 3 for codon tokenization, but got 10\n
Source code in multimolecule/tokenisers/rna/tokenization_rna.py
Pythonclass RnaTokenizer(Tokenizer):\n \"\"\"\n Tokenizer for RNA sequences.\n\n Args:\n alphabet: alphabet to use for tokenization.\n\n - If is `None`, the standard RNA alphabet will be used.\n - If is a `string`, it should correspond to the name of a predefined alphabet. The options include\n + `standard`\n + `extended`\n + `streamline`\n + `nucleobase`\n - If is an alphabet or a list of characters, that specific alphabet will be used.\n nmers: Size of kmer to tokenize.\n codon: Whether to tokenize into codons.\n replace_T_with_U: Whether to replace T with U.\n do_upper_case: Whether to convert input to uppercase.\n\n Examples:\n >>> from multimolecule import RnaTokenizer\n >>> tokenizer = RnaTokenizer()\n >>> tokenizer('<pad><cls><eos><unk><mask><null>ACGUNRYSWKMBDHV.X*-I')[\"input_ids\"]\n [1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 2]\n >>> tokenizer('acgu')[\"input_ids\"]\n [1, 6, 7, 8, 9, 2]\n >>> tokenizer('acgt')[\"input_ids\"]\n [1, 6, 7, 8, 9, 2]\n >>> tokenizer = RnaTokenizer(replace_T_with_U=False)\n >>> tokenizer('acgt')[\"input_ids\"]\n [1, 6, 7, 8, 3, 2]\n >>> tokenizer = RnaTokenizer(nmers=3)\n >>> tokenizer('uagcuuauc')[\"input_ids\"]\n [1, 83, 17, 64, 49, 96, 84, 22, 2]\n >>> tokenizer = RnaTokenizer(codon=True)\n >>> tokenizer('uagcuuauc')[\"input_ids\"]\n [1, 83, 49, 22, 2]\n >>> tokenizer('uagcuuauca')[\"input_ids\"]\n Traceback (most recent call last):\n ValueError: length of input sequence must be a multiple of 3 for codon tokenization, but got 10\n \"\"\"\n\n model_input_names = [\"input_ids\", \"attention_mask\"]\n\n def __init__(\n self,\n alphabet: Alphabet | str | List[str] | None = None,\n nmers: int = 1,\n codon: bool = False,\n replace_T_with_U: bool = True,\n do_upper_case: bool = True,\n additional_special_tokens: List | Tuple | None = None,\n **kwargs,\n ):\n if codon and (nmers > 1 and nmers != 3):\n raise ValueError(\"Codon and nmers cannot be used together.\")\n if codon:\n nmers = 3 # set to 3 to get correct vocab\n if not isinstance(alphabet, Alphabet):\n alphabet = get_alphabet(alphabet, nmers=nmers)\n super().__init__(\n alphabet=alphabet,\n nmers=nmers,\n codon=codon,\n replace_T_with_U=replace_T_with_U,\n do_upper_case=do_upper_case,\n additional_special_tokens=additional_special_tokens,\n **kwargs,\n )\n self.replace_T_with_U = replace_T_with_U\n self.nmers = nmers\n self.codon = codon\n\n def _tokenize(self, text: str, **kwargs):\n if self.do_upper_case:\n text = text.upper()\n if self.replace_T_with_U:\n text = text.replace(\"T\", \"U\")\n if self.codon:\n if len(text) % 3 != 0:\n raise ValueError(\n f\"length of input sequence must be a multiple of 3 for codon tokenization, but got {len(text)}\"\n )\n return [text[i : i + 3] for i in range(0, len(text), 3)]\n if self.nmers > 1:\n return [text[i : i + self.nmers] for i in range(len(text) - self.nmers + 1)] # noqa: E203\n return list(text)\n
"},{"location":"models/splicebert/#multimolecule.models.splicebert.RnaTokenizer(alphabet)","title":"alphabet
","text":""},{"location":"models/splicebert/#multimolecule.models.splicebert.RnaTokenizer(nmers)","title":"nmers
","text":""},{"location":"models/splicebert/#multimolecule.models.splicebert.RnaTokenizer(codon)","title":"codon
","text":""},{"location":"models/splicebert/#multimolecule.models.splicebert.RnaTokenizer(replace_T_with_U)","title":"replace_T_with_U
","text":""},{"location":"models/splicebert/#multimolecule.models.splicebert.RnaTokenizer(do_upper_case)","title":"do_upper_case
","text":""},{"location":"models/splicebert/#multimolecule.models.splicebert.SpliceBertConfig","title":"SpliceBertConfig","text":" Bases: PreTrainedConfig
This is the configuration class to store the configuration of a SpliceBertModel
. It is used to instantiate a SpliceBert model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the SpliceBert biomed-AI/SpliceBERT architecture.
Configuration objects inherit from PreTrainedConfig
and can be used to control the model outputs. Read the documentation from PreTrainedConfig
for more information.
Parameters:
Name Type Description Defaultint
Vocabulary size of the SpliceBert model. Defines the number of different tokens that can be represented by the inputs_ids
passed when calling [SpliceBertModel
].
26
int
Dimensionality of the encoder layers and the pooler layer.
512
int
Number of hidden layers in the Transformer encoder.
6
int
Number of attention heads for each attention layer in the Transformer encoder.
16
int
Dimensionality of the \u201cintermediate\u201d (often named feed-forward) layer in the Transformer encoder.
2048
float
The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
0.1
float
The dropout ratio for the attention probabilities.
0.1
int
The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
1026
float
The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
0.02
float
The epsilon used by the layer normalization layers.
1e-12
Examples:
Python Console Session>>> from multimolecule import SpliceBertModel, SpliceBertConfig\n>>> # Initializing a SpliceBERT multimolecule/splicebert style configuration\n>>> configuration = SpliceBertConfig()\n>>> # Initializing a model (with random weights) from the multimolecule/splicebert style configuration\n>>> model = SpliceBertModel(configuration)\n>>> # Accessing the model configuration\n>>> configuration = model.config\n
Source code in multimolecule/models/splicebert/configuration_splicebert.py
Pythonclass SpliceBertConfig(PreTrainedConfig):\n r\"\"\"\n This is the configuration class to store the configuration of a\n [`SpliceBertModel`][multimolecule.models.SpliceBertModel]. It is used to instantiate a SpliceBert model according\n to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will\n yield a similar configuration to that of the SpliceBert\n [biomed-AI/SpliceBERT](https://github.com/biomed-AI/SpliceBERT) architecture.\n\n Configuration objects inherit from [`PreTrainedConfig`][multimolecule.models.PreTrainedConfig] and can be used to\n control the model outputs. Read the documentation from [`PreTrainedConfig`][multimolecule.models.PreTrainedConfig]\n for more information.\n\n Args:\n vocab_size:\n Vocabulary size of the SpliceBert model. Defines the number of different tokens that can be represented by\n the `inputs_ids` passed when calling [`SpliceBertModel`].\n hidden_size:\n Dimensionality of the encoder layers and the pooler layer.\n num_hidden_layers:\n Number of hidden layers in the Transformer encoder.\n num_attention_heads:\n Number of attention heads for each attention layer in the Transformer encoder.\n intermediate_size:\n Dimensionality of the \"intermediate\" (often named feed-forward) layer in the Transformer encoder.\n hidden_dropout:\n The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.\n attention_dropout:\n The dropout ratio for the attention probabilities.\n max_position_embeddings:\n The maximum sequence length that this model might ever be used with. Typically set this to something large\n just in case (e.g., 512 or 1024 or 2048).\n initializer_range:\n The standard deviation of the truncated_normal_initializer for initializing all weight matrices.\n layer_norm_eps:\n The epsilon used by the layer normalization layers.\n\n Examples:\n >>> from multimolecule import SpliceBertModel, SpliceBertConfig\n >>> # Initializing a SpliceBERT multimolecule/splicebert style configuration\n >>> configuration = SpliceBertConfig()\n >>> # Initializing a model (with random weights) from the multimolecule/splicebert style configuration\n >>> model = SpliceBertModel(configuration)\n >>> # Accessing the model configuration\n >>> configuration = model.config\n \"\"\"\n\n model_type = \"splicebert\"\n\n def __init__(\n self,\n vocab_size: int = 26,\n hidden_size: int = 512,\n num_hidden_layers: int = 6,\n num_attention_heads: int = 16,\n intermediate_size: int = 2048,\n hidden_act: str = \"gelu\",\n hidden_dropout: float = 0.1,\n attention_dropout: float = 0.1,\n max_position_embeddings: int = 1026,\n initializer_range: float = 0.02,\n layer_norm_eps: float = 1e-12,\n position_embedding_type: str = \"absolute\",\n is_decoder: bool = False,\n use_cache: bool = True,\n head: HeadConfig | None = None,\n lm_head: MaskedLMHeadConfig | None = None,\n **kwargs,\n ):\n super().__init__(**kwargs)\n\n self.vocab_size = vocab_size\n self.type_vocab_size = 2\n self.hidden_size = hidden_size\n self.num_hidden_layers = num_hidden_layers\n self.num_attention_heads = num_attention_heads\n self.intermediate_size = intermediate_size\n self.hidden_act = hidden_act\n self.hidden_dropout = hidden_dropout\n self.attention_dropout = attention_dropout\n self.max_position_embeddings = max_position_embeddings\n self.initializer_range = initializer_range\n self.layer_norm_eps = layer_norm_eps\n self.position_embedding_type = position_embedding_type\n self.is_decoder = is_decoder\n self.use_cache = use_cache\n self.head = HeadConfig(**head) if head is not None else None\n self.lm_head = MaskedLMHeadConfig(**lm_head) if lm_head is not None else None\n
"},{"location":"models/splicebert/#multimolecule.models.splicebert.SpliceBertConfig(vocab_size)","title":"vocab_size
","text":""},{"location":"models/splicebert/#multimolecule.models.splicebert.SpliceBertConfig(hidden_size)","title":"hidden_size
","text":""},{"location":"models/splicebert/#multimolecule.models.splicebert.SpliceBertConfig(num_hidden_layers)","title":"num_hidden_layers
","text":""},{"location":"models/splicebert/#multimolecule.models.splicebert.SpliceBertConfig(num_attention_heads)","title":"num_attention_heads
","text":""},{"location":"models/splicebert/#multimolecule.models.splicebert.SpliceBertConfig(intermediate_size)","title":"intermediate_size
","text":""},{"location":"models/splicebert/#multimolecule.models.splicebert.SpliceBertConfig(hidden_dropout)","title":"hidden_dropout
","text":""},{"location":"models/splicebert/#multimolecule.models.splicebert.SpliceBertConfig(attention_dropout)","title":"attention_dropout
","text":""},{"location":"models/splicebert/#multimolecule.models.splicebert.SpliceBertConfig(max_position_embeddings)","title":"max_position_embeddings
","text":""},{"location":"models/splicebert/#multimolecule.models.splicebert.SpliceBertConfig(initializer_range)","title":"initializer_range
","text":""},{"location":"models/splicebert/#multimolecule.models.splicebert.SpliceBertConfig(layer_norm_eps)","title":"layer_norm_eps
","text":""},{"location":"models/splicebert/#multimolecule.models.splicebert.SpliceBertForContactPrediction","title":"SpliceBertForContactPrediction","text":" Bases: SpliceBertPreTrainedModel
Examples:
Python Console Session>>> from multimolecule import SpliceBertConfig, SpliceBertForContactPrediction, RnaTokenizer\n>>> config = SpliceBertConfig()\n>>> model = SpliceBertForContactPrediction(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=torch.randint(2, (1, 5, 5)))\n>>> output[\"logits\"].shape\ntorch.Size([1, 5, 5, 1])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n
Source code in multimolecule/models/splicebert/modeling_splicebert.py
Pythonclass SpliceBertForContactPrediction(SpliceBertPreTrainedModel):\n \"\"\"\n Examples:\n >>> from multimolecule import SpliceBertConfig, SpliceBertForContactPrediction, RnaTokenizer\n >>> config = SpliceBertConfig()\n >>> model = SpliceBertForContactPrediction(config)\n >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n >>> output = model(**input, labels=torch.randint(2, (1, 5, 5)))\n >>> output[\"logits\"].shape\n torch.Size([1, 5, 5, 1])\n >>> output[\"loss\"] # doctest:+ELLIPSIS\n tensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n \"\"\"\n\n def __init__(self, config: SpliceBertConfig):\n super().__init__(config)\n self.splicebert = SpliceBertModel(config, add_pooling_layer=True)\n self.contact_head = ContactPredictionHead(config)\n self.head_config = self.contact_head.config\n\n # Initialize weights and apply final processing\n self.post_init()\n\n def forward(\n self,\n input_ids: Tensor | NestedTensor,\n attention_mask: Tensor | None = None,\n position_ids: Tensor | None = None,\n head_mask: Tensor | None = None,\n inputs_embeds: Tensor | NestedTensor | None = None,\n labels: Tensor | None = None,\n output_attentions: bool | None = None,\n output_hidden_states: bool | None = None,\n return_dict: bool | None = None,\n **kwargs,\n ) -> Tuple[Tensor, ...] | ContactPredictorOutput:\n if output_attentions is False:\n warn(\"output_attentions must be True for contact classification and will be ignored.\")\n return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n outputs = self.splicebert(\n input_ids,\n attention_mask=attention_mask,\n position_ids=position_ids,\n head_mask=head_mask,\n inputs_embeds=inputs_embeds,\n output_attentions=True,\n output_hidden_states=output_hidden_states,\n return_dict=return_dict,\n **kwargs,\n )\n output = self.contact_head(outputs, attention_mask, input_ids, labels)\n logits, loss = output.logits, output.loss\n\n if not return_dict:\n output = (logits,) + outputs[2:]\n return ((loss,) + output) if loss is not None else output\n\n return ContactPredictorOutput(\n loss=loss,\n logits=logits,\n hidden_states=outputs.hidden_states,\n attentions=outputs.attentions,\n )\n
"},{"location":"models/splicebert/#multimolecule.models.splicebert.SpliceBertForMaskedLM","title":"SpliceBertForMaskedLM","text":" Bases: SpliceBertPreTrainedModel
Examples:
Python Console Session>>> from multimolecule import SpliceBertConfig, SpliceBertForMaskedLM, RnaTokenizer\n>>> config = SpliceBertConfig()\n>>> model = SpliceBertForMaskedLM(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=input[\"input_ids\"])\n>>> output[\"logits\"].shape\ntorch.Size([1, 7, 26])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<NllLossBackward0>)\n
Source code in multimolecule/models/splicebert/modeling_splicebert.py
Pythonclass SpliceBertForMaskedLM(SpliceBertPreTrainedModel):\n \"\"\"\n Examples:\n >>> from multimolecule import SpliceBertConfig, SpliceBertForMaskedLM, RnaTokenizer\n >>> config = SpliceBertConfig()\n >>> model = SpliceBertForMaskedLM(config)\n >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n >>> output = model(**input, labels=input[\"input_ids\"])\n >>> output[\"logits\"].shape\n torch.Size([1, 7, 26])\n >>> output[\"loss\"] # doctest:+ELLIPSIS\n tensor(..., grad_fn=<NllLossBackward0>)\n \"\"\"\n\n _tied_weights_keys = [\"lm_head.decoder.bias\", \"lm_head.decoder.weight\"]\n\n def __init__(self, config: SpliceBertConfig):\n super().__init__(config)\n if config.is_decoder:\n logger.warning(\n \"If you want to use `SpliceBertForMaskedLM` make sure `config.is_decoder=False` for \"\n \"bi-directional self-attention.\"\n )\n self.splicebert = SpliceBertModel(config, add_pooling_layer=False)\n self.lm_head = MaskedLMHead(config)\n\n # Initialize weights and apply final processing\n self.post_init()\n\n def get_output_embeddings(self):\n return self.lm_head.decoder\n\n def set_output_embeddings(self, new_embeddings):\n self.lm_head.decoder = new_embeddings\n\n def forward(\n self,\n input_ids: Tensor | NestedTensor,\n attention_mask: Tensor | None = None,\n position_ids: Tensor | None = None,\n head_mask: Tensor | None = None,\n inputs_embeds: Tensor | NestedTensor | None = None,\n encoder_hidden_states: Tensor | None = None,\n encoder_attention_mask: Tensor | None = None,\n labels: Tensor | None = None,\n output_attentions: bool | None = None,\n output_hidden_states: bool | None = None,\n return_dict: bool | None = None,\n **kwargs,\n ) -> Tuple[Tensor, ...] | MaskedLMOutput:\n return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n outputs = self.splicebert(\n input_ids,\n attention_mask=attention_mask,\n position_ids=position_ids,\n head_mask=head_mask,\n inputs_embeds=inputs_embeds,\n encoder_hidden_states=encoder_hidden_states,\n encoder_attention_mask=encoder_attention_mask,\n output_attentions=output_attentions,\n output_hidden_states=output_hidden_states,\n return_dict=return_dict,\n **kwargs,\n )\n output = self.lm_head(outputs, labels)\n logits, loss = output.logits, output.loss\n\n if not return_dict:\n output = (logits,) + outputs[2:]\n return ((loss,) + output) if loss is not None else output\n\n return MaskedLMOutput(\n loss=loss,\n logits=logits,\n hidden_states=outputs.hidden_states,\n attentions=outputs.attentions,\n )\n
"},{"location":"models/splicebert/#multimolecule.models.splicebert.SpliceBertForSequencePrediction","title":"SpliceBertForSequencePrediction","text":" Bases: SpliceBertPreTrainedModel
Examples:
Python Console Session>>> from multimolecule import SpliceBertConfig, SpliceBertForSequencePrediction, RnaTokenizer\n>>> config = SpliceBertConfig()\n>>> model = SpliceBertForSequencePrediction(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=torch.tensor([[1]]))\n>>> output[\"logits\"].shape\ntorch.Size([1, 1])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n
Source code in multimolecule/models/splicebert/modeling_splicebert.py
Pythonclass SpliceBertForSequencePrediction(SpliceBertPreTrainedModel):\n \"\"\"\n Examples:\n >>> from multimolecule import SpliceBertConfig, SpliceBertForSequencePrediction, RnaTokenizer\n >>> config = SpliceBertConfig()\n >>> model = SpliceBertForSequencePrediction(config)\n >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n >>> output = model(**input, labels=torch.tensor([[1]]))\n >>> output[\"logits\"].shape\n torch.Size([1, 1])\n >>> output[\"loss\"] # doctest:+ELLIPSIS\n tensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n \"\"\"\n\n def __init__(self, config: SpliceBertConfig):\n super().__init__(config)\n self.splicebert = SpliceBertModel(config, add_pooling_layer=True)\n self.sequence_head = SequencePredictionHead(config)\n self.head_config = self.sequence_head.config\n\n # Initialize weights and apply final processing\n self.post_init()\n\n def forward(\n self,\n input_ids: Tensor | NestedTensor,\n attention_mask: Tensor | None = None,\n position_ids: Tensor | None = None,\n head_mask: Tensor | None = None,\n inputs_embeds: Tensor | NestedTensor | None = None,\n labels: Tensor | None = None,\n output_attentions: bool | None = None,\n output_hidden_states: bool | None = None,\n return_dict: bool | None = None,\n **kwargs,\n ) -> Tuple[Tensor, ...] | SequencePredictorOutput:\n return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n outputs = self.splicebert(\n input_ids,\n attention_mask=attention_mask,\n position_ids=position_ids,\n head_mask=head_mask,\n inputs_embeds=inputs_embeds,\n output_attentions=output_attentions,\n output_hidden_states=output_hidden_states,\n return_dict=return_dict,\n **kwargs,\n )\n output = self.sequence_head(outputs, labels)\n logits, loss = output.logits, output.loss\n\n if not return_dict:\n output = (logits,) + outputs[2:]\n return ((loss,) + output) if loss is not None else output\n\n return SequencePredictorOutput(\n loss=loss,\n logits=logits,\n hidden_states=outputs.hidden_states,\n attentions=outputs.attentions,\n )\n
"},{"location":"models/splicebert/#multimolecule.models.splicebert.SpliceBertForTokenPrediction","title":"SpliceBertForTokenPrediction","text":" Bases: SpliceBertPreTrainedModel
Examples:
Python Console Session>>> from multimolecule import SpliceBertConfig, SpliceBertForTokenPrediction, RnaTokenizer\n>>> config = SpliceBertConfig()\n>>> model = SpliceBertForTokenPrediction(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=torch.randint(2, (1, 5)))\n>>> output[\"logits\"].shape\ntorch.Size([1, 5, 1])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n
Source code in multimolecule/models/splicebert/modeling_splicebert.py
Pythonclass SpliceBertForTokenPrediction(SpliceBertPreTrainedModel):\n \"\"\"\n Examples:\n >>> from multimolecule import SpliceBertConfig, SpliceBertForTokenPrediction, RnaTokenizer\n >>> config = SpliceBertConfig()\n >>> model = SpliceBertForTokenPrediction(config)\n >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n >>> output = model(**input, labels=torch.randint(2, (1, 5)))\n >>> output[\"logits\"].shape\n torch.Size([1, 5, 1])\n >>> output[\"loss\"] # doctest:+ELLIPSIS\n tensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n \"\"\"\n\n def __init__(self, config: SpliceBertConfig):\n super().__init__(config)\n self.splicebert = SpliceBertModel(config, add_pooling_layer=True)\n self.token_head = TokenPredictionHead(config)\n self.head_config = self.token_head.config\n\n # Initialize weights and apply final processing\n self.post_init()\n\n def forward(\n self,\n input_ids: Tensor | NestedTensor,\n attention_mask: Tensor | None = None,\n position_ids: Tensor | None = None,\n head_mask: Tensor | None = None,\n inputs_embeds: Tensor | NestedTensor | None = None,\n labels: Tensor | None = None,\n output_attentions: bool | None = None,\n output_hidden_states: bool | None = None,\n return_dict: bool | None = None,\n **kwargs,\n ) -> Tuple[Tensor, ...] | TokenPredictorOutput:\n return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n outputs = self.splicebert(\n input_ids,\n attention_mask=attention_mask,\n position_ids=position_ids,\n head_mask=head_mask,\n inputs_embeds=inputs_embeds,\n output_attentions=output_attentions,\n output_hidden_states=output_hidden_states,\n return_dict=return_dict,\n **kwargs,\n )\n output = self.token_head(outputs, attention_mask, input_ids, labels)\n logits, loss = output.logits, output.loss\n\n if not return_dict:\n output = (logits,) + outputs[2:]\n return ((loss,) + output) if loss is not None else output\n\n return TokenPredictorOutput(\n loss=loss,\n logits=logits,\n hidden_states=outputs.hidden_states,\n attentions=outputs.attentions,\n )\n
"},{"location":"models/splicebert/#multimolecule.models.splicebert.SpliceBertModel","title":"SpliceBertModel","text":" Bases: SpliceBertPreTrainedModel
Examples:
Python Console Session>>> from multimolecule import SpliceBertConfig, SpliceBertModel, RnaTokenizer\n>>> config = SpliceBertConfig()\n>>> model = SpliceBertModel(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input)\n>>> output[\"last_hidden_state\"].shape\ntorch.Size([1, 7, 512])\n>>> output[\"pooler_output\"].shape\ntorch.Size([1, 512])\n
Source code in multimolecule/models/splicebert/modeling_splicebert.py
Pythonclass SpliceBertModel(SpliceBertPreTrainedModel):\n \"\"\"\n Examples:\n >>> from multimolecule import SpliceBertConfig, SpliceBertModel, RnaTokenizer\n >>> config = SpliceBertConfig()\n >>> model = SpliceBertModel(config)\n >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n >>> output = model(**input)\n >>> output[\"last_hidden_state\"].shape\n torch.Size([1, 7, 512])\n >>> output[\"pooler_output\"].shape\n torch.Size([1, 512])\n \"\"\"\n\n def __init__(self, config: SpliceBertConfig, add_pooling_layer: bool = True):\n super().__init__(config)\n self.pad_token_id = config.pad_token_id\n self.embeddings = SpliceBertEmbeddings(config)\n self.encoder = SpliceBertEncoder(config)\n self.pooler = SpliceBertPooler(config) if add_pooling_layer else None\n\n # Initialize weights and apply final processing\n self.post_init()\n\n def get_input_embeddings(self):\n return self.embeddings.word_embeddings\n\n def set_input_embeddings(self, value):\n self.embeddings.word_embeddings = value\n\n def _prune_heads(self, heads_to_prune):\n \"\"\"\n Prunes heads of the model. heads_to_prune: dict of {layer_num: list of heads to prune in this layer} See base\n class PreTrainedModel\n \"\"\"\n for layer, heads in heads_to_prune.items():\n self.encoder.layer[layer].attention.prune_heads(heads)\n\n def forward(\n self,\n input_ids: Tensor | NestedTensor,\n attention_mask: Tensor | None = None,\n position_ids: Tensor | None = None,\n head_mask: Tensor | None = None,\n inputs_embeds: Tensor | NestedTensor | None = None,\n encoder_hidden_states: Tensor | None = None,\n encoder_attention_mask: Tensor | None = None,\n past_key_values: Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None = None,\n use_cache: bool | None = None,\n output_attentions: bool | None = None,\n output_hidden_states: bool | None = None,\n return_dict: bool | None = None,\n **kwargs,\n ) -> Tuple[Tensor, ...] | BaseModelOutputWithPoolingAndCrossAttentions:\n r\"\"\"\n Args:\n encoder_hidden_states:\n Shape: `(batch_size, sequence_length, hidden_size)`\n\n Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if\n the model is configured as a decoder.\n encoder_attention_mask:\n Shape: `(batch_size, sequence_length)`\n\n Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used\n in the cross-attention if the model is configured as a decoder. Mask values selected in `[0, 1]`:\n\n - 1 for tokens that are **not masked**,\n - 0 for tokens that are **masked**.\n past_key_values:\n Tuple of length `config.n_layers` with each tuple having 4 tensors of shape\n `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)\n\n Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up\n decoding.\n\n If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those\n that don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of\n all `decoder_input_ids` of shape `(batch_size, sequence_length)`.\n use_cache:\n If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding\n (see `past_key_values`).\n \"\"\"\n if kwargs:\n warn(\n f\"Additional keyword arguments `{', '.join(kwargs)}` are detected in \"\n f\"`{self.__class__.__name__}.forward`, they will be ignored.\\n\"\n \"This is provided for backward compatibility and may lead to unexpected behavior.\"\n )\n output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions\n output_hidden_states = (\n output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states\n )\n return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n\n if self.config.is_decoder:\n use_cache = use_cache if use_cache is not None else self.config.use_cache\n else:\n use_cache = False\n\n if isinstance(input_ids, NestedTensor):\n input_ids, attention_mask = input_ids.tensor, input_ids.mask\n if input_ids is not None and inputs_embeds is not None:\n raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n if input_ids is not None:\n self.warn_if_padding_and_no_attention_mask(input_ids, attention_mask)\n input_shape = input_ids.size()\n elif inputs_embeds is not None:\n input_shape = inputs_embeds.size()[:-1]\n else:\n raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n batch_size, seq_length = input_shape\n device = input_ids.device if input_ids is not None else inputs_embeds.device # type: ignore[union-attr]\n\n # past_key_values_length\n past_key_values_length = past_key_values[0][0].shape[2] if past_key_values is not None else 0\n\n if attention_mask is None:\n attention_mask = (\n input_ids.ne(self.pad_token_id)\n if self.pad_token_id is not None\n else torch.ones(((batch_size, seq_length + past_key_values_length)), device=device)\n )\n\n # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]\n # ourselves in which case we just need to make it broadcastable to all heads.\n extended_attention_mask: Tensor = self.get_extended_attention_mask(attention_mask, input_shape)\n\n # If a 2D or 3D attention mask is provided for the cross-attention\n # we need to make broadcastable to [batch_size, num_heads, seq_length, seq_length]\n if self.config.is_decoder and encoder_hidden_states is not None:\n encoder_batch_size, encoder_sequence_length, _ = encoder_hidden_states.size()\n encoder_hidden_shape = (encoder_batch_size, encoder_sequence_length)\n if encoder_attention_mask is None:\n encoder_attention_mask = torch.ones(encoder_hidden_shape, device=device)\n encoder_extended_attention_mask = self.invert_attention_mask(encoder_attention_mask)\n else:\n encoder_extended_attention_mask = None\n\n # Prepare head mask if needed\n # 1.0 in head_mask indicate we keep the head\n # attention_probs has shape bsz x n_heads x N x N\n # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]\n # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]\n head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)\n\n embedding_output = self.embeddings(\n input_ids=input_ids,\n position_ids=position_ids,\n inputs_embeds=inputs_embeds,\n past_key_values_length=past_key_values_length,\n )\n encoder_outputs = self.encoder(\n embedding_output,\n attention_mask=extended_attention_mask,\n head_mask=head_mask,\n encoder_hidden_states=encoder_hidden_states,\n encoder_attention_mask=encoder_extended_attention_mask,\n past_key_values=past_key_values,\n use_cache=use_cache,\n output_attentions=output_attentions,\n output_hidden_states=output_hidden_states,\n return_dict=return_dict,\n )\n sequence_output = encoder_outputs[0]\n pooled_output = self.pooler(sequence_output) if self.pooler is not None else None\n\n if not return_dict:\n return (sequence_output, pooled_output) + encoder_outputs[1:]\n\n return BaseModelOutputWithPoolingAndCrossAttentions(\n last_hidden_state=sequence_output,\n pooler_output=pooled_output,\n past_key_values=encoder_outputs.past_key_values,\n hidden_states=encoder_outputs.hidden_states,\n attentions=encoder_outputs.attentions,\n cross_attentions=encoder_outputs.cross_attentions,\n )\n
"},{"location":"models/splicebert/#multimolecule.models.splicebert.SpliceBertModel.forward","title":"forward","text":"Pythonforward(input_ids: Tensor | NestedTensor, attention_mask: Tensor | None = None, position_ids: Tensor | None = None, head_mask: Tensor | None = None, inputs_embeds: Tensor | NestedTensor | None = None, encoder_hidden_states: Tensor | None = None, encoder_attention_mask: Tensor | None = None, past_key_values: Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None = None, use_cache: bool | None = None, output_attentions: bool | None = None, output_hidden_states: bool | None = None, return_dict: bool | None = None, **kwargs) -> Tuple[Tensor, ...] | BaseModelOutputWithPoolingAndCrossAttentions\n
Parameters:
Name Type Description DefaultTensor | None
Shape: (batch_size, sequence_length, hidden_size)
Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if the model is configured as a decoder.
None
Tensor | None
Shape: (batch_size, sequence_length)
Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used in the cross-attention if the model is configured as a decoder. Mask values selected in [0, 1]
:
None
Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None
Tuple of length config.n_layers
with each tuple having 4 tensors of shape `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)
Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up decoding.
If past_key_values
are used, the user can optionally input only the last decoder_input_ids
(those that don\u2019t have their past key value states given to this model) of shape (batch_size, 1)
instead of all decoder_input_ids
of shape (batch_size, sequence_length)
.
None
bool | None
If set to True
, past_key_values
key value states are returned and can be used to speed up decoding (see past_key_values
).
None
Source code in multimolecule/models/splicebert/modeling_splicebert.py
Pythondef forward(\n self,\n input_ids: Tensor | NestedTensor,\n attention_mask: Tensor | None = None,\n position_ids: Tensor | None = None,\n head_mask: Tensor | None = None,\n inputs_embeds: Tensor | NestedTensor | None = None,\n encoder_hidden_states: Tensor | None = None,\n encoder_attention_mask: Tensor | None = None,\n past_key_values: Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None = None,\n use_cache: bool | None = None,\n output_attentions: bool | None = None,\n output_hidden_states: bool | None = None,\n return_dict: bool | None = None,\n **kwargs,\n) -> Tuple[Tensor, ...] | BaseModelOutputWithPoolingAndCrossAttentions:\n r\"\"\"\n Args:\n encoder_hidden_states:\n Shape: `(batch_size, sequence_length, hidden_size)`\n\n Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if\n the model is configured as a decoder.\n encoder_attention_mask:\n Shape: `(batch_size, sequence_length)`\n\n Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used\n in the cross-attention if the model is configured as a decoder. Mask values selected in `[0, 1]`:\n\n - 1 for tokens that are **not masked**,\n - 0 for tokens that are **masked**.\n past_key_values:\n Tuple of length `config.n_layers` with each tuple having 4 tensors of shape\n `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)\n\n Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up\n decoding.\n\n If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those\n that don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of\n all `decoder_input_ids` of shape `(batch_size, sequence_length)`.\n use_cache:\n If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding\n (see `past_key_values`).\n \"\"\"\n if kwargs:\n warn(\n f\"Additional keyword arguments `{', '.join(kwargs)}` are detected in \"\n f\"`{self.__class__.__name__}.forward`, they will be ignored.\\n\"\n \"This is provided for backward compatibility and may lead to unexpected behavior.\"\n )\n output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions\n output_hidden_states = (\n output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states\n )\n return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n\n if self.config.is_decoder:\n use_cache = use_cache if use_cache is not None else self.config.use_cache\n else:\n use_cache = False\n\n if isinstance(input_ids, NestedTensor):\n input_ids, attention_mask = input_ids.tensor, input_ids.mask\n if input_ids is not None and inputs_embeds is not None:\n raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n if input_ids is not None:\n self.warn_if_padding_and_no_attention_mask(input_ids, attention_mask)\n input_shape = input_ids.size()\n elif inputs_embeds is not None:\n input_shape = inputs_embeds.size()[:-1]\n else:\n raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n batch_size, seq_length = input_shape\n device = input_ids.device if input_ids is not None else inputs_embeds.device # type: ignore[union-attr]\n\n # past_key_values_length\n past_key_values_length = past_key_values[0][0].shape[2] if past_key_values is not None else 0\n\n if attention_mask is None:\n attention_mask = (\n input_ids.ne(self.pad_token_id)\n if self.pad_token_id is not None\n else torch.ones(((batch_size, seq_length + past_key_values_length)), device=device)\n )\n\n # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]\n # ourselves in which case we just need to make it broadcastable to all heads.\n extended_attention_mask: Tensor = self.get_extended_attention_mask(attention_mask, input_shape)\n\n # If a 2D or 3D attention mask is provided for the cross-attention\n # we need to make broadcastable to [batch_size, num_heads, seq_length, seq_length]\n if self.config.is_decoder and encoder_hidden_states is not None:\n encoder_batch_size, encoder_sequence_length, _ = encoder_hidden_states.size()\n encoder_hidden_shape = (encoder_batch_size, encoder_sequence_length)\n if encoder_attention_mask is None:\n encoder_attention_mask = torch.ones(encoder_hidden_shape, device=device)\n encoder_extended_attention_mask = self.invert_attention_mask(encoder_attention_mask)\n else:\n encoder_extended_attention_mask = None\n\n # Prepare head mask if needed\n # 1.0 in head_mask indicate we keep the head\n # attention_probs has shape bsz x n_heads x N x N\n # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]\n # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]\n head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)\n\n embedding_output = self.embeddings(\n input_ids=input_ids,\n position_ids=position_ids,\n inputs_embeds=inputs_embeds,\n past_key_values_length=past_key_values_length,\n )\n encoder_outputs = self.encoder(\n embedding_output,\n attention_mask=extended_attention_mask,\n head_mask=head_mask,\n encoder_hidden_states=encoder_hidden_states,\n encoder_attention_mask=encoder_extended_attention_mask,\n past_key_values=past_key_values,\n use_cache=use_cache,\n output_attentions=output_attentions,\n output_hidden_states=output_hidden_states,\n return_dict=return_dict,\n )\n sequence_output = encoder_outputs[0]\n pooled_output = self.pooler(sequence_output) if self.pooler is not None else None\n\n if not return_dict:\n return (sequence_output, pooled_output) + encoder_outputs[1:]\n\n return BaseModelOutputWithPoolingAndCrossAttentions(\n last_hidden_state=sequence_output,\n pooler_output=pooled_output,\n past_key_values=encoder_outputs.past_key_values,\n hidden_states=encoder_outputs.hidden_states,\n attentions=encoder_outputs.attentions,\n cross_attentions=encoder_outputs.cross_attentions,\n )\n
"},{"location":"models/splicebert/#multimolecule.models.splicebert.SpliceBertModel.forward(encoder_hidden_states)","title":"encoder_hidden_states
","text":""},{"location":"models/splicebert/#multimolecule.models.splicebert.SpliceBertModel.forward(encoder_attention_mask)","title":"encoder_attention_mask
","text":""},{"location":"models/splicebert/#multimolecule.models.splicebert.SpliceBertModel.forward(past_key_values)","title":"past_key_values
","text":""},{"location":"models/splicebert/#multimolecule.models.splicebert.SpliceBertModel.forward(use_cache)","title":"use_cache
","text":""},{"location":"models/splicebert/#multimolecule.models.splicebert.SpliceBertPreTrainedModel","title":"SpliceBertPreTrainedModel","text":" Bases: PreTrainedModel
An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained models.
Source code inmultimolecule/models/splicebert/modeling_splicebert.py
Pythonclass SpliceBertPreTrainedModel(PreTrainedModel):\n \"\"\"\n An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained\n models.\n \"\"\"\n\n config_class = SpliceBertConfig\n base_model_prefix = \"splicebert\"\n supports_gradient_checkpointing = True\n _no_split_modules = [\"SpliceBertLayer\", \"SpliceBertEmbeddings\"]\n\n # Copied from transformers.models.bert.modeling_bert.BertPreTrainedModel._init_weights\n def _init_weights(self, module: nn.Module):\n \"\"\"Initialize the weights\"\"\"\n if isinstance(module, nn.Linear):\n # Slightly different from the TF version which uses truncated_normal for initialization\n # cf https://github.com/pytorch/pytorch/pull/5617\n module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)\n if module.bias is not None:\n module.bias.data.zero_()\n elif isinstance(module, nn.Embedding):\n module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)\n if module.padding_idx is not None:\n module.weight.data[module.padding_idx].zero_()\n elif isinstance(module, nn.LayerNorm):\n module.bias.data.zero_()\n module.weight.data.fill_(1.0)\n\n def _set_gradient_checkpointing(self, module, value=False):\n if isinstance(module, SpliceBertEncoder):\n module.gradient_checkpointing = value\n
"},{"location":"models/utrbert/","title":"3UTRBERT","text":"Pre-trained model on 3\u2019 untranslated region (3\u2019UTR) using a masked language modeling (MLM) objective.
"},{"location":"models/utrbert/#disclaimer","title":"Disclaimer","text":"This is an UNOFFICIAL implementation of the Deciphering 3\u2019 UTR mediated gene regulation using interpretable deep representation learning by Yuning Yang, Gen Li, et al.
The OFFICIAL repository of 3UTRBERT is at yangyn533/3UTRBERT.
Tip
The MultiMolecule team has confirmed that the provided model and checkpoints are producing the same intermediate representations as the original implementation.
The team releasing 3UTRBERT did not write this model card for this model so this model card has been written by the MultiMolecule team.
"},{"location":"models/utrbert/#model-details","title":"Model Details","text":"3UTRBERT is a bert-style model pre-trained on a large corpus of 3\u2019 untranslated regions (3\u2019UTRs) in a self-supervised fashion. This means that the model was trained on the raw nucleotides of RNA sequences only, with an automatic process to generate inputs and labels from those texts. Please refer to the Training Details section for more information on the training process.
"},{"location":"models/utrbert/#variations","title":"Variations","text":"multimolecule/utrbert-3mer
: The 3UTRBERT model pre-trained on 3-mer data.multimolecule/utrbert-4mer
: The 3UTRBERT model pre-trained on 4-mer data.multimolecule/utrbert-5mer
: The 3UTRBERT model pre-trained on 5-mer data.multimolecule/utrbert-6mer
: The 3UTRBERT model pre-trained on 6-mer data.The model file depends on the multimolecule
library. You can install it using pip:
pip install multimolecule\n
"},{"location":"models/utrbert/#direct-use","title":"Direct Use","text":"Note: Default transformers pipeline does not support K-mer tokenization.
You can use this model directly with a pipeline for masked language modeling:
Python>>> import multimolecule # you must import multimolecule to register models\n>>> from transformers import pipeline\n>>> unmasker = pipeline(\"fill-mask\", model=\"multimolecule/utrbert-3mer\")\n>>> unmasker(\"gguc<mask><mask><mask>cugguuagaccagaucugagccu\")[1]\n\n[{'score': 0.40745577216148376,\n 'token': 47,\n 'token_str': 'CUC',\n 'sequence': '<cls> GGU GUC <mask> CUC <mask> CUG UGG GGU GUU UUA UAG AGA GAC ACC CCA CAG AGA GAU AUC UCU CUG UGA GAG AGC GCC CCU <eos>'},\n {'score': 0.40001827478408813,\n 'token': 32,\n 'token_str': 'CAC',\n 'sequence': '<cls> GGU GUC <mask> CAC <mask> CUG UGG GGU GUU UUA UAG AGA GAC ACC CCA CAG AGA GAU AUC UCU CUG UGA GAG AGC GCC CCU <eos>'},\n {'score': 0.14566268026828766,\n 'token': 37,\n 'token_str': 'CCC',\n 'sequence': '<cls> GGU GUC <mask> CCC <mask> CUG UGG GGU GUU UUA UAG AGA GAC ACC CCA CAG AGA GAU AUC UCU CUG UGA GAG AGC GCC CCU <eos>'},\n {'score': 0.04422207176685333,\n 'token': 42,\n 'token_str': 'CGC',\n 'sequence': '<cls> GGU GUC <mask> CGC <mask> CUG UGG GGU GUU UUA UAG AGA GAC ACC CCA CAG AGA GAU AUC UCU CUG UGA GAG AGC GCC CCU <eos>'},\n {'score': 0.0008025980787351727,\n 'token': 34,\n 'token_str': 'CAU',\n 'sequence': '<cls> GGU GUC <mask> CAU <mask> CUG UGG GGU GUU UUA UAG AGA GAC ACC CCA CAG AGA GAU AUC UCU CUG UGA GAG AGC GCC CCU <eos>'}]\n
"},{"location":"models/utrbert/#downstream-use","title":"Downstream Use","text":""},{"location":"models/utrbert/#extract-features","title":"Extract Features","text":"Here is how to use this model to get the features of a given sequence in PyTorch:
Pythonfrom multimolecule import RnaTokenizer, UtrBertModel\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/utrbert-3mer\")\nmodel = UtrBertModel.from_pretrained(\"multimolecule/utrbert-3mer\")\n\ntext = \"UAGCUUAUCAGACUGAUGUUGA\"\ninput = tokenizer(text, return_tensors=\"pt\")\n\noutput = model(**input)\n
"},{"location":"models/utrbert/#sequence-classification-regression","title":"Sequence Classification / Regression","text":"Note: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for sequence classification or regression.
Here is how to use this model as backbone to fine-tune for a sequence-level task in PyTorch:
Pythonimport torch\nfrom multimolecule import RnaTokenizer, UtrBertForSequencePrediction\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/utrbert-3mer\")\nmodel = UtrBertForSequencePrediction.from_pretrained(\"multimolecule/utrbert-3mer\")\n\ntext = \"UAGCUUAUCAGACUGAUGUUGA\"\ninput = tokenizer(text, return_tensors=\"pt\")\nlabel = torch.tensor([1])\n\noutput = model(**input, labels=label)\n
"},{"location":"models/utrbert/#token-classification-regression","title":"Token Classification / Regression","text":"Note: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for nucleotide classification or regression.
Here is how to use this model as backbone to fine-tune for a nucleotide-level task in PyTorch:
Pythonimport torch\nfrom multimolecule import RnaTokenizer, UtrBertForTokenPrediction\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/utrbert-3mer\")\nmodel = UtrBertForTokenPrediction.from_pretrained(\"multimolecule/utrbert-3mer\")\n\ntext = \"UAGCUUAUCAGACUGAUGUUGA\"\ninput = tokenizer(text, return_tensors=\"pt\")\nlabel = torch.randint(2, (len(text), ))\n\noutput = model(**input, labels=label)\n
"},{"location":"models/utrbert/#contact-classification-regression","title":"Contact Classification / Regression","text":"Note: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for contact classification or regression.
Here is how to use this model as backbone to fine-tune for a contact-level task in PyTorch:
Pythonimport torch\nfrom multimolecule import RnaTokenizer, UtrBertForContactPrediction\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/utrbert-3mer\")\nmodel = UtrBertForContactPrediction.from_pretrained(\"multimolecule/utrbert-3mer\")\n\ntext = \"UAGCUUAUCAGACUGAUGUUGA\"\ninput = tokenizer(text, return_tensors=\"pt\")\nlabel = torch.randint(2, (len(text), len(text)))\n\noutput = model(**input, labels=label)\n
"},{"location":"models/utrbert/#training-details","title":"Training Details","text":"3UTRBERT used Masked Language Modeling (MLM) as the pre-training objective: taking a sequence, the model randomly masks 15% of the tokens in the input then runs the entire masked sentence through the model and has to predict the masked tokens. This is comparable to the Cloze task in language modeling.
"},{"location":"models/utrbert/#training-data","title":"Training Data","text":"The 3UTRBERT model was pre-trained on human mRNA transcript sequences from GENCODE. GENCODE aims to identify all gene features in the human genome using a combination of computational analysis, manual annotation, and experimental validation. The GENCODE release 40 used by this work contains 61,544 genes, and 246,624 transcripts.
3UTRBERT collected the human mRNA transcript sequences from GENCODE, including 108,573 unique mRNA transcripts. Only the longest transcript of each gene was used in the pre-training process. 3UTRBERT only used the 3\u2019 untranslated regions (3\u2019UTRs) of the mRNA transcripts for pre-training to avoid codon constrains in the CDS region, and to reduce increased complexity of the entire mRNA transcripts. The average length of the 3\u2019UTRs was 1,227 nucleotides, while the median length was 631 nucleotides. Each 3\u2019UTR sequence was cut to non-overlapping patches of 510 nucleotides. The remaining sequences were padded to the same length.
Note RnaTokenizer
will convert \u201cT\u201ds to \u201cU\u201ds for you, you may disable this behaviour by passing replace_T_with_U=False
.
3UTRBERT used masked language modeling (MLM) as the pre-training objective. The masking procedure is similar to the one used in BERT:
<mask>
.Since 3UTRBERT used k-mer tokenizer, it masks the entire k-mer instead of individual nucleotides to avoid information leakage.
For example, if the k-mer is 3, the sequence \"UAGCGUAU\"
will be tokenized as [\"UAG\", \"AGC\", \"GCG\", \"CGU\", \"GUA\", \"UAU\"]
. If the nucleotide \"C\"
is masked, the adjacent tokens will also be masked, resulting [\"UAG\", \"<mask>\", \"<mask>\", \"<mask>\", \"GUA\", \"UAU\"]
.
The model was trained on 4 NVIDIA Quadro RTX 6000 GPUs with 24GiB memories.
BibTeX:
BibTeX@article {yang2023deciphering,\n author = {Yang, Yuning and Li, Gen and Pang, Kuan and Cao, Wuxinhao and Li, Xiangtao and Zhang, Zhaolei},\n title = {Deciphering 3{\\textquoteright} UTR mediated gene regulation using interpretable deep representation learning},\n elocation-id = {2023.09.08.556883},\n year = {2023},\n doi = {10.1101/2023.09.08.556883},\n publisher = {Cold Spring Harbor Laboratory},\n abstract = {The 3{\\textquoteright}untranslated regions (3{\\textquoteright}UTRs) of messenger RNAs contain many important cis-regulatory elements that are under functional and evolutionary constraints. We hypothesize that these constraints are similar to grammars and syntaxes in human languages and can be modeled by advanced natural language models such as Transformers, which has been very effective in modeling protein sequence and structures. Here we describe 3UTRBERT, which implements an attention-based language model, i.e., Bidirectional Encoder Representations from Transformers (BERT). 3UTRBERT was pre-trained on aggregated 3{\\textquoteright}UTR sequences of human mRNAs in a task-agnostic manner; the pre-trained model was then fine-tuned for specific downstream tasks such as predicting RBP binding sites, m6A RNA modification sites, and predicting RNA sub-cellular localizations. Benchmark results showed that 3UTRBERT generally outperformed other contemporary methods in each of these tasks. We also showed that the self-attention mechanism within 3UTRBERT allows direct visualization of the semantic relationship between sequence elements.Competing Interest StatementThe authors have declared no competing interest.},\n URL = {https://www.biorxiv.org/content/early/2023/09/12/2023.09.08.556883},\n eprint = {https://www.biorxiv.org/content/early/2023/09/12/2023.09.08.556883.full.pdf},\n journal = {bioRxiv}\n}\n
"},{"location":"models/utrbert/#contact","title":"Contact","text":"Please use GitHub issues of MultiMolecule for any questions or comments on the model card.
Please contact the authors of the 3UTRBERT paper for questions or comments on the paper/model.
"},{"location":"models/utrbert/#license","title":"License","text":"This model is licensed under the AGPL-3.0 License.
Text OnlySPDX-License-Identifier: AGPL-3.0-or-later\n
"},{"location":"models/utrbert/#multimolecule.models.utrbert","title":"multimolecule.models.utrbert","text":""},{"location":"models/utrbert/#multimolecule.models.utrbert.RnaTokenizer","title":"RnaTokenizer","text":" Bases: Tokenizer
Tokenizer for RNA sequences.
Parameters:
Name Type Description DefaultAlphabet | str | List[str] | None
alphabet to use for tokenization.
None
, the standard RNA alphabet will be used.string
, it should correspond to the name of a predefined alphabet. The options includestandard
extended
streamline
nucleobase
None
int
Size of kmer to tokenize.
1
bool
Whether to tokenize into codons.
False
bool
Whether to replace T with U.
True
bool
Whether to convert input to uppercase.
True
Examples:
Python Console Session>>> from multimolecule import RnaTokenizer\n>>> tokenizer = RnaTokenizer()\n>>> tokenizer('<pad><cls><eos><unk><mask><null>ACGUNRYSWKMBDHV.X*-I')[\"input_ids\"]\n[1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 2]\n>>> tokenizer('acgu')[\"input_ids\"]\n[1, 6, 7, 8, 9, 2]\n>>> tokenizer('acgt')[\"input_ids\"]\n[1, 6, 7, 8, 9, 2]\n>>> tokenizer = RnaTokenizer(replace_T_with_U=False)\n>>> tokenizer('acgt')[\"input_ids\"]\n[1, 6, 7, 8, 3, 2]\n>>> tokenizer = RnaTokenizer(nmers=3)\n>>> tokenizer('uagcuuauc')[\"input_ids\"]\n[1, 83, 17, 64, 49, 96, 84, 22, 2]\n>>> tokenizer = RnaTokenizer(codon=True)\n>>> tokenizer('uagcuuauc')[\"input_ids\"]\n[1, 83, 49, 22, 2]\n>>> tokenizer('uagcuuauca')[\"input_ids\"]\nTraceback (most recent call last):\nValueError: length of input sequence must be a multiple of 3 for codon tokenization, but got 10\n
Source code in multimolecule/tokenisers/rna/tokenization_rna.py
Pythonclass RnaTokenizer(Tokenizer):\n \"\"\"\n Tokenizer for RNA sequences.\n\n Args:\n alphabet: alphabet to use for tokenization.\n\n - If is `None`, the standard RNA alphabet will be used.\n - If is a `string`, it should correspond to the name of a predefined alphabet. The options include\n + `standard`\n + `extended`\n + `streamline`\n + `nucleobase`\n - If is an alphabet or a list of characters, that specific alphabet will be used.\n nmers: Size of kmer to tokenize.\n codon: Whether to tokenize into codons.\n replace_T_with_U: Whether to replace T with U.\n do_upper_case: Whether to convert input to uppercase.\n\n Examples:\n >>> from multimolecule import RnaTokenizer\n >>> tokenizer = RnaTokenizer()\n >>> tokenizer('<pad><cls><eos><unk><mask><null>ACGUNRYSWKMBDHV.X*-I')[\"input_ids\"]\n [1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 2]\n >>> tokenizer('acgu')[\"input_ids\"]\n [1, 6, 7, 8, 9, 2]\n >>> tokenizer('acgt')[\"input_ids\"]\n [1, 6, 7, 8, 9, 2]\n >>> tokenizer = RnaTokenizer(replace_T_with_U=False)\n >>> tokenizer('acgt')[\"input_ids\"]\n [1, 6, 7, 8, 3, 2]\n >>> tokenizer = RnaTokenizer(nmers=3)\n >>> tokenizer('uagcuuauc')[\"input_ids\"]\n [1, 83, 17, 64, 49, 96, 84, 22, 2]\n >>> tokenizer = RnaTokenizer(codon=True)\n >>> tokenizer('uagcuuauc')[\"input_ids\"]\n [1, 83, 49, 22, 2]\n >>> tokenizer('uagcuuauca')[\"input_ids\"]\n Traceback (most recent call last):\n ValueError: length of input sequence must be a multiple of 3 for codon tokenization, but got 10\n \"\"\"\n\n model_input_names = [\"input_ids\", \"attention_mask\"]\n\n def __init__(\n self,\n alphabet: Alphabet | str | List[str] | None = None,\n nmers: int = 1,\n codon: bool = False,\n replace_T_with_U: bool = True,\n do_upper_case: bool = True,\n additional_special_tokens: List | Tuple | None = None,\n **kwargs,\n ):\n if codon and (nmers > 1 and nmers != 3):\n raise ValueError(\"Codon and nmers cannot be used together.\")\n if codon:\n nmers = 3 # set to 3 to get correct vocab\n if not isinstance(alphabet, Alphabet):\n alphabet = get_alphabet(alphabet, nmers=nmers)\n super().__init__(\n alphabet=alphabet,\n nmers=nmers,\n codon=codon,\n replace_T_with_U=replace_T_with_U,\n do_upper_case=do_upper_case,\n additional_special_tokens=additional_special_tokens,\n **kwargs,\n )\n self.replace_T_with_U = replace_T_with_U\n self.nmers = nmers\n self.codon = codon\n\n def _tokenize(self, text: str, **kwargs):\n if self.do_upper_case:\n text = text.upper()\n if self.replace_T_with_U:\n text = text.replace(\"T\", \"U\")\n if self.codon:\n if len(text) % 3 != 0:\n raise ValueError(\n f\"length of input sequence must be a multiple of 3 for codon tokenization, but got {len(text)}\"\n )\n return [text[i : i + 3] for i in range(0, len(text), 3)]\n if self.nmers > 1:\n return [text[i : i + self.nmers] for i in range(len(text) - self.nmers + 1)] # noqa: E203\n return list(text)\n
"},{"location":"models/utrbert/#multimolecule.models.utrbert.RnaTokenizer(alphabet)","title":"alphabet
","text":""},{"location":"models/utrbert/#multimolecule.models.utrbert.RnaTokenizer(nmers)","title":"nmers
","text":""},{"location":"models/utrbert/#multimolecule.models.utrbert.RnaTokenizer(codon)","title":"codon
","text":""},{"location":"models/utrbert/#multimolecule.models.utrbert.RnaTokenizer(replace_T_with_U)","title":"replace_T_with_U
","text":""},{"location":"models/utrbert/#multimolecule.models.utrbert.RnaTokenizer(do_upper_case)","title":"do_upper_case
","text":""},{"location":"models/utrbert/#multimolecule.models.utrbert.UtrBertConfig","title":"UtrBertConfig","text":" Bases: PreTrainedConfig
This is the configuration class to store the configuration of a UtrBertModel
. It is used to instantiate a 3UTRBERT model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the 3UTRBERT yangyn533/3UTRBERT architecture.
Configuration objects inherit from PreTrainedConfig
and can be used to control the model outputs. Read the documentation from PreTrainedConfig
for more information.
Parameters:
Name Type Description Defaultint | None
Vocabulary size of the UTRBERT model. Defines the number of different tokens that can be represented by the inputs_ids
passed when calling [BertModel
].
None
int | None
kmer size of the UTRBERT model. Defines the vocabulary size of the model.
None
int
Dimensionality of the encoder layers and the pooler layer.
768
int
Number of hidden layers in the Transformer encoder.
12
int
Number of attention heads for each attention layer in the Transformer encoder.
12
int
Dimensionality of the \u201cintermediate\u201d (often named feed-forward) layer in the Transformer encoder.
3072
str
The non-linear activation function (function or string) in the encoder and pooler. If string, \"gelu\"
, \"relu\"
, \"silu\"
and \"gelu_new\"
are supported.
'gelu'
float
The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
0.1
float
The dropout ratio for the attention probabilities.
0.1
int
The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
512
float
The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
0.02
float
The epsilon used by the layer normalization layers.
1e-12
str
Type of position embedding. Choose one of \"absolute\"
, \"relative_key\"
, \"relative_key_query\"
. For positional embeddings use \"absolute\"
. For more information on \"relative_key\"
, please refer to Self-Attention with Relative Position Representations (Shaw et al.). For more information on \"relative_key_query\"
, please refer to Method 4 in Improve Transformer Models with Better Relative Position Embeddings (Huang et al.).
'absolute'
bool
Whether the model is used as a decoder or not. If False
, the model is used as an encoder.
False
bool
Whether or not the model should return the last key/values attentions (not used by all models). Only relevant if config.is_decoder=True
.
True
Examples:
Python Console Session>>> from multimolecule import UtrBertConfig, UtrBertModel\n>>> # Initializing a UtrBERT multimolecule/utrbert style configuration\n>>> configuration = UtrBertConfig(vocab_size=26, nmers=1)\n>>> # Initializing a model (with random weights) from the multimolecule/utrbert style configuration\n>>> model = UtrBertModel(configuration)\n>>> # Accessing the model configuration\n>>> configuration = model.config\n
Source code in multimolecule/models/utrbert/configuration_utrbert.py
Pythonclass UtrBertConfig(PreTrainedConfig):\n r\"\"\"\n This is the configuration class to store the configuration of a [`UtrBertModel`][multimolecule.models.UtrBertModel].\n It is used to instantiate a 3UTRBERT model according to the specified arguments, defining the model architecture.\n Instantiating a configuration with the defaults will yield a similar configuration to that of the 3UTRBERT\n [yangyn533/3UTRBERT](https://github.com/yangyn533/3UTRBERT) architecture.\n\n Configuration objects inherit from [`PreTrainedConfig`][multimolecule.models.PreTrainedConfig] and can be used to\n control the model outputs. Read the documentation from [`PreTrainedConfig`][multimolecule.models.PreTrainedConfig]\n for more information.\n\n Args:\n vocab_size:\n Vocabulary size of the UTRBERT model. Defines the number of different tokens that can be represented by the\n `inputs_ids` passed when calling [`BertModel`].\n nmers:\n kmer size of the UTRBERT model. Defines the vocabulary size of the model.\n hidden_size:\n Dimensionality of the encoder layers and the pooler layer.\n num_hidden_layers:\n Number of hidden layers in the Transformer encoder.\n num_attention_heads:\n Number of attention heads for each attention layer in the Transformer encoder.\n intermediate_size:\n Dimensionality of the \"intermediate\" (often named feed-forward) layer in the Transformer encoder.\n hidden_act:\n The non-linear activation function (function or string) in the encoder and pooler. If string, `\"gelu\"`,\n `\"relu\"`, `\"silu\"` and `\"gelu_new\"` are supported.\n hidden_dropout:\n The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.\n attention_dropout:\n The dropout ratio for the attention probabilities.\n max_position_embeddings:\n The maximum sequence length that this model might ever be used with. Typically set this to something large\n just in case (e.g., 512 or 1024 or 2048).\n initializer_range:\n The standard deviation of the truncated_normal_initializer for initializing all weight matrices.\n layer_norm_eps:\n The epsilon used by the layer normalization layers.\n position_embedding_type:\n Type of position embedding. Choose one of `\"absolute\"`, `\"relative_key\"`, `\"relative_key_query\"`. For\n positional embeddings use `\"absolute\"`. For more information on `\"relative_key\"`, please refer to\n [Self-Attention with Relative Position Representations (Shaw et al.)](https://arxiv.org/abs/1803.02155).\n For more information on `\"relative_key_query\"`, please refer to *Method 4* in [Improve Transformer Models\n with Better Relative Position Embeddings (Huang et al.)](https://arxiv.org/abs/2009.13658).\n is_decoder:\n Whether the model is used as a decoder or not. If `False`, the model is used as an encoder.\n use_cache:\n Whether or not the model should return the last key/values attentions (not used by all models). Only\n relevant if `config.is_decoder=True`.\n\n Examples:\n >>> from multimolecule import UtrBertConfig, UtrBertModel\n >>> # Initializing a UtrBERT multimolecule/utrbert style configuration\n >>> configuration = UtrBertConfig(vocab_size=26, nmers=1)\n >>> # Initializing a model (with random weights) from the multimolecule/utrbert style configuration\n >>> model = UtrBertModel(configuration)\n >>> # Accessing the model configuration\n >>> configuration = model.config\n \"\"\"\n\n model_type = \"utrbert\"\n\n def __init__(\n self,\n vocab_size: int | None = None,\n nmers: int | None = None,\n hidden_size: int = 768,\n num_hidden_layers: int = 12,\n num_attention_heads: int = 12,\n intermediate_size: int = 3072,\n hidden_act: str = \"gelu\",\n hidden_dropout: float = 0.1,\n attention_dropout: float = 0.1,\n max_position_embeddings: int = 512,\n initializer_range: float = 0.02,\n layer_norm_eps: float = 1e-12,\n position_embedding_type: str = \"absolute\",\n is_decoder: bool = False,\n use_cache: bool = True,\n head: HeadConfig | None = None,\n lm_head: MaskedLMHeadConfig | None = None,\n **kwargs,\n ):\n super().__init__(**kwargs)\n\n self.vocab_size = vocab_size\n self.type_vocab_size = 2\n self.nmers = nmers\n self.hidden_size = hidden_size\n self.num_hidden_layers = num_hidden_layers\n self.num_attention_heads = num_attention_heads\n self.hidden_act = hidden_act\n self.intermediate_size = intermediate_size\n self.hidden_dropout = hidden_dropout\n self.attention_dropout = attention_dropout\n self.max_position_embeddings = max_position_embeddings\n self.initializer_range = initializer_range\n self.layer_norm_eps = layer_norm_eps\n self.position_embedding_type = position_embedding_type\n self.is_decoder = is_decoder\n self.use_cache = use_cache\n self.head = HeadConfig(**head) if head is not None else None\n self.lm_head = MaskedLMHeadConfig(**lm_head) if lm_head is not None else None\n
"},{"location":"models/utrbert/#multimolecule.models.utrbert.UtrBertConfig(vocab_size)","title":"vocab_size
","text":""},{"location":"models/utrbert/#multimolecule.models.utrbert.UtrBertConfig(nmers)","title":"nmers
","text":""},{"location":"models/utrbert/#multimolecule.models.utrbert.UtrBertConfig(hidden_size)","title":"hidden_size
","text":""},{"location":"models/utrbert/#multimolecule.models.utrbert.UtrBertConfig(num_hidden_layers)","title":"num_hidden_layers
","text":""},{"location":"models/utrbert/#multimolecule.models.utrbert.UtrBertConfig(num_attention_heads)","title":"num_attention_heads
","text":""},{"location":"models/utrbert/#multimolecule.models.utrbert.UtrBertConfig(intermediate_size)","title":"intermediate_size
","text":""},{"location":"models/utrbert/#multimolecule.models.utrbert.UtrBertConfig(hidden_act)","title":"hidden_act
","text":""},{"location":"models/utrbert/#multimolecule.models.utrbert.UtrBertConfig(hidden_dropout)","title":"hidden_dropout
","text":""},{"location":"models/utrbert/#multimolecule.models.utrbert.UtrBertConfig(attention_dropout)","title":"attention_dropout
","text":""},{"location":"models/utrbert/#multimolecule.models.utrbert.UtrBertConfig(max_position_embeddings)","title":"max_position_embeddings
","text":""},{"location":"models/utrbert/#multimolecule.models.utrbert.UtrBertConfig(initializer_range)","title":"initializer_range
","text":""},{"location":"models/utrbert/#multimolecule.models.utrbert.UtrBertConfig(layer_norm_eps)","title":"layer_norm_eps
","text":""},{"location":"models/utrbert/#multimolecule.models.utrbert.UtrBertConfig(position_embedding_type)","title":"position_embedding_type
","text":""},{"location":"models/utrbert/#multimolecule.models.utrbert.UtrBertConfig(is_decoder)","title":"is_decoder
","text":""},{"location":"models/utrbert/#multimolecule.models.utrbert.UtrBertConfig(use_cache)","title":"use_cache
","text":""},{"location":"models/utrbert/#multimolecule.models.utrbert.UtrBertForContactPrediction","title":"UtrBertForContactPrediction","text":" Bases: UtrBertPreTrainedModel
Examples:
Python Console Session>>> from multimolecule import UtrBertConfig, UtrBertForContactPrediction, RnaTokenizer\n>>> tokenizer = RnaTokenizer(nmers=1)\n>>> config = UtrBertConfig(vocab_size=tokenizer.vocab_size)\n>>> model = UtrBertForContactPrediction(config)\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=torch.randint(2, (1, 5, 5)))\n>>> output[\"logits\"].shape\ntorch.Size([1, 5, 5, 1])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n
Source code in multimolecule/models/utrbert/modeling_utrbert.py
Pythonclass UtrBertForContactPrediction(UtrBertPreTrainedModel):\n \"\"\"\n Examples:\n >>> from multimolecule import UtrBertConfig, UtrBertForContactPrediction, RnaTokenizer\n >>> tokenizer = RnaTokenizer(nmers=1)\n >>> config = UtrBertConfig(vocab_size=tokenizer.vocab_size)\n >>> model = UtrBertForContactPrediction(config)\n >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n >>> output = model(**input, labels=torch.randint(2, (1, 5, 5)))\n >>> output[\"logits\"].shape\n torch.Size([1, 5, 5, 1])\n >>> output[\"loss\"] # doctest:+ELLIPSIS\n tensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n \"\"\"\n\n def __init__(self, config: UtrBertConfig):\n super().__init__(config)\n self.utrbert = UtrBertModel(config, add_pooling_layer=True)\n self.contact_head = ContactPredictionHead(config)\n self.head_config = self.contact_head.config\n\n # Initialize weights and apply final processing\n self.post_init()\n\n def forward(\n self,\n input_ids: Tensor | NestedTensor,\n attention_mask: Tensor | None = None,\n position_ids: Tensor | None = None,\n head_mask: Tensor | None = None,\n inputs_embeds: Tensor | NestedTensor | None = None,\n labels: Tensor | None = None,\n output_attentions: bool | None = None,\n output_hidden_states: bool | None = None,\n return_dict: bool | None = None,\n **kwargs,\n ) -> Tuple[Tensor, ...] | ContactPredictorOutput:\n if output_attentions is False:\n warn(\"output_attentions must be True for contact classification and will be ignored.\")\n return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n outputs = self.utrbert(\n input_ids,\n attention_mask=attention_mask,\n position_ids=position_ids,\n head_mask=head_mask,\n inputs_embeds=inputs_embeds,\n output_attentions=True,\n output_hidden_states=output_hidden_states,\n return_dict=return_dict,\n **kwargs,\n )\n output = self.contact_head(outputs, attention_mask, input_ids, labels)\n logits, loss = output.logits, output.loss\n\n if not return_dict:\n output = (logits,) + outputs[2:]\n return ((loss,) + output) if loss is not None else output\n\n return ContactPredictorOutput(\n loss=loss,\n logits=logits,\n hidden_states=outputs.hidden_states,\n attentions=outputs.attentions,\n )\n
"},{"location":"models/utrbert/#multimolecule.models.utrbert.UtrBertForMaskedLM","title":"UtrBertForMaskedLM","text":" Bases: UtrBertPreTrainedModel
Examples:
Python Console Session>>> from multimolecule import UtrBertConfig, UtrBertForMaskedLM, RnaTokenizer\n>>> tokenizer = RnaTokenizer(nmers=2)\n>>> config = UtrBertConfig(vocab_size=tokenizer.vocab_size)\n>>> model = UtrBertForMaskedLM(config)\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=input[\"input_ids\"])\n>>> output[\"logits\"].shape\ntorch.Size([1, 6, 31])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<NllLossBackward0>)\n
Source code in multimolecule/models/utrbert/modeling_utrbert.py
Pythonclass UtrBertForMaskedLM(UtrBertPreTrainedModel):\n \"\"\"\n Examples:\n >>> from multimolecule import UtrBertConfig, UtrBertForMaskedLM, RnaTokenizer\n >>> tokenizer = RnaTokenizer(nmers=2)\n >>> config = UtrBertConfig(vocab_size=tokenizer.vocab_size)\n >>> model = UtrBertForMaskedLM(config)\n >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n >>> output = model(**input, labels=input[\"input_ids\"])\n >>> output[\"logits\"].shape\n torch.Size([1, 6, 31])\n >>> output[\"loss\"] # doctest:+ELLIPSIS\n tensor(..., grad_fn=<NllLossBackward0>)\n \"\"\"\n\n _tied_weights_keys = [\"lm_head.decoder.weight\", \"lm_head.decoder.bias\"]\n\n def __init__(self, config: UtrBertConfig):\n super().__init__(config)\n if config.is_decoder:\n logger.warning(\n \"If you want to use `BertForMaskedLM` make sure `config.is_decoder=False` for \"\n \"bi-directional self-attention.\"\n )\n self.utrbert = UtrBertModel(config, add_pooling_layer=False)\n self.lm_head = MaskedLMHead(config)\n\n # Initialize weights and apply final processing\n self.post_init()\n\n def get_output_embeddings(self):\n return self.lm_head.decoder\n\n def set_output_embeddings(self, new_embeddings):\n self.lm_head.decoder = new_embeddings\n\n def forward(\n self,\n input_ids: Tensor | NestedTensor,\n attention_mask: Tensor | None = None,\n position_ids: Tensor | None = None,\n head_mask: Tensor | None = None,\n inputs_embeds: Tensor | NestedTensor | None = None,\n encoder_hidden_states: Tensor | None = None,\n encoder_attention_mask: Tensor | None = None,\n labels: Tensor | None = None,\n output_attentions: bool | None = None,\n output_hidden_states: bool | None = None,\n return_dict: bool | None = None,\n **kwargs,\n ) -> Tuple[Tensor, ...] | MaskedLMOutput:\n return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n outputs = self.utrbert(\n input_ids,\n attention_mask=attention_mask,\n position_ids=position_ids,\n head_mask=head_mask,\n inputs_embeds=inputs_embeds,\n encoder_hidden_states=encoder_hidden_states,\n encoder_attention_mask=encoder_attention_mask,\n output_attentions=output_attentions,\n output_hidden_states=output_hidden_states,\n return_dict=return_dict,\n **kwargs,\n )\n output = self.lm_head(outputs, labels)\n logits, loss = output.logits, output.loss\n\n if not return_dict:\n output = (logits,) + outputs[2:]\n return ((loss,) + output) if loss is not None else output\n\n return MaskedLMOutput(\n loss=loss,\n logits=logits,\n hidden_states=outputs.hidden_states,\n attentions=outputs.attentions,\n )\n
"},{"location":"models/utrbert/#multimolecule.models.utrbert.UtrBertForSequencePrediction","title":"UtrBertForSequencePrediction","text":" Bases: UtrBertPreTrainedModel
Examples:
Python Console Session>>> from multimolecule import UtrBertConfig, UtrBertForSequencePrediction, RnaTokenizer\n>>> tokenizer = RnaTokenizer(nmers=4)\n>>> config = UtrBertConfig(vocab_size=tokenizer.vocab_size)\n>>> model = UtrBertForSequencePrediction(config)\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=torch.tensor([[1]]))\n>>> output[\"logits\"].shape\ntorch.Size([1, 1])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n
Source code in multimolecule/models/utrbert/modeling_utrbert.py
Pythonclass UtrBertForSequencePrediction(UtrBertPreTrainedModel):\n \"\"\"\n Examples:\n >>> from multimolecule import UtrBertConfig, UtrBertForSequencePrediction, RnaTokenizer\n >>> tokenizer = RnaTokenizer(nmers=4)\n >>> config = UtrBertConfig(vocab_size=tokenizer.vocab_size)\n >>> model = UtrBertForSequencePrediction(config)\n >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n >>> output = model(**input, labels=torch.tensor([[1]]))\n >>> output[\"logits\"].shape\n torch.Size([1, 1])\n >>> output[\"loss\"] # doctest:+ELLIPSIS\n tensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n \"\"\"\n\n def __init__(self, config: UtrBertConfig):\n super().__init__(config)\n self.utrbert = UtrBertModel(config)\n self.sequence_head = SequencePredictionHead(config)\n self.head_config = self.sequence_head.config\n\n # Initialize weights and apply final processing\n self.post_init()\n\n def forward(\n self,\n input_ids: Tensor | NestedTensor,\n attention_mask: Tensor | None = None,\n position_ids: Tensor | None = None,\n head_mask: Tensor | None = None,\n inputs_embeds: Tensor | NestedTensor | None = None,\n labels: Tensor | None = None,\n output_attentions: bool | None = None,\n output_hidden_states: bool | None = None,\n return_dict: bool | None = None,\n **kwargs,\n ) -> Tuple[Tensor, ...] | SequencePredictorOutput:\n return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n outputs = self.utrbert(\n input_ids,\n attention_mask=attention_mask,\n position_ids=position_ids,\n head_mask=head_mask,\n inputs_embeds=inputs_embeds,\n output_attentions=output_attentions,\n output_hidden_states=output_hidden_states,\n return_dict=return_dict,\n **kwargs,\n )\n output = self.sequence_head(outputs, labels)\n logits, loss = output.logits, output.loss\n\n if not return_dict:\n output = (logits,) + outputs[2:]\n return ((loss,) + output) if loss is not None else output\n\n return SequencePredictorOutput(\n loss=loss,\n logits=logits,\n hidden_states=outputs.hidden_states,\n attentions=outputs.attentions,\n )\n
"},{"location":"models/utrbert/#multimolecule.models.utrbert.UtrBertForTokenPrediction","title":"UtrBertForTokenPrediction","text":" Bases: UtrBertPreTrainedModel
Examples:
Python Console Session>>> from multimolecule import UtrBertConfig, UtrBertForTokenPrediction, RnaTokenizer\n>>> tokenizer = RnaTokenizer(nmers=2)\n>>> config = UtrBertConfig(vocab_size=tokenizer.vocab_size, nmers=2)\n>>> model = UtrBertForTokenPrediction(config)\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=torch.randint(2, (1, 5)))\n>>> output[\"logits\"].shape\ntorch.Size([1, 5, 1])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n
Source code in multimolecule/models/utrbert/modeling_utrbert.py
Pythonclass UtrBertForTokenPrediction(UtrBertPreTrainedModel):\n \"\"\"\n Examples:\n >>> from multimolecule import UtrBertConfig, UtrBertForTokenPrediction, RnaTokenizer\n >>> tokenizer = RnaTokenizer(nmers=2)\n >>> config = UtrBertConfig(vocab_size=tokenizer.vocab_size, nmers=2)\n >>> model = UtrBertForTokenPrediction(config)\n >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n >>> output = model(**input, labels=torch.randint(2, (1, 5)))\n >>> output[\"logits\"].shape\n torch.Size([1, 5, 1])\n >>> output[\"loss\"] # doctest:+ELLIPSIS\n tensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n \"\"\"\n\n def __init__(self, config: UtrBertConfig):\n super().__init__(config)\n self.num_labels = config.num_labels\n self.utrbert = UtrBertModel(config, add_pooling_layer=False)\n self.token_head = TokenKMerHead(config)\n self.head_config = self.token_head.config\n\n # Initialize weights and apply final processing\n self.post_init()\n\n def forward(\n self,\n input_ids: Tensor | NestedTensor,\n attention_mask: Tensor | None = None,\n position_ids: Tensor | None = None,\n head_mask: Tensor | None = None,\n inputs_embeds: Tensor | NestedTensor | None = None,\n labels: Tensor | None = None,\n output_attentions: bool | None = None,\n output_hidden_states: bool | None = None,\n return_dict: bool | None = None,\n **kwargs,\n ) -> Tuple[Tensor, ...] | TokenPredictorOutput:\n return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n outputs = self.utrbert(\n input_ids,\n attention_mask=attention_mask,\n position_ids=position_ids,\n head_mask=head_mask,\n inputs_embeds=inputs_embeds,\n output_attentions=output_attentions,\n output_hidden_states=output_hidden_states,\n return_dict=return_dict,\n **kwargs,\n )\n output = self.token_head(outputs, attention_mask, input_ids, labels)\n logits, loss = output.logits, output.loss\n\n if not return_dict:\n output = (logits,) + outputs[2:]\n return ((loss,) + output) if loss is not None else output\n\n return TokenPredictorOutput(\n loss=loss,\n logits=logits,\n hidden_states=outputs.hidden_states,\n attentions=outputs.attentions,\n )\n
"},{"location":"models/utrbert/#multimolecule.models.utrbert.UtrBertModel","title":"UtrBertModel","text":" Bases: UtrBertPreTrainedModel
Examples:
Python Console Session>>> from multimolecule import UtrBertConfig, UtrBertModel, RnaTokenizer\n>>> tokenizer = RnaTokenizer(nmers=1)\n>>> config = UtrBertConfig(vocab_size=tokenizer.vocab_size)\n>>> model = UtrBertModel(config)\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input)\n>>> output[\"last_hidden_state\"].shape\ntorch.Size([1, 7, 768])\n>>> output[\"pooler_output\"].shape\ntorch.Size([1, 768])\n
Source code in multimolecule/models/utrbert/modeling_utrbert.py
Pythonclass UtrBertModel(UtrBertPreTrainedModel):\n \"\"\"\n Examples:\n >>> from multimolecule import UtrBertConfig, UtrBertModel, RnaTokenizer\n >>> tokenizer = RnaTokenizer(nmers=1)\n >>> config = UtrBertConfig(vocab_size=tokenizer.vocab_size)\n >>> model = UtrBertModel(config)\n >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n >>> output = model(**input)\n >>> output[\"last_hidden_state\"].shape\n torch.Size([1, 7, 768])\n >>> output[\"pooler_output\"].shape\n torch.Size([1, 768])\n \"\"\"\n\n def __init__(self, config: UtrBertConfig, add_pooling_layer: bool = True):\n super().__init__(config)\n self.pad_token_id = config.pad_token_id\n self.embeddings = UtrBertEmbeddings(config)\n self.encoder = UtrBertEncoder(config)\n self.pooler = UtrBertPooler(config) if add_pooling_layer else None\n\n # Initialize weights and apply final processing\n self.post_init()\n\n def get_input_embeddings(self):\n return self.embeddings.word_embeddings\n\n def set_input_embeddings(self, value):\n self.embeddings.word_embeddings = value\n\n def _prune_heads(self, heads_to_prune):\n \"\"\"\n Prunes heads of the model. heads_to_prune: dict of {layer_num: list of heads to prune in this layer} See base\n class PreTrainedModel\n \"\"\"\n for layer, heads in heads_to_prune.items():\n self.encoder.layer[layer].attention.prune_heads(heads)\n\n def forward(\n self,\n input_ids: Tensor | NestedTensor,\n attention_mask: Tensor | None = None,\n position_ids: Tensor | None = None,\n head_mask: Tensor | None = None,\n inputs_embeds: Tensor | NestedTensor | None = None,\n encoder_hidden_states: Tensor | None = None,\n encoder_attention_mask: Tensor | None = None,\n past_key_values: Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None = None,\n use_cache: bool | None = None,\n output_attentions: bool | None = None,\n output_hidden_states: bool | None = None,\n return_dict: bool | None = None,\n **kwargs,\n ) -> Tuple[Tensor, ...] | BaseModelOutputWithPoolingAndCrossAttentions:\n r\"\"\"\n Args:\n encoder_hidden_states:\n Shape: `(batch_size, sequence_length, hidden_size)`\n\n Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if\n the model is configured as a decoder.\n encoder_attention_mask:\n Shape: `(batch_size, sequence_length)`\n\n Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used\n in the cross-attention if the model is configured as a decoder. Mask values selected in `[0, 1]`:\n\n - 1 for tokens that are **not masked**,\n - 0 for tokens that are **masked**.\n past_key_values:\n Tuple of length `config.n_layers` with each tuple having 4 tensors of shape\n `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)\n\n Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up\n decoding.\n\n If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those\n that don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of\n all `decoder_input_ids` of shape `(batch_size, sequence_length)`.\n use_cache:\n If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding\n (see `past_key_values`).\n \"\"\"\n if kwargs:\n warn(\n f\"Additional keyword arguments `{', '.join(kwargs)}` are detected in \"\n f\"`{self.__class__.__name__}.forward`, they will be ignored.\\n\"\n \"This is provided for backward compatibility and may lead to unexpected behavior.\"\n )\n output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions\n output_hidden_states = (\n output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states\n )\n return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n\n if self.config.is_decoder:\n use_cache = use_cache if use_cache is not None else self.config.use_cache\n else:\n use_cache = False\n\n if isinstance(input_ids, NestedTensor):\n input_ids, attention_mask = input_ids.tensor, input_ids.mask\n if input_ids is not None and inputs_embeds is not None:\n raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n if input_ids is not None:\n self.warn_if_padding_and_no_attention_mask(input_ids, attention_mask)\n input_shape = input_ids.size()\n elif inputs_embeds is not None:\n input_shape = inputs_embeds.size()[:-1]\n else:\n raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n batch_size, seq_length = input_shape\n device = input_ids.device if input_ids is not None else inputs_embeds.device # type: ignore[union-attr]\n\n # past_key_values_length\n past_key_values_length = past_key_values[0][0].shape[2] if past_key_values is not None else 0\n\n if attention_mask is None:\n attention_mask = (\n input_ids.ne(self.pad_token_id)\n if self.pad_token_id is not None\n else torch.ones(((batch_size, seq_length + past_key_values_length)), device=device)\n )\n\n # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]\n # ourselves in which case we just need to make it broadcastable to all heads.\n extended_attention_mask: Tensor = self.get_extended_attention_mask(attention_mask, input_shape)\n\n # If a 2D or 3D attention mask is provided for the cross-attention\n # we need to make broadcastable to [batch_size, num_heads, seq_length, seq_length]\n if self.config.is_decoder and encoder_hidden_states is not None:\n encoder_batch_size, encoder_sequence_length, _ = encoder_hidden_states.size()\n encoder_hidden_shape = (encoder_batch_size, encoder_sequence_length)\n if encoder_attention_mask is None:\n encoder_attention_mask = torch.ones(encoder_hidden_shape, device=device)\n encoder_extended_attention_mask = self.invert_attention_mask(encoder_attention_mask)\n else:\n encoder_extended_attention_mask = None\n\n # Prepare head mask if needed\n # 1.0 in head_mask indicate we keep the head\n # attention_probs has shape bsz x n_heads x N x N\n # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]\n # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]\n head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)\n\n embedding_output = self.embeddings(\n input_ids=input_ids,\n position_ids=position_ids,\n inputs_embeds=inputs_embeds,\n past_key_values_length=past_key_values_length,\n )\n encoder_outputs = self.encoder(\n embedding_output,\n attention_mask=extended_attention_mask,\n head_mask=head_mask,\n encoder_hidden_states=encoder_hidden_states,\n encoder_attention_mask=encoder_extended_attention_mask,\n past_key_values=past_key_values,\n use_cache=use_cache,\n output_attentions=output_attentions,\n output_hidden_states=output_hidden_states,\n return_dict=return_dict,\n )\n sequence_output = encoder_outputs[0]\n pooled_output = self.pooler(sequence_output) if self.pooler is not None else None\n\n if not return_dict:\n return (sequence_output, pooled_output) + encoder_outputs[1:]\n\n return BaseModelOutputWithPoolingAndCrossAttentions(\n last_hidden_state=sequence_output,\n pooler_output=pooled_output,\n past_key_values=encoder_outputs.past_key_values,\n hidden_states=encoder_outputs.hidden_states,\n attentions=encoder_outputs.attentions,\n cross_attentions=encoder_outputs.cross_attentions,\n )\n
"},{"location":"models/utrbert/#multimolecule.models.utrbert.UtrBertModel.forward","title":"forward","text":"Pythonforward(input_ids: Tensor | NestedTensor, attention_mask: Tensor | None = None, position_ids: Tensor | None = None, head_mask: Tensor | None = None, inputs_embeds: Tensor | NestedTensor | None = None, encoder_hidden_states: Tensor | None = None, encoder_attention_mask: Tensor | None = None, past_key_values: Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None = None, use_cache: bool | None = None, output_attentions: bool | None = None, output_hidden_states: bool | None = None, return_dict: bool | None = None, **kwargs) -> Tuple[Tensor, ...] | BaseModelOutputWithPoolingAndCrossAttentions\n
Parameters:
Name Type Description DefaultTensor | None
Shape: (batch_size, sequence_length, hidden_size)
Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if the model is configured as a decoder.
None
Tensor | None
Shape: (batch_size, sequence_length)
Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used in the cross-attention if the model is configured as a decoder. Mask values selected in [0, 1]
:
None
Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None
Tuple of length config.n_layers
with each tuple having 4 tensors of shape `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)
Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up decoding.
If past_key_values
are used, the user can optionally input only the last decoder_input_ids
(those that don\u2019t have their past key value states given to this model) of shape (batch_size, 1)
instead of all decoder_input_ids
of shape (batch_size, sequence_length)
.
None
bool | None
If set to True
, past_key_values
key value states are returned and can be used to speed up decoding (see past_key_values
).
None
Source code in multimolecule/models/utrbert/modeling_utrbert.py
Pythondef forward(\n self,\n input_ids: Tensor | NestedTensor,\n attention_mask: Tensor | None = None,\n position_ids: Tensor | None = None,\n head_mask: Tensor | None = None,\n inputs_embeds: Tensor | NestedTensor | None = None,\n encoder_hidden_states: Tensor | None = None,\n encoder_attention_mask: Tensor | None = None,\n past_key_values: Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None = None,\n use_cache: bool | None = None,\n output_attentions: bool | None = None,\n output_hidden_states: bool | None = None,\n return_dict: bool | None = None,\n **kwargs,\n) -> Tuple[Tensor, ...] | BaseModelOutputWithPoolingAndCrossAttentions:\n r\"\"\"\n Args:\n encoder_hidden_states:\n Shape: `(batch_size, sequence_length, hidden_size)`\n\n Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if\n the model is configured as a decoder.\n encoder_attention_mask:\n Shape: `(batch_size, sequence_length)`\n\n Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used\n in the cross-attention if the model is configured as a decoder. Mask values selected in `[0, 1]`:\n\n - 1 for tokens that are **not masked**,\n - 0 for tokens that are **masked**.\n past_key_values:\n Tuple of length `config.n_layers` with each tuple having 4 tensors of shape\n `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)\n\n Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up\n decoding.\n\n If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those\n that don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of\n all `decoder_input_ids` of shape `(batch_size, sequence_length)`.\n use_cache:\n If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding\n (see `past_key_values`).\n \"\"\"\n if kwargs:\n warn(\n f\"Additional keyword arguments `{', '.join(kwargs)}` are detected in \"\n f\"`{self.__class__.__name__}.forward`, they will be ignored.\\n\"\n \"This is provided for backward compatibility and may lead to unexpected behavior.\"\n )\n output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions\n output_hidden_states = (\n output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states\n )\n return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n\n if self.config.is_decoder:\n use_cache = use_cache if use_cache is not None else self.config.use_cache\n else:\n use_cache = False\n\n if isinstance(input_ids, NestedTensor):\n input_ids, attention_mask = input_ids.tensor, input_ids.mask\n if input_ids is not None and inputs_embeds is not None:\n raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n if input_ids is not None:\n self.warn_if_padding_and_no_attention_mask(input_ids, attention_mask)\n input_shape = input_ids.size()\n elif inputs_embeds is not None:\n input_shape = inputs_embeds.size()[:-1]\n else:\n raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n batch_size, seq_length = input_shape\n device = input_ids.device if input_ids is not None else inputs_embeds.device # type: ignore[union-attr]\n\n # past_key_values_length\n past_key_values_length = past_key_values[0][0].shape[2] if past_key_values is not None else 0\n\n if attention_mask is None:\n attention_mask = (\n input_ids.ne(self.pad_token_id)\n if self.pad_token_id is not None\n else torch.ones(((batch_size, seq_length + past_key_values_length)), device=device)\n )\n\n # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]\n # ourselves in which case we just need to make it broadcastable to all heads.\n extended_attention_mask: Tensor = self.get_extended_attention_mask(attention_mask, input_shape)\n\n # If a 2D or 3D attention mask is provided for the cross-attention\n # we need to make broadcastable to [batch_size, num_heads, seq_length, seq_length]\n if self.config.is_decoder and encoder_hidden_states is not None:\n encoder_batch_size, encoder_sequence_length, _ = encoder_hidden_states.size()\n encoder_hidden_shape = (encoder_batch_size, encoder_sequence_length)\n if encoder_attention_mask is None:\n encoder_attention_mask = torch.ones(encoder_hidden_shape, device=device)\n encoder_extended_attention_mask = self.invert_attention_mask(encoder_attention_mask)\n else:\n encoder_extended_attention_mask = None\n\n # Prepare head mask if needed\n # 1.0 in head_mask indicate we keep the head\n # attention_probs has shape bsz x n_heads x N x N\n # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]\n # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]\n head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)\n\n embedding_output = self.embeddings(\n input_ids=input_ids,\n position_ids=position_ids,\n inputs_embeds=inputs_embeds,\n past_key_values_length=past_key_values_length,\n )\n encoder_outputs = self.encoder(\n embedding_output,\n attention_mask=extended_attention_mask,\n head_mask=head_mask,\n encoder_hidden_states=encoder_hidden_states,\n encoder_attention_mask=encoder_extended_attention_mask,\n past_key_values=past_key_values,\n use_cache=use_cache,\n output_attentions=output_attentions,\n output_hidden_states=output_hidden_states,\n return_dict=return_dict,\n )\n sequence_output = encoder_outputs[0]\n pooled_output = self.pooler(sequence_output) if self.pooler is not None else None\n\n if not return_dict:\n return (sequence_output, pooled_output) + encoder_outputs[1:]\n\n return BaseModelOutputWithPoolingAndCrossAttentions(\n last_hidden_state=sequence_output,\n pooler_output=pooled_output,\n past_key_values=encoder_outputs.past_key_values,\n hidden_states=encoder_outputs.hidden_states,\n attentions=encoder_outputs.attentions,\n cross_attentions=encoder_outputs.cross_attentions,\n )\n
"},{"location":"models/utrbert/#multimolecule.models.utrbert.UtrBertModel.forward(encoder_hidden_states)","title":"encoder_hidden_states
","text":""},{"location":"models/utrbert/#multimolecule.models.utrbert.UtrBertModel.forward(encoder_attention_mask)","title":"encoder_attention_mask
","text":""},{"location":"models/utrbert/#multimolecule.models.utrbert.UtrBertModel.forward(past_key_values)","title":"past_key_values
","text":""},{"location":"models/utrbert/#multimolecule.models.utrbert.UtrBertModel.forward(use_cache)","title":"use_cache
","text":""},{"location":"models/utrbert/#multimolecule.models.utrbert.UtrBertPreTrainedModel","title":"UtrBertPreTrainedModel","text":" Bases: PreTrainedModel
An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained models.
Source code inmultimolecule/models/utrbert/modeling_utrbert.py
Pythonclass UtrBertPreTrainedModel(PreTrainedModel):\n \"\"\"\n An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained\n models.\n \"\"\"\n\n config_class = UtrBertConfig\n base_model_prefix = \"utrbert\"\n supports_gradient_checkpointing = True\n _no_split_modules = [\"UtrBertLayer\", \"UtrBertEmbeddings\"]\n\n # Copied from transformers.models.bert.modeling_bert.BertPreTrainedModel._init_weights\n def _init_weights(self, module: nn.Module):\n \"\"\"Initialize the weights\"\"\"\n if isinstance(module, nn.Linear):\n # Slightly different from the TF version which uses truncated_normal for initialization\n # cf https://github.com/pytorch/pytorch/pull/5617\n module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)\n if module.bias is not None:\n module.bias.data.zero_()\n elif isinstance(module, nn.Embedding):\n module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)\n if module.padding_idx is not None:\n module.weight.data[module.padding_idx].zero_()\n elif isinstance(module, nn.LayerNorm):\n module.bias.data.zero_()\n module.weight.data.fill_(1.0)\n
"},{"location":"models/utrlm/","title":"UTR-LM","text":"Pre-trained model on 5\u2019 untranslated region (5\u2019UTR) using masked language modeling (MLM), Secondary Structure (SS), and Minimum Free Energy (MFE) objectives.
"},{"location":"models/utrlm/#statement","title":"Statement","text":"A 5\u2019 UTR Language Model for Decoding Untranslated Regions of mRNA and Function Predictions is published in Nature Machine Intelligence, which is a Closed Access / Author-Fee journal.
Machine learning has been at the forefront of the movement for free and open access to research.
We see no role for closed access or author-fee publication in the future of machine learning research and believe the adoption of these journals as an outlet of record for the machine learning community would be a retrograde step.
The MultiMolecule team is committed to the principles of open access and open science.
We do NOT endorse the publication of manuscripts in Closed Access / Author-Fee journals and encourage the community to support Open Access journals and conferences.
Please consider signing the Statement on Nature Machine Intelligence.
"},{"location":"models/utrlm/#disclaimer","title":"Disclaimer","text":"This is an UNOFFICIAL implementation of the A 5\u2019 UTR Language Model for Decoding Untranslated Regions of mRNA and Function Predictions by Yanyi Chu, Dan Yu, et al.
The OFFICIAL repository of UTR-LM is at a96123155/UTR-LM.
Warning
The MultiMolecule team is unable to confirm that the provided model and checkpoints are producing the same intermediate representations as the original implementation. This is because
The proposed method is published in a Closed Access / Author-Fee journal.
The team releasing UTR-LM did not write this model card for this model so this model card has been written by the MultiMolecule team.
"},{"location":"models/utrlm/#model-details","title":"Model Details","text":"UTR-LM is a bert-style model pre-trained on a large corpus of 5\u2019 untranslated regions (5\u2019UTRs) in a self-supervised fashion. This means that the model was trained on the raw nucleotides of RNA sequences only, with an automatic process to generate inputs and labels from those texts. Please refer to the Training Details section for more information on the training process.
"},{"location":"models/utrlm/#variations","title":"Variations","text":"multimolecule/utrlm-te_el
: The UTR-LM model for Translation Efficiency of transcripts and mRNA Expression Level.multimolecule/utrlm-mrl
: The UTR-LM model for Mean Ribosome Loading.The model file depends on the multimolecule
library. You can install it using pip:
pip install multimolecule\n
"},{"location":"models/utrlm/#direct-use","title":"Direct Use","text":"You can use this model directly with a pipeline for masked language modeling:
Python>>> import multimolecule # you must import multimolecule to register models\n>>> from transformers import pipeline\n>>> unmasker = pipeline(\"fill-mask\", model=\"multimolecule/utrlm-te_el\")\n>>> unmasker(\"gguc<mask>cucugguuagaccagaucugagccu\")\n\n[{'score': 0.07707168161869049,\n 'token': 23,\n 'token_str': '*',\n 'sequence': 'G G U C * C U C U G G U U A G A C C A G A U C U G A G C C U'},\n {'score': 0.07588472962379456,\n 'token': 5,\n 'token_str': '<null>',\n 'sequence': 'G G U C C U C U G G U U A G A C C A G A U C U G A G C C U'},\n {'score': 0.07178673148155212,\n 'token': 9,\n 'token_str': 'U',\n 'sequence': 'G G U C U C U C U G G U U A G A C C A G A U C U G A G C C U'},\n {'score': 0.06414645165205002,\n 'token': 10,\n 'token_str': 'N',\n 'sequence': 'G G U C N C U C U G G U U A G A C C A G A U C U G A G C C U'},\n {'score': 0.06385370343923569,\n 'token': 12,\n 'token_str': 'Y',\n 'sequence': 'G G U C Y C U C U G G U U A G A C C A G A U C U G A G C C U'}]\n
"},{"location":"models/utrlm/#downstream-use","title":"Downstream Use","text":""},{"location":"models/utrlm/#extract-features","title":"Extract Features","text":"Here is how to use this model to get the features of a given sequence in PyTorch:
Pythonfrom multimolecule import RnaTokenizer, UtrLmModel\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/utrlm-te_el\")\nmodel = UtrLmModel.from_pretrained(\"multimolecule/utrlm-te_el\")\n\ntext = \"UAGCUUAUCAGACUGAUGUUGA\"\ninput = tokenizer(text, return_tensors=\"pt\")\n\noutput = model(**input)\n
"},{"location":"models/utrlm/#sequence-classification-regression","title":"Sequence Classification / Regression","text":"Note: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for sequence classification or regression.
Here is how to use this model as backbone to fine-tune for a sequence-level task in PyTorch:
Pythonimport torch\nfrom multimolecule import RnaTokenizer, UtrLmForSequencePrediction\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/utrlm-te_el\")\nmodel = UtrLmForSequencePrediction.from_pretrained(\"multimolecule/utrlm-te_el\")\n\ntext = \"UAGCUUAUCAGACUGAUGUUGA\"\ninput = tokenizer(text, return_tensors=\"pt\")\nlabel = torch.tensor([1])\n\noutput = model(**input, labels=label)\n
"},{"location":"models/utrlm/#token-classification-regression","title":"Token Classification / Regression","text":"Note: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for nucleotide classification or regression.
Here is how to use this model as backbone to fine-tune for a nucleotide-level task in PyTorch:
Pythonimport torch\nfrom multimolecule import RnaTokenizer, UtrLmForTokenPrediction\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/utrlm-te_el\")\nmodel = UtrLmForTokenPrediction.from_pretrained(\"multimolecule/utrlm-te_el\")\n\ntext = \"UAGCUUAUCAGACUGAUGUUGA\"\ninput = tokenizer(text, return_tensors=\"pt\")\nlabel = torch.randint(2, (len(text), ))\n\noutput = model(**input, labels=label)\n
"},{"location":"models/utrlm/#contact-classification-regression","title":"Contact Classification / Regression","text":"Note: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for contact classification or regression.
Here is how to use this model as backbone to fine-tune for a contact-level task in PyTorch:
Pythonimport torch\nfrom multimolecule import RnaTokenizer, UtrLmForContactPrediction\n\n\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/utrlm-te_el\")\nmodel = UtrLmForContactPrediction.from_pretrained(\"multimolecule/utrlm-te_el\")\n\ntext = \"UAGCUUAUCAGACUGAUGUUGA\"\ninput = tokenizer(text, return_tensors=\"pt\")\nlabel = torch.randint(2, (len(text), len(text)))\n\noutput = model(**input, labels=label)\n
"},{"location":"models/utrlm/#training-details","title":"Training Details","text":"UTR-LM used a mixed training strategy with one self-supervised task and two supervised tasks, where the labels of both supervised tasks are calculated using ViennaRNA.
<mask>
token in the MLM task.The UTR-LM model was pre-trained on 5\u2019 UTR sequences from three sources:
UTR-LM preprocessed the 5\u2019 UTR sequences in a 4-step pipeline:
Note RnaTokenizer
will convert \u201cT\u201ds to \u201cU\u201ds for you, you may disable this behaviour by passing replace_T_with_U=False
.
UTR-LM used masked language modeling (MLM) as one of the pre-training objectives. The masking procedure is similar to the one used in BERT:
<mask>
.The model was trained on two clusters:
BibTeX:
BibTeX@article {chu2023a,\n author = {Chu, Yanyi and Yu, Dan and Li, Yupeng and Huang, Kaixuan and Shen, Yue and Cong, Le and Zhang, Jason and Wang, Mengdi},\n title = {A 5{\\textquoteright} UTR Language Model for Decoding Untranslated Regions of mRNA and Function Predictions},\n elocation-id = {2023.10.11.561938},\n year = {2023},\n doi = {10.1101/2023.10.11.561938},\n publisher = {Cold Spring Harbor Laboratory},\n abstract = {The 5{\\textquoteright} UTR, a regulatory region at the beginning of an mRNA molecule, plays a crucial role in regulating the translation process and impacts the protein expression level. Language models have showcased their effectiveness in decoding the functions of protein and genome sequences. Here, we introduced a language model for 5{\\textquoteright} UTR, which we refer to as the UTR-LM. The UTR-LM is pre-trained on endogenous 5{\\textquoteright} UTRs from multiple species and is further augmented with supervised information including secondary structure and minimum free energy. We fine-tuned the UTR-LM in a variety of downstream tasks. The model outperformed the best-known benchmark by up to 42\\% for predicting the Mean Ribosome Loading, and by up to 60\\% for predicting the Translation Efficiency and the mRNA Expression Level. The model also applies to identifying unannotated Internal Ribosome Entry Sites within the untranslated region and improves the AUPR from 0.37 to 0.52 compared to the best baseline. Further, we designed a library of 211 novel 5{\\textquoteright} UTRs with high predicted values of translation efficiency and evaluated them via a wet-lab assay. Experiment results confirmed that our top designs achieved a 32.5\\% increase in protein production level relative to well-established 5{\\textquoteright} UTR optimized for therapeutics.Competing Interest StatementThe authors have declared no competing interest.},\n URL = {https://www.biorxiv.org/content/early/2023/10/14/2023.10.11.561938},\n eprint = {https://www.biorxiv.org/content/early/2023/10/14/2023.10.11.561938.full.pdf},\n journal = {bioRxiv}\n}\n
"},{"location":"models/utrlm/#contact","title":"Contact","text":"Please use GitHub issues of MultiMolecule for any questions or comments on the model card.
Please contact the authors of the UTR-LM paper for questions or comments on the paper/model.
"},{"location":"models/utrlm/#license","title":"License","text":"This model is licensed under the AGPL-3.0 License.
Text OnlySPDX-License-Identifier: AGPL-3.0-or-later\n
"},{"location":"models/utrlm/#multimolecule.models.utrlm","title":"multimolecule.models.utrlm","text":""},{"location":"models/utrlm/#multimolecule.models.utrlm.RnaTokenizer","title":"RnaTokenizer","text":" Bases: Tokenizer
Tokenizer for RNA sequences.
Parameters:
Name Type Description DefaultAlphabet | str | List[str] | None
alphabet to use for tokenization.
None
, the standard RNA alphabet will be used.string
, it should correspond to the name of a predefined alphabet. The options includestandard
extended
streamline
nucleobase
None
int
Size of kmer to tokenize.
1
bool
Whether to tokenize into codons.
False
bool
Whether to replace T with U.
True
bool
Whether to convert input to uppercase.
True
Examples:
Python Console Session>>> from multimolecule import RnaTokenizer\n>>> tokenizer = RnaTokenizer()\n>>> tokenizer('<pad><cls><eos><unk><mask><null>ACGUNRYSWKMBDHV.X*-I')[\"input_ids\"]\n[1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 2]\n>>> tokenizer('acgu')[\"input_ids\"]\n[1, 6, 7, 8, 9, 2]\n>>> tokenizer('acgt')[\"input_ids\"]\n[1, 6, 7, 8, 9, 2]\n>>> tokenizer = RnaTokenizer(replace_T_with_U=False)\n>>> tokenizer('acgt')[\"input_ids\"]\n[1, 6, 7, 8, 3, 2]\n>>> tokenizer = RnaTokenizer(nmers=3)\n>>> tokenizer('uagcuuauc')[\"input_ids\"]\n[1, 83, 17, 64, 49, 96, 84, 22, 2]\n>>> tokenizer = RnaTokenizer(codon=True)\n>>> tokenizer('uagcuuauc')[\"input_ids\"]\n[1, 83, 49, 22, 2]\n>>> tokenizer('uagcuuauca')[\"input_ids\"]\nTraceback (most recent call last):\nValueError: length of input sequence must be a multiple of 3 for codon tokenization, but got 10\n
Source code in multimolecule/tokenisers/rna/tokenization_rna.py
Pythonclass RnaTokenizer(Tokenizer):\n \"\"\"\n Tokenizer for RNA sequences.\n\n Args:\n alphabet: alphabet to use for tokenization.\n\n - If is `None`, the standard RNA alphabet will be used.\n - If is a `string`, it should correspond to the name of a predefined alphabet. The options include\n + `standard`\n + `extended`\n + `streamline`\n + `nucleobase`\n - If is an alphabet or a list of characters, that specific alphabet will be used.\n nmers: Size of kmer to tokenize.\n codon: Whether to tokenize into codons.\n replace_T_with_U: Whether to replace T with U.\n do_upper_case: Whether to convert input to uppercase.\n\n Examples:\n >>> from multimolecule import RnaTokenizer\n >>> tokenizer = RnaTokenizer()\n >>> tokenizer('<pad><cls><eos><unk><mask><null>ACGUNRYSWKMBDHV.X*-I')[\"input_ids\"]\n [1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 2]\n >>> tokenizer('acgu')[\"input_ids\"]\n [1, 6, 7, 8, 9, 2]\n >>> tokenizer('acgt')[\"input_ids\"]\n [1, 6, 7, 8, 9, 2]\n >>> tokenizer = RnaTokenizer(replace_T_with_U=False)\n >>> tokenizer('acgt')[\"input_ids\"]\n [1, 6, 7, 8, 3, 2]\n >>> tokenizer = RnaTokenizer(nmers=3)\n >>> tokenizer('uagcuuauc')[\"input_ids\"]\n [1, 83, 17, 64, 49, 96, 84, 22, 2]\n >>> tokenizer = RnaTokenizer(codon=True)\n >>> tokenizer('uagcuuauc')[\"input_ids\"]\n [1, 83, 49, 22, 2]\n >>> tokenizer('uagcuuauca')[\"input_ids\"]\n Traceback (most recent call last):\n ValueError: length of input sequence must be a multiple of 3 for codon tokenization, but got 10\n \"\"\"\n\n model_input_names = [\"input_ids\", \"attention_mask\"]\n\n def __init__(\n self,\n alphabet: Alphabet | str | List[str] | None = None,\n nmers: int = 1,\n codon: bool = False,\n replace_T_with_U: bool = True,\n do_upper_case: bool = True,\n additional_special_tokens: List | Tuple | None = None,\n **kwargs,\n ):\n if codon and (nmers > 1 and nmers != 3):\n raise ValueError(\"Codon and nmers cannot be used together.\")\n if codon:\n nmers = 3 # set to 3 to get correct vocab\n if not isinstance(alphabet, Alphabet):\n alphabet = get_alphabet(alphabet, nmers=nmers)\n super().__init__(\n alphabet=alphabet,\n nmers=nmers,\n codon=codon,\n replace_T_with_U=replace_T_with_U,\n do_upper_case=do_upper_case,\n additional_special_tokens=additional_special_tokens,\n **kwargs,\n )\n self.replace_T_with_U = replace_T_with_U\n self.nmers = nmers\n self.codon = codon\n\n def _tokenize(self, text: str, **kwargs):\n if self.do_upper_case:\n text = text.upper()\n if self.replace_T_with_U:\n text = text.replace(\"T\", \"U\")\n if self.codon:\n if len(text) % 3 != 0:\n raise ValueError(\n f\"length of input sequence must be a multiple of 3 for codon tokenization, but got {len(text)}\"\n )\n return [text[i : i + 3] for i in range(0, len(text), 3)]\n if self.nmers > 1:\n return [text[i : i + self.nmers] for i in range(len(text) - self.nmers + 1)] # noqa: E203\n return list(text)\n
"},{"location":"models/utrlm/#multimolecule.models.utrlm.RnaTokenizer(alphabet)","title":"alphabet
","text":""},{"location":"models/utrlm/#multimolecule.models.utrlm.RnaTokenizer(nmers)","title":"nmers
","text":""},{"location":"models/utrlm/#multimolecule.models.utrlm.RnaTokenizer(codon)","title":"codon
","text":""},{"location":"models/utrlm/#multimolecule.models.utrlm.RnaTokenizer(replace_T_with_U)","title":"replace_T_with_U
","text":""},{"location":"models/utrlm/#multimolecule.models.utrlm.RnaTokenizer(do_upper_case)","title":"do_upper_case
","text":""},{"location":"models/utrlm/#multimolecule.models.utrlm.UtrLmConfig","title":"UtrLmConfig","text":" Bases: PreTrainedConfig
This is the configuration class to store the configuration of a UtrLmModel
. It is used to instantiate a UTR-LM model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the UTR-LM a96123155/UTR-LM architecture.
Configuration objects inherit from PreTrainedConfig
and can be used to control the model outputs. Read the documentation from PreTrainedConfig
for more information.
Parameters:
Name Type Description Defaultint
Vocabulary size of the UTR-LM model. Defines the number of different tokens that can be represented by the inputs_ids
passed when calling [UtrLmModel
].
26
int
Dimensionality of the encoder layers and the pooler layer.
128
int
Number of hidden layers in the Transformer encoder.
6
int
Number of attention heads for each attention layer in the Transformer encoder.
16
int
Dimensionality of the \u201cintermediate\u201d (often named feed-forward) layer in the Transformer encoder.
512
float
The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
0.1
float
The dropout ratio for the attention probabilities.
0.1
int
The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
1026
float
The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
0.02
float
The epsilon used by the layer normalization layers.
1e-12
str
Type of position embedding. Choose one of \"absolute\"
, \"relative_key\"
, \"relative_key_query\", \"rotary\"
. For positional embeddings use \"absolute\"
. For more information on \"relative_key\"
, please refer to Self-Attention with Relative Position Representations (Shaw et al.). For more information on \"relative_key_query\"
, please refer to Method 4 in Improve Transformer Models with Better Relative Position Embeddings (Huang et al.).
'rotary'
bool
Whether the model is used as a decoder or not. If False
, the model is used as an encoder.
False
bool
Whether or not the model should return the last key/values attentions (not used by all models). Only relevant if config.is_decoder=True
.
True
bool
Whether to apply layer normalization after embeddings but before the main stem of the network.
False
bool
When this is enabled, masked tokens are treated as if they had been dropped out by input dropout.
False
Examples:
Python Console Session>>> from multimolecule import UtrLmModel, UtrLmConfig\n>>> # Initializing a UTR-LM multimolecule/utrlm style configuration\n>>> configuration = UtrLmConfig()\n>>> # Initializing a model (with random weights) from the multimolecule/utrlm style configuration\n>>> model = UtrLmModel(configuration)\n>>> # Accessing the model configuration\n>>> configuration = model.config\n
Source code in multimolecule/models/utrlm/configuration_utrlm.py
Pythonclass UtrLmConfig(PreTrainedConfig):\n r\"\"\"\n This is the configuration class to store the configuration of a [`UtrLmModel`][multimolecule.models.UtrLmModel].\n It is used to instantiate a UTR-LM model according to the specified arguments, defining the model architecture.\n Instantiating a configuration with the defaults will yield a similar configuration to that of the UTR-LM\n [a96123155/UTR-LM](https://github.com/a96123155/UTR-LM) architecture.\n\n Configuration objects inherit from [`PreTrainedConfig`][multimolecule.models.PreTrainedConfig] and can be used to\n control the model outputs. Read the documentation from [`PreTrainedConfig`][multimolecule.models.PreTrainedConfig]\n for more information.\n\n Args:\n vocab_size:\n Vocabulary size of the UTR-LM model. Defines the number of different tokens that can be represented by the\n `inputs_ids` passed when calling [`UtrLmModel`].\n hidden_size:\n Dimensionality of the encoder layers and the pooler layer.\n num_hidden_layers:\n Number of hidden layers in the Transformer encoder.\n num_attention_heads:\n Number of attention heads for each attention layer in the Transformer encoder.\n intermediate_size:\n Dimensionality of the \"intermediate\" (often named feed-forward) layer in the Transformer encoder.\n hidden_dropout:\n The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.\n attention_dropout:\n The dropout ratio for the attention probabilities.\n max_position_embeddings:\n The maximum sequence length that this model might ever be used with. Typically set this to something large\n just in case (e.g., 512 or 1024 or 2048).\n initializer_range:\n The standard deviation of the truncated_normal_initializer for initializing all weight matrices.\n layer_norm_eps:\n The epsilon used by the layer normalization layers.\n position_embedding_type:\n Type of position embedding. Choose one of `\"absolute\"`, `\"relative_key\"`, `\"relative_key_query\", \"rotary\"`.\n For positional embeddings use `\"absolute\"`. For more information on `\"relative_key\"`, please refer to\n [Self-Attention with Relative Position Representations (Shaw et al.)](https://arxiv.org/abs/1803.02155).\n For more information on `\"relative_key_query\"`, please refer to *Method 4* in [Improve Transformer Models\n with Better Relative Position Embeddings (Huang et al.)](https://arxiv.org/abs/2009.13658).\n is_decoder:\n Whether the model is used as a decoder or not. If `False`, the model is used as an encoder.\n use_cache:\n Whether or not the model should return the last key/values attentions (not used by all models). Only\n relevant if `config.is_decoder=True`.\n emb_layer_norm_before:\n Whether to apply layer normalization after embeddings but before the main stem of the network.\n token_dropout:\n When this is enabled, masked tokens are treated as if they had been dropped out by input dropout.\n\n Examples:\n >>> from multimolecule import UtrLmModel, UtrLmConfig\n >>> # Initializing a UTR-LM multimolecule/utrlm style configuration\n >>> configuration = UtrLmConfig()\n >>> # Initializing a model (with random weights) from the multimolecule/utrlm style configuration\n >>> model = UtrLmModel(configuration)\n >>> # Accessing the model configuration\n >>> configuration = model.config\n \"\"\"\n\n model_type = \"utrlm\"\n\n def __init__(\n self,\n vocab_size: int = 26,\n hidden_size: int = 128,\n num_hidden_layers: int = 6,\n num_attention_heads: int = 16,\n intermediate_size: int = 512,\n hidden_act: str = \"gelu\",\n hidden_dropout: float = 0.1,\n attention_dropout: float = 0.1,\n max_position_embeddings: int = 1026,\n initializer_range: float = 0.02,\n layer_norm_eps: float = 1e-12,\n position_embedding_type: str = \"rotary\",\n is_decoder: bool = False,\n use_cache: bool = True,\n emb_layer_norm_before: bool = False,\n token_dropout: bool = False,\n head: HeadConfig | None = None,\n lm_head: MaskedLMHeadConfig | None = None,\n ss_head: HeadConfig | None = None,\n mfe_head: HeadConfig | None = None,\n **kwargs,\n ):\n super().__init__(**kwargs)\n\n self.vocab_size = vocab_size\n self.hidden_size = hidden_size\n self.num_hidden_layers = num_hidden_layers\n self.num_attention_heads = num_attention_heads\n self.intermediate_size = intermediate_size\n self.hidden_act = hidden_act\n self.hidden_dropout = hidden_dropout\n self.attention_dropout = attention_dropout\n self.max_position_embeddings = max_position_embeddings\n self.initializer_range = initializer_range\n self.layer_norm_eps = layer_norm_eps\n self.position_embedding_type = position_embedding_type\n self.is_decoder = is_decoder\n self.use_cache = use_cache\n self.emb_layer_norm_before = emb_layer_norm_before\n self.token_dropout = token_dropout\n self.head = HeadConfig(**head) if head is not None else None\n self.lm_head = MaskedLMHeadConfig(**lm_head) if lm_head is not None else None\n self.ss_head = HeadConfig(**ss_head) if ss_head is not None else None\n self.mfe_head = HeadConfig(**mfe_head) if mfe_head is not None else None\n
"},{"location":"models/utrlm/#multimolecule.models.utrlm.UtrLmConfig(vocab_size)","title":"vocab_size
","text":""},{"location":"models/utrlm/#multimolecule.models.utrlm.UtrLmConfig(hidden_size)","title":"hidden_size
","text":""},{"location":"models/utrlm/#multimolecule.models.utrlm.UtrLmConfig(num_hidden_layers)","title":"num_hidden_layers
","text":""},{"location":"models/utrlm/#multimolecule.models.utrlm.UtrLmConfig(num_attention_heads)","title":"num_attention_heads
","text":""},{"location":"models/utrlm/#multimolecule.models.utrlm.UtrLmConfig(intermediate_size)","title":"intermediate_size
","text":""},{"location":"models/utrlm/#multimolecule.models.utrlm.UtrLmConfig(hidden_dropout)","title":"hidden_dropout
","text":""},{"location":"models/utrlm/#multimolecule.models.utrlm.UtrLmConfig(attention_dropout)","title":"attention_dropout
","text":""},{"location":"models/utrlm/#multimolecule.models.utrlm.UtrLmConfig(max_position_embeddings)","title":"max_position_embeddings
","text":""},{"location":"models/utrlm/#multimolecule.models.utrlm.UtrLmConfig(initializer_range)","title":"initializer_range
","text":""},{"location":"models/utrlm/#multimolecule.models.utrlm.UtrLmConfig(layer_norm_eps)","title":"layer_norm_eps
","text":""},{"location":"models/utrlm/#multimolecule.models.utrlm.UtrLmConfig(position_embedding_type)","title":"position_embedding_type
","text":""},{"location":"models/utrlm/#multimolecule.models.utrlm.UtrLmConfig(is_decoder)","title":"is_decoder
","text":""},{"location":"models/utrlm/#multimolecule.models.utrlm.UtrLmConfig(use_cache)","title":"use_cache
","text":""},{"location":"models/utrlm/#multimolecule.models.utrlm.UtrLmConfig(emb_layer_norm_before)","title":"emb_layer_norm_before
","text":""},{"location":"models/utrlm/#multimolecule.models.utrlm.UtrLmConfig(token_dropout)","title":"token_dropout
","text":""},{"location":"models/utrlm/#multimolecule.models.utrlm.UtrLmForContactPrediction","title":"UtrLmForContactPrediction","text":" Bases: UtrLmPreTrainedModel
Examples:
Python Console Session>>> from multimolecule import UtrLmConfig, UtrLmForContactPrediction, RnaTokenizer\n>>> config = UtrLmConfig()\n>>> model = UtrLmForContactPrediction(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=torch.randint(2, (1, 5, 5)))\n>>> output[\"logits\"].shape\ntorch.Size([1, 5, 5, 1])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n
Source code in multimolecule/models/utrlm/modeling_utrlm.py
Pythonclass UtrLmForContactPrediction(UtrLmPreTrainedModel):\n \"\"\"\n Examples:\n >>> from multimolecule import UtrLmConfig, UtrLmForContactPrediction, RnaTokenizer\n >>> config = UtrLmConfig()\n >>> model = UtrLmForContactPrediction(config)\n >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n >>> output = model(**input, labels=torch.randint(2, (1, 5, 5)))\n >>> output[\"logits\"].shape\n torch.Size([1, 5, 5, 1])\n >>> output[\"loss\"] # doctest:+ELLIPSIS\n tensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n \"\"\"\n\n def __init__(self, config: UtrLmConfig):\n super().__init__(config)\n self.utrlm = UtrLmModel(config, add_pooling_layer=True)\n self.contact_head = ContactPredictionHead(config)\n self.head_config = self.contact_head.config\n\n # Initialize weights and apply final processing\n self.post_init()\n\n def forward(\n self,\n input_ids: Tensor | NestedTensor,\n attention_mask: Tensor | None = None,\n position_ids: Tensor | None = None,\n head_mask: Tensor | None = None,\n inputs_embeds: Tensor | NestedTensor | None = None,\n labels: Tensor | None = None,\n output_attentions: bool | None = None,\n output_hidden_states: bool | None = None,\n return_dict: bool | None = None,\n **kwargs,\n ) -> Tuple[Tensor, ...] | ContactPredictorOutput:\n if output_attentions is False:\n warn(\"output_attentions must be True for contact classification and will be ignored.\")\n return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n outputs = self.utrlm(\n input_ids,\n attention_mask=attention_mask,\n position_ids=position_ids,\n head_mask=head_mask,\n inputs_embeds=inputs_embeds,\n output_attentions=True,\n output_hidden_states=output_hidden_states,\n return_dict=return_dict,\n **kwargs,\n )\n output = self.contact_head(outputs, attention_mask, input_ids, labels)\n logits, loss = output.logits, output.loss\n\n if not return_dict:\n output = (logits,) + outputs[2:]\n return ((loss,) + output) if loss is not None else output\n\n return ContactPredictorOutput(\n loss=loss,\n logits=logits,\n hidden_states=outputs.hidden_states,\n attentions=outputs.attentions,\n )\n
"},{"location":"models/utrlm/#multimolecule.models.utrlm.UtrLmForMaskedLM","title":"UtrLmForMaskedLM","text":" Bases: UtrLmPreTrainedModel
Examples:
Python Console Session>>> from multimolecule import UtrLmConfig, UtrLmModel, RnaTokenizer\n>>> config = UtrLmConfig()\n>>> model = UtrLmForMaskedLM(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=input[\"input_ids\"])\n>>> output[\"logits\"].shape\ntorch.Size([1, 7, 26])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<NllLossBackward0>)\n
Source code in multimolecule/models/utrlm/modeling_utrlm.py
Pythonclass UtrLmForMaskedLM(UtrLmPreTrainedModel):\n \"\"\"\n Examples:\n >>> from multimolecule import UtrLmConfig, UtrLmModel, RnaTokenizer\n >>> config = UtrLmConfig()\n >>> model = UtrLmForMaskedLM(config)\n >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n >>> output = model(**input, labels=input[\"input_ids\"])\n >>> output[\"logits\"].shape\n torch.Size([1, 7, 26])\n >>> output[\"loss\"] # doctest:+ELLIPSIS\n tensor(..., grad_fn=<NllLossBackward0>)\n \"\"\"\n\n _tied_weights_keys = [\"lm_head.decoder.weight\", \"lm_head.decoder.bias\"]\n\n def __init__(self, config: UtrLmConfig):\n super().__init__(config)\n if config.is_decoder:\n logger.warning(\n \"If you want to use `UtrLmForMaskedLM` make sure `config.is_decoder=False` for \"\n \"bi-directional self-attention.\"\n )\n self.utrlm = UtrLmModel(config, add_pooling_layer=False)\n self.lm_head = MaskedLMHead(config)\n\n # Initialize weights and apply final processing\n self.post_init()\n\n def forward(\n self,\n input_ids: Tensor | NestedTensor,\n attention_mask: Tensor | None = None,\n position_ids: Tensor | None = None,\n head_mask: Tensor | None = None,\n inputs_embeds: Tensor | NestedTensor | None = None,\n encoder_hidden_states: Tensor | None = None,\n encoder_attention_mask: Tensor | None = None,\n labels: Tensor | None = None,\n output_attentions: bool | None = None,\n output_hidden_states: bool | None = None,\n return_dict: bool | None = None,\n **kwargs,\n ) -> Tuple[Tensor, ...] | MaskedLMOutput:\n return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n outputs = self.utrlm(\n input_ids,\n attention_mask=attention_mask,\n position_ids=position_ids,\n head_mask=head_mask,\n inputs_embeds=inputs_embeds,\n encoder_hidden_states=encoder_hidden_states,\n encoder_attention_mask=encoder_attention_mask,\n output_attentions=output_attentions,\n output_hidden_states=output_hidden_states,\n return_dict=return_dict,\n **kwargs,\n )\n output = self.lm_head(outputs, labels)\n logits, loss = output.logits, output.loss\n\n if not return_dict:\n output = (logits,) + outputs[2:]\n return ((loss,) + output) if loss is not None else output\n\n return MaskedLMOutput(\n loss=loss,\n logits=logits,\n hidden_states=outputs.hidden_states,\n attentions=outputs.attentions,\n )\n
"},{"location":"models/utrlm/#multimolecule.models.utrlm.UtrLmForPreTraining","title":"UtrLmForPreTraining","text":" Bases: UtrLmPreTrainedModel
Examples:
Python Console Session>>> from multimolecule import UtrLmConfig, UtrLmModel, RnaTokenizer\n>>> config = UtrLmConfig()\n>>> model = UtrLmForPreTraining(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels_mlm=input[\"input_ids\"])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<AddBackward0>)\n>>> output[\"logits\"].shape\ntorch.Size([1, 7, 26])\n>>> output[\"contact_map\"].shape\ntorch.Size([1, 5, 5, 1])\n
Source code in multimolecule/models/utrlm/modeling_utrlm.py
Pythonclass UtrLmForPreTraining(UtrLmPreTrainedModel):\n \"\"\"\n Examples:\n >>> from multimolecule import UtrLmConfig, UtrLmModel, RnaTokenizer\n >>> config = UtrLmConfig()\n >>> model = UtrLmForPreTraining(config)\n >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n >>> output = model(**input, labels_mlm=input[\"input_ids\"])\n >>> output[\"loss\"] # doctest:+ELLIPSIS\n tensor(..., grad_fn=<AddBackward0>)\n >>> output[\"logits\"].shape\n torch.Size([1, 7, 26])\n >>> output[\"contact_map\"].shape\n torch.Size([1, 5, 5, 1])\n \"\"\"\n\n _tied_weights_keys = [\n \"lm_head.decoder.weight\",\n \"lm_head.decoder.bias\",\n \"pretrain.predictions.decoder.weight\",\n \"pretrain.predictions.decoder.bias\",\n \"pretrain.predictions_ss.decoder.weight\",\n \"pretrain.predictions_ss.decoder.bias\",\n ]\n\n def __init__(self, config: UtrLmConfig):\n super().__init__(config)\n if config.is_decoder:\n logger.warning(\n \"If you want to use `UtrLmForPreTraining` make sure `config.is_decoder=False` for \"\n \"bi-directional self-attention.\"\n )\n self.utrlm = UtrLmModel(config, add_pooling_layer=False)\n self.pretrain = UtrLmPreTrainingHeads(config)\n\n # Initialize weights and apply final processing\n self.post_init()\n\n def get_output_embeddings(self):\n return self.pretrain.predictions.decoder\n\n def set_output_embeddings(self, embeddings):\n self.pretrain.predictions.decoder = embeddings\n\n def forward(\n self,\n input_ids: Tensor | NestedTensor,\n attention_mask: Tensor | None = None,\n position_ids: Tensor | None = None,\n head_mask: Tensor | None = None,\n inputs_embeds: Tensor | NestedTensor | None = None,\n encoder_hidden_states: Tensor | None = None,\n encoder_attention_mask: Tensor | None = None,\n labels_mlm: Tensor | None = None,\n labels_contact: Tensor | None = None,\n labels_ss: Tensor | None = None,\n labels_mfe: Tensor | None = None,\n output_attentions: bool | None = None,\n output_hidden_states: bool | None = None,\n return_dict: bool | None = None,\n **kwargs,\n ) -> Tuple[Tensor, ...] | UtrLmForPreTrainingOutput:\n if output_attentions is False:\n warn(\"output_attentions must be True for contact classification and will be ignored.\")\n return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n outputs = self.utrlm(\n input_ids,\n attention_mask=attention_mask,\n position_ids=position_ids,\n head_mask=head_mask,\n inputs_embeds=inputs_embeds,\n encoder_hidden_states=encoder_hidden_states,\n encoder_attention_mask=encoder_attention_mask,\n output_attentions=True,\n output_hidden_states=output_hidden_states,\n return_dict=return_dict,\n **kwargs,\n )\n total_loss, logits, contact_map, secondary_structure, minimum_free_energy = self.pretrain(\n outputs,\n attention_mask,\n input_ids,\n labels_mlm=labels_mlm,\n labels_contact=labels_contact,\n labels_ss=labels_ss,\n labels_mfe=labels_mfe,\n )\n\n if not return_dict:\n output = (logits,) + outputs[2:]\n return ((total_loss,) + output) if total_loss is not None else output\n\n return UtrLmForPreTrainingOutput(\n loss=total_loss,\n logits=logits,\n contact_map=contact_map,\n secondary_structure=secondary_structure,\n minimum_free_energy=minimum_free_energy,\n hidden_states=outputs.hidden_states,\n attentions=outputs.attentions,\n )\n
"},{"location":"models/utrlm/#multimolecule.models.utrlm.UtrLmForSequencePrediction","title":"UtrLmForSequencePrediction","text":" Bases: UtrLmPreTrainedModel
Examples:
Python Console Session>>> from multimolecule import UtrLmConfig, UtrLmModel, RnaTokenizer\n>>> config = UtrLmConfig()\n>>> model = UtrLmForSequencePrediction(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=torch.tensor([[1]]))\n>>> output[\"logits\"].shape\ntorch.Size([1, 1])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n
Source code in multimolecule/models/utrlm/modeling_utrlm.py
Pythonclass UtrLmForSequencePrediction(UtrLmPreTrainedModel):\n \"\"\"\n Examples:\n >>> from multimolecule import UtrLmConfig, UtrLmModel, RnaTokenizer\n >>> config = UtrLmConfig()\n >>> model = UtrLmForSequencePrediction(config)\n >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n >>> output = model(**input, labels=torch.tensor([[1]]))\n >>> output[\"logits\"].shape\n torch.Size([1, 1])\n >>> output[\"loss\"] # doctest:+ELLIPSIS\n tensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n \"\"\"\n\n def __init__(self, config: UtrLmConfig):\n super().__init__(config)\n self.utrlm = UtrLmModel(config, add_pooling_layer=True)\n self.sequence_head = SequencePredictionHead(config)\n self.head_config = self.sequence_head.config\n\n # Initialize weights and apply final processing\n self.post_init()\n\n def forward(\n self,\n input_ids: Tensor | NestedTensor,\n attention_mask: Tensor | None = None,\n position_ids: Tensor | None = None,\n head_mask: Tensor | None = None,\n inputs_embeds: Tensor | NestedTensor | None = None,\n labels: Tensor | None = None,\n output_attentions: bool | None = None,\n output_hidden_states: bool | None = None,\n return_dict: bool | None = None,\n **kwargs,\n ) -> Tuple[Tensor, ...] | SequencePredictorOutput:\n return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n outputs = self.utrlm(\n input_ids,\n attention_mask=attention_mask,\n position_ids=position_ids,\n head_mask=head_mask,\n inputs_embeds=inputs_embeds,\n output_attentions=output_attentions,\n output_hidden_states=output_hidden_states,\n return_dict=return_dict,\n **kwargs,\n )\n output = self.sequence_head(outputs, labels)\n logits, loss = output.logits, output.loss\n\n if not return_dict:\n output = (logits,) + outputs[2:]\n return ((loss,) + output) if loss is not None else output\n\n return SequencePredictorOutput(\n loss=loss,\n logits=logits,\n hidden_states=outputs.hidden_states,\n attentions=outputs.attentions,\n )\n
"},{"location":"models/utrlm/#multimolecule.models.utrlm.UtrLmForTokenPrediction","title":"UtrLmForTokenPrediction","text":" Bases: UtrLmPreTrainedModel
Examples:
Python Console Session>>> from multimolecule import UtrLmConfig, UtrLmModel, RnaTokenizer\n>>> config = UtrLmConfig()\n>>> model = UtrLmForTokenPrediction(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input, labels=torch.randint(2, (1, 5)))\n>>> output[\"logits\"].shape\ntorch.Size([1, 5, 1])\n>>> output[\"loss\"]\ntensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n
Source code in multimolecule/models/utrlm/modeling_utrlm.py
Pythonclass UtrLmForTokenPrediction(UtrLmPreTrainedModel):\n \"\"\"\n Examples:\n >>> from multimolecule import UtrLmConfig, UtrLmModel, RnaTokenizer\n >>> config = UtrLmConfig()\n >>> model = UtrLmForTokenPrediction(config)\n >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n >>> output = model(**input, labels=torch.randint(2, (1, 5)))\n >>> output[\"logits\"].shape\n torch.Size([1, 5, 1])\n >>> output[\"loss\"] # doctest:+ELLIPSIS\n tensor(..., grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)\n \"\"\"\n\n def __init__(self, config: UtrLmConfig):\n super().__init__(config)\n self.utrlm = UtrLmModel(config, add_pooling_layer=True)\n self.token_head = TokenPredictionHead(config)\n self.head_config = self.token_head.config\n\n # Initialize weights and apply final processing\n self.post_init()\n\n def forward(\n self,\n input_ids: Tensor | NestedTensor,\n attention_mask: Tensor | None = None,\n position_ids: Tensor | None = None,\n head_mask: Tensor | None = None,\n inputs_embeds: Tensor | NestedTensor | None = None,\n labels: Tensor | None = None,\n output_attentions: bool | None = None,\n output_hidden_states: bool | None = None,\n return_dict: bool | None = None,\n **kwargs,\n ) -> Tuple[Tensor, ...] | TokenPredictorOutput:\n return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n outputs = self.utrlm(\n input_ids,\n attention_mask=attention_mask,\n position_ids=position_ids,\n head_mask=head_mask,\n inputs_embeds=inputs_embeds,\n output_attentions=output_attentions,\n output_hidden_states=output_hidden_states,\n return_dict=return_dict,\n **kwargs,\n )\n output = self.token_head(outputs, attention_mask, input_ids, labels)\n logits, loss = output.logits, output.loss\n\n if not return_dict:\n output = (logits,) + outputs[2:]\n return ((loss,) + output) if loss is not None else output\n\n return TokenPredictorOutput(\n loss=loss,\n logits=logits,\n hidden_states=outputs.hidden_states,\n attentions=outputs.attentions,\n )\n
"},{"location":"models/utrlm/#multimolecule.models.utrlm.UtrLmModel","title":"UtrLmModel","text":" Bases: UtrLmPreTrainedModel
Examples:
Python Console Session>>> from multimolecule import UtrLmConfig, UtrLmModel, RnaTokenizer\n>>> config = UtrLmConfig()\n>>> model = UtrLmModel(config)\n>>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n>>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n>>> output = model(**input)\n>>> output[\"last_hidden_state\"].shape\ntorch.Size([1, 7, 128])\n>>> output[\"pooler_output\"].shape\ntorch.Size([1, 128])\n
Source code in multimolecule/models/utrlm/modeling_utrlm.py
Pythonclass UtrLmModel(UtrLmPreTrainedModel):\n \"\"\"\n Examples:\n >>> from multimolecule import UtrLmConfig, UtrLmModel, RnaTokenizer\n >>> config = UtrLmConfig()\n >>> model = UtrLmModel(config)\n >>> tokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rna\")\n >>> input = tokenizer(\"ACGUN\", return_tensors=\"pt\")\n >>> output = model(**input)\n >>> output[\"last_hidden_state\"].shape\n torch.Size([1, 7, 128])\n >>> output[\"pooler_output\"].shape\n torch.Size([1, 128])\n \"\"\"\n\n def __init__(self, config: UtrLmConfig, add_pooling_layer: bool = True):\n super().__init__(config)\n self.pad_token_id = config.pad_token_id\n self.embeddings = UtrLmEmbeddings(config)\n self.encoder = UtrLmEncoder(config)\n self.pooler = UtrLmPooler(config) if add_pooling_layer else None\n\n # Initialize weights and apply final processing\n self.post_init()\n\n def get_input_embeddings(self):\n return self.embeddings.word_embeddings\n\n def set_input_embeddings(self, value):\n self.embeddings.word_embeddings = value\n\n def _prune_heads(self, heads_to_prune):\n \"\"\"\n Prunes heads of the model. heads_to_prune: dict of {layer_num: list of heads to prune in this layer} See base\n class PreTrainedModel\n \"\"\"\n for layer, heads in heads_to_prune.items():\n self.encoder.layer[layer].attention.prune_heads(heads)\n\n def forward(\n self,\n input_ids: Tensor | NestedTensor,\n attention_mask: Tensor | None = None,\n position_ids: Tensor | None = None,\n head_mask: Tensor | None = None,\n inputs_embeds: Tensor | NestedTensor | None = None,\n encoder_hidden_states: Tensor | None = None,\n encoder_attention_mask: Tensor | None = None,\n past_key_values: Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None = None,\n use_cache: bool | None = None,\n output_attentions: bool | None = None,\n output_hidden_states: bool | None = None,\n return_dict: bool | None = None,\n **kwargs,\n ) -> Tuple[Tensor, ...] | BaseModelOutputWithPoolingAndCrossAttentions:\n r\"\"\"\n Args:\n encoder_hidden_states:\n Shape: `(batch_size, sequence_length, hidden_size)`\n\n Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if\n the model is configured as a decoder.\n encoder_attention_mask:\n Shape: `(batch_size, sequence_length)`\n\n Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used\n in the cross-attention if the model is configured as a decoder. Mask values selected in `[0, 1]`:\n\n - 1 for tokens that are **not masked**,\n - 0 for tokens that are **masked**.\n past_key_values:\n Tuple of length `config.n_layers` with each tuple having 4 tensors of shape\n `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)\n\n Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up\n decoding.\n\n If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those\n that don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of\n all `decoder_input_ids` of shape `(batch_size, sequence_length)`.\n use_cache:\n If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding\n (see `past_key_values`).\n \"\"\"\n if kwargs:\n warn(\n f\"Additional keyword arguments `{', '.join(kwargs)}` are detected in \"\n f\"`{self.__class__.__name__}.forward`, they will be ignored.\\n\"\n \"This is provided for backward compatibility and may lead to unexpected behavior.\"\n )\n output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions\n output_hidden_states = (\n output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states\n )\n return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n\n if self.config.is_decoder:\n use_cache = use_cache if use_cache is not None else self.config.use_cache\n else:\n use_cache = False\n\n if isinstance(input_ids, NestedTensor):\n input_ids, attention_mask = input_ids.tensor, input_ids.mask\n if input_ids is not None and inputs_embeds is not None:\n raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n if input_ids is not None:\n self.warn_if_padding_and_no_attention_mask(input_ids, attention_mask)\n input_shape = input_ids.size()\n elif inputs_embeds is not None:\n input_shape = inputs_embeds.size()[:-1]\n else:\n raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n batch_size, seq_length = input_shape\n device = input_ids.device if input_ids is not None else inputs_embeds.device # type: ignore[union-attr]\n\n # past_key_values_length\n past_key_values_length = past_key_values[0][0].shape[2] if past_key_values is not None else 0\n\n if attention_mask is None:\n attention_mask = (\n input_ids.ne(self.pad_token_id)\n if self.pad_token_id is not None\n else torch.ones(((batch_size, seq_length + past_key_values_length)), device=device)\n )\n\n # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]\n # ourselves in which case we just need to make it broadcastable to all heads.\n extended_attention_mask: Tensor = self.get_extended_attention_mask(attention_mask, input_shape)\n\n # If a 2D or 3D attention mask is provided for the cross-attention\n # we need to make broadcastable to [batch_size, num_heads, seq_length, seq_length]\n if self.config.is_decoder and encoder_hidden_states is not None:\n encoder_batch_size, encoder_sequence_length, _ = encoder_hidden_states.size()\n encoder_hidden_shape = (encoder_batch_size, encoder_sequence_length)\n if encoder_attention_mask is None:\n encoder_attention_mask = torch.ones(encoder_hidden_shape, device=device)\n encoder_extended_attention_mask = self.invert_attention_mask(encoder_attention_mask)\n else:\n encoder_extended_attention_mask = None\n\n # Prepare head mask if needed\n # 1.0 in head_mask indicate we keep the head\n # attention_probs has shape bsz x n_heads x N x N\n # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]\n # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]\n head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)\n\n embedding_output = self.embeddings(\n input_ids=input_ids,\n position_ids=position_ids,\n attention_mask=attention_mask,\n inputs_embeds=inputs_embeds,\n past_key_values_length=past_key_values_length,\n )\n encoder_outputs = self.encoder(\n embedding_output,\n attention_mask=extended_attention_mask,\n head_mask=head_mask,\n encoder_hidden_states=encoder_hidden_states,\n encoder_attention_mask=encoder_extended_attention_mask,\n past_key_values=past_key_values,\n use_cache=use_cache,\n output_attentions=output_attentions,\n output_hidden_states=output_hidden_states,\n return_dict=return_dict,\n )\n sequence_output = encoder_outputs[0]\n pooled_output = self.pooler(sequence_output) if self.pooler is not None else None\n\n if not return_dict:\n return (sequence_output, pooled_output) + encoder_outputs[1:]\n\n return BaseModelOutputWithPoolingAndCrossAttentions(\n last_hidden_state=sequence_output,\n pooler_output=pooled_output,\n past_key_values=encoder_outputs.past_key_values,\n hidden_states=encoder_outputs.hidden_states,\n attentions=encoder_outputs.attentions,\n cross_attentions=encoder_outputs.cross_attentions,\n )\n
"},{"location":"models/utrlm/#multimolecule.models.utrlm.UtrLmModel.forward","title":"forward","text":"Pythonforward(input_ids: Tensor | NestedTensor, attention_mask: Tensor | None = None, position_ids: Tensor | None = None, head_mask: Tensor | None = None, inputs_embeds: Tensor | NestedTensor | None = None, encoder_hidden_states: Tensor | None = None, encoder_attention_mask: Tensor | None = None, past_key_values: Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None = None, use_cache: bool | None = None, output_attentions: bool | None = None, output_hidden_states: bool | None = None, return_dict: bool | None = None, **kwargs) -> Tuple[Tensor, ...] | BaseModelOutputWithPoolingAndCrossAttentions\n
Parameters:
Name Type Description DefaultTensor | None
Shape: (batch_size, sequence_length, hidden_size)
Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if the model is configured as a decoder.
None
Tensor | None
Shape: (batch_size, sequence_length)
Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used in the cross-attention if the model is configured as a decoder. Mask values selected in [0, 1]
:
None
Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None
Tuple of length config.n_layers
with each tuple having 4 tensors of shape `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)
Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up decoding.
If past_key_values
are used, the user can optionally input only the last decoder_input_ids
(those that don\u2019t have their past key value states given to this model) of shape (batch_size, 1)
instead of all decoder_input_ids
of shape (batch_size, sequence_length)
.
None
bool | None
If set to True
, past_key_values
key value states are returned and can be used to speed up decoding (see past_key_values
).
None
Source code in multimolecule/models/utrlm/modeling_utrlm.py
Pythondef forward(\n self,\n input_ids: Tensor | NestedTensor,\n attention_mask: Tensor | None = None,\n position_ids: Tensor | None = None,\n head_mask: Tensor | None = None,\n inputs_embeds: Tensor | NestedTensor | None = None,\n encoder_hidden_states: Tensor | None = None,\n encoder_attention_mask: Tensor | None = None,\n past_key_values: Tuple[Tuple[Tensor, Tensor, Tensor, Tensor], ...] | None = None,\n use_cache: bool | None = None,\n output_attentions: bool | None = None,\n output_hidden_states: bool | None = None,\n return_dict: bool | None = None,\n **kwargs,\n) -> Tuple[Tensor, ...] | BaseModelOutputWithPoolingAndCrossAttentions:\n r\"\"\"\n Args:\n encoder_hidden_states:\n Shape: `(batch_size, sequence_length, hidden_size)`\n\n Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if\n the model is configured as a decoder.\n encoder_attention_mask:\n Shape: `(batch_size, sequence_length)`\n\n Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used\n in the cross-attention if the model is configured as a decoder. Mask values selected in `[0, 1]`:\n\n - 1 for tokens that are **not masked**,\n - 0 for tokens that are **masked**.\n past_key_values:\n Tuple of length `config.n_layers` with each tuple having 4 tensors of shape\n `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)\n\n Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up\n decoding.\n\n If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those\n that don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of\n all `decoder_input_ids` of shape `(batch_size, sequence_length)`.\n use_cache:\n If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding\n (see `past_key_values`).\n \"\"\"\n if kwargs:\n warn(\n f\"Additional keyword arguments `{', '.join(kwargs)}` are detected in \"\n f\"`{self.__class__.__name__}.forward`, they will be ignored.\\n\"\n \"This is provided for backward compatibility and may lead to unexpected behavior.\"\n )\n output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions\n output_hidden_states = (\n output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states\n )\n return_dict = return_dict if return_dict is not None else self.config.use_return_dict\n\n if self.config.is_decoder:\n use_cache = use_cache if use_cache is not None else self.config.use_cache\n else:\n use_cache = False\n\n if isinstance(input_ids, NestedTensor):\n input_ids, attention_mask = input_ids.tensor, input_ids.mask\n if input_ids is not None and inputs_embeds is not None:\n raise ValueError(\"You cannot specify both input_ids and inputs_embeds at the same time\")\n if input_ids is not None:\n self.warn_if_padding_and_no_attention_mask(input_ids, attention_mask)\n input_shape = input_ids.size()\n elif inputs_embeds is not None:\n input_shape = inputs_embeds.size()[:-1]\n else:\n raise ValueError(\"You have to specify either input_ids or inputs_embeds\")\n\n batch_size, seq_length = input_shape\n device = input_ids.device if input_ids is not None else inputs_embeds.device # type: ignore[union-attr]\n\n # past_key_values_length\n past_key_values_length = past_key_values[0][0].shape[2] if past_key_values is not None else 0\n\n if attention_mask is None:\n attention_mask = (\n input_ids.ne(self.pad_token_id)\n if self.pad_token_id is not None\n else torch.ones(((batch_size, seq_length + past_key_values_length)), device=device)\n )\n\n # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]\n # ourselves in which case we just need to make it broadcastable to all heads.\n extended_attention_mask: Tensor = self.get_extended_attention_mask(attention_mask, input_shape)\n\n # If a 2D or 3D attention mask is provided for the cross-attention\n # we need to make broadcastable to [batch_size, num_heads, seq_length, seq_length]\n if self.config.is_decoder and encoder_hidden_states is not None:\n encoder_batch_size, encoder_sequence_length, _ = encoder_hidden_states.size()\n encoder_hidden_shape = (encoder_batch_size, encoder_sequence_length)\n if encoder_attention_mask is None:\n encoder_attention_mask = torch.ones(encoder_hidden_shape, device=device)\n encoder_extended_attention_mask = self.invert_attention_mask(encoder_attention_mask)\n else:\n encoder_extended_attention_mask = None\n\n # Prepare head mask if needed\n # 1.0 in head_mask indicate we keep the head\n # attention_probs has shape bsz x n_heads x N x N\n # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]\n # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]\n head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)\n\n embedding_output = self.embeddings(\n input_ids=input_ids,\n position_ids=position_ids,\n attention_mask=attention_mask,\n inputs_embeds=inputs_embeds,\n past_key_values_length=past_key_values_length,\n )\n encoder_outputs = self.encoder(\n embedding_output,\n attention_mask=extended_attention_mask,\n head_mask=head_mask,\n encoder_hidden_states=encoder_hidden_states,\n encoder_attention_mask=encoder_extended_attention_mask,\n past_key_values=past_key_values,\n use_cache=use_cache,\n output_attentions=output_attentions,\n output_hidden_states=output_hidden_states,\n return_dict=return_dict,\n )\n sequence_output = encoder_outputs[0]\n pooled_output = self.pooler(sequence_output) if self.pooler is not None else None\n\n if not return_dict:\n return (sequence_output, pooled_output) + encoder_outputs[1:]\n\n return BaseModelOutputWithPoolingAndCrossAttentions(\n last_hidden_state=sequence_output,\n pooler_output=pooled_output,\n past_key_values=encoder_outputs.past_key_values,\n hidden_states=encoder_outputs.hidden_states,\n attentions=encoder_outputs.attentions,\n cross_attentions=encoder_outputs.cross_attentions,\n )\n
"},{"location":"models/utrlm/#multimolecule.models.utrlm.UtrLmModel.forward(encoder_hidden_states)","title":"encoder_hidden_states
","text":""},{"location":"models/utrlm/#multimolecule.models.utrlm.UtrLmModel.forward(encoder_attention_mask)","title":"encoder_attention_mask
","text":""},{"location":"models/utrlm/#multimolecule.models.utrlm.UtrLmModel.forward(past_key_values)","title":"past_key_values
","text":""},{"location":"models/utrlm/#multimolecule.models.utrlm.UtrLmModel.forward(use_cache)","title":"use_cache
","text":""},{"location":"models/utrlm/#multimolecule.models.utrlm.UtrLmPreTrainedModel","title":"UtrLmPreTrainedModel","text":" Bases: PreTrainedModel
An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained models.
Source code inmultimolecule/models/utrlm/modeling_utrlm.py
Pythonclass UtrLmPreTrainedModel(PreTrainedModel):\n \"\"\"\n An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained\n models.\n \"\"\"\n\n config_class = UtrLmConfig\n base_model_prefix = \"utrlm\"\n supports_gradient_checkpointing = True\n _no_split_modules = [\"UtrLmLayer\", \"UtrLmEmbeddings\"]\n\n # Copied from transformers.models.bert.modeling_bert.BertPreTrainedModel._init_weights\n def _init_weights(self, module: nn.Module):\n \"\"\"Initialize the weights\"\"\"\n if isinstance(module, nn.Linear):\n # Slightly different from the TF version which uses truncated_normal for initialization\n # cf https://github.com/pytorch/pytorch/pull/5617\n module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)\n if module.bias is not None:\n module.bias.data.zero_()\n elif isinstance(module, nn.Embedding):\n module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)\n if module.padding_idx is not None:\n module.weight.data[module.padding_idx].zero_()\n elif isinstance(module, nn.LayerNorm):\n module.bias.data.zero_()\n module.weight.data.fill_(1.0)\n
"},{"location":"module/","title":"module","text":"module
provides a collection of pre-defined modules for users to implement their own architectures.
MultiMolecule is built upon the ecosystem, embracing a similar design philosophy: Don\u2019t Repeat Yourself. We follow the single model file policy
where each model under the models
package contains one and only one modeling.py
file that describes the network design.
The module
package is intended for simple, reusable modules that are consistent across multiple models. This approach minimizes code duplication and promotes clean, maintainable code.
module
package includes components that are commonly used across different models, such as the SequencePredictionHead
. This reduces redundancy and simplifies the development process.modeling.py
.SequencePredictionHead
, TokenPredictionHead
, and ContactPredictionHead
.SinusoidalEmbedding
and RotaryEmbedding
.embeddings
provide a collection of pre-defined positional embeddings.
Bases: Module
Rotary position embeddings based on those in RoFormer.
Query and keys are transformed by rotation matrices which depend on their relative positions.
CacheThe inverse frequency buffer is cached and updated only when the sequence length changes or the device changes.
Sequence LengthRotary Embedding is irrespective of the sequence length and can be used for any sequence length.
Source code inmultimolecule/module/embeddings/rotary.py
Python@PositionEmbeddingRegistry.register(\"rotary\")\n@PositionEmbeddingRegistryHF.register(\"rotary\")\nclass RotaryEmbedding(nn.Module):\n \"\"\"\n Rotary position embeddings based on those in\n [RoFormer](https://huggingface.co/docs/transformers/model_doc/roformer).\n\n Query and keys are transformed by rotation\n matrices which depend on their relative positions.\n\n Tip: **Cache**\n The inverse frequency buffer is cached and updated only when the sequence length changes or the device changes.\n\n Success: **Sequence Length**\n Rotary Embedding is irrespective of the sequence length and can be used for any sequence length.\n \"\"\"\n\n def __init__(self, embedding_dim: int):\n super().__init__()\n # Generate and save the inverse frequency buffer (non trainable)\n inv_freq = 1.0 / (10000 ** (torch.arange(0, embedding_dim, 2, dtype=torch.int64).float() / embedding_dim))\n self.register_buffer(\"inv_freq\", inv_freq)\n\n self._seq_len_cached = None\n self._cos_cached = None\n self._sin_cached = None\n\n def forward(self, q: Tensor, k: Tensor) -> Tuple[Tensor, Tensor]:\n self._update_cos_sin_tables(k, seq_dimension=-2)\n\n return (self.apply_rotary_pos_emb(q), self.apply_rotary_pos_emb(k))\n\n def _update_cos_sin_tables(self, x, seq_dimension=2):\n seq_len = x.shape[seq_dimension]\n\n # Reset the tables if the sequence length has changed,\n # or if we're on a new device (possibly due to tracing for instance)\n if seq_len != self._seq_len_cached or self._cos_cached.device != x.device:\n self._seq_len_cached = seq_len\n t = torch.arange(x.shape[seq_dimension], device=x.device).type_as(self.inv_freq)\n freqs = torch.outer(t, self.inv_freq)\n emb = torch.cat((freqs, freqs), dim=-1).to(x.device)\n\n self._cos_cached = emb.cos()[None, None, :, :]\n self._sin_cached = emb.sin()[None, None, :, :]\n\n return self._cos_cached, self._sin_cached\n\n def apply_rotary_pos_emb(self, x):\n cos = self._cos_cached[:, :, : x.shape[-2], :]\n sin = self._sin_cached[:, :, : x.shape[-2], :]\n\n return (x * cos) + (self.rotate_half(x) * sin)\n\n @staticmethod\n def rotate_half(x):\n x1, x2 = x.chunk(2, dim=-1)\n return torch.cat((-x2, x1), dim=-1)\n
"},{"location":"module/embeddings/#multimolecule.module.embeddings.SinusoidalEmbedding","title":"SinusoidalEmbedding","text":" Bases: Embedding
Sinusoidal positional embeddings for inputs with any length.
FreezingThe embeddings are frozen and cannot be trained. They will not be saved in the model\u2019s state_dict.
Padding IdxPadding symbols are ignored if the padding_idx is specified.
Sequence LengthThese embeddings get automatically extended in forward if more positions is needed.
Source code inmultimolecule/module/embeddings/sinusoidal.py
Python@PositionEmbeddingRegistry.register(\"sinusoidal\")\n@PositionEmbeddingRegistryHF.register(\"sinusoidal\")\nclass SinusoidalEmbedding(nn.Embedding):\n r\"\"\"\n Sinusoidal positional embeddings for inputs with any length.\n\n Note: **Freezing**\n The embeddings are frozen and cannot be trained.\n They will not be saved in the model's state_dict.\n\n Tip: **Padding Idx**\n Padding symbols are ignored if the padding_idx is specified.\n\n Success: **Sequence Length**\n These embeddings get automatically extended in forward if more positions is needed.\n \"\"\"\n\n _is_hf_initialized = True\n\n def __init__(self, num_embeddings: int, embedding_dim: int, padding_idx: int | None = None, bias: int = 0):\n weight = self.get_embedding(num_embeddings, embedding_dim, padding_idx)\n super().__init__(num_embeddings, embedding_dim, padding_idx, _weight=weight.detach(), _freeze=True)\n self.bias = bias\n\n def update_weight(self, num_embeddings: int, embedding_dim: int, padding_idx: int | None = None):\n weight = self.get_embedding(num_embeddings, embedding_dim, padding_idx).to(\n dtype=self.weight.dtype, device=self.weight.device # type: ignore[has-type]\n )\n self.weight = nn.Parameter(weight.detach(), requires_grad=False)\n\n @staticmethod\n def get_embedding(num_embeddings: int, embedding_dim: int, padding_idx: int | None = None) -> Tensor:\n \"\"\"\n Build sinusoidal embeddings.\n\n This matches the implementation in tensor2tensor, but differs slightly from the description in Section 3.5 of\n \"Attention Is All You Need\".\n \"\"\"\n half_dim = embedding_dim // 2\n emb = torch.exp(torch.arange(half_dim, dtype=torch.float) * -(math.log(10000) / (half_dim - 1)))\n emb = torch.arange(num_embeddings, dtype=torch.float).unsqueeze(1) * emb.unsqueeze(0)\n emb = torch.cat([torch.sin(emb), torch.cos(emb)], dim=1).view(num_embeddings, -1)\n if embedding_dim % 2 == 1:\n # zero pad\n emb = torch.cat([emb, torch.zeros(num_embeddings, 1)], dim=1)\n if padding_idx is not None:\n emb[padding_idx, :] = 0\n return emb\n\n @staticmethod\n def get_position_ids(tensor, padding_idx: int | None = None):\n \"\"\"\n Replace non-padding symbols with their position numbers.\n\n Position numbers begin at padding_idx+1. Padding symbols are ignored.\n \"\"\"\n # The series of casts and type-conversions here are carefully\n # balanced to both work with ONNX export and XLA. In particular XLA\n # prefers ints, cumsum defaults to output longs, and ONNX doesn't know\n # how to handle the dtype kwarg in cumsum.\n if padding_idx is None:\n return torch.cumsum(tensor.new_ones(tensor.size(1)).long(), dim=0) - 1\n mask = tensor.ne(padding_idx).int()\n return (torch.cumsum(mask, dim=1).type_as(mask) * mask).long() + padding_idx\n\n def forward(self, input_ids: Tensor) -> Tensor:\n _, seq_len = input_ids.shape[:2]\n # expand embeddings if needed\n max_pos = seq_len + self.bias + 1\n if self.padding_idx is not None:\n max_pos += self.padding_idx\n if max_pos > self.weight.size(0):\n self.update_weight(max_pos, self.embedding_dim, self.padding_idx)\n # Need to shift the position ids by the padding index\n position_ids = self.get_position_ids(input_ids, self.padding_idx) + self.bias\n return super().forward(position_ids)\n\n def state_dict(self, destination=None, prefix=\"\", keep_vars=False):\n return {}\n\n def load_state_dict(self, *args, state_dict, strict=True):\n return\n\n def _load_from_state_dict(\n self, state_dict, prefix, local_metadata, strict, missing_keys, unexpected_keys, error_msgs\n ):\n return\n
"},{"location":"module/embeddings/#multimolecule.module.embeddings.SinusoidalEmbedding.get_embedding","title":"get_embedding staticmethod
","text":"Pythonget_embedding(num_embeddings: int, embedding_dim: int, padding_idx: int | None = None) -> Tensor\n
Build sinusoidal embeddings.
This matches the implementation in tensor2tensor, but differs slightly from the description in Section 3.5 of \u201cAttention Is All You Need\u201d.
Source code inmultimolecule/module/embeddings/sinusoidal.py
Python@staticmethod\ndef get_embedding(num_embeddings: int, embedding_dim: int, padding_idx: int | None = None) -> Tensor:\n \"\"\"\n Build sinusoidal embeddings.\n\n This matches the implementation in tensor2tensor, but differs slightly from the description in Section 3.5 of\n \"Attention Is All You Need\".\n \"\"\"\n half_dim = embedding_dim // 2\n emb = torch.exp(torch.arange(half_dim, dtype=torch.float) * -(math.log(10000) / (half_dim - 1)))\n emb = torch.arange(num_embeddings, dtype=torch.float).unsqueeze(1) * emb.unsqueeze(0)\n emb = torch.cat([torch.sin(emb), torch.cos(emb)], dim=1).view(num_embeddings, -1)\n if embedding_dim % 2 == 1:\n # zero pad\n emb = torch.cat([emb, torch.zeros(num_embeddings, 1)], dim=1)\n if padding_idx is not None:\n emb[padding_idx, :] = 0\n return emb\n
"},{"location":"module/embeddings/#multimolecule.module.embeddings.SinusoidalEmbedding.get_position_ids","title":"get_position_ids staticmethod
","text":"Pythonget_position_ids(tensor, padding_idx: int | None = None)\n
Replace non-padding symbols with their position numbers.
Position numbers begin at padding_idx+1. Padding symbols are ignored.
Source code inmultimolecule/module/embeddings/sinusoidal.py
Python@staticmethod\ndef get_position_ids(tensor, padding_idx: int | None = None):\n \"\"\"\n Replace non-padding symbols with their position numbers.\n\n Position numbers begin at padding_idx+1. Padding symbols are ignored.\n \"\"\"\n # The series of casts and type-conversions here are carefully\n # balanced to both work with ONNX export and XLA. In particular XLA\n # prefers ints, cumsum defaults to output longs, and ONNX doesn't know\n # how to handle the dtype kwarg in cumsum.\n if padding_idx is None:\n return torch.cumsum(tensor.new_ones(tensor.size(1)).long(), dim=0) - 1\n mask = tensor.ne(padding_idx).int()\n return (torch.cumsum(mask, dim=1).type_as(mask) * mask).long() + padding_idx\n
"},{"location":"module/heads/","title":"heads","text":"heads
provide a collection of pre-defined prediction heads.
heads
take in either a ModelOutupt
, a dict
, or a tuple
as input. It automatically looks for the model output required for prediction and processes it accordingly.
Some prediction heads may require additional information, such as the attention_mask
or the input_ids
, like ContactPredictionHead
. These additional arguments can be passed in as arguments/keyword arguments.
Note that heads
use the same ModelOutupt
conventions as the Transformers. If the model output is a tuple
, we consider the first element as the pooler_output
, the second element as the last_hidden_state
, and the last element as the attention_map
. It is the user\u2019s responsibility to ensure that the model output is correctly formatted.
If the model output is a ModelOutupt
or a dict
, the heads
will look for the HeadConfig.output_name
from the model output. You can specify the output_name
in the HeadConfig
to ensure that the heads
can correctly locate the required tensor.
Bases: BaseHeadConfig
Configuration class for a prediction head.
Parameters:
Name Type Description DefaultNumber of labels to use in the last layer added to the model, typically for a classification task.
Head should look for Config.num_labels
if is None
.
Problem type for XxxForYyyPrediction
models. Can be one of \"binary\"
, \"regression\"
, \"multiclass\"
or \"multilabel\"
.
Head should look for Config.problem_type
if is None
.
Dimensionality of the encoder layers and the pooler layer.
Head should look for Config.hidden_size
if is None
.
The dropout ratio for the hidden states.
requiredThe transform operation applied to hidden states.
requiredThe activation function of transform applied to hidden states.
requiredWhether to apply bias to the final prediction layer.
requiredThe activation function of the final prediction output.
requiredThe epsilon used by the layer normalization layers.
requiredThe name of the tensor required in model outputs.
If is None
, will use the default output name of the corresponding head.
The type of the head in the model.
This is used by MultiMoleculeModel
to construct heads.
multimolecule/module/heads/config.py
Pythonclass HeadConfig(BaseHeadConfig):\n r\"\"\"\n Configuration class for a prediction head.\n\n Args:\n num_labels:\n Number of labels to use in the last layer added to the model, typically for a classification task.\n\n Head should look for [`Config.num_labels`][multimolecule.PreTrainedConfig] if is `None`.\n problem_type:\n Problem type for `XxxForYyyPrediction` models. Can be one of `\"binary\"`, `\"regression\"`,\n `\"multiclass\"` or `\"multilabel\"`.\n\n Head should look for [`Config.problem_type`][multimolecule.PreTrainedConfig] if is `None`.\n hidden_size:\n Dimensionality of the encoder layers and the pooler layer.\n\n Head should look for [`Config.hidden_size`][multimolecule.PreTrainedConfig] if is `None`.\n dropout:\n The dropout ratio for the hidden states.\n transform:\n The transform operation applied to hidden states.\n transform_act:\n The activation function of transform applied to hidden states.\n bias:\n Whether to apply bias to the final prediction layer.\n act:\n The activation function of the final prediction output.\n layer_norm_eps:\n The epsilon used by the layer normalization layers.\n output_name:\n The name of the tensor required in model outputs.\n\n If is `None`, will use the default output name of the corresponding head.\n type:\n The type of the head in the model.\n\n This is used by [`MultiMoleculeModel`][multimolecule.MultiMoleculeModel] to construct heads.\n \"\"\"\n\n num_labels: Optional[int] = None\n problem_type: Optional[str] = None\n hidden_size: Optional[int] = None\n dropout: float = 0.0\n transform: Optional[str] = None\n transform_act: Optional[str] = \"gelu\"\n bias: bool = True\n act: Optional[str] = None\n layer_norm_eps: float = 1e-12\n output_name: Optional[str] = None\n type: Optional[str] = None\n
"},{"location":"module/heads/#multimolecule.module.heads.config.HeadConfig(num_labels)","title":"num_labels
","text":""},{"location":"module/heads/#multimolecule.module.heads.config.HeadConfig(problem_type)","title":"problem_type
","text":""},{"location":"module/heads/#multimolecule.module.heads.config.HeadConfig(hidden_size)","title":"hidden_size
","text":""},{"location":"module/heads/#multimolecule.module.heads.config.HeadConfig(dropout)","title":"dropout
","text":""},{"location":"module/heads/#multimolecule.module.heads.config.HeadConfig(transform)","title":"transform
","text":""},{"location":"module/heads/#multimolecule.module.heads.config.HeadConfig(transform_act)","title":"transform_act
","text":""},{"location":"module/heads/#multimolecule.module.heads.config.HeadConfig(bias)","title":"bias
","text":""},{"location":"module/heads/#multimolecule.module.heads.config.HeadConfig(act)","title":"act
","text":""},{"location":"module/heads/#multimolecule.module.heads.config.HeadConfig(layer_norm_eps)","title":"layer_norm_eps
","text":""},{"location":"module/heads/#multimolecule.module.heads.config.HeadConfig(output_name)","title":"output_name
","text":""},{"location":"module/heads/#multimolecule.module.heads.config.HeadConfig(type)","title":"type
","text":""},{"location":"module/heads/#multimolecule.module.heads.config.MaskedLMHeadConfig","title":"MaskedLMHeadConfig","text":" Bases: BaseHeadConfig
Configuration class for a Masked Language Modeling head.
Parameters:
Name Type Description DefaultDimensionality of the encoder layers and the pooler layer.
Head should look for Config.hidden_size
if is None
.
The dropout ratio for the hidden states.
requiredThe transform operation applied to hidden states.
requiredThe activation function of transform applied to hidden states.
requiredWhether to apply bias to the final prediction layer.
requiredThe activation function of the final prediction output.
requiredThe epsilon used by the layer normalization layers.
requiredThe name of the tensor required in model outputs.
If is None
, will use the default output name of the corresponding head.
multimolecule/module/heads/config.py
Pythonclass MaskedLMHeadConfig(BaseHeadConfig):\n r\"\"\"\n Configuration class for a Masked Language Modeling head.\n\n Args:\n hidden_size:\n Dimensionality of the encoder layers and the pooler layer.\n\n Head should look for [`Config.hidden_size`][multimolecule.PreTrainedConfig] if is `None`.\n dropout:\n The dropout ratio for the hidden states.\n transform:\n The transform operation applied to hidden states.\n transform_act:\n The activation function of transform applied to hidden states.\n bias:\n Whether to apply bias to the final prediction layer.\n act:\n The activation function of the final prediction output.\n layer_norm_eps:\n The epsilon used by the layer normalization layers.\n output_name:\n The name of the tensor required in model outputs.\n\n If is `None`, will use the default output name of the corresponding head.\n \"\"\"\n\n hidden_size: Optional[int] = None\n dropout: float = 0.0\n transform: Optional[str] = \"nonlinear\"\n transform_act: Optional[str] = \"gelu\"\n bias: bool = True\n act: Optional[str] = None\n layer_norm_eps: float = 1e-12\n output_name: Optional[str] = None\n
"},{"location":"module/heads/#multimolecule.module.heads.config.MaskedLMHeadConfig(hidden_size)","title":"hidden_size
","text":""},{"location":"module/heads/#multimolecule.module.heads.config.MaskedLMHeadConfig(dropout)","title":"dropout
","text":""},{"location":"module/heads/#multimolecule.module.heads.config.MaskedLMHeadConfig(transform)","title":"transform
","text":""},{"location":"module/heads/#multimolecule.module.heads.config.MaskedLMHeadConfig(transform_act)","title":"transform_act
","text":""},{"location":"module/heads/#multimolecule.module.heads.config.MaskedLMHeadConfig(bias)","title":"bias
","text":""},{"location":"module/heads/#multimolecule.module.heads.config.MaskedLMHeadConfig(act)","title":"act
","text":""},{"location":"module/heads/#multimolecule.module.heads.config.MaskedLMHeadConfig(layer_norm_eps)","title":"layer_norm_eps
","text":""},{"location":"module/heads/#multimolecule.module.heads.config.MaskedLMHeadConfig(output_name)","title":"output_name
","text":""},{"location":"module/heads/#multimolecule.module.heads.sequence","title":"multimolecule.module.heads.sequence","text":""},{"location":"module/heads/#multimolecule.module.heads.sequence.SequencePredictionHead","title":"SequencePredictionHead","text":" Bases: PredictionHead
Head for tasks in sequence-level.
Parameters:
Name Type Description DefaultPreTrainedConfig
The configuration object for the model.
requiredHeadConfig | None
The configuration object for the head. If None, will use configuration from the config
.
None
Source code in multimolecule/module/heads/sequence.py
Python@HeadRegistry.register(\"sequence\")\nclass SequencePredictionHead(PredictionHead):\n r\"\"\"\n Head for tasks in sequence-level.\n\n Args:\n config: The configuration object for the model.\n head_config: The configuration object for the head.\n If None, will use configuration from the `config`.\n \"\"\"\n\n output_name: str = \"pooler_output\"\n r\"\"\"The default output to use for the head.\"\"\"\n\n def __init__(self, config: PreTrainedConfig, head_config: HeadConfig | None = None):\n super().__init__(config, head_config)\n if head_config is not None and head_config.output_name is not None:\n self.output_name = head_config.output_name\n\n def forward( # type: ignore[override] # pylint: disable=arguments-renamed\n self,\n outputs: ModelOutput | Tuple[Tensor, ...],\n labels: Tensor | None = None,\n output_name: str | None = None,\n **kwargs,\n ) -> HeadOutput:\n r\"\"\"\n Forward pass of the SequencePredictionHead.\n\n Args:\n outputs: The outputs of the model.\n labels: The labels for the head.\n output_name: The name of the output to use.\n Defaults to `self.output_name`.\n \"\"\"\n if isinstance(outputs, (Mapping, ModelOutput)):\n output = outputs[output_name or self.output_name]\n elif isinstance(outputs, tuple):\n output = outputs[1]\n return super().forward(output, labels, **kwargs)\n
"},{"location":"module/heads/#multimolecule.module.heads.sequence.SequencePredictionHead(config)","title":"config
","text":""},{"location":"module/heads/#multimolecule.module.heads.sequence.SequencePredictionHead(head_config)","title":"head_config
","text":""},{"location":"module/heads/#multimolecule.module.heads.sequence.SequencePredictionHead.output_name","title":"output_name class-attribute
instance-attribute
","text":"Pythonoutput_name: str = 'pooler_output'\n
The default output to use for the head.
"},{"location":"module/heads/#multimolecule.module.heads.sequence.SequencePredictionHead.forward","title":"forward","text":"Pythonforward(outputs: ModelOutput | Tuple[Tensor, ...], labels: Tensor | None = None, output_name: str | None = None, **kwargs) -> HeadOutput\n
Forward pass of the SequencePredictionHead.
Parameters:
Name Type Description DefaultModelOutput | Tuple[Tensor, ...]
The outputs of the model.
requiredTensor | None
The labels for the head.
None
str | None
The name of the output to use. Defaults to self.output_name
.
None
Source code in multimolecule/module/heads/sequence.py
Pythondef forward( # type: ignore[override] # pylint: disable=arguments-renamed\n self,\n outputs: ModelOutput | Tuple[Tensor, ...],\n labels: Tensor | None = None,\n output_name: str | None = None,\n **kwargs,\n) -> HeadOutput:\n r\"\"\"\n Forward pass of the SequencePredictionHead.\n\n Args:\n outputs: The outputs of the model.\n labels: The labels for the head.\n output_name: The name of the output to use.\n Defaults to `self.output_name`.\n \"\"\"\n if isinstance(outputs, (Mapping, ModelOutput)):\n output = outputs[output_name or self.output_name]\n elif isinstance(outputs, tuple):\n output = outputs[1]\n return super().forward(output, labels, **kwargs)\n
"},{"location":"module/heads/#multimolecule.module.heads.sequence.SequencePredictionHead.forward(outputs)","title":"outputs
","text":""},{"location":"module/heads/#multimolecule.module.heads.sequence.SequencePredictionHead.forward(labels)","title":"labels
","text":""},{"location":"module/heads/#multimolecule.module.heads.sequence.SequencePredictionHead.forward(output_name)","title":"output_name
","text":""},{"location":"module/heads/#multimolecule.module.heads.token","title":"multimolecule.module.heads.token","text":""},{"location":"module/heads/#multimolecule.module.heads.token.TokenPredictionHead","title":"TokenPredictionHead","text":" Bases: PredictionHead
Head for tasks in token-level.
Parameters:
Name Type Description DefaultPreTrainedConfig
The configuration object for the model.
requiredHeadConfig | None
The configuration object for the head. If None, will use configuration from the config
.
None
Source code in multimolecule/module/heads/token.py
Python@HeadRegistry.token.register(\"single\", default=True)\n@TokenHeadRegistryHF.register(\"single\", default=True)\nclass TokenPredictionHead(PredictionHead):\n r\"\"\"\n Head for tasks in token-level.\n\n Args:\n config: The configuration object for the model.\n head_config: The configuration object for the head.\n If None, will use configuration from the `config`.\n \"\"\"\n\n output_name: str = \"last_hidden_state\"\n r\"\"\"The default output to use for the head.\"\"\"\n\n def __init__(self, config: PreTrainedConfig, head_config: HeadConfig | None = None):\n super().__init__(config, head_config)\n if head_config is not None and head_config.output_name is not None:\n self.output_name = head_config.output_name\n\n def forward( # type: ignore[override] # pylint: disable=arguments-renamed\n self,\n outputs: ModelOutput | Tuple[Tensor, ...],\n attention_mask: Tensor | None = None,\n input_ids: NestedTensor | Tensor | None = None,\n labels: Tensor | None = None,\n output_name: str | None = None,\n **kwargs,\n ) -> HeadOutput:\n r\"\"\"\n Forward pass of the TokenPredictionHead.\n\n Args:\n outputs: The outputs of the model.\n attention_mask: The attention mask for the inputs.\n input_ids: The input ids for the inputs.\n labels: The labels for the head.\n output_name: The name of the output to use.\n Defaults to `self.output_name`.\n \"\"\"\n if isinstance(outputs, (Mapping, ModelOutput)):\n output = outputs[output_name or self.output_name]\n elif isinstance(outputs, tuple):\n output = outputs[0]\n else:\n raise ValueError(f\"Unsupported type for outputs: {type(outputs)}\")\n\n if attention_mask is None:\n attention_mask = self._get_attention_mask(input_ids)\n output = output * attention_mask.unsqueeze(-1)\n output, _, _ = self._remove_special_tokens(output, attention_mask, input_ids)\n\n return super().forward(output, labels, **kwargs)\n
"},{"location":"module/heads/#multimolecule.module.heads.token.TokenPredictionHead(config)","title":"config
","text":""},{"location":"module/heads/#multimolecule.module.heads.token.TokenPredictionHead(head_config)","title":"head_config
","text":""},{"location":"module/heads/#multimolecule.module.heads.token.TokenPredictionHead.output_name","title":"output_name class-attribute
instance-attribute
","text":"Pythonoutput_name: str = 'last_hidden_state'\n
The default output to use for the head.
"},{"location":"module/heads/#multimolecule.module.heads.token.TokenPredictionHead.forward","title":"forward","text":"Pythonforward(outputs: ModelOutput | Tuple[Tensor, ...], attention_mask: Tensor | None = None, input_ids: NestedTensor | Tensor | None = None, labels: Tensor | None = None, output_name: str | None = None, **kwargs) -> HeadOutput\n
Forward pass of the TokenPredictionHead.
Parameters:
Name Type Description DefaultModelOutput | Tuple[Tensor, ...]
The outputs of the model.
requiredTensor | None
The attention mask for the inputs.
None
NestedTensor | Tensor | None
The input ids for the inputs.
None
Tensor | None
The labels for the head.
None
str | None
The name of the output to use. Defaults to self.output_name
.
None
Source code in multimolecule/module/heads/token.py
Pythondef forward( # type: ignore[override] # pylint: disable=arguments-renamed\n self,\n outputs: ModelOutput | Tuple[Tensor, ...],\n attention_mask: Tensor | None = None,\n input_ids: NestedTensor | Tensor | None = None,\n labels: Tensor | None = None,\n output_name: str | None = None,\n **kwargs,\n) -> HeadOutput:\n r\"\"\"\n Forward pass of the TokenPredictionHead.\n\n Args:\n outputs: The outputs of the model.\n attention_mask: The attention mask for the inputs.\n input_ids: The input ids for the inputs.\n labels: The labels for the head.\n output_name: The name of the output to use.\n Defaults to `self.output_name`.\n \"\"\"\n if isinstance(outputs, (Mapping, ModelOutput)):\n output = outputs[output_name or self.output_name]\n elif isinstance(outputs, tuple):\n output = outputs[0]\n else:\n raise ValueError(f\"Unsupported type for outputs: {type(outputs)}\")\n\n if attention_mask is None:\n attention_mask = self._get_attention_mask(input_ids)\n output = output * attention_mask.unsqueeze(-1)\n output, _, _ = self._remove_special_tokens(output, attention_mask, input_ids)\n\n return super().forward(output, labels, **kwargs)\n
"},{"location":"module/heads/#multimolecule.module.heads.token.TokenPredictionHead.forward(outputs)","title":"outputs
","text":""},{"location":"module/heads/#multimolecule.module.heads.token.TokenPredictionHead.forward(attention_mask)","title":"attention_mask
","text":""},{"location":"module/heads/#multimolecule.module.heads.token.TokenPredictionHead.forward(input_ids)","title":"input_ids
","text":""},{"location":"module/heads/#multimolecule.module.heads.token.TokenPredictionHead.forward(labels)","title":"labels
","text":""},{"location":"module/heads/#multimolecule.module.heads.token.TokenPredictionHead.forward(output_name)","title":"output_name
","text":""},{"location":"module/heads/#multimolecule.module.heads.token.TokenKMerHead","title":"TokenKMerHead","text":" Bases: PredictionHead
Head for tasks in token-level with kmer inputs.
Parameters:
Name Type Description DefaultPreTrainedConfig
The configuration object for the model.
requiredHeadConfig | None
The configuration object for the head. If None, will use configuration from the config
.
None
Source code in multimolecule/module/heads/token.py
Python@HeadRegistry.register(\"token.kmer\")\n@TokenHeadRegistryHF.register(\"kmer\")\nclass TokenKMerHead(PredictionHead):\n r\"\"\"\n Head for tasks in token-level with kmer inputs.\n\n Args:\n config: The configuration object for the model.\n head_config: The configuration object for the head.\n If None, will use configuration from the `config`.\n \"\"\"\n\n output_name: str = \"last_hidden_state\"\n r\"\"\"The default output to use for the head.\"\"\"\n\n def __init__(self, config: PreTrainedConfig, head_config: HeadConfig | None = None):\n super().__init__(config, head_config)\n self.nmers = config.nmers\n if head_config is not None and head_config.output_name is not None:\n self.output_name = head_config.output_name\n # Do not pass bos_token_id and eos_token_id to unfold_kmer_embeddings\n # As they will be removed in preprocess\n self.unfold_kmer_embeddings = partial(unfold_kmer_embeddings, nmers=self.nmers)\n\n def forward( # type: ignore[override] # pylint: disable=arguments-renamed\n self,\n outputs: ModelOutput | Tuple[Tensor, ...],\n attention_mask: Tensor | None = None,\n input_ids: NestedTensor | Tensor | None = None,\n labels: Tensor | None = None,\n output_name: str | None = None,\n **kwargs,\n ) -> HeadOutput:\n r\"\"\"\n Forward pass of the TokenKMerHead.\n\n Args:\n outputs: The outputs of the model.\n attention_mask: The attention mask for the inputs.\n input_ids: The input ids for the inputs.\n labels: The labels for the head.\n output_name: The name of the output to use.\n Defaults to `self.output_name`.\n \"\"\"\n if isinstance(outputs, (Mapping, ModelOutput)):\n output = outputs[output_name or self.output_name]\n elif isinstance(outputs, tuple):\n output = outputs[0]\n else:\n raise ValueError(f\"Unsupported type for outputs: {type(outputs)}\")\n\n if attention_mask is None:\n attention_mask = self._get_attention_mask(input_ids)\n output = output * attention_mask.unsqueeze(-1)\n output, attention_mask, _ = self._remove_special_tokens(output, attention_mask, input_ids)\n\n output = self.unfold_kmer_embeddings(output, attention_mask)\n return super().forward(output, labels, **kwargs)\n
"},{"location":"module/heads/#multimolecule.module.heads.token.TokenKMerHead(config)","title":"config
","text":""},{"location":"module/heads/#multimolecule.module.heads.token.TokenKMerHead(head_config)","title":"head_config
","text":""},{"location":"module/heads/#multimolecule.module.heads.token.TokenKMerHead.output_name","title":"output_name class-attribute
instance-attribute
","text":"Pythonoutput_name: str = 'last_hidden_state'\n
The default output to use for the head.
"},{"location":"module/heads/#multimolecule.module.heads.token.TokenKMerHead.forward","title":"forward","text":"Pythonforward(outputs: ModelOutput | Tuple[Tensor, ...], attention_mask: Tensor | None = None, input_ids: NestedTensor | Tensor | None = None, labels: Tensor | None = None, output_name: str | None = None, **kwargs) -> HeadOutput\n
Forward pass of the TokenKMerHead.
Parameters:
Name Type Description DefaultModelOutput | Tuple[Tensor, ...]
The outputs of the model.
requiredTensor | None
The attention mask for the inputs.
None
NestedTensor | Tensor | None
The input ids for the inputs.
None
Tensor | None
The labels for the head.
None
str | None
The name of the output to use. Defaults to self.output_name
.
None
Source code in multimolecule/module/heads/token.py
Pythondef forward( # type: ignore[override] # pylint: disable=arguments-renamed\n self,\n outputs: ModelOutput | Tuple[Tensor, ...],\n attention_mask: Tensor | None = None,\n input_ids: NestedTensor | Tensor | None = None,\n labels: Tensor | None = None,\n output_name: str | None = None,\n **kwargs,\n) -> HeadOutput:\n r\"\"\"\n Forward pass of the TokenKMerHead.\n\n Args:\n outputs: The outputs of the model.\n attention_mask: The attention mask for the inputs.\n input_ids: The input ids for the inputs.\n labels: The labels for the head.\n output_name: The name of the output to use.\n Defaults to `self.output_name`.\n \"\"\"\n if isinstance(outputs, (Mapping, ModelOutput)):\n output = outputs[output_name or self.output_name]\n elif isinstance(outputs, tuple):\n output = outputs[0]\n else:\n raise ValueError(f\"Unsupported type for outputs: {type(outputs)}\")\n\n if attention_mask is None:\n attention_mask = self._get_attention_mask(input_ids)\n output = output * attention_mask.unsqueeze(-1)\n output, attention_mask, _ = self._remove_special_tokens(output, attention_mask, input_ids)\n\n output = self.unfold_kmer_embeddings(output, attention_mask)\n return super().forward(output, labels, **kwargs)\n
"},{"location":"module/heads/#multimolecule.module.heads.token.TokenKMerHead.forward(outputs)","title":"outputs
","text":""},{"location":"module/heads/#multimolecule.module.heads.token.TokenKMerHead.forward(attention_mask)","title":"attention_mask
","text":""},{"location":"module/heads/#multimolecule.module.heads.token.TokenKMerHead.forward(input_ids)","title":"input_ids
","text":""},{"location":"module/heads/#multimolecule.module.heads.token.TokenKMerHead.forward(labels)","title":"labels
","text":""},{"location":"module/heads/#multimolecule.module.heads.token.TokenKMerHead.forward(output_name)","title":"output_name
","text":""},{"location":"module/heads/#multimolecule.module.heads.contact","title":"multimolecule.module.heads.contact","text":""},{"location":"module/heads/#multimolecule.module.heads.contact.ContactPredictionHead","title":"ContactPredictionHead","text":" Bases: PredictionHead
Head for tasks in contact-level.
Performs symmetrization, and average product correct.
Parameters:
Name Type Description DefaultPreTrainedConfig
The configuration object for the model.
requiredHeadConfig | None
The configuration object for the head. If None, will use configuration from the config
.
None
Source code in multimolecule/module/heads/contact.py
Python@HeadRegistry.contact.register(\"attention\")\nclass ContactPredictionHead(PredictionHead):\n r\"\"\"\n Head for tasks in contact-level.\n\n Performs symmetrization, and average product correct.\n\n Args:\n config: The configuration object for the model.\n head_config: The configuration object for the head.\n If None, will use configuration from the `config`.\n \"\"\"\n\n output_name: str = \"attentions\"\n r\"\"\"The default output to use for the head.\"\"\"\n\n requires_attention: bool = True\n\n def __init__(self, config: PreTrainedConfig, head_config: HeadConfig | None = None):\n super().__init__(config, head_config)\n self.config.hidden_size = config.num_hidden_layers * config.num_attention_heads\n num_layers = self.config.get(\"num_layers\", 16)\n num_channels = self.config.get(\"num_channels\", self.config.hidden_size // 10) # type: ignore[operator]\n block = self.config.get(\"block\", \"auto\")\n self.decoder = ResNet(\n num_layers=num_layers,\n hidden_size=self.config.hidden_size, # type: ignore[arg-type]\n block=block,\n num_channels=num_channels,\n num_labels=self.num_labels,\n )\n if head_config is not None and head_config.output_name is not None:\n self.output_name = head_config.output_name\n\n def forward( # type: ignore[override] # pylint: disable=arguments-renamed\n self,\n outputs: ModelOutput | Mapping | Tuple[Tensor, ...],\n attention_mask: Tensor | None = None,\n input_ids: NestedTensor | Tensor | None = None,\n labels: Tensor | None = None,\n output_name: str | None = None,\n **kwargs,\n ) -> HeadOutput:\n r\"\"\"\n Forward pass of the ContactPredictionHead.\n\n Args:\n outputs: The outputs of the model.\n attention_mask: The attention mask for the inputs.\n input_ids: The input ids for the inputs.\n labels: The labels for the head.\n output_name: The name of the output to use.\n Defaults to `self.output_name`.\n \"\"\"\n\n if isinstance(outputs, (Mapping, ModelOutput)):\n output = outputs[output_name or self.output_name]\n elif isinstance(outputs, tuple):\n output = outputs[-1]\n attentions = torch.stack(output, 1)\n\n # In the original model, attentions for padding tokens are completely zeroed out.\n # This makes no difference most of the time because the other tokens won't attend to them,\n # but it does for the contact prediction task, which takes attentions as input,\n # so we have to mimic that here.\n if attention_mask is None:\n attention_mask = self._get_attention_mask(input_ids)\n attention_mask = attention_mask.unsqueeze(1) * attention_mask.unsqueeze(2)\n attentions = attentions * attention_mask[:, None, None, :, :]\n\n # remove cls token attentions\n if self.bos_token_id is not None:\n attentions = attentions[..., 1:, 1:]\n attention_mask = attention_mask[..., 1:]\n if input_ids is not None:\n input_ids = input_ids[..., 1:]\n # remove eos token attentions\n if self.eos_token_id is not None:\n if input_ids is not None:\n eos_mask = input_ids.ne(self.eos_token_id).to(attentions)\n else:\n last_valid_indices = attention_mask.sum(dim=-1)\n seq_length = attention_mask.size(-1)\n eos_mask = torch.arange(seq_length, device=attentions.device).unsqueeze(0) == last_valid_indices\n eos_mask = eos_mask.unsqueeze(1) * eos_mask.unsqueeze(2)\n attentions = attentions * eos_mask[:, None, None, :, :]\n attentions = attentions[..., :-1, :-1]\n\n # features: batch x channels x input_ids x input_ids (symmetric)\n batch_size, layers, heads, seqlen, _ = attentions.size()\n attentions = attentions.view(batch_size, layers * heads, seqlen, seqlen)\n attentions = attentions.to(self.decoder.proj.weight.device)\n attentions = average_product_correct(symmetrize(attentions))\n attentions = attentions.permute(0, 2, 3, 1).squeeze(3)\n\n return super().forward(attentions, labels, **kwargs)\n
"},{"location":"module/heads/#multimolecule.module.heads.contact.ContactPredictionHead(config)","title":"config
","text":""},{"location":"module/heads/#multimolecule.module.heads.contact.ContactPredictionHead(head_config)","title":"head_config
","text":""},{"location":"module/heads/#multimolecule.module.heads.contact.ContactPredictionHead.output_name","title":"output_name class-attribute
instance-attribute
","text":"Pythonoutput_name: str = 'attentions'\n
The default output to use for the head.
"},{"location":"module/heads/#multimolecule.module.heads.contact.ContactPredictionHead.forward","title":"forward","text":"Pythonforward(outputs: ModelOutput | Mapping | Tuple[Tensor, ...], attention_mask: Tensor | None = None, input_ids: NestedTensor | Tensor | None = None, labels: Tensor | None = None, output_name: str | None = None, **kwargs) -> HeadOutput\n
Forward pass of the ContactPredictionHead.
Parameters:
Name Type Description DefaultModelOutput | Mapping | Tuple[Tensor, ...]
The outputs of the model.
requiredTensor | None
The attention mask for the inputs.
None
NestedTensor | Tensor | None
The input ids for the inputs.
None
Tensor | None
The labels for the head.
None
str | None
The name of the output to use. Defaults to self.output_name
.
None
Source code in multimolecule/module/heads/contact.py
Pythondef forward( # type: ignore[override] # pylint: disable=arguments-renamed\n self,\n outputs: ModelOutput | Mapping | Tuple[Tensor, ...],\n attention_mask: Tensor | None = None,\n input_ids: NestedTensor | Tensor | None = None,\n labels: Tensor | None = None,\n output_name: str | None = None,\n **kwargs,\n) -> HeadOutput:\n r\"\"\"\n Forward pass of the ContactPredictionHead.\n\n Args:\n outputs: The outputs of the model.\n attention_mask: The attention mask for the inputs.\n input_ids: The input ids for the inputs.\n labels: The labels for the head.\n output_name: The name of the output to use.\n Defaults to `self.output_name`.\n \"\"\"\n\n if isinstance(outputs, (Mapping, ModelOutput)):\n output = outputs[output_name or self.output_name]\n elif isinstance(outputs, tuple):\n output = outputs[-1]\n attentions = torch.stack(output, 1)\n\n # In the original model, attentions for padding tokens are completely zeroed out.\n # This makes no difference most of the time because the other tokens won't attend to them,\n # but it does for the contact prediction task, which takes attentions as input,\n # so we have to mimic that here.\n if attention_mask is None:\n attention_mask = self._get_attention_mask(input_ids)\n attention_mask = attention_mask.unsqueeze(1) * attention_mask.unsqueeze(2)\n attentions = attentions * attention_mask[:, None, None, :, :]\n\n # remove cls token attentions\n if self.bos_token_id is not None:\n attentions = attentions[..., 1:, 1:]\n attention_mask = attention_mask[..., 1:]\n if input_ids is not None:\n input_ids = input_ids[..., 1:]\n # remove eos token attentions\n if self.eos_token_id is not None:\n if input_ids is not None:\n eos_mask = input_ids.ne(self.eos_token_id).to(attentions)\n else:\n last_valid_indices = attention_mask.sum(dim=-1)\n seq_length = attention_mask.size(-1)\n eos_mask = torch.arange(seq_length, device=attentions.device).unsqueeze(0) == last_valid_indices\n eos_mask = eos_mask.unsqueeze(1) * eos_mask.unsqueeze(2)\n attentions = attentions * eos_mask[:, None, None, :, :]\n attentions = attentions[..., :-1, :-1]\n\n # features: batch x channels x input_ids x input_ids (symmetric)\n batch_size, layers, heads, seqlen, _ = attentions.size()\n attentions = attentions.view(batch_size, layers * heads, seqlen, seqlen)\n attentions = attentions.to(self.decoder.proj.weight.device)\n attentions = average_product_correct(symmetrize(attentions))\n attentions = attentions.permute(0, 2, 3, 1).squeeze(3)\n\n return super().forward(attentions, labels, **kwargs)\n
"},{"location":"module/heads/#multimolecule.module.heads.contact.ContactPredictionHead.forward(outputs)","title":"outputs
","text":""},{"location":"module/heads/#multimolecule.module.heads.contact.ContactPredictionHead.forward(attention_mask)","title":"attention_mask
","text":""},{"location":"module/heads/#multimolecule.module.heads.contact.ContactPredictionHead.forward(input_ids)","title":"input_ids
","text":""},{"location":"module/heads/#multimolecule.module.heads.contact.ContactPredictionHead.forward(labels)","title":"labels
","text":""},{"location":"module/heads/#multimolecule.module.heads.contact.ContactPredictionHead.forward(output_name)","title":"output_name
","text":""},{"location":"module/heads/#multimolecule.module.heads.contact.ContactLogitsHead","title":"ContactLogitsHead","text":" Bases: PredictionHead
Head for tasks in contact-level.
Performs symmetrization, and average product correct.
Parameters:
Name Type Description DefaultPreTrainedConfig
The configuration object for the model.
requiredHeadConfig | None
The configuration object for the head. If None, will use configuration from the config
.
None
Source code in multimolecule/module/heads/contact.py
Python@HeadRegistry.contact.register(\"logits\")\nclass ContactLogitsHead(PredictionHead):\n r\"\"\"\n Head for tasks in contact-level.\n\n Performs symmetrization, and average product correct.\n\n Args:\n config: The configuration object for the model.\n head_config: The configuration object for the head.\n If None, will use configuration from the `config`.\n \"\"\"\n\n output_name: str = \"last_hidden_state\"\n r\"\"\"The default output to use for the head.\"\"\"\n\n requires_attention: bool = False\n\n def __init__(self, config: PreTrainedConfig, head_config: HeadConfig | None = None):\n super().__init__(config, head_config)\n num_layers = self.config.get(\"num_layers\", 16)\n num_channels = self.config.get(\"num_channels\", self.config.hidden_size // 10) # type: ignore[operator]\n block = self.config.get(\"block\", \"auto\")\n self.decoder = ResNet(\n num_layers=num_layers,\n hidden_size=self.config.hidden_size, # type: ignore[arg-type]\n block=block,\n num_channels=num_channels,\n num_labels=self.num_labels,\n )\n if head_config is not None and head_config.output_name is not None:\n self.output_name = head_config.output_name\n\n def forward( # type: ignore[override] # pylint: disable=arguments-renamed\n self,\n outputs: ModelOutput | Mapping | Tuple[Tensor, ...],\n attention_mask: Tensor | None = None,\n input_ids: NestedTensor | Tensor | None = None,\n labels: Tensor | None = None,\n output_name: str | None = None,\n **kwargs,\n ) -> HeadOutput:\n r\"\"\"\n Forward pass of the ContactPredictionHead.\n\n Args:\n outputs: The outputs of the model.\n attention_mask: The attention mask for the inputs.\n input_ids: The input ids for the inputs.\n labels: The labels for the head.\n output_name: The name of the output to use.\n Defaults to `self.output_name`.\n \"\"\"\n if isinstance(outputs, (Mapping, ModelOutput)):\n output = outputs[output_name or self.output_name]\n elif isinstance(outputs, tuple):\n output = outputs[0]\n else:\n raise ValueError(f\"Unsupported type for outputs: {type(outputs)}\")\n\n if attention_mask is None:\n attention_mask = self._get_attention_mask(input_ids)\n output = output * attention_mask.unsqueeze(-1)\n output, _, _ = self._remove_special_tokens(output, attention_mask, input_ids)\n\n # make symmetric contact map\n contact_map = output.unsqueeze(1) * output.unsqueeze(2)\n\n return super().forward(contact_map, labels, **kwargs)\n
"},{"location":"module/heads/#multimolecule.module.heads.contact.ContactLogitsHead(config)","title":"config
","text":""},{"location":"module/heads/#multimolecule.module.heads.contact.ContactLogitsHead(head_config)","title":"head_config
","text":""},{"location":"module/heads/#multimolecule.module.heads.contact.ContactLogitsHead.output_name","title":"output_name class-attribute
instance-attribute
","text":"Pythonoutput_name: str = 'last_hidden_state'\n
The default output to use for the head.
"},{"location":"module/heads/#multimolecule.module.heads.contact.ContactLogitsHead.forward","title":"forward","text":"Pythonforward(outputs: ModelOutput | Mapping | Tuple[Tensor, ...], attention_mask: Tensor | None = None, input_ids: NestedTensor | Tensor | None = None, labels: Tensor | None = None, output_name: str | None = None, **kwargs) -> HeadOutput\n
Forward pass of the ContactPredictionHead.
Parameters:
Name Type Description DefaultModelOutput | Mapping | Tuple[Tensor, ...]
The outputs of the model.
requiredTensor | None
The attention mask for the inputs.
None
NestedTensor | Tensor | None
The input ids for the inputs.
None
Tensor | None
The labels for the head.
None
str | None
The name of the output to use. Defaults to self.output_name
.
None
Source code in multimolecule/module/heads/contact.py
Pythondef forward( # type: ignore[override] # pylint: disable=arguments-renamed\n self,\n outputs: ModelOutput | Mapping | Tuple[Tensor, ...],\n attention_mask: Tensor | None = None,\n input_ids: NestedTensor | Tensor | None = None,\n labels: Tensor | None = None,\n output_name: str | None = None,\n **kwargs,\n) -> HeadOutput:\n r\"\"\"\n Forward pass of the ContactPredictionHead.\n\n Args:\n outputs: The outputs of the model.\n attention_mask: The attention mask for the inputs.\n input_ids: The input ids for the inputs.\n labels: The labels for the head.\n output_name: The name of the output to use.\n Defaults to `self.output_name`.\n \"\"\"\n if isinstance(outputs, (Mapping, ModelOutput)):\n output = outputs[output_name or self.output_name]\n elif isinstance(outputs, tuple):\n output = outputs[0]\n else:\n raise ValueError(f\"Unsupported type for outputs: {type(outputs)}\")\n\n if attention_mask is None:\n attention_mask = self._get_attention_mask(input_ids)\n output = output * attention_mask.unsqueeze(-1)\n output, _, _ = self._remove_special_tokens(output, attention_mask, input_ids)\n\n # make symmetric contact map\n contact_map = output.unsqueeze(1) * output.unsqueeze(2)\n\n return super().forward(contact_map, labels, **kwargs)\n
"},{"location":"module/heads/#multimolecule.module.heads.contact.ContactLogitsHead.forward(outputs)","title":"outputs
","text":""},{"location":"module/heads/#multimolecule.module.heads.contact.ContactLogitsHead.forward(attention_mask)","title":"attention_mask
","text":""},{"location":"module/heads/#multimolecule.module.heads.contact.ContactLogitsHead.forward(input_ids)","title":"input_ids
","text":""},{"location":"module/heads/#multimolecule.module.heads.contact.ContactLogitsHead.forward(labels)","title":"labels
","text":""},{"location":"module/heads/#multimolecule.module.heads.contact.ContactLogitsHead.forward(output_name)","title":"output_name
","text":""},{"location":"module/heads/#multimolecule.module.heads.contact.symmetrize","title":"symmetrize","text":"Pythonsymmetrize(x)\n
Make layer symmetric in final two dimensions, used for contact prediction.
Source code inmultimolecule/module/heads/contact.py
Pythondef symmetrize(x):\n \"Make layer symmetric in final two dimensions, used for contact prediction.\"\n return x + x.transpose(-1, -2)\n
"},{"location":"module/heads/#multimolecule.module.heads.contact.average_product_correct","title":"average_product_correct","text":"Pythonaverage_product_correct(x)\n
Perform average product correct, used for contact prediction.
Source code inmultimolecule/module/heads/contact.py
Pythondef average_product_correct(x):\n \"Perform average product correct, used for contact prediction.\"\n a1 = x.sum(-1, keepdims=True)\n a2 = x.sum(-2, keepdims=True)\n a12 = x.sum((-1, -2), keepdims=True)\n\n avg = a1 * a2\n avg.div_(a12) # in-place to reduce memory\n normalized = x - avg\n return normalized\n
"},{"location":"module/heads/#multimolecule.module.heads.pretrain","title":"multimolecule.module.heads.pretrain","text":""},{"location":"module/heads/#multimolecule.module.heads.pretrain.MaskedLMHead","title":"MaskedLMHead","text":" Bases: Module
Head for masked language modeling.
Parameters:
Name Type Description DefaultPreTrainedConfig
The configuration object for the model.
requiredMaskedLMHeadConfig | None
The configuration object for the head. If None, will use configuration from the config
.
None
Source code in multimolecule/module/heads/pretrain.py
Python@HeadRegistry.register(\"masked_lm\")\nclass MaskedLMHead(nn.Module):\n r\"\"\"\n Head for masked language modeling.\n\n Args:\n config: The configuration object for the model.\n head_config: The configuration object for the head.\n If None, will use configuration from the `config`.\n \"\"\"\n\n output_name: str = \"last_hidden_state\"\n r\"\"\"The default output to use for the head.\"\"\"\n\n def __init__(\n self, config: PreTrainedConfig, weight: Tensor | None = None, head_config: MaskedLMHeadConfig | None = None\n ):\n super().__init__()\n if head_config is None:\n head_config = (config.lm_head if hasattr(config, \"lm_head\") else config.head) or MaskedLMHeadConfig()\n self.config: MaskedLMHeadConfig = head_config\n if self.config.hidden_size is None:\n self.config.hidden_size = config.hidden_size\n self.num_labels = config.vocab_size\n self.dropout = nn.Dropout(self.config.dropout)\n self.transform = HeadTransformRegistryHF.build(self.config)\n self.decoder = nn.Linear(self.config.hidden_size, self.num_labels, bias=False)\n if weight is not None:\n self.decoder.weight = weight\n if self.config.bias:\n self.bias = nn.Parameter(torch.zeros(self.num_labels))\n self.decoder.bias = self.bias\n self.activation = ACT2FN[self.config.act] if self.config.act is not None else None\n if head_config is not None and head_config.output_name is not None:\n self.output_name = head_config.output_name\n\n def forward(\n self, outputs: ModelOutput | Tuple[Tensor, ...], labels: Tensor | None = None, output_name: str | None = None\n ) -> HeadOutput:\n r\"\"\"\n Forward pass of the MaskedLMHead.\n\n Args:\n outputs: The outputs of the model.\n labels: The labels for the head.\n output_name: The name of the output to use.\n Defaults to `self.output_name`.\n \"\"\"\n if isinstance(outputs, (Mapping, ModelOutput)):\n output = outputs[output_name or self.output_name]\n elif isinstance(outputs, tuple):\n output = outputs[0]\n else:\n raise ValueError(f\"Unsupported type for outputs: {type(outputs)}\")\n output = self.dropout(output)\n output = self.transform(output)\n output = self.decoder(output)\n if self.activation is not None:\n output = self.activation(output)\n if labels is not None:\n if isinstance(labels, NestedTensor):\n if isinstance(output, Tensor):\n output = labels.nested_like(output, strict=False)\n return HeadOutput(output, F.cross_entropy(output.concat, labels.concat))\n return HeadOutput(output, F.cross_entropy(output.view(-1, self.num_labels), labels.view(-1)))\n return HeadOutput(output)\n
"},{"location":"module/heads/#multimolecule.module.heads.pretrain.MaskedLMHead(config)","title":"config
","text":""},{"location":"module/heads/#multimolecule.module.heads.pretrain.MaskedLMHead(head_config)","title":"head_config
","text":""},{"location":"module/heads/#multimolecule.module.heads.pretrain.MaskedLMHead.output_name","title":"output_name class-attribute
instance-attribute
","text":"Pythonoutput_name: str = 'last_hidden_state'\n
The default output to use for the head.
"},{"location":"module/heads/#multimolecule.module.heads.pretrain.MaskedLMHead.forward","title":"forward","text":"Pythonforward(outputs: ModelOutput | Tuple[Tensor, ...], labels: Tensor | None = None, output_name: str | None = None) -> HeadOutput\n
Forward pass of the MaskedLMHead.
Parameters:
Name Type Description DefaultModelOutput | Tuple[Tensor, ...]
The outputs of the model.
requiredTensor | None
The labels for the head.
None
str | None
The name of the output to use. Defaults to self.output_name
.
None
Source code in multimolecule/module/heads/pretrain.py
Pythondef forward(\n self, outputs: ModelOutput | Tuple[Tensor, ...], labels: Tensor | None = None, output_name: str | None = None\n) -> HeadOutput:\n r\"\"\"\n Forward pass of the MaskedLMHead.\n\n Args:\n outputs: The outputs of the model.\n labels: The labels for the head.\n output_name: The name of the output to use.\n Defaults to `self.output_name`.\n \"\"\"\n if isinstance(outputs, (Mapping, ModelOutput)):\n output = outputs[output_name or self.output_name]\n elif isinstance(outputs, tuple):\n output = outputs[0]\n else:\n raise ValueError(f\"Unsupported type for outputs: {type(outputs)}\")\n output = self.dropout(output)\n output = self.transform(output)\n output = self.decoder(output)\n if self.activation is not None:\n output = self.activation(output)\n if labels is not None:\n if isinstance(labels, NestedTensor):\n if isinstance(output, Tensor):\n output = labels.nested_like(output, strict=False)\n return HeadOutput(output, F.cross_entropy(output.concat, labels.concat))\n return HeadOutput(output, F.cross_entropy(output.view(-1, self.num_labels), labels.view(-1)))\n return HeadOutput(output)\n
"},{"location":"module/heads/#multimolecule.module.heads.pretrain.MaskedLMHead.forward(outputs)","title":"outputs
","text":""},{"location":"module/heads/#multimolecule.module.heads.pretrain.MaskedLMHead.forward(labels)","title":"labels
","text":""},{"location":"module/heads/#multimolecule.module.heads.pretrain.MaskedLMHead.forward(output_name)","title":"output_name
","text":""},{"location":"module/heads/#multimolecule.module.heads.generic","title":"multimolecule.module.heads.generic","text":""},{"location":"module/heads/#multimolecule.module.heads.generic.PredictionHead","title":"PredictionHead","text":" Bases: Module
Head for all-level of tasks.
Parameters:
Name Type Description DefaultPreTrainedConfig
The configuration object for the model.
requiredHeadConfig | None
The configuration object for the head. If None, will use configuration from the config
.
None
Source code in multimolecule/module/heads/generic.py
Pythonclass PredictionHead(nn.Module):\n r\"\"\"\n Head for all-level of tasks.\n\n Args:\n config: The configuration object for the model.\n head_config: The configuration object for the head.\n If None, will use configuration from the `config`.\n \"\"\"\n\n num_labels: int\n requires_attention: bool = False\n\n def __init__(self, config: PreTrainedConfig, head_config: HeadConfig | None = None):\n super().__init__()\n if head_config is None:\n head_config = config.head or HeadConfig(num_labels=config.num_labels)\n elif head_config.num_labels is None:\n head_config.num_labels = config.num_labels\n self.config = head_config\n if self.config.hidden_size is None:\n self.config.hidden_size = config.hidden_size\n if self.config.problem_type is None:\n self.config.problem_type = config.problem_type\n self.bos_token_id = config.bos_token_id\n self.eos_token_id = config.eos_token_id\n self.pad_token_id = config.pad_token_id\n self.num_labels = self.config.num_labels # type: ignore[assignment]\n self.dropout = nn.Dropout(self.config.dropout)\n self.transform = HeadTransformRegistryHF.build(self.config)\n self.decoder = nn.Linear(self.config.hidden_size, self.num_labels, bias=self.config.bias)\n self.activation = ACT2FN[self.config.act] if self.config.act is not None else None\n self.criterion = CriterionRegistry.build(self.config)\n\n def forward(self, embeddings: Tensor, labels: Tensor | None, **kwargs) -> HeadOutput:\n r\"\"\"\n Forward pass of the PredictionHead.\n\n Args:\n embeddings: The embeddings to be passed through the head.\n labels: The labels for the head.\n \"\"\"\n if kwargs:\n warn(\n f\"The following arguments are not applicable to {self.__class__.__name__}\"\n f\"and will be ignored: {kwargs.keys()}\"\n )\n output = self.dropout(embeddings)\n output = self.transform(output)\n output = self.decoder(output)\n if self.activation is not None:\n output = self.activation(output)\n if labels is not None:\n if isinstance(labels, NestedTensor):\n if isinstance(output, Tensor):\n output = labels.nested_like(output, strict=False)\n return HeadOutput(output, self.criterion(output.concat, labels.concat))\n return HeadOutput(output, self.criterion(output, labels))\n return HeadOutput(output)\n\n def _get_attention_mask(self, input_ids: NestedTensor | Tensor) -> Tensor:\n if isinstance(input_ids, NestedTensor):\n return input_ids.mask\n if input_ids is None:\n raise ValueError(\n f\"Either attention_mask or input_ids must be provided for {self.__class__.__name__} to work.\"\n )\n if self.pad_token_id is None:\n raise ValueError(\n f\"pad_token_id must be provided when attention_mask is not passed to {self.__class__.__name__}.\"\n )\n return input_ids.ne(self.pad_token_id)\n\n def _remove_special_tokens(\n self, output: Tensor, attention_mask: Tensor, input_ids: Tensor | None\n ) -> Tuple[Tensor, Tensor, Tensor]:\n # remove cls token embeddings\n if self.bos_token_id is not None:\n output = output[..., 1:, :]\n attention_mask = attention_mask[..., 1:]\n if input_ids is not None:\n input_ids = input_ids[..., 1:]\n # remove eos token embeddings\n if self.eos_token_id is not None:\n if input_ids is not None:\n eos_mask = input_ids.ne(self.eos_token_id).to(output)\n input_ids = input_ids[..., :-1]\n else:\n last_valid_indices = attention_mask.sum(dim=-1)\n seq_length = attention_mask.size(-1)\n eos_mask = torch.arange(seq_length, device=output.device) == last_valid_indices.unsqueeze(1)\n output = output * eos_mask[:, :, None]\n output = output[..., :-1, :]\n attention_mask = attention_mask[..., 1:]\n return output, attention_mask, input_ids\n
"},{"location":"module/heads/#multimolecule.module.heads.generic.PredictionHead(config)","title":"config
","text":""},{"location":"module/heads/#multimolecule.module.heads.generic.PredictionHead(head_config)","title":"head_config
","text":""},{"location":"module/heads/#multimolecule.module.heads.generic.PredictionHead.forward","title":"forward","text":"Pythonforward(embeddings: Tensor, labels: Tensor | None, **kwargs) -> HeadOutput\n
Forward pass of the PredictionHead.
Parameters:
Name Type Description DefaultTensor
The embeddings to be passed through the head.
requiredTensor | None
The labels for the head.
required Source code inmultimolecule/module/heads/generic.py
Pythondef forward(self, embeddings: Tensor, labels: Tensor | None, **kwargs) -> HeadOutput:\n r\"\"\"\n Forward pass of the PredictionHead.\n\n Args:\n embeddings: The embeddings to be passed through the head.\n labels: The labels for the head.\n \"\"\"\n if kwargs:\n warn(\n f\"The following arguments are not applicable to {self.__class__.__name__}\"\n f\"and will be ignored: {kwargs.keys()}\"\n )\n output = self.dropout(embeddings)\n output = self.transform(output)\n output = self.decoder(output)\n if self.activation is not None:\n output = self.activation(output)\n if labels is not None:\n if isinstance(labels, NestedTensor):\n if isinstance(output, Tensor):\n output = labels.nested_like(output, strict=False)\n return HeadOutput(output, self.criterion(output.concat, labels.concat))\n return HeadOutput(output, self.criterion(output, labels))\n return HeadOutput(output)\n
"},{"location":"module/heads/#multimolecule.module.heads.generic.PredictionHead.forward(embeddings)","title":"embeddings
","text":""},{"location":"module/heads/#multimolecule.module.heads.generic.PredictionHead.forward(labels)","title":"labels
","text":""},{"location":"module/heads/#multimolecule.module.heads.output","title":"multimolecule.module.heads.output","text":""},{"location":"module/heads/#multimolecule.module.heads.output.HeadOutput","title":"HeadOutput dataclass
","text":" Bases: ModelOutput
Output of a prediction head.
Parameters:
Name Type Description DefaultFloatTensor
The prediction logits from the head.
requiredFloatTensor | None
The loss from the head. Defaults to None.
None
Source code in multimolecule/module/heads/output.py
Python@dataclass\nclass HeadOutput(ModelOutput):\n r\"\"\"\n Output of a prediction head.\n\n Args:\n logits: The prediction logits from the head.\n loss: The loss from the head.\n Defaults to None.\n \"\"\"\n\n logits: FloatTensor\n loss: FloatTensor | None = None\n
"},{"location":"module/heads/#multimolecule.module.heads.output.HeadOutput(logits)","title":"logits
","text":""},{"location":"module/heads/#multimolecule.module.heads.output.HeadOutput(loss)","title":"loss
","text":""},{"location":"tokenisers/","title":"tokenisers","text":"tokenisers
provide a collection of pre-defined tokenizers.
A tokenizer is a class that converts a sequence of nucleotides or amino acids into a sequence of indices. It is used to pre-process the input sequence before feeding it into a model.
Please refer to Tokenizer for more details.
"},{"location":"tokenisers/#available-tokenizers","title":"Available Tokenizers","text":"DnaTokenizer is smart, it tokenizes raw DNA nucleotides into tokens, no matter if the input is in uppercase or lowercase, uses T (Thymine) or U (Uracil), and with or without special tokens. It also supports tokenization into nmers and codons, so you don\u2019t have to write complex code to preprocess your data.
By default, DnaTokenizer
uses the standard alphabet. If nmers
is greater than 1
, or codon
is set to True
, it will instead use the streamline alphabet.
MultiMolecule provides a set of predefined alphabets for tokenization.
"},{"location":"tokenisers/dna/#multimolecule.tokenisers.DnaTokenizer","title":"multimolecule.tokenisers.DnaTokenizer","text":" Bases: Tokenizer
Tokenizer for DNA sequences.
Parameters:
Name Type Description DefaultAlphabet | str | List[str] | None
alphabet to use for tokenization.
None
, the standard RNA alphabet will be used.string
, it should correspond to the name of a predefined alphabet. The options includestandard
iupac
streamline
nucleobase
None
int
Size of kmer to tokenize.
1
bool
Whether to tokenize into codons.
False
bool
Whether to replace U with T.
True
bool
Whether to convert input to uppercase.
True
Examples:
Python Console Session>>> from multimolecule import DnaTokenizer\n>>> tokenizer = DnaTokenizer()\n>>> tokenizer('<pad><cls><eos><unk><mask><null>ACGTNRYSWKMBDHV.X*-')[\"input_ids\"]\n[1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 2]\n>>> tokenizer('acgt')[\"input_ids\"]\n[1, 6, 7, 8, 9, 2]\n>>> tokenizer('acgu')[\"input_ids\"]\n[1, 6, 7, 8, 9, 2]\n>>> tokenizer = DnaTokenizer(replace_U_with_T=False)\n>>> tokenizer('acgu')[\"input_ids\"]\n[1, 6, 7, 8, 3, 2]\n>>> tokenizer = DnaTokenizer(nmers=3)\n>>> tokenizer('tataaagta')[\"input_ids\"]\n[1, 84, 21, 81, 6, 8, 19, 71, 2]\n>>> tokenizer = DnaTokenizer(codon=True)\n>>> tokenizer('tataaagta')[\"input_ids\"]\n[1, 84, 6, 71, 2]\n>>> tokenizer('tataaagtaa')[\"input_ids\"]\nTraceback (most recent call last):\nValueError: length of input sequence must be a multiple of 3 for codon tokenization, but got 10\n
Source code in multimolecule/tokenisers/dna/tokenization_dna.py
Pythonclass DnaTokenizer(Tokenizer):\n \"\"\"\n Tokenizer for DNA sequences.\n\n Args:\n alphabet: alphabet to use for tokenization.\n\n - If is `None`, the standard RNA alphabet will be used.\n - If is a `string`, it should correspond to the name of a predefined alphabet. The options include\n + `standard`\n + `iupac`\n + `streamline`\n + `nucleobase`\n - If is an alphabet or a list of characters, that specific alphabet will be used.\n nmers: Size of kmer to tokenize.\n codon: Whether to tokenize into codons.\n replace_U_with_T: Whether to replace U with T.\n do_upper_case: Whether to convert input to uppercase.\n\n Examples:\n >>> from multimolecule import DnaTokenizer\n >>> tokenizer = DnaTokenizer()\n >>> tokenizer('<pad><cls><eos><unk><mask><null>ACGTNRYSWKMBDHV.X*-')[\"input_ids\"]\n [1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 2]\n >>> tokenizer('acgt')[\"input_ids\"]\n [1, 6, 7, 8, 9, 2]\n >>> tokenizer('acgu')[\"input_ids\"]\n [1, 6, 7, 8, 9, 2]\n >>> tokenizer = DnaTokenizer(replace_U_with_T=False)\n >>> tokenizer('acgu')[\"input_ids\"]\n [1, 6, 7, 8, 3, 2]\n >>> tokenizer = DnaTokenizer(nmers=3)\n >>> tokenizer('tataaagta')[\"input_ids\"]\n [1, 84, 21, 81, 6, 8, 19, 71, 2]\n >>> tokenizer = DnaTokenizer(codon=True)\n >>> tokenizer('tataaagta')[\"input_ids\"]\n [1, 84, 6, 71, 2]\n >>> tokenizer('tataaagtaa')[\"input_ids\"]\n Traceback (most recent call last):\n ValueError: length of input sequence must be a multiple of 3 for codon tokenization, but got 10\n \"\"\"\n\n model_input_names = [\"input_ids\", \"attention_mask\"]\n\n def __init__(\n self,\n alphabet: Alphabet | str | List[str] | None = None,\n nmers: int = 1,\n codon: bool = False,\n replace_U_with_T: bool = True,\n do_upper_case: bool = True,\n additional_special_tokens: List | Tuple | None = None,\n **kwargs,\n ):\n if codon and (nmers > 1 and nmers != 3):\n raise ValueError(\"Codon and nmers cannot be used together.\")\n if codon:\n nmers = 3 # set to 3 to get correct vocab\n if not isinstance(alphabet, Alphabet):\n alphabet = get_alphabet(alphabet, nmers=nmers)\n super().__init__(\n alphabet=alphabet,\n nmers=nmers,\n codon=codon,\n replace_U_with_T=replace_U_with_T,\n do_upper_case=do_upper_case,\n additional_special_tokens=additional_special_tokens,\n **kwargs,\n )\n self.replace_U_with_T = replace_U_with_T\n self.nmers = nmers\n self.codon = codon\n\n def _tokenize(self, text: str, **kwargs):\n if self.do_upper_case:\n text = text.upper()\n if self.replace_U_with_T:\n text = text.replace(\"U\", \"T\")\n if self.codon:\n if len(text) % 3 != 0:\n raise ValueError(\n f\"length of input sequence must be a multiple of 3 for codon tokenization, but got {len(text)}\"\n )\n return [text[i : i + 3] for i in range(0, len(text), 3)]\n if self.nmers > 1:\n return [text[i : i + self.nmers] for i in range(len(text) - self.nmers + 1)] # noqa: E203\n return list(text)\n
"},{"location":"tokenisers/dna/#multimolecule.tokenisers.DnaTokenizer(alphabet)","title":"alphabet
","text":""},{"location":"tokenisers/dna/#multimolecule.tokenisers.DnaTokenizer(nmers)","title":"nmers
","text":""},{"location":"tokenisers/dna/#multimolecule.tokenisers.DnaTokenizer(codon)","title":"codon
","text":""},{"location":"tokenisers/dna/#multimolecule.tokenisers.DnaTokenizer(replace_U_with_T)","title":"replace_U_with_T
","text":""},{"location":"tokenisers/dna/#multimolecule.tokenisers.DnaTokenizer(do_upper_case)","title":"do_upper_case
","text":""},{"location":"tokenisers/dna/#standard-alphabet","title":"Standard Alphabet","text":"The standard alphabet is an extended version of the IUPAC alphabet. This extension includes two additional symbols to the IUPAC alphabet, X
and *
.
X
: Any base; is slightly different from N
which represents Unknown base. In automatic word embedding conversion, the X
will be initialized as the mean of A
, C
, G
, and T
, while N
will not be further processed.*
: is not used in MultiMolecule and is reserved for future use.gap
Note that we use .
to represent a gap in the sequence.
While -
exists in the standard alphabet, it is not used in MultiMolecule and is reserved for future use.
IUPAC nucleotide code is a standard nucleotide code proposed by the International Union of Pure and Applied Chemistry (IUPAC) to represent DNA sequences.
It consists of 10 symbols that represent ambiguity in the nucleotide sequence and 1 symbol that represents a gap in addition to the streamline alphabet.
Code Represents A Adenine C Cytosine G Guanine T Thymine R A or G Y C or T S C or G W A or T K G or T M A or C B C, G, or T D A, G, or T H A, C, or T V A, C, or G N A, C, G, or T . GapNote that we use .
to represent a gap in the sequence.
The streamline alphabet includes one additional symbol to the nucleobase alphabet, N
to represent unknown nucleobase.
The nucleobase alphabet is a minimal version of the DNA alphabet that includes only the four canonical nucleotides A
, C
, G
, and T
.
DotBracketTokenizer provides a simple way to tokenize secondary structure in dot-bracket notation. It also supports tokenization into nmers and codons, so you don\u2019t have to write complex code to preprocess your data.
By default, DotBracketTokenizer
uses the standard alphabet. If nmers
is greater than 1
, or codon
is set to True
, it will instead use the streamline alphabet.
MultiMolecule provides a set of predefined alphabets for tokenization.
"},{"location":"tokenisers/dot_bracket/#multimolecule.tokenisers.DotBracketTokenizer","title":"multimolecule.tokenisers.DotBracketTokenizer","text":" Bases: Tokenizer
Tokenizer for Secondary Structure sequences.
Parameters:
Name Type Description DefaultAlphabet | str | List[str] | None
alphabet to use for tokenization.
None
, the standard Secondary Structure alphabet will be used.string
, it should correspond to the name of a predefined alphabet. The options includestandard
iupac
streamline
nucleobase
None
int
Size of kmer to tokenize.
1
bool
Whether to tokenize into codons.
False
Examples:
Python Console Session>>> from multimolecule import DotBracketTokenizer\n>>> tokenizer = DotBracketTokenizer()\n>>> tokenizer('<pad><cls><eos><unk><mask><null>.()+,[]{}|<>-_:~$@^%*')[\"input_ids\"]\n[1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 2]\n>>> tokenizer('(.)')[\"input_ids\"]\n[1, 7, 6, 8, 2]\n>>> tokenizer('+(.)')[\"input_ids\"]\n[1, 9, 7, 6, 8, 2]\n>>> tokenizer = DotBracketTokenizer(nmers=3)\n>>> tokenizer('(((((+..........)))))')[\"input_ids\"]\n[1, 27, 27, 27, 29, 34, 54, 6, 6, 6, 6, 6, 6, 6, 6, 8, 16, 48, 48, 48, 2]\n>>> tokenizer = DotBracketTokenizer(codon=True)\n>>> tokenizer('(((((+..........)))))')[\"input_ids\"]\n[1, 27, 29, 6, 6, 6, 16, 48, 2]\n>>> tokenizer('(((((+...........)))))')[\"input_ids\"]\nTraceback (most recent call last):\nValueError: length of input sequence must be a multiple of 3 for codon tokenization, but got 22\n
Source code in multimolecule/tokenisers/dot_bracket/tokenization_db.py
Pythonclass DotBracketTokenizer(Tokenizer):\n \"\"\"\n Tokenizer for Secondary Structure sequences.\n\n Args:\n alphabet: alphabet to use for tokenization.\n\n - If is `None`, the standard Secondary Structure alphabet will be used.\n - If is a `string`, it should correspond to the name of a predefined alphabet. The options include\n + `standard`\n + `iupac`\n + `streamline`\n + `nucleobase`\n - If is an alphabet or a list of characters, that specific alphabet will be used.\n nmers: Size of kmer to tokenize.\n codon: Whether to tokenize into codons.\n\n Examples:\n >>> from multimolecule import DotBracketTokenizer\n >>> tokenizer = DotBracketTokenizer()\n >>> tokenizer('<pad><cls><eos><unk><mask><null>.()+,[]{}|<>-_:~$@^%*')[\"input_ids\"]\n [1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 2]\n >>> tokenizer('(.)')[\"input_ids\"]\n [1, 7, 6, 8, 2]\n >>> tokenizer('+(.)')[\"input_ids\"]\n [1, 9, 7, 6, 8, 2]\n >>> tokenizer = DotBracketTokenizer(nmers=3)\n >>> tokenizer('(((((+..........)))))')[\"input_ids\"]\n [1, 27, 27, 27, 29, 34, 54, 6, 6, 6, 6, 6, 6, 6, 6, 8, 16, 48, 48, 48, 2]\n >>> tokenizer = DotBracketTokenizer(codon=True)\n >>> tokenizer('(((((+..........)))))')[\"input_ids\"]\n [1, 27, 29, 6, 6, 6, 16, 48, 2]\n >>> tokenizer('(((((+...........)))))')[\"input_ids\"]\n Traceback (most recent call last):\n ValueError: length of input sequence must be a multiple of 3 for codon tokenization, but got 22\n \"\"\"\n\n model_input_names = [\"input_ids\", \"attention_mask\"]\n\n def __init__(\n self,\n alphabet: Alphabet | str | List[str] | None = None,\n nmers: int = 1,\n codon: bool = False,\n additional_special_tokens: List | Tuple | None = None,\n **kwargs,\n ):\n if codon and (nmers > 1 and nmers != 3):\n raise ValueError(\"Codon and nmers cannot be used together.\")\n if codon:\n nmers = 3 # set to 3 to get correct vocab\n if not isinstance(alphabet, Alphabet):\n alphabet = get_alphabet(alphabet, nmers=nmers)\n super().__init__(\n alphabet=alphabet,\n nmers=nmers,\n codon=codon,\n additional_special_tokens=additional_special_tokens,\n **kwargs,\n )\n self.nmers = nmers\n self.codon = codon\n\n def _tokenize(self, text: str, **kwargs):\n if self.codon:\n if len(text) % 3 != 0:\n raise ValueError(\n f\"length of input sequence must be a multiple of 3 for codon tokenization, but got {len(text)}\"\n )\n return [text[i : i + 3] for i in range(0, len(text), 3)]\n if self.nmers > 1:\n return [text[i : i + self.nmers] for i in range(len(text) - self.nmers + 1)] # noqa: E203\n return list(text)\n
"},{"location":"tokenisers/dot_bracket/#multimolecule.tokenisers.DotBracketTokenizer(alphabet)","title":"alphabet
","text":""},{"location":"tokenisers/dot_bracket/#multimolecule.tokenisers.DotBracketTokenizer(nmers)","title":"nmers
","text":""},{"location":"tokenisers/dot_bracket/#multimolecule.tokenisers.DotBracketTokenizer(codon)","title":"codon
","text":""},{"location":"tokenisers/dot_bracket/#standard-alphabet","title":"Standard Alphabet","text":"The standard alphabet is an extended version of the Extended Dot-Bracket Notation. This extension includes most symbols from the WUSS notation for better compatibility with existing tools.
Code Represents . unpaired ( internal helices of all terminal stems ) internal helices of all terminal stems + nick between strand , unpaired in multibranch loops [ internal helices that includes at least one annotated () stem ] internal helices that includes at least one annotated () stem { all internal helices of deeper multifurcations } all internal helices of deeper multifurcations | mostly paired < simple terminal stems > simple terminal stems - bulges and interior loops _ unpaired : single stranded in the exterior loop ~ local structural alignment left regions of target and query unaligned $ Not Used @ Not Used ^ Not Used % Not Used * Not Used"},{"location":"tokenisers/dot_bracket/#extended-alphabet","title":"Extended Alphabet","text":"Extended Dot-Bracket Notation is a more generalized version of the original Dot-Bracket notation may use additional pairs of brackets for annotating pseudo-knots, since different pairs of brackets are not required to be nested.
Code Represents . unpaired ( internal helices of all terminal stems ) internal helices of all terminal stems + nick between strand , unpaired in multibranch loops [ internal helices that includes at least one annotated () stem ] internal helices that includes at least one annotated () stem { all internal helices of deeper multifurcations } all internal helices of deeper multifurcations | mostly paired < simple terminal stems > simple terminal stemsNote that we use .
to represent a gap in the sequence.
The streamline alphabet includes one additional symbol to the nucleobase alphabet, N
to represent unknown nucleobase.
ProteinTokenizer is smart, it tokenizes raw amino acids into tokens, no matter if the input is in uppercase or lowercase, and with or without special tokens.
By default, ProteinTokenizer
uses the standard alphabet.
MultiMolecule provides a set of predefined alphabets for tokenization.
"},{"location":"tokenisers/protein/#multimolecule.tokenisers.ProteinTokenizer","title":"multimolecule.tokenisers.ProteinTokenizer","text":" Bases: Tokenizer
Tokenizer for Protein sequences.
Parameters:
Name Type Description DefaultAlphabet | str | List[str] | None
alphabet to use for tokenization.
None
, the standard RNA alphabet will be used.string
, it should correspond to the name of a predefined alphabet. The options includestandard
iupac
streamline
None
bool
Whether to convert input to uppercase.
True
Examples:
Python Console Session>>> from multimolecule import ProteinTokenizer\n>>> tokenizer = ProteinTokenizer()\n>>> tokenizer('ACDEFGHIKLMNPQRSTVWYXZBJUO')[\"input_ids\"]\n[1, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 2]\n>>> tokenizer('<pad><cls><eos><unk><mask><null>.*-')[\"input_ids\"]\n[1, 0, 1, 2, 3, 4, 5, 32, 33, 34, 2]\n>>> tokenizer('manlgcwmlv')[\"input_ids\"]\n[1, 16, 6, 17, 15, 11, 7, 24, 16, 15, 23, 2]\n
Source code in multimolecule/tokenisers/protein/tokenization_protein.py
Pythonclass ProteinTokenizer(Tokenizer):\n \"\"\"\n Tokenizer for Protein sequences.\n\n Args:\n alphabet: alphabet to use for tokenization.\n\n - If is `None`, the standard RNA alphabet will be used.\n - If is a `string`, it should correspond to the name of a predefined alphabet. The options include\n + `standard`\n + `iupac`\n + `streamline`\n - If is an alphabet or a list of characters, that specific alphabet will be used.\n do_upper_case: Whether to convert input to uppercase.\n\n Examples:\n >>> from multimolecule import ProteinTokenizer\n >>> tokenizer = ProteinTokenizer()\n >>> tokenizer('ACDEFGHIKLMNPQRSTVWYXZBJUO')[\"input_ids\"]\n [1, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 2]\n >>> tokenizer('<pad><cls><eos><unk><mask><null>.*-')[\"input_ids\"]\n [1, 0, 1, 2, 3, 4, 5, 32, 33, 34, 2]\n >>> tokenizer('manlgcwmlv')[\"input_ids\"]\n [1, 16, 6, 17, 15, 11, 7, 24, 16, 15, 23, 2]\n \"\"\"\n\n model_input_names = [\"input_ids\", \"attention_mask\"]\n\n def __init__(\n self,\n alphabet: Alphabet | str | List[str] | None = None,\n do_upper_case: bool = True,\n additional_special_tokens: List | Tuple | None = None,\n **kwargs,\n ):\n if not isinstance(alphabet, Alphabet):\n alphabet = get_alphabet(alphabet)\n super().__init__(\n alphabet=alphabet,\n additional_special_tokens=additional_special_tokens,\n do_upper_case=do_upper_case,\n **kwargs,\n )\n\n def _tokenize(self, text: str, **kwargs):\n if self.do_upper_case:\n text = text.upper()\n return list(text)\n
"},{"location":"tokenisers/protein/#multimolecule.tokenisers.ProteinTokenizer(alphabet)","title":"alphabet
","text":""},{"location":"tokenisers/protein/#multimolecule.tokenisers.ProteinTokenizer(do_upper_case)","title":"do_upper_case
","text":""},{"location":"tokenisers/protein/#standard-alphabet","title":"Standard Alphabet","text":"The standard alphabet is an extended version of the IUPAC alphabet. This extension includes six additional symbols to the IUPAC alphabet, J
, U
, O
, .
, -
, and *
.
J
: Xle; Leucine (L) or Isoleucine (I)U
: Sec; SelenocysteineO
: Pyl; Pyrrolysine.
: is not used in MultiMolecule and is reserved for future use.-
: is not used in MultiMolecule and is reserved for future use.*
: is not used in MultiMolecule and is reserved for future use.IUPAC amino acid code is a standard amino acid code proposed by the International Union of Pure and Applied Chemistry (IUPAC) to represent Protein sequences.
The IUPAC amino acid code consists of three additional symbols to Streamline Alphabet, B
, Z
, and X
.
The streamline alphabet is a simplified version of the standard alphabet.
Amino Acid Code Three letter Code Amino Acid A Ala Alanine C Cys Cysteine D Asp Aspartic Acid E Glu Glutamic Acid F Phe Phenylalanine G Gly Glycine H His Histidine I Ile Isoleucine K Lys Lysine L Leu Leucine M Met Methionine N Asn Asparagine P Pro Proline Q Gln Glutamine R Arg Arginine S Ser Serine T Thr Threonine V Val Valine W Trp Tryptophan Y Tyr Tyrosine X Xaa Any amino acid"},{"location":"tokenisers/rna/","title":"RnaTokenizer","text":"RnaTokenizer is smart, it tokenizes raw RNA nucleotides into tokens, no matter if the input is in uppercase or lowercase, uses U (Uracil) or U (Thymine), and with or without special tokens. It also supports tokenization into nmers and codons, so you don\u2019t have to write complex code to preprocess your data.
By default, RnaTokenizer
uses the standard alphabet. If nmers
is greater than 1
, or codon
is set to True
, it will instead use the streamline alphabet.
MultiMolecule provides a set of predefined alphabets for tokenization.
"},{"location":"tokenisers/rna/#multimolecule.tokenisers.RnaTokenizer","title":"multimolecule.tokenisers.RnaTokenizer","text":" Bases: Tokenizer
Tokenizer for RNA sequences.
Parameters:
Name Type Description DefaultAlphabet | str | List[str] | None
alphabet to use for tokenization.
None
, the standard RNA alphabet will be used.string
, it should correspond to the name of a predefined alphabet. The options includestandard
extended
streamline
nucleobase
None
int
Size of kmer to tokenize.
1
bool
Whether to tokenize into codons.
False
bool
Whether to replace T with U.
True
bool
Whether to convert input to uppercase.
True
Examples:
Python Console Session>>> from multimolecule import RnaTokenizer\n>>> tokenizer = RnaTokenizer()\n>>> tokenizer('<pad><cls><eos><unk><mask><null>ACGUNRYSWKMBDHV.X*-I')[\"input_ids\"]\n[1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 2]\n>>> tokenizer('acgu')[\"input_ids\"]\n[1, 6, 7, 8, 9, 2]\n>>> tokenizer('acgt')[\"input_ids\"]\n[1, 6, 7, 8, 9, 2]\n>>> tokenizer = RnaTokenizer(replace_T_with_U=False)\n>>> tokenizer('acgt')[\"input_ids\"]\n[1, 6, 7, 8, 3, 2]\n>>> tokenizer = RnaTokenizer(nmers=3)\n>>> tokenizer('uagcuuauc')[\"input_ids\"]\n[1, 83, 17, 64, 49, 96, 84, 22, 2]\n>>> tokenizer = RnaTokenizer(codon=True)\n>>> tokenizer('uagcuuauc')[\"input_ids\"]\n[1, 83, 49, 22, 2]\n>>> tokenizer('uagcuuauca')[\"input_ids\"]\nTraceback (most recent call last):\nValueError: length of input sequence must be a multiple of 3 for codon tokenization, but got 10\n
Source code in multimolecule/tokenisers/rna/tokenization_rna.py
Pythonclass RnaTokenizer(Tokenizer):\n \"\"\"\n Tokenizer for RNA sequences.\n\n Args:\n alphabet: alphabet to use for tokenization.\n\n - If is `None`, the standard RNA alphabet will be used.\n - If is a `string`, it should correspond to the name of a predefined alphabet. The options include\n + `standard`\n + `extended`\n + `streamline`\n + `nucleobase`\n - If is an alphabet or a list of characters, that specific alphabet will be used.\n nmers: Size of kmer to tokenize.\n codon: Whether to tokenize into codons.\n replace_T_with_U: Whether to replace T with U.\n do_upper_case: Whether to convert input to uppercase.\n\n Examples:\n >>> from multimolecule import RnaTokenizer\n >>> tokenizer = RnaTokenizer()\n >>> tokenizer('<pad><cls><eos><unk><mask><null>ACGUNRYSWKMBDHV.X*-I')[\"input_ids\"]\n [1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 2]\n >>> tokenizer('acgu')[\"input_ids\"]\n [1, 6, 7, 8, 9, 2]\n >>> tokenizer('acgt')[\"input_ids\"]\n [1, 6, 7, 8, 9, 2]\n >>> tokenizer = RnaTokenizer(replace_T_with_U=False)\n >>> tokenizer('acgt')[\"input_ids\"]\n [1, 6, 7, 8, 3, 2]\n >>> tokenizer = RnaTokenizer(nmers=3)\n >>> tokenizer('uagcuuauc')[\"input_ids\"]\n [1, 83, 17, 64, 49, 96, 84, 22, 2]\n >>> tokenizer = RnaTokenizer(codon=True)\n >>> tokenizer('uagcuuauc')[\"input_ids\"]\n [1, 83, 49, 22, 2]\n >>> tokenizer('uagcuuauca')[\"input_ids\"]\n Traceback (most recent call last):\n ValueError: length of input sequence must be a multiple of 3 for codon tokenization, but got 10\n \"\"\"\n\n model_input_names = [\"input_ids\", \"attention_mask\"]\n\n def __init__(\n self,\n alphabet: Alphabet | str | List[str] | None = None,\n nmers: int = 1,\n codon: bool = False,\n replace_T_with_U: bool = True,\n do_upper_case: bool = True,\n additional_special_tokens: List | Tuple | None = None,\n **kwargs,\n ):\n if codon and (nmers > 1 and nmers != 3):\n raise ValueError(\"Codon and nmers cannot be used together.\")\n if codon:\n nmers = 3 # set to 3 to get correct vocab\n if not isinstance(alphabet, Alphabet):\n alphabet = get_alphabet(alphabet, nmers=nmers)\n super().__init__(\n alphabet=alphabet,\n nmers=nmers,\n codon=codon,\n replace_T_with_U=replace_T_with_U,\n do_upper_case=do_upper_case,\n additional_special_tokens=additional_special_tokens,\n **kwargs,\n )\n self.replace_T_with_U = replace_T_with_U\n self.nmers = nmers\n self.codon = codon\n\n def _tokenize(self, text: str, **kwargs):\n if self.do_upper_case:\n text = text.upper()\n if self.replace_T_with_U:\n text = text.replace(\"T\", \"U\")\n if self.codon:\n if len(text) % 3 != 0:\n raise ValueError(\n f\"length of input sequence must be a multiple of 3 for codon tokenization, but got {len(text)}\"\n )\n return [text[i : i + 3] for i in range(0, len(text), 3)]\n if self.nmers > 1:\n return [text[i : i + self.nmers] for i in range(len(text) - self.nmers + 1)] # noqa: E203\n return list(text)\n
"},{"location":"tokenisers/rna/#multimolecule.tokenisers.RnaTokenizer(alphabet)","title":"alphabet
","text":""},{"location":"tokenisers/rna/#multimolecule.tokenisers.RnaTokenizer(nmers)","title":"nmers
","text":""},{"location":"tokenisers/rna/#multimolecule.tokenisers.RnaTokenizer(codon)","title":"codon
","text":""},{"location":"tokenisers/rna/#multimolecule.tokenisers.RnaTokenizer(replace_T_with_U)","title":"replace_T_with_U
","text":""},{"location":"tokenisers/rna/#multimolecule.tokenisers.RnaTokenizer(do_upper_case)","title":"do_upper_case
","text":""},{"location":"tokenisers/rna/#standard-alphabet","title":"Standard Alphabet","text":"The standard alphabet is an extended version of the IUPAC alphabet. This extension includes three additional symbols to the IUPAC alphabet, I
, X
and *
.
I
: Inosine; is a post-transcriptional modification that is not a standard RNA base. Inosine is the result of a deamination reaction of adenines that is catalyzed by adenosine deaminases acting on tRNAs (ADATs)X
: Any base; is slightly different from N
which represents Unknown base. In automatic word embedding conversion, the X
will be initialized as the mean of A
, C
, G
, and U
, while N
will not be further processed.*
: is not used in MultiMolecule and is reserved for future use.gap
Note that we use .
to represent a gap in the sequence.
While -
exists in the standard alphabet, it is not used in MultiMolecule and is reserved for future use.
IUPAC nucleotide code is a standard nucleotide code proposed by the International Union of Pure and Applied Chemistry (IUPAC) to represent RNA sequences.
It consists of 10 symbols that represent ambiguity in the nucleotide sequence and 1 symbol that represents a gap in addition to the streamline alphabet.
Code Represents A Adenine C Cytosine G Guanine U Uracil R A or G Y C or U S G or C W A or U K G or U M A or C B C, G, or U D A, G, or U H A, C, or U V A, C, or G N A, C, G, or U . GapNote that we use .
to represent a gap in the sequence.
The streamline alphabet includes one additional symbol to the nucleobase alphabet, N
to represent unknown nucleobase.
The nucleobase alphabet is a minimal version of the RNA alphabet that includes only the four canonical nucleotides A
, C
, G
, and U
.
\u200b\u4f7f\u7528\u200b\u673a\u5668\u200b\u5b66\u4e60\u200b\u52a0\u901f\u200b\u5206\u5b50\u751f\u7269\u5b66\u200b\u7814\u7a76\u200b
"},{"location":"zh/#_1","title":"\u4ecb\u7ecd","text":"
\u200b\u6b22\u8fce\u200b\u6765\u5230\u200b MultiMolecule (\u200b\u6d66\u539f\u200b)\uff0c\u200b\u8fd9\u662f\u200b\u4e00\u6b3e\u200b\u57fa\u7840\u200b\u5e93\u200b\uff0c\u200b\u65e8\u5728\u200b\u901a\u8fc7\u200b\u673a\u5668\u200b\u5b66\u4e60\u200b\u52a0\u901f\u200b\u5206\u5b50\u751f\u7269\u5b66\u200b\u7684\u200b\u79d1\u7814\u200b\u8fdb\u5c55\u200b\u3002 MultiMolecule \u200b\u63d0\u4f9b\u200b\u4e86\u200b\u4e00\u5957\u200b\u5168\u9762\u200b\u4e14\u200b\u7075\u6d3b\u200b\u7684\u200b\u5de5\u5177\u200b\uff0c\u200b\u5e2e\u52a9\u200b\u7814\u7a76\u200b\u4eba\u5458\u200b\u8f7b\u677e\u200b\u5229\u7528\u200b AI\uff0c\u200b\u4e3b\u8981\u200b\u805a\u7126\u200b\u4e8e\u200b\u751f\u7269\u200b\u5206\u5b50\u200b\u6570\u636e\u200b\uff08RNA\u3001DNA \u200b\u548c\u200b\u86cb\u767d\u8d28\u200b\uff09\u3002
"},{"location":"zh/#_2","title":"\u6982\u89c8","text":"MultiMolecule \u200b\u4ee5\u200b\u7075\u6d3b\u6027\u200b\u548c\u200b\u6613\u7528\u6027\u200b\u4e3a\u200b\u8bbe\u8ba1\u200b\u6838\u5fc3\u200b\u3002 \u200b\u5176\u200b\u6a21\u5757\u5316\u200b\u8bbe\u8ba1\u200b\u5141\u8bb8\u200b\u60a8\u200b\u6839\u636e\u200b\u9700\u8981\u200b\u4ec5\u200b\u4f7f\u7528\u200b\u6240\u200b\u9700\u200b\u7684\u200b\u7ec4\u4ef6\u200b\uff0c\u200b\u5e76\u200b\u80fd\u200b\u65e0\u7f1d\u200b\u96c6\u6210\u200b\u5230\u200b\u73b0\u6709\u200b\u7684\u200b\u5de5\u4f5c\u200b\u6d41\u7a0b\u200b\u4e2d\u200b\uff0c\u200b\u800c\u200b\u4e0d\u4f1a\u200b\u589e\u52a0\u200b\u4e0d\u5fc5\u8981\u200b\u7684\u200b\u590d\u6742\u6027\u200b\u3002
data
\uff1a\u200b\u667a\u80fd\u200b\u7684\u200b Dataset
\uff0c\u200b\u80fd\u591f\u200b\u81ea\u52a8\u200b\u63a8\u65ad\u200b\u4efb\u52a1\u200b\uff0c\u200b\u5305\u62ec\u200b\u4efb\u52a1\u200b\u7684\u200b\u5c42\u7ea7\u200b\uff08\u200b\u5e8f\u5217\u200b\u3001\u200b\u4ee4\u724c\u200b\u3001\u200b\u63a5\u89e6\u200b\uff09\u200b\u548c\u200b\u7c7b\u578b\u200b\uff08\u200b\u5206\u7c7b\u200b\u3001\u200b\u56de\u5f52\u200b\uff09\u3002\u200b\u8fd8\u200b\u63d0\u4f9b\u200b\u591a\u4efb\u52a1\u200b\u6570\u636e\u200b\u96c6\u200b\u548c\u200b\u91c7\u6837\u5668\u200b\uff0c\u200b\u7b80\u5316\u200b\u591a\u4efb\u52a1\u200b\u5b66\u4e60\u200b\uff0c\u200b\u65e0\u9700\u200b\u989d\u5916\u200b\u914d\u7f6e\u200b\u3002datasets
\uff1a\u200b\u5e7f\u6cdb\u200b\u4f7f\u7528\u200b\u7684\u200b\u751f\u7269\u200b\u5206\u5b50\u200b\u6570\u636e\u200b\u96c6\u200b\u96c6\u5408\u200b\u3002module
\uff1a\u200b\u6a21\u5757\u5316\u200b\u795e\u7ecf\u7f51\u7edc\u200b\u6784\u5efa\u200b\u5757\u200b\uff0c\u200b\u5305\u62ec\u200b\u5d4c\u5165\u200b\u5c42\u200b\u3001\u200b\u9884\u6d4b\u200b\u5934\u200b\u548c\u200b\u635f\u5931\u200b\u51fd\u6570\u200b\uff0c\u200b\u7528\u4e8e\u200b\u6784\u5efa\u200b\u81ea\u5b9a\u4e49\u200b\u6a21\u578b\u200b\u3002models
\uff1a\u200b\u5206\u5b50\u751f\u7269\u5b66\u200b\u9886\u57df\u200b\u7684\u200b\u6700\u200b\u5148\u8fdb\u200b\u9884\u200b\u8bad\u7ec3\u200b\u6a21\u578b\u200b\u5b9e\u73b0\u200b\u3002tokenisers
\uff1a\u200b\u7528\u4e8e\u200b\u5c06\u200b DNA\u3001RNA\u3001\u200b\u86cb\u767d\u8d28\u200b\u53ca\u5176\u200b\u4ed6\u200b\u5e8f\u5217\u200b\u8f6c\u6362\u200b\u4e3a\u200b\u72ec\u70ed\u200b\u7f16\u7801\u200b\u7684\u200b\u5206\u8bcd\u5668\u200b\u3002\u200b\u4ece\u200b PyPI \u200b\u5b89\u88c5\u200b\u6700\u65b0\u200b\u7684\u200b\u7a33\u5b9a\u200b\u7248\u672c\u200b\uff1a
Bashpip install multimolecule\n
\u200b\u4ece\u200b\u6e90\u4ee3\u7801\u200b\u5b89\u88c5\u200b\u6700\u65b0\u200b\u7248\u672c\u200b\uff1a
Bashpip install git+https://github.com/DLS5-Omics/MultiMolecule\n
"},{"location":"zh/#_4","title":"\u5f15\u7528","text":"\u200b\u5982\u679c\u200b\u60a8\u200b\u5728\u200b\u7814\u7a76\u200b\u4e2d\u200b\u4f7f\u7528\u200b MultiMolecule\uff0c\u200b\u8bf7\u200b\u6309\u7167\u200b\u4ee5\u4e0b\u200b\u65b9\u5f0f\u200b\u5f15\u7528\u200b\u6211\u4eec\u200b\uff1a
BibTeX@software{chen_2024_12638419,\n author = {Chen, Zhiyuan and Zhu, Sophia Y.},\n title = {MultiMolecule},\n doi = {10.5281/zenodo.12638419},\n publisher = {Zenodo},\n url = {https://doi.org/10.5281/zenodo.12638419},\n year = 2024,\n month = may,\n day = 4\n}\n
"},{"location":"zh/#_5","title":"\u8bb8\u53ef\u8bc1","text":"\u200b\u6211\u4eec\u200b\u76f8\u4fe1\u200b\u5f00\u653e\u200b\u662f\u200b\u7814\u7a76\u200b\u7684\u200b\u57fa\u7840\u200b\u3002
MultiMolecule \u200b\u5728\u200b GNU Affero \u200b\u901a\u7528\u200b\u516c\u5171\u200b\u8bb8\u53ef\u8bc1\u200b \u200b\u4e0b\u200b\u6388\u6743\u200b\u3002
\u200b\u8bf7\u200b\u52a0\u5165\u200b\u6211\u4eec\u200b\uff0c\u200b\u5171\u540c\u200b\u5efa\u7acb\u200b\u4e00\u4e2a\u200b\u5f00\u653e\u200b\u7684\u200b\u7814\u7a76\u200b\u793e\u533a\u200b\u3002
SPDX-License-Identifier: AGPL-3.0-or-later
\u200b\u7531\u4e39\u7075\u200b\u5728\u200b\u5730\u7403\u200b\u5f00\u53d1\u200b
\u200b\u6211\u4eec\u200b\u662f\u200b\u4e00\u4e2a\u200b\u7531\u200b\u5f00\u53d1\u8005\u200b\u3001\u200b\u8bbe\u8ba1\u200b\u4eba\u5458\u200b\u548c\u200b\u5176\u4ed6\u200b\u4eba\u5458\u200b\u7ec4\u6210\u200b\u7684\u200b\u793e\u533a\u200b\uff0c\u200b\u81f4\u529b\u4e8e\u200b\u8ba9\u200b\u6df1\u5ea6\u200b\u5b66\u4e60\u200b\u6280\u672f\u200b\u66f4\u52a0\u200b\u5f00\u653e\u200b\u3002
\u200b\u6211\u4eec\u200b\u662f\u200b\u4e00\u4e2a\u200b\u7531\u200b\u4e2a\u4f53\u200b\u7ec4\u6210\u200b\u7684\u200b\u793e\u533a\u200b\uff0c\u200b\u81f4\u529b\u4e8e\u200b\u63a8\u52a8\u200b\u6df1\u5ea6\u200b\u5b66\u4e60\u200b\u7684\u200b\u53ef\u80fd\u6027\u200b\u8fb9\u754c\u200b\u3002
\u200b\u6211\u4eec\u200b\u5bf9\u200b\u6df1\u5ea6\u200b\u5b66\u4e60\u200b\u53ca\u5176\u200b\u7528\u6237\u200b\u5145\u6ee1\u200b\u6fc0\u60c5\u200b\u3002
\u200b\u6211\u4eec\u200b\u662f\u200b\u4e39\u7075\u200b\u3002
"},{"location":"zh/about/license-faq/","title":"License FAQ","text":"\u200b\u7ffb\u8bd1\u200b
\u200b\u672c\u6587\u200b\u5185\u5bb9\u200b\u4e3a\u200b\u7ffb\u8bd1\u200b\u7248\u672c\u200b\uff0c\u200b\u65e8\u5728\u200b\u4e3a\u200b\u7528\u6237\u200b\u63d0\u4f9b\u65b9\u4fbf\u200b\u3002 \u200b\u6211\u4eec\u200b\u5df2\u7ecf\u200b\u5c3d\u529b\u200b\u786e\u4fdd\u200b\u7ffb\u8bd1\u200b\u7684\u200b\u51c6\u786e\u6027\u200b\u3002 \u200b\u4f46\u200b\u8bf7\u200b\u6ce8\u610f\u200b\uff0c\u200b\u7ffb\u8bd1\u200b\u5185\u5bb9\u200b\u53ef\u80fd\u200b\u5305\u542b\u200b\u9519\u8bef\u200b\uff0c\u200b\u4ec5\u4f9b\u53c2\u8003\u200b\u3002 \u200b\u8bf7\u4ee5\u200b\u82f1\u6587\u200b\u539f\u6587\u200b\u4e3a\u51c6\u200b\u3002
\u200b\u4e3a\u200b\u6ee1\u8db3\u200b\u5408\u89c4\u6027\u200b\u4e0e\u200b\u6267\u6cd5\u200b\u8981\u6c42\u200b\uff0c\u200b\u7ffb\u8bd1\u200b\u6587\u6863\u200b\u4e2d\u200b\u7684\u200b\u4efb\u4f55\u200b\u4e0d\u200b\u51c6\u786e\u200b\u6216\u200b\u6b67\u4e49\u200b\u4e4b\u5904\u200b\u5747\u200b\u4e0d\u200b\u5177\u6709\u200b\u7ea6\u675f\u529b\u200b\uff0c\u200b\u4e5f\u200b\u4e0d\u200b\u5177\u5907\u200b\u6cd5\u5f8b\u6548\u529b\u200b\u3002
"},{"location":"zh/about/license-faq/#_1","title":"\u8bb8\u53ef\u200b\u534f\u8bae\u200b\u5e38\u89c1\u200b\u95ee\u9898\u89e3\u7b54","text":"\u200b\u672c\u200b\u8bb8\u53ef\u200b\u534f\u8bae\u200b\u5e38\u89c1\u200b\u95ee\u9898\u89e3\u7b54\u200b\u89e3\u91ca\u200b\u4e86\u200b\u60a8\u200b\u53ef\u4ee5\u200b\u5728\u200b\u4f55\u79cd\u200b\u6761\u4ef6\u200b\u4e0b\u200b\u4f7f\u7528\u200b\u7531\u4e39\u7075\u200b\u56e2\u961f\u200b\uff08\u200b\u4e5f\u200b\u79f0\u4e3a\u200b\u4e39\u7075\u200b\uff09\uff08\u201c\u200b\u6211\u4eec\u200b\u201d\u200b\u6216\u200b\u201c\u200b\u6211\u4eec\u200b\u7684\u200b\u201d\uff09\u200b\u63d0\u4f9b\u200b\u7684\u200b\u6570\u636e\u200b\u3001\u200b\u6a21\u578b\u200b\u3001\u200b\u4ee3\u7801\u200b\u3001\u200b\u914d\u7f6e\u200b\u3001\u200b\u6587\u6863\u200b\u548c\u200b\u6743\u91cd\u200b\u3002 \u200b\u5b83\u200b\u4f5c\u4e3a\u200b\u6211\u4eec\u200b\u7684\u200b \u200b\u8bb8\u53ef\u200b\u534f\u8bae\u200b \u200b\u7684\u200b\u9644\u52a0\u6587\u4ef6\u200b\u3002
"},{"location":"zh/about/license-faq/#0","title":"0. \u200b\u5173\u952e\u70b9\u200b\u603b\u7ed3","text":"\u200b\u672c\u200b\u603b\u7ed3\u200b\u63d0\u4f9b\u200b\u4e86\u200b\u5e38\u89c1\u200b\u95ee\u9898\u89e3\u7b54\u200b\u7684\u200b\u5173\u952e\u70b9\u200b\uff0c\u200b\u4f46\u200b\u60a8\u200b\u53ef\u4ee5\u200b\u901a\u8fc7\u200b\u70b9\u51fb\u200b\u6bcf\u4e2a\u200b\u5173\u952e\u70b9\u200b\u540e\u200b\u7684\u200b\u94fe\u63a5\u200b\u6216\u200b\u4f7f\u7528\u200b\u76ee\u5f55\u200b\u6765\u200b\u627e\u5230\u200b\u60a8\u200b\u6240\u200b\u67e5\u627e\u200b\u7684\u200b\u90e8\u5206\u200b\u4ee5\u200b\u4e86\u89e3\u200b\u66f4\u200b\u591a\u200b\u8be6\u60c5\u200b\u3002
\u200b\u5728\u200b MultiMolecule \u200b\u4e2d\u200b\uff0c\u200b\u4ec0\u4e48\u200b\u6784\u6210\u200b\u4e86\u200b\u201c\u200b\u6e90\u4ee3\u7801\u200b\u201d\uff1f
\u200b\u6211\u4eec\u200b\u8ba4\u4e3a\u200b\u6211\u4eec\u200b\u5b58\u50a8\u200b\u5e93\u4e2d\u200b\u7684\u200b\u6240\u6709\u200b\u5185\u5bb9\u200b\u90fd\u200b\u662f\u200b\u6e90\u4ee3\u7801\u200b\uff0c\u200b\u5305\u62ec\u200b\u6570\u636e\u200b\u3001\u200b\u6a21\u578b\u200b\u3001\u200b\u4ee3\u7801\u200b\u3001\u200b\u914d\u7f6e\u200b\u548c\u200b\u6587\u6863\u200b\u3002
\u200b\u5728\u200bMultiMolecule\u200b\u4e2d\u200b\uff0c\u200b\u4ec0\u4e48\u200b\u6784\u6210\u200b\u4e86\u200b\u201c\u200b\u6e90\u4ee3\u7801\u200b\u201d\uff1f
\u200b\u6211\u200b\u53ef\u4ee5\u200b\u4f7f\u7528\u200b MultiMolecule \u200b\u53d1\u8868\u200b\u7814\u7a76\u200b\u8bba\u6587\u200b\u5417\u200b\uff1f
\u200b\u89c6\u200b\u60c5\u51b5\u200b\u800c\u5b9a\u200b\u3002
\u200b\u60a8\u200b\u53ef\u4ee5\u200b\u6309\u7167\u200b \u200b\u8bb8\u53ef\u200b\u534f\u8bae\u200b \u200b\u7684\u200b\u6761\u6b3e\u200b\u5728\u200b\u5b8c\u5168\u200b\u5f00\u653e\u200b\u83b7\u53d6\u200b\u7684\u200b\u671f\u520a\u200b\u548c\u200b\u4f1a\u8bae\u200b\u6216\u9884\u200b\u5370\u672c\u200b\u670d\u52a1\u5668\u200b\u4e0a\u200b\u53d1\u8868\u200b\u7814\u7a76\u200b\u8bba\u6587\u200b\u3002
\u200b\u8981\u200b\u5728\u200b\u5c01\u95ed\u200b\u83b7\u53d6\u200b\u7684\u200b\u671f\u520a\u200b\u548c\u200b\u4f1a\u8bae\u200b\u4e0a\u200b\u53d1\u8868\u200b\u7814\u7a76\u200b\u8bba\u6587\u200b\uff0c\u200b\u60a8\u200b\u5fc5\u987b\u200b\u4ece\u200b\u6211\u4eec\u200b\u8fd9\u91cc\u200b\u83b7\u5f97\u200b\u5355\u72ec\u200b\u7684\u200b\u8bb8\u53ef\u200b\u3002
\u200b\u6211\u200b\u53ef\u4ee5\u200b\u4f7f\u7528\u200bMultiMolecule\u200b\u53d1\u8868\u200b\u7814\u7a76\u200b\u8bba\u6587\u200b\u5417\u200b\uff1f
\u200b\u6211\u200b\u53ef\u4ee5\u200b\u5c06\u200b MultiMolecule \u200b\u7528\u4e8e\u200b\u5546\u4e1a\u7528\u9014\u200b\u5417\u200b\uff1f
\u200b\u662f\u200b\u7684\u200b\uff0c\u200b\u60a8\u200b\u53ef\u4ee5\u200b\u6839\u636e\u200b \u200b\u8bb8\u53ef\u200b\u534f\u8bae\u200b \u200b\u7684\u200b\u6761\u6b3e\u200b\u5c06\u200bMultiMolecule\u200b\u7528\u4e8e\u200b\u5546\u4e1a\u7528\u9014\u200b\u3002
\u200b\u6211\u200b\u53ef\u4ee5\u200b\u5c06\u200bMultiMolecule\u200b\u7528\u4e8e\u200b\u5546\u4e1a\u7528\u9014\u200b\u5417\u200b\uff1f
\u200b\u4e0e\u200b\u67d0\u4e9b\u200b\u7ec4\u7ec7\u200b\u6709\u200b\u5173\u7cfb\u200b\u7684\u200b\u4eba\u200b\u662f\u5426\u200b\u6709\u200b\u7279\u5b9a\u200b\u7684\u200b\u8bb8\u53ef\u200b\u6761\u6b3e\u200b\uff1f
\u200b\u662f\u200b\u7684\u200b\uff0c\u200b\u4e0e\u200b\u67d0\u4e9b\u200b\u7ec4\u7ec7\u200b\u6709\u200b\u5173\u7cfb\u200b\u7684\u200b\u4eba\u200b\u6709\u200b\u7279\u5b9a\u200b\u7684\u200b\u8bb8\u53ef\u200b\u6761\u6b3e\u200b\u3002
\u200b\u4e0e\u200b\u67d0\u4e9b\u200b\u7ec4\u7ec7\u200b\u6709\u200b\u5173\u7cfb\u200b\u7684\u200b\u4eba\u200b\u662f\u5426\u200b\u6709\u200b\u7279\u5b9a\u200b\u7684\u200b\u8bb8\u53ef\u200b\u6761\u6b3e\u200b\uff1f
"},{"location":"zh/about/license-faq/#1-multimolecule","title":"1. \u200b\u5728\u200b MultiMolecule \u200b\u4e2d\u200b\uff0c\u200b\u4ec0\u4e48\u200b\u6784\u6210\u200b\u4e86\u200b\u201c\u200b\u6e90\u4ee3\u7801\u200b\u201d\uff1f","text":"\u200b\u6211\u4eec\u200b\u8ba4\u4e3a\u200b\u6211\u4eec\u200b\u5b58\u50a8\u200b\u5e93\u4e2d\u200b\u7684\u200b\u6240\u6709\u200b\u5185\u5bb9\u200b\u90fd\u200b\u662f\u200b\u6e90\u4ee3\u7801\u200b\u3002
\u200b\u673a\u5668\u200b\u5b66\u4e60\u200b\u6a21\u578b\u200b\u7684\u200b\u8bad\u7ec3\u200b\u8fc7\u7a0b\u200b\u88ab\u200b\u89c6\u4f5c\u200b\u7c7b\u4f3c\u200b\u4e8e\u200b\u4f20\u7edf\u200b\u8f6f\u4ef6\u200b\u7684\u200b\u7f16\u8bd1\u200b\u8fc7\u7a0b\u200b\u3002\u200b\u56e0\u6b64\u200b\uff0c\u200b\u6a21\u578b\u200b\u3001\u200b\u4ee3\u7801\u200b\u3001\u200b\u914d\u7f6e\u200b\u3001\u200b\u6587\u6863\u200b\u548c\u200b\u7528\u4e8e\u200b\u8bad\u7ec3\u200b\u7684\u200b\u6570\u636e\u200b\u90fd\u200b\u88ab\u200b\u89c6\u4e3a\u200b\u6e90\u4ee3\u7801\u200b\u7684\u200b\u4e00\u90e8\u5206\u200b\uff0c\u200b\u800c\u200b\u8bad\u7ec3\u200b\u51fa\u200b\u7684\u200b\u6a21\u578b\u200b\u6743\u91cd\u200b\u5219\u200b\u88ab\u200b\u89c6\u4e3a\u200b\u76ee\u6807\u200b\u4ee3\u7801\u200b\u7684\u200b\u4e00\u90e8\u5206\u200b\u3002
\u200b\u6211\u4eec\u200b\u8fd8\u200b\u5c06\u200b\u7814\u7a76\u200b\u8bba\u6587\u200b\u548c\u200b\u624b\u7a3f\u200b\u89c6\u4e3a\u200b\u4e00\u79cd\u200b\u7279\u6b8a\u200b\u7684\u200b\u6587\u6863\u200b\u5f62\u5f0f\u200b\uff0c\u200b\u5b83\u4eec\u200b\u4e5f\u200b\u662f\u200b\u6e90\u4ee3\u7801\u200b\u7684\u200b\u4e00\u90e8\u5206\u200b\u3002
"},{"location":"zh/about/license-faq/#2-multimolecule","title":"2 \u200b\u6211\u200b\u53ef\u4ee5\u200b\u4f7f\u7528\u200b MultiMolecule \u200b\u53d1\u8868\u200b\u7814\u7a76\u200b\u8bba\u6587\u200b\u5417\u200b\uff1f","text":"\u200b\u7531\u4e8e\u200b\u7814\u7a76\u200b\u8bba\u6587\u200b\u88ab\u200b\u89c6\u4e3a\u200b\u6e90\u4ee3\u7801\u200b\u7684\u200b\u4e00\u79cd\u200b\u5f62\u5f0f\u200b\uff0c\u200b\u5982\u679c\u200b\u53d1\u8868\u200b\u4f7f\u7528\u200b MultiMolecule \u200b\u7684\u200b\u8bba\u6587\u200b\uff0c\u200b\u51fa\u7248\u5546\u200b\u5fc5\u987b\u200b\u5f00\u6e90\u200b\u5176\u200b\u670d\u52a1\u5668\u200b\u4e0a\u200b\u7684\u200b\u6240\u6709\u200b\u6750\u6599\u200b\uff0c\u200b\u4ee5\u200b\u7b26\u5408\u200b \u200b\u8bb8\u53ef\u200b\u534f\u8bae\u200b \u200b\u7684\u200b\u8981\u6c42\u200b\u3002\u200b\u5bf9\u4e8e\u200b\u5927\u591a\u6570\u200b\u51fa\u7248\u5546\u200b\u6765\u8bf4\u200b\uff0c\u200b\u8fd9\u662f\u200b\u4e0d\u5207\u5b9e\u9645\u200b\u7684\u200b\u3002
\u200b\u4f5c\u4e3a\u200b \u200b\u8bb8\u53ef\u200b\u534f\u8bae\u200b \u200b\u7b2c\u200b 7 \u200b\u6761\u200b\u7684\u200b\u7279\u522b\u200b\u8c41\u514d\u200b\uff0c\u200b\u6211\u4eec\u200b\u5141\u8bb8\u200b\u5728\u200b\u4e0d\u200b\u5411\u200b\u4f5c\u8005\u200b\u6536\u53d6\u200b\u4efb\u4f55\u200b\u8d39\u7528\u200b\u7684\u200b\u5b8c\u5168\u200b\u5f00\u653e\u200b\u83b7\u53d6\u200b\u7684\u200b\u671f\u520a\u200b\u3001\u200b\u4f1a\u8bae\u200b\u6216\u9884\u200b\u5370\u672c\u200b\u670d\u52a1\u5668\u200b\u4e0a\u200b\u53d1\u8868\u200b\u4f7f\u7528\u200b MultiMolecule \u200b\u7684\u200b\u7814\u7a76\u200b\u8bba\u6587\u200b\uff0c\u200b\u524d\u63d0\u200b\u662f\u200b\u6240\u6709\u200b\u53d1\u8868\u200b\u7684\u200b\u624b\u7a3f\u200b\u90fd\u200b\u5e94\u200b\u6309\u7167\u200b\u5141\u8bb8\u200b\u5171\u4eab\u200b\u624b\u7a3f\u200b\u7684\u200bGNU \u200b\u81ea\u7531\u200b\u6587\u6863\u200b\u8bb8\u53ef\u200b\u534f\u8bae\u200b\uff08GFDL\uff09\u200b\u6216\u200b\u77e5\u8bc6\u200b\u5171\u4eab\u200b\u8bb8\u53ef\u200b\u534f\u8bae\u200b\u6216\u200bOSI \u200b\u6279\u51c6\u200b\u8bb8\u53ef\u200b\u534f\u8bae\u200b\u63d0\u4f9b\u200b\u3002
\u200b\u4f5c\u4e3a\u200b \u200b\u8bb8\u53ef\u200b\u534f\u8bae\u200b \u200b\u7b2c\u200b 7 \u200b\u6761\u200b\u7684\u200b\u7279\u522b\u200b\u8c41\u514d\u200b\uff0c\u200b\u6211\u4eec\u200b\u5141\u8bb8\u200b\u5728\u200b\u90e8\u5206\u200b\u975e\u76c8\u5229\u6027\u200b\u7684\u200b\u6742\u5fd7\u200b\u3001\u200b\u4f1a\u8bae\u200b\u6216\u9884\u200b\u5370\u672c\u200b\u670d\u52a1\u5668\u200b\u4e0a\u200b\u53d1\u8868\u200b\u4f7f\u7528\u200b MultiMolecule \u200b\u7684\u200b\u7814\u7a76\u200b\u8bba\u6587\u200b\u3002\u200b\u76ee\u524d\u200b\uff0c\u200b\u6211\u4eec\u200b\u5141\u8bb8\u200b\u7684\u200b\u975e\u76c8\u5229\u6027\u200b\u6742\u5fd7\u200b\u3001\u200b\u4f1a\u8bae\u200b\u6216\u9884\u200b\u5370\u672c\u200b\u670d\u52a1\u5668\u200b\u5305\u62ec\u200b\uff1a
\u200b\u8981\u200b\u5728\u200b\u5c01\u95ed\u200b\u83b7\u53d6\u200b\u7684\u200b\u671f\u520a\u200b\u6216\u200b\u4f1a\u8bae\u200b\u4e0a\u200b\u53d1\u8868\u200b\u8bba\u6587\u200b\uff0c\u200b\u60a8\u200b\u5fc5\u987b\u200b\u4ece\u200b\u6211\u4eec\u200b\u8fd9\u91cc\u200b\u83b7\u5f97\u200b\u5355\u72ec\u200b\u7684\u200b\u8bb8\u53ef\u200b\u3002\u200b\u8fd9\u200b\u901a\u5e38\u200b\u5305\u62ec\u200b\u5171\u540c\u200b\u7f72\u540d\u200b\u3001\u200b\u652f\u6301\u200b\u9879\u76ee\u200b\u7684\u200b\u8d39\u7528\u200b\u6216\u200b\u4e24\u8005\u200b\u517c\u800c\u6709\u4e4b\u200b\u3002\u200b\u8bf7\u200b\u901a\u8fc7\u200b multimolecule@zyc.ai \u200b\u4e0e\u200b\u6211\u4eec\u200b\u8054\u7cfb\u200b\u4ee5\u200b\u83b7\u53d6\u200b\u66f4\u200b\u591a\u200b\u4fe1\u606f\u200b\u3002
\u200b\u867d\u7136\u200b\u4e0d\u662f\u200b\u5f3a\u5236\u6027\u200b\u7684\u200b\uff0c\u200b\u4f46\u200b\u6211\u4eec\u200b\u5efa\u8bae\u200b\u5728\u200b\u7814\u7a76\u200b\u8bba\u6587\u200b\u4e2d\u200b\u5f15\u7528\u200b MultiMolecule \u200b\u9879\u76ee\u200b\u3002
"},{"location":"zh/about/license-faq/#3-multimolecule","title":"3. \u200b\u6211\u200b\u53ef\u4ee5\u200b\u5c06\u200b MultiMolecule \u200b\u7528\u4e8e\u200b\u5546\u4e1a\u7528\u9014\u200b\u5417\u200b\uff1f","text":"\u200b\u662f\u200b\u7684\u200b\uff0c\u200b\u60a8\u200b\u53ef\u4ee5\u200b\u6839\u636e\u200b \u200b\u8bb8\u53ef\u200b\u534f\u8bae\u200b \u200b\u5c06\u200b MultiMolecule \u200b\u7528\u4e8e\u200b\u5546\u4e1a\u7528\u9014\u200b\u3002\u200b\u4f46\u662f\u200b\uff0c\u200b\u60a8\u200b\u5fc5\u987b\u200b\u5f00\u6e90\u200b\u5bf9\u200b\u6e90\u4ee3\u7801\u200b\u7684\u200b\u4efb\u4f55\u200b\u4fee\u6539\u200b\uff0c\u200b\u5e76\u200b\u4f7f\u200b\u5176\u200b\u5728\u200b \u200b\u8bb8\u53ef\u200b\u534f\u8bae\u200b \u200b\u4e0b\u200b\u53ef\u7528\u200b\u3002
\u200b\u5982\u679c\u200b\u60a8\u200b\u5e0c\u671b\u200b\u5728\u200b\u4e0d\u200b\u5f00\u6e90\u200b\u4fee\u6539\u200b\u5185\u5bb9\u200b\u7684\u200b\u60c5\u51b5\u200b\u4e0b\u200b\u5c06\u200b MultiMolecule \u200b\u7528\u4e8e\u200b\u5546\u4e1a\u7528\u9014\u200b\uff0c\u200b\u5219\u200b\u5fc5\u987b\u200b\u4ece\u200b\u6211\u4eec\u200b\u8fd9\u91cc\u200b\u83b7\u5f97\u200b\u5355\u72ec\u200b\u7684\u200b\u8bb8\u53ef\u200b\u3002\u200b\u8fd9\u200b\u901a\u5e38\u200b\u6d89\u53ca\u200b\u652f\u6301\u200b\u9879\u76ee\u200b\u7684\u200b\u8d39\u7528\u200b\u3002\u200b\u8bf7\u200b\u901a\u8fc7\u200b multimolecule@zyc.ai \u200b\u4e0e\u200b\u6211\u4eec\u200b\u8054\u7cfb\u200b\u4ee5\u200b\u83b7\u53d6\u200b\u66f4\u200b\u591a\u200b\u8be6\u7ec6\u4fe1\u606f\u200b\u3002
"},{"location":"zh/about/license-faq/#4","title":"4. \u200b\u4e0e\u200b\u67d0\u4e9b\u200b\u7ec4\u7ec7\u200b\u6709\u200b\u5173\u7cfb\u200b\u7684\u200b\u4eba\u200b\u662f\u5426\u200b\u6709\u200b\u7279\u5b9a\u200b\u7684\u200b\u8bb8\u53ef\u200b\u6761\u6b3e\u200b\uff1f","text":"\u200b\u662f\u200b\u7684\u200b\uff01
\u200b\u5982\u679c\u200b\u60a8\u200b\u4e0e\u200b\u4e00\u4e2a\u200b\u4e0e\u200b\u6211\u4eec\u200b\u6709\u200b\u5355\u72ec\u200b\u8bb8\u53ef\u200b\u534f\u8bae\u200b\u7684\u200b\u7ec4\u7ec7\u200b\u6709\u200b\u5173\u7cfb\u200b\uff0c\u200b\u60a8\u200b\u53ef\u80fd\u200b\u4f1a\u200b\u53d7\u5230\u200b\u4e0d\u540c\u200b\u7684\u200b\u8bb8\u53ef\u200b\u6761\u6b3e\u200b\u7684\u200b\u7ea6\u675f\u200b\u3002\u200b\u8bf7\u200b\u54a8\u8be2\u200b\u60a8\u200b\u7ec4\u7ec7\u200b\u7684\u200b\u6cd5\u5f8b\u200b\u90e8\u95e8\u200b\uff0c\u200b\u4ee5\u200b\u786e\u5b9a\u200b\u60a8\u200b\u662f\u5426\u200b\u53d7\u5236\u4e8e\u200b\u5355\u72ec\u200b\u7684\u200b\u8bb8\u53ef\u200b\u534f\u8bae\u200b\u3002
\u200b\u4ee5\u4e0b\u200b\u7ec4\u7ec7\u200b\u7684\u200b\u6210\u5458\u200b\u81ea\u52a8\u200b\u83b7\u5f97\u200b\u4e00\u4e2a\u200b\u4e0d\u53ef\u200b\u8f6c\u8ba9\u200b\u3001\u200b\u4e0d\u53ef\u200b\u518d\u200b\u8bb8\u53ef\u200b\u3001\u200b\u4e0d\u53ef\u200b\u5206\u53d1\u200b\u7684\u200b MIT \u200b\u8bb8\u53ef\u200b\u534f\u8bae\u200b \u200b\u6765\u200b\u4f7f\u7528\u200b MultiMolecule\uff1a
\u200b\u6b64\u200b\u7279\u522b\u200b\u8bb8\u53ef\u200b\u88ab\u200b\u89c6\u4e3a\u200b \u200b\u8bb8\u53ef\u200b\u534f\u8bae\u200b \u200b\u7b2c\u200b 7 \u200b\u6761\u4e2d\u200b\u7684\u200b\u9644\u52a0\u200b\u6761\u6b3e\u200b\u3002 \u200b\u5b83\u200b\u4e0d\u53ef\u200b\u518d\u200b\u5206\u53d1\u200b\uff0c\u200b\u5e76\u4e14\u200b\u60a8\u200b\u88ab\u200b\u7981\u6b62\u200b\u521b\u5efa\u200b\u4efb\u4f55\u200b\u72ec\u7acb\u200b\u7684\u200b\u884d\u751f\u200b\u4f5c\u54c1\u200b\u3002 \u200b\u57fa\u4e8e\u200b\u6b64\u200b\u8bb8\u53ef\u200b\u7684\u200b\u4efb\u4f55\u200b\u4fee\u6539\u200b\u6216\u200b\u884d\u751f\u200b\u4f5c\u54c1\u200b\u5c06\u200b\u81ea\u52a8\u200b\u88ab\u200b\u89c6\u4e3a\u200b MultiMolecule \u200b\u7684\u200b\u884d\u751f\u200b\u4f5c\u54c1\u200b\uff0c\u200b\u5fc5\u987b\u200b\u9075\u5b88\u200b \u200b\u8bb8\u53ef\u200b\u534f\u8bae\u200b \u200b\u7684\u200b\u6240\u6709\u200b\u6761\u6b3e\u200b\u3002 \u200b\u8fd9\u200b\u786e\u4fdd\u200b\u4e86\u200b\u7b2c\u4e09\u65b9\u200b\u65e0\u6cd5\u200b\u7ed5\u8fc7\u200b\u8bb8\u53ef\u200b\u6761\u6b3e\u200b\u6216\u200b\u4ece\u200b\u884d\u751f\u200b\u4f5c\u54c1\u200b\u4e2d\u200b\u521b\u5efa\u200b\u5355\u72ec\u200b\u7684\u200b\u8bb8\u53ef\u200b\u534f\u8bae\u200b\u3002
"},{"location":"zh/about/license-faq/#5-agpl-multimolecule","title":"5. \u200b\u5982\u679c\u200b\u6211\u200b\u7684\u200b\u7ec4\u7ec7\u200b\u7981\u6b62\u200b\u4f7f\u7528\u200b AGPL \u200b\u8bb8\u53ef\u200b\u534f\u8bae\u200b\u4e0b\u200b\u7684\u200b\u4ee3\u7801\u200b\uff0c\u200b\u6211\u8be5\u200b\u5982\u4f55\u200b\u4f7f\u7528\u200b MultiMolecule\uff1f","text":"\u200b\u4e00\u4e9b\u200b\u7ec4\u7ec7\u200b\uff08\u200b\u5982\u200bGoogle\uff09\u200b\u6709\u200b\u7981\u6b62\u200b\u4f7f\u7528\u200b AGPL \u200b\u8bb8\u53ef\u200b\u534f\u8bae\u200b\u4e0b\u200b\u4ee3\u7801\u200b\u7684\u200b\u653f\u7b56\u200b\u3002
\u200b\u5982\u679c\u200b\u60a8\u200b\u4e0e\u200b\u7981\u6b62\u200b\u4f7f\u7528\u200b AGPL \u200b\u8bb8\u53ef\u200b\u534f\u8bae\u200b\u4ee3\u7801\u200b\u7684\u200b\u7ec4\u7ec7\u200b\u6709\u200b\u5173\u7cfb\u200b\uff0c\u200b\u60a8\u200b\u5fc5\u987b\u200b\u4ece\u200b\u6211\u4eec\u200b\u8fd9\u91cc\u200b\u83b7\u5f97\u200b\u5355\u72ec\u200b\u7684\u200b\u8bb8\u53ef\u200b\u3002\u200b\u8bf7\u200b\u901a\u8fc7\u200b multimolecule@zyc.ai \u200b\u4e0e\u200b\u6211\u4eec\u200b\u8054\u7cfb\u200b\u4ee5\u200b\u83b7\u53d6\u200b\u66f4\u200b\u591a\u200b\u8be6\u7ec6\u4fe1\u606f\u200b\u3002
"},{"location":"zh/about/license-faq/#6-multimolecule","title":"6. \u200b\u5982\u679c\u200b\u6211\u200b\u662f\u200b\u7f8e\u56fd\u8054\u90a6\u653f\u5e9c\u200b\u7684\u200b\u96c7\u5458\u200b\uff0c\u200b\u6211\u200b\u53ef\u4ee5\u200b\u4f7f\u7528\u200b MultiMolecule \u200b\u5417\u200b\uff1f","text":"\u200b\u4e0d\u80fd\u200b\u3002
\u200b\u6839\u636e\u200b17 U.S. Code \u00a7 105\uff0c\u200b\u7f8e\u56fd\u8054\u90a6\u653f\u5e9c\u200b\u96c7\u5458\u200b\u64b0\u5199\u200b\u7684\u200b\u4ee3\u7801\u200b\u4e0d\u200b\u53d7\u200b\u7248\u6743\u4fdd\u62a4\u200b\u3002
\u200b\u56e0\u6b64\u200b\uff0c\u200b\u7f8e\u56fd\u8054\u90a6\u653f\u5e9c\u200b\u96c7\u5458\u200b\u65e0\u6cd5\u200b\u9075\u5b88\u200b \u200b\u8bb8\u53ef\u200b\u534f\u8bae\u200b \u200b\u7684\u200b\u6761\u6b3e\u200b\u3002
"},{"location":"zh/about/license-faq/#7","title":"7. \u200b\u6211\u4eec\u200b\u4f1a\u200b\u66f4\u65b0\u200b\u6b64\u200b\u5e38\u89c1\u200b\u95ee\u9898\u89e3\u7b54\u200b\u5417\u200b\uff1f","text":"\u200b\u7b80\u800c\u8a00\u4e4b\u200b
\u200b\u662f\u200b\u7684\u200b\uff0c\u200b\u6211\u4eec\u200b\u5c06\u200b\u6839\u636e\u200b\u9700\u8981\u200b\u66f4\u65b0\u200b\u6b64\u200b\u5e38\u89c1\u200b\u95ee\u9898\u89e3\u7b54\u200b\u4ee5\u200b\u4fdd\u6301\u200b\u4e0e\u200b\u76f8\u5173\u200b\u6cd5\u5f8b\u200b\u7684\u200b\u4e00\u81f4\u200b\u3002
\u200b\u6211\u4eec\u200b\u53ef\u80fd\u200b\u4f1a\u200b\u4e0d\u65f6\u200b\u66f4\u65b0\u200b\u6b64\u200b\u8bb8\u53ef\u200b\u534f\u8bae\u200b\u5e38\u89c1\u200b\u95ee\u9898\u89e3\u7b54\u200b\u3002 \u200b\u66f4\u65b0\u200b\u540e\u200b\u7684\u200b\u7248\u672c\u200b\u5c06\u200b\u901a\u8fc7\u200b\u66f4\u65b0\u200b\u672c\u200b\u9875\u9762\u200b\u5e95\u90e8\u200b\u7684\u200b\u201c\u200b\u6700\u540e\u200b\u4fee\u8ba2\u200b\u65f6\u95f4\u200b\u201d\u200b\u6765\u200b\u8868\u793a\u200b\u3002 \u200b\u5982\u679c\u200b\u6211\u4eec\u200b\u8fdb\u884c\u200b\u4efb\u4f55\u200b\u91cd\u5927\u200b\u66f4\u6539\u200b\uff0c\u200b\u6211\u4eec\u200b\u5c06\u200b\u901a\u8fc7\u200b\u5728\u200b\u672c\u9875\u200b\u53d1\u5e03\u200b\u65b0\u200b\u7684\u200b\u8bb8\u53ef\u200b\u534f\u8bae\u200b\u5e38\u89c1\u200b\u95ee\u9898\u89e3\u7b54\u200b\u6765\u200b\u901a\u77e5\u200b\u60a8\u200b\u3002 \u200b\u7531\u4e8e\u200b\u6211\u4eec\u200b\u4e0d\u200b\u6536\u96c6\u200b\u60a8\u200b\u7684\u200b\u4efb\u4f55\u200b\u8054\u7cfb\u200b\u4fe1\u606f\u200b\uff0c\u200b\u6211\u4eec\u200b\u65e0\u6cd5\u200b\u76f4\u63a5\u200b\u901a\u77e5\u200b\u60a8\u200b\u3002 \u200b\u6211\u4eec\u200b\u9f13\u52b1\u200b\u60a8\u200b\u7ecf\u5e38\u200b\u67e5\u770b\u200b\u672c\u200b\u8bb8\u53ef\u200b\u534f\u8bae\u200b\u5e38\u89c1\u200b\u95ee\u9898\u89e3\u7b54\u200b\uff0c\u200b\u4ee5\u200b\u4e86\u89e3\u200b\u60a8\u200b\u53ef\u4ee5\u200b\u5982\u4f55\u200b\u4f7f\u7528\u200b\u6211\u4eec\u200b\u7684\u200b\u6570\u636e\u200b\u3001\u200b\u6a21\u578b\u200b\u3001\u200b\u4ee3\u7801\u200b\u3001\u200b\u914d\u7f6e\u200b\u3001\u200b\u6587\u6863\u200b\u548c\u200b\u6743\u91cd\u200b\u3002
"},{"location":"zh/data/","title":"data","text":"data
\u200b\u63d0\u4f9b\u200b\u4e86\u200b\u4e00\u7cfb\u5217\u200b\u7528\u4e8e\u200b\u5904\u7406\u200b\u6570\u636e\u200b\u7684\u200b\u5b9e\u7528\u5de5\u5177\u200b\u3002
\u200b\u5c3d\u7ba1\u200b datasets
\u200b\u662f\u200b\u4e00\u4e2a\u200b\u5f3a\u5927\u200b\u7684\u200b\u7ba1\u7406\u200b\u6570\u636e\u200b\u96c6\u200b\u7684\u200b\u5e93\u200b\uff0c\u200b\u4f46\u200b\u5b83\u200b\u662f\u200b\u4e00\u4e2a\u200b\u901a\u7528\u200b\u5de5\u5177\u200b\uff0c\u200b\u53ef\u80fd\u200b\u65e0\u6cd5\u200b\u6db5\u76d6\u200b\u79d1\u5b66\u200b\u5e94\u7528\u7a0b\u5e8f\u200b\u7684\u200b\u6240\u6709\u200b\u7279\u5b9a\u200b\u529f\u80fd\u200b\u3002
data
\u200b\u5305\u200b\u65e8\u5728\u200b\u901a\u8fc7\u200b\u63d0\u4f9b\u200b\u5728\u200b\u79d1\u5b66\u200b\u4efb\u52a1\u200b\u4e2d\u200b\u5e38\u7528\u200b\u7684\u200b\u6570\u636e\u5904\u7406\u200b\u5b9e\u7528\u7a0b\u5e8f\u200b\u6765\u200b\u8865\u5145\u200b datasets
\u3002
from multimolecule.data import Dataset\n\ndata = Dataset(\"data/rna/5utr.csv\", split=\"train\", pretrained=\"multimolecule/rna\")\n
"},{"location":"zh/data/#datasets","title":"\u4ece\u200b datasets
\u200b\u52a0\u8f7d","text":"Pythonfrom multimolecule.data import Dataset\n\ndata = Dataset(\"multimolecule/bprna-spot\", split=\"train\", pretrained=\"multimolecule/rna\")\n
"},{"location":"zh/datasets/","title":"datasets","text":"datasets
\u200b\u63d0\u4f9b\u200b\u4e86\u200b\u4e00\u7cfb\u5217\u200b\u5e7f\u6cdb\u200b\u4f7f\u7528\u200b\u7684\u200b\u6570\u636e\u200b\u96c6\u200b\u3002
from multimolecule.data import Dataset\n\ndata = Dataset(\"multimolecule/bprna-spot\", split=\"train\", pretrained=\"multimolecule/rna\")\n
"},{"location":"zh/models/","title":"models","text":"models
\u200b\u63d0\u4f9b\u200b\u4e86\u200b\u4e00\u7cfb\u5217\u200b\u9884\u200b\u8bad\u7ec3\u200b\u6a21\u578b\u200b\u3002
\u200b\u5728\u200b transformers
\u200b\u5e93\u200b\u5f53\u4e2d\u200b\uff0c\u200b\u6a21\u578b\u200b\u7c7b\u200b\u7684\u200b\u540d\u5b57\u200b\u6709\u65f6\u200b\u53ef\u4ee5\u200b\u5f15\u8d77\u200b\u8bef\u89e3\u200b\u3002 \u200b\u5c3d\u7ba1\u200b\u8fd9\u4e9b\u200b\u7c7b\u200b\u652f\u6301\u200b\u56de\u5f52\u200b\u548c\u200b\u5206\u7c7b\u200b\u4efb\u52a1\u200b\uff0c\u200b\u4f46\u200b\u5b83\u4eec\u200b\u7684\u200b\u540d\u5b57\u200b\u901a\u5e38\u200b\u5305\u542b\u200b xxxForSequenceClassification
\uff0c\u200b\u8fd9\u200b\u53ef\u80fd\u200b\u6697\u793a\u200b\u5b83\u4eec\u200b\u53ea\u80fd\u200b\u7528\u4e8e\u200b\u5206\u7c7b\u200b\u3002
\u200b\u4e3a\u4e86\u200b\u907f\u514d\u200b\u8fd9\u79cd\u200b\u6b67\u4e49\u200b\uff0cMultiMolecule \u200b\u63d0\u4f9b\u200b\u4e86\u200b\u4e00\u7cfb\u5217\u200b\u6a21\u578b\u200b\u7c7b\u200b\uff0c\u200b\u8fd9\u4e9b\u200b\u7c7b\u200b\u7684\u200b\u540d\u79f0\u200b\u6e05\u6670\u200b\u3001\u200b\u76f4\u89c2\u200b\uff0c\u200b\u53cd\u6620\u200b\u4e86\u200b\u5b83\u4eec\u200b\u7684\u200b\u9884\u671f\u200b\u7528\u9014\u200b\uff1a
multimolecule.AutoModelForSequencePrediction
: \u200b\u5e8f\u5217\u200b\u9884\u6d4b\u200bmultimolecule.AutoModelForTokenPrediction
: \u200b\u4ee4\u724c\u200b\u9884\u6d4b\u200bmultimolecule.AutoModelForContactPrediction
: \u200b\u63a5\u89e6\u200b\u9884\u6d4b\u200b\u200b\u6bcf\u4e2a\u200b\u6a21\u578b\u200b\u90fd\u200b\u652f\u6301\u200b\u56de\u5f52\u200b\u548c\u200b\u5206\u7c7b\u200b\u4efb\u52a1\u200b\uff0c\u200b\u4e3a\u200b\u5e7f\u6cdb\u200b\u7684\u200b\u5e94\u7528\u200b\u63d0\u4f9b\u200b\u4e86\u200b\u7075\u6d3b\u6027\u200b\u548c\u200b\u7cbe\u5ea6\u200b\u3002
"},{"location":"zh/models/#_2","title":"\u63a5\u89e6\u200b\u9884\u6d4b","text":"\u200b\u63a5\u89e6\u200b\u9884\u6d4b\u200b\u4e3a\u200b\u5e8f\u5217\u200b\u4e2d\u200b\u7684\u200b\u6bcf\u200b\u4e00\u5bf9\u200b\u4ee4\u724c\u200b\u5206\u914d\u200b\u4e00\u4e2a\u200b\u6807\u7b7e\u200b\u3002 \u200b\u6700\u200b\u5e38\u89c1\u200b\u7684\u200b\u63a5\u89e6\u200b\u9884\u6d4b\u200b\u4efb\u52a1\u200b\u4e4b\u4e00\u200b\u662f\u200b\u86cb\u767d\u8d28\u200b\u8ddd\u79bb\u200b\u56fe\u200b\u9884\u6d4b\u200b\u3002 \u200b\u86cb\u767d\u8d28\u200b\u8ddd\u79bb\u200b\u56fe\u200b\u9884\u6d4b\u200b\u8bd5\u56fe\u200b\u627e\u5230\u200b\u4e09\u7ef4\u200b\u86cb\u767d\u8d28\u200b\u7ed3\u6784\u200b\u4e2d\u200b\u6240\u6709\u200b\u53ef\u80fd\u200b\u7684\u200b\u6c28\u57fa\u9178\u200b\u6b8b\u57fa\u200b\u5bf9\u200b\u4e4b\u95f4\u200b\u7684\u200b\u8ddd\u79bb\u200b
"},{"location":"zh/models/#_3","title":"\u6838\u82f7\u9178\u200b\u9884\u6d4b","text":"\u200b\u4e0e\u200b Token Classification \u200b\u7c7b\u4f3c\u200b\uff0c\u200b\u4f46\u200b\u5982\u679c\u200b\u6a21\u578b\u200b\u914d\u7f6e\u200b\u4e2d\u200b\u5b9a\u4e49\u200b\u4e86\u200b <bos>
\u200b\u6216\u200b <eos>
\u200b\u4ee4\u724c\u200b\uff0c\u200b\u5219\u200b\u5c06\u200b\u5176\u200b\u79fb\u9664\u200b\u3002
<bos>
\u200b\u548c\u200b <eos>
\u200b\u4ee4\u724c\u200b
\u200b\u5728\u200b MultiMolecule \u200b\u63d0\u4f9b\u200b\u7684\u200b\u5206\u8bcd\u5668\u200b\u4e2d\u200b\uff0c<bos>
\u200b\u4ee4\u724c\u200b\u6307\u5411\u200b <cls>
\u200b\u4ee4\u724c\u200b\uff0c<sep>
\u200b\u4ee4\u724c\u200b\u6307\u5411\u200b <eos>
\u200b\u4ee4\u724c\u200b\u3002
multimolecule.AutoModel
\u200b\u6784\u5efa","text":"Pythonfrom transformers import AutoTokenizer\n\nfrom multimolecule import AutoModelForSequencePrediction\n\nmodel = AutoModelForSequencePrediction.from_pretrained(\"multimolecule/rnafm\")\ntokenizer = AutoTokenizer.from_pretrained(\"multimolecule/rnafm\")\n\nsequence = \"UAGCGUAUCAGACUGAUGUUG\"\noutput = model(**tokenizer(sequence, return_tensors=\"pt\"))\n
"},{"location":"zh/models/#_5","title":"\u76f4\u63a5\u200b\u8bbf\u95ee","text":"\u200b\u6240\u6709\u200b\u6a21\u578b\u200b\u53ef\u4ee5\u200b\u901a\u8fc7\u200b from_pretrained
\u200b\u65b9\u6cd5\u200b\u76f4\u63a5\u200b\u52a0\u8f7d\u200b\u3002
from multimolecule.models import RnaFmForTokenPrediction, RnaTokenizer\n\nmodel = RnaFmForTokenPrediction.from_pretrained(\"multimolecule/rnafm\")\ntokenizer = RnaTokenizer.from_pretrained(\"multimolecule/rnafm\")\n\nsequence = \"UAGCGUAUCAGACUGAUGUUG\"\noutput = model(**tokenizer(sequence, return_tensors=\"pt\"))\n
"},{"location":"zh/models/#transformersautomodel","title":"\u4f7f\u7528\u200b transformers.AutoModel
\u200b\u6784\u5efa","text":"\u200b\u867d\u7136\u200b\u6211\u4eec\u200b\u4e3a\u200b\u6a21\u578b\u200b\u7c7b\u200b\u4f7f\u7528\u200b\u4e86\u200b\u4e0d\u540c\u200b\u7684\u200b\u547d\u540d\u200b\u7ea6\u5b9a\u200b\uff0c\u200b\u4f46\u200b\u6a21\u578b\u200b\u4ecd\u7136\u200b\u6ce8\u518c\u200b\u5230\u200b\u76f8\u5e94\u200b\u7684\u200b transformers.AutoModel
\u200b\u4e2d\u200b\u3002
from transformers import AutoModelForSequenceClassification, AutoTokenizer\n\nimport multimolecule # noqa: F401\n\nmodel = AutoModelForSequenceClassification.from_pretrained(\"multimolecule/mrnafm\")\ntokenizer = AutoTokenizer.from_pretrained(\"multimolecule/mrnafm\")\n\nsequence = \"UAGCGUAUCAGACUGAUGUUG\"\noutput = model(**tokenizer(sequence, return_tensors=\"pt\"))\n
\u200b\u4f7f\u7528\u200b\u524d\u5148\u200b import multimolecule
\u200b\u8bf7\u200b\u6ce8\u610f\u200b\uff0c\u200b\u5728\u200b\u4f7f\u7528\u200b transformers.AutoModel
\u200b\u6784\u5efa\u200b\u6a21\u578b\u200b\u4e4b\u524d\u200b\uff0c\u200b\u5fc5\u987b\u200b\u5148\u200b import multimolecule
\u3002 \u200b\u6a21\u578b\u200b\u7684\u200b\u6ce8\u518c\u200b\u5728\u200b multimolecule
\u200b\u5305\u4e2d\u200b\u5b8c\u6210\u200b\uff0c\u200b\u6a21\u578b\u200b\u5728\u200b transformers
\u200b\u5305\u4e2d\u200b\u4e0d\u53ef\u200b\u7528\u200b\u3002
\u200b\u5982\u679c\u200b\u5728\u200b\u4f7f\u7528\u200b transformers.AutoModel
\u200b\u4e4b\u524d\u200b\u672a\u200b import multimolecule
\uff0c\u200b\u5c06\u4f1a\u200b\u5f15\u53d1\u200b\u4ee5\u4e0b\u200b\u9519\u8bef\u200b\uff1a
ValueError: The checkpoint you are trying to load has model type `rnafm` but Transformers does not recognize this architecture. This could be because of an issue with the checkpoint, or because your version of Transformers is out of date.\n
"},{"location":"zh/models/#_6","title":"\u521d\u59cb\u5316\u200b\u4e00\u4e2a\u200b\u9999\u8349\u200b\u6a21\u578b","text":"\u200b\u4f60\u200b\u4e5f\u200b\u53ef\u4ee5\u200b\u4f7f\u7528\u200b\u6a21\u578b\u200b\u7c7b\u200b\u521d\u59cb\u5316\u200b\u4e00\u4e2a\u200b\u57fa\u7840\u200b\u6a21\u578b\u200b\u3002
Pythonfrom multimolecule.models import RnaFmConfig, RnaFmForTokenPrediction, RnaTokenizer\n\nconfig = RnaFmConfig()\nmodel = RnaFmForTokenPrediction(config)\ntokenizer = RnaTokenizer()\n\nsequence = \"UAGCGUAUCAGACUGAUGUUG\"\noutput = model(**tokenizer(sequence, return_tensors=\"pt\"))\n
"},{"location":"zh/models/#_7","title":"\u53ef\u7528\u200b\u6a21\u578b","text":""},{"location":"zh/models/#dna","title":"\u8131\u6c27\u6838\u7cd6\u6838\u9178\u200b\uff08DNA\uff09","text":"module
\u200b\u63d0\u4f9b\u200b\u4e86\u200b\u4e00\u7cfb\u5217\u200b\u9884\u5b9a\u200b\u4e49\u200b\u6a21\u5757\u200b\uff0c\u200b\u4f9b\u200b\u7528\u6237\u200b\u5b9e\u73b0\u200b\u81ea\u5df1\u200b\u7684\u200b\u67b6\u6784\u200b\u3002
MultiMolecule \u200b\u5efa\u7acb\u200b\u5728\u200b \u200b\u751f\u6001\u7cfb\u7edf\u200b\u4e4b\u4e0a\u200b\uff0c\u200b\u62e5\u62b1\u200b\u7c7b\u4f3c\u200b\u7684\u200b\u8bbe\u8ba1\u200b\u7406\u5ff5\u200b\uff1a\u200b\u4e0d\u8981\u200b \u200b\u91cd\u590d\u200b\u81ea\u5df1\u200b\u3002 \u200b\u6211\u4eec\u200b\u9075\u5faa\u200b \u200b\u5355\u4e00\u200b\u6a21\u578b\u200b\u6587\u4ef6\u200b\u7b56\u7565\u200b
\uff0c\u200b\u5176\u4e2d\u200b models
\u200b\u5305\u4e2d\u200b\u7684\u200b\u6bcf\u4e2a\u200b\u6a21\u578b\u200b\u90fd\u200b\u5305\u542b\u200b\u4e00\u4e2a\u200b\u4e14\u200b\u4ec5\u200b\u6709\u200b\u4e00\u4e2a\u200b\u63cf\u8ff0\u200b\u7f51\u7edc\u200b\u8bbe\u8ba1\u200b\u7684\u200b modeling.py
\u200b\u6587\u4ef6\u200b\u3002
module
\u200b\u5305\u200b\u65e8\u5728\u200b\u63d0\u4f9b\u200b\u7b80\u5355\u200b\u3001\u200b\u53ef\u200b\u91cd\u7528\u200b\u7684\u200b\u6a21\u5757\u200b\uff0c\u200b\u8fd9\u4e9b\u200b\u6a21\u5757\u200b\u5728\u200b\u591a\u4e2a\u200b\u6a21\u578b\u200b\u4e2d\u200b\u4fdd\u6301\u4e00\u81f4\u200b\u3002\u200b\u8fd9\u79cd\u200b\u65b9\u6cd5\u200b\u6700\u5927\u200b\u7a0b\u5ea6\u200b\u5730\u200b\u51cf\u5c11\u200b\u4e86\u200b\u4ee3\u7801\u200b\u91cd\u590d\u200b\uff0c\u200b\u5e76\u200b\u4fc3\u8fdb\u200b\u4e86\u200b\u5e72\u51c0\u200b\u3001\u200b\u6613\u4e8e\u200b\u7ef4\u62a4\u200b\u7684\u200b\u4ee3\u7801\u200b\u3002
module
\u200b\u5305\u62ec\u200b\u4e00\u4e9b\u200b\u5728\u200b\u4e0d\u540c\u200b\u6a21\u578b\u200b\u4e2d\u200b\u5e38\u7528\u200b\u7684\u200b\u7ec4\u4ef6\u200b\uff0c\u200b\u4f8b\u5982\u200b SequencePredictionHead
\u3002\u200b\u8fd9\u200b\u51cf\u5c11\u200b\u4e86\u200b\u5197\u4f59\u200b\uff0c\u200b\u5e76\u200b\u7b80\u5316\u200b\u4e86\u200b\u5f00\u53d1\u200b\u8fc7\u7a0b\u200b\u3002module
\u200b\u5305\u200b\u4e13\u6ce8\u200b\u4e8e\u200b\u66f4\u200b\u7b80\u5355\u200b\u7684\u200b\u7ec4\u4ef6\u200b\uff0c\u200b\u5c06\u200b\u590d\u6742\u200b\u7684\u200b\u3001\u200b\u7279\u5b9a\u200b\u4e8e\u200b\u6a21\u578b\u200b\u7684\u200b\u53d8\u5316\u200b\u7559\u7ed9\u200b\u6bcf\u4e2a\u200b\u6a21\u578b\u200b\u7684\u200b modeling.py
\u200b\u4e2d\u200b\u5b9a\u4e49\u200b\u3002SequencePredictionHead
\u3001TokenPredictionHead
\u200b\u548c\u200bContactPredictionHead
\u3002SinusoidalEmbedding
\u200b\u548c\u200b RotaryEmbedding
\u3002embeddings
\u200b\u63d0\u4f9b\u200b\u4e86\u200b\u4e00\u7cfb\u5217\u200b\u9884\u5b9a\u200b\u4e49\u200b\u7684\u200b\u4f4d\u7f6e\u200b\u7f16\u7801\u200b\u3002
heads
\u200b\u63d0\u4f9b\u200b\u4e86\u200b\u4e00\u7cfb\u5217\u200b\u7684\u200b\u6a21\u578b\u200b\u9884\u6d4b\u200b\u5934\u200b\uff0c\u200b\u7528\u4e8e\u200b\u5904\u7406\u200b\u4e0d\u540c\u200b\u7684\u200b\u4efb\u52a1\u200b\u3002
heads
\u200b\u63a5\u53d7\u200b ModelOutupt
\u3001dict
\u200b\u6216\u200b tuple
\u200b\u4f5c\u4e3a\u200b\u8f93\u5165\u200b\u3002 \u200b\u5b83\u4f1a\u200b\u81ea\u52a8\u200b\u67e5\u627e\u200b\u9884\u6d4b\u200b\u6240\u200b\u9700\u200b\u7684\u200b\u6a21\u578b\u200b\u8f93\u51fa\u200b\u5e76\u200b\u76f8\u5e94\u200b\u5730\u200b\u5904\u7406\u200b\u3002
\u200b\u4e00\u4e9b\u200b\u9884\u6d4b\u200b\u5934\u200b\u53ef\u80fd\u200b\u9700\u8981\u200b\u989d\u5916\u200b\u7684\u200b\u4fe1\u606f\u200b\uff0c\u200b\u4f8b\u5982\u200b attention_mask
\u200b\u6216\u200b input_ids
\uff0c\u200b\u4f8b\u5982\u200b ContactPredictionHead
\u3002 \u200b\u8fd9\u4e9b\u200b\u989d\u5916\u200b\u7684\u200b\u53c2\u6570\u200b\u53ef\u4ee5\u200b\u4f5c\u4e3a\u200b\u53c2\u6570\u200b/\u200b\u5173\u952e\u5b57\u200b\u53c2\u6570\u200b\u4f20\u5165\u200b\u3002
\u200b\u8bf7\u200b\u6ce8\u610f\u200b\uff0cheads
\u200b\u4f7f\u7528\u200b\u4e0e\u200b Transformers \u200b\u76f8\u540c\u200b\u7684\u200b ModelOutupt
\u200b\u7ea6\u5b9a\u200b\u3002 \u200b\u5982\u679c\u200b\u6a21\u578b\u200b\u8f93\u51fa\u200b\u662f\u200b\u4e00\u4e2a\u200b tuple
\uff0c\u200b\u6211\u4eec\u200b\u5c06\u200b\u7b2c\u4e00\u4e2a\u200b\u5143\u7d20\u200b\u89c6\u4e3a\u200b pooler_output
\uff0c\u200b\u7b2c\u4e8c\u4e2a\u200b\u5143\u7d20\u200b\u89c6\u4e3a\u200b last_hidden_state
\uff0c\u200b\u6700\u540e\u200b\u4e00\u4e2a\u200b\u5143\u7d20\u200b\u89c6\u4e3a\u200b attention_map
\u3002 \u200b\u7528\u6237\u200b\u6709\u200b\u8d23\u4efb\u200b\u786e\u4fdd\u200b\u6a21\u578b\u200b\u8f93\u51fa\u200b\u683c\u5f0f\u200b\u6b63\u786e\u200b\u3002
\u200b\u5982\u679c\u200b\u6a21\u578b\u200b\u8f93\u51fa\u200b\u662f\u200b\u4e00\u4e2a\u200b ModelOutupt
\u200b\u6216\u200b\u4e00\u4e2a\u200b dict
\uff0cheads
\u200b\u5c06\u200b\u4ece\u200b\u6a21\u578b\u200b\u8f93\u51fa\u200b\u4e2d\u200b\u67e5\u627e\u200b HeadConfig.output_name
\u3002 \u200b\u4f60\u200b\u53ef\u4ee5\u200b\u5728\u200b HeadConfig
\u200b\u4e2d\u200b\u6307\u5b9a\u200b output_name
\uff0c\u200b\u4ee5\u200b\u786e\u4fdd\u200b heads
\u200b\u53ef\u4ee5\u200b\u6b63\u786e\u200b\u5b9a\u4f4d\u200b\u6240\u200b\u9700\u200b\u7684\u200b\u5f20\u91cf\u200b\u3002
tokenisers
\u200b\u63d0\u4f9b\u200b\u4e86\u200b\u4e00\u7cfb\u5217\u200b\u9884\u5b9a\u200b\u4e49\u200b\u4ee4\u724c\u200b\u5668\u200b\u3002
\u200b\u4ee4\u724c\u200b\u5668\u662f\u200b\u4e00\u4e2a\u200b\u5c06\u200b\u6838\u82f7\u9178\u200b\u6216\u200b\u6c28\u57fa\u9178\u200b\u5e8f\u5217\u200b\u8f6c\u6362\u200b\u4e3a\u200b\u7d22\u5f15\u200b\u5e8f\u5217\u200b\u7684\u200b\u7c7b\u200b\u3002\u200b\u5b83\u200b\u7528\u4e8e\u200b\u5728\u200b\u5c06\u200b\u8f93\u5165\u200b\u5e8f\u5217\u200b\u9988\u9001\u200b\u5230\u200b\u6a21\u578b\u200b\u4e4b\u524d\u200b\u5bf9\u200b\u5176\u200b\u8fdb\u884c\u200b\u9884\u5904\u7406\u200b\u3002
\u200b\u8bf7\u53c2\u9605\u200b Tokenizer \u200b\u4e86\u89e3\u200b\u66f4\u200b\u591a\u200b\u7ec6\u8282\u200b\u3002
"},{"location":"zh/tokenisers/#_1","title":"\u53ef\u7528\u200b\u4ee4\u724c\u200b\u5668","text":"我们相信开放是研究的基础。
-MultiMolecule 在GNU Affero 通用公共许可证下授权。
+MultiMolecule 在 GNU Affero 通用公共许可证 下授权。
请加入我们,共同建立一个开放的研究社区。
SPDX-License-Identifier: AGPL-3.0-or-later