License + docs + docstrings for ProteinFr, WtrCls, ScanWtr

valence-labs · Jul 12, 2024 · dac9308 · dac9308
1 parent bbe9f3a
commit dac9308
Show file tree

Hide file tree

Showing 12 changed files with 418 additions and 307 deletions.
diff --git a/LICENSE b/LICENSE
diff --git a/docs/API/regressor.md b/docs/API/regressor.md
@@ -0,0 +1 @@
+::: openqdc.utils.regressor
diff --git a/docs/_overrides/main.html b/docs/_overrides/main.html
diff --git a/docs/assets/qdc_logo.png b/docs/assets/qdc_logo.png
diff --git a/docs/css/custom-openqdc.css b/docs/css/custom-openqdc.css
@@ -1,20 +1,20 @@
 :root {
-    --datamol-primary: #ad069c;
-    --datamol-secondary: #343a40;
+    --openqdc-primary: ##201342;
+    --openqdc-secondary: #4A1E7E;
 
     /* Primary color shades */
-    --md-primary-fg-color: var(--datamol-primary);
-    --md-primary-fg-color--light: var(--datamol-primary);
-    --md-primary-fg-color--dark: var(--datamol-primary);
-    --md-primary-bg-color: var(--datamol-secondary);
-    --md-primary-bg-color--light: var(--datamol-secondary);
-    --md-text-link-color: var(--datamol-secondary);
+    --md-primary-fg-color: var(--openqdc-primary);
+    --md-primary-fg-color--light: var(--openqdc-primary);
+    --md-primary-fg-color--dark: var(--openqdc-primary);
+    --md-primary-bg-color: var(--openqdc-secondary);
+    --md-primary-bg-color--light: var(--openqdc-secondary);
+    --md-text-link-color: var(--openqdc-secondary);
 
     /* Accent color shades */
-    --md-accent-fg-color: var(--datamol-secondary);
-    --md-accent-fg-color--transparent: var(--datamol-secondary);
-    --md-accent-bg-color: var(--datamol-secondary);
-    --md-accent-bg-color--light: var(--datamol-secondary);
+    --md-accent-fg-color: var(--openqdc-secondary);
+    --md-accent-fg-color--transparent: var(--openqdc-secondary);
+    --md-accent-bg-color: var(--openqdc-secondary);
+    --md-accent-bg-color--light: var(--openqdc-secondary);
   }
 
   :root>* {
@@ -23,24 +23,24 @@
     --md-code-fg-color: hsla(200, 18%, 26%, 1);
 
     /* Footer */
-    --md-footer-bg-color: var(--datamol-primary);
+    --md-footer-bg-color: var(--openqdc-primary);
     /* --md-footer-bg-color--dark: hsla(0, 0%, 0%, 0.32); */
-    --md-footer-fg-color: var(--datamol-secondary);
-    --md-footer-fg-color--light: var(--datamol-secondary);
-    --md-footer-fg-color--lighter: var(--datamol-secondary);
+    --md-footer-fg-color: var(--openqdc-secondary);
+    --md-footer-fg-color--light: var(--openqdc-secondary);
+    --md-footer-fg-color--lighter: var(--openqdc-secondary);
 
   }
 
   .md-header {
-    background-image: linear-gradient(to right, #ad069c, #470b41);
+    background-image: linear-gradient(to right, #131036, #4A1E7E);
   }
 
   .md-footer {
-    background-image: linear-gradient(to right, #ad069c, #470b41);
+    background-image: linear-gradient(to right, #131036, #4A1E7E);
   }
 
   .md-tabs {
-    background-image: linear-gradient(to right, #F4F6F9, #e9bde4);
+    background-image: linear-gradient(to right, #F4F6F9, #b39bce);
   }
 
   .md-header__topic {

diff --git a/docs/e0s_and_qm.md b/docs/e0s_and_qm.md
@@ -0,0 +1,32 @@
+# Overview of QM Methods and Normalization
+
+OpenQDC provides support for 250+ QM Methods and provides a way to standardize and categorize
+the usage of different level of theories used for Quantum Mechanics Single Point Calculations 
+to add value and information to the datasets.
+
+## Level of Theory
+
+To avoid inconsistencies, level of theories are standardized and categorized into Python Enums
+consisting of a functional, a basis set, and a correction method.
+OpenQDC covers more than 106 functionals, 20 basis sets, and 11
+correction methods.
+OpenQDC provides the computed the isolated atom energies `e0` for each QM method.
+
+
+## Normalization
+
+
+We provide support of energies through "physical" and "regression" normalization to conserve the size extensivity of chemical systems.
+OpenQDC through this normalization, provide a way to transform the potential energy to atomization energy by subtracting isolated atom energies `e0` 
+physically interpretable and extensivity-conserving normalization method. Alternatively, we pre-335
+compute the average contribution of each atom species to potential energy via linear or ridge336
+regression, centering the distribution at 0 and providing uncertainty estimation for the computed337
+values. Predicted atomic energies can also be scaled to approximate a standard normal distribution
+
+### Physical Normalization
+
+
+
+### Regression Normalization
+
+
diff --git a/mkdocs.yml b/mkdocs.yml
@@ -17,10 +17,12 @@ nav:
   - Usage: usage.md
   - CLI: cli.md
   - Available Datasets: datasets.md
+  - QM methods: e0s_and_qm.md
   - Tutorials:
     - Really hard example: tutorials/usage.ipynb
   - API:
     - e0 and QM methods: API/methods.md
+    - e0 regression: API/regressor.md
     - Datasets:
       - Potential Energy:
         - Alchemy : API/datasets/alchemy.md
@@ -63,13 +65,12 @@ nav:
 
 theme:
   name: material
-  custom_dir: docs/_overrides
-  #palette:
-  #  primary: purple
-  #  accent: purple
+  #custom_dir: docs/_overrides
   features:
     - navigation.tabs
     #- navigation.expand
+  #favicon: assets/qdc_logo.png
+  logo: assets/qdc_logo.png
 
 
 extra_css:

diff --git a/openqdc/datasets/potential/proteinfragments.py b/openqdc/datasets/potential/proteinfragments.py
@@ -88,12 +88,37 @@ def _unpack_data_tuple(self, data):
         return q, s, z, r, e, f, d
 
 
-# graphs is smiles
 class ProteinFragments(BaseDataset):
-    """https://www.science.org/doi/10.1126/sciadv.adn4397"""
+    """
+    ProteinFragments is a dataset constructed from a subset of the
+    the data was generated from a top-down and bottom-up approach:
+
+    Top-down:
+        Fragments are generated by cutting out a spherical
+        region around an atom (including solvent molecules)
+        and saturating all dangling bonds.
+        Sampling was done with the Molecular Dynamics (MD) method from
+        conventional FF at room temperature.
+
+    Bottom-up:
+        Fragments are generated by constructing chemical graphs
+        of one to eight nonhydrogen atoms.
+        Sampling of multiple conformers per fragments was done with
+        MD simulations at high temperatures or normal mode sampling.
+
+
+    Usage:
+    ```python
+    from openqdc.datasets import ProteinFragments
+    dataset = ProteinFragments()
+    ```
+
+    References:
+        https://www.science.org/doi/10.1126/sciadv.adn4397
+    """
 
     __name__ = "proteinfragments"
-
+    # PBE0/def2-TZVPP+MBD
     __energy_methods__ = [
         PotentialMethod.WB97X_6_31G_D,  # "wb97x/6-31g(d)"
     ]
@@ -135,7 +160,28 @@ def read_raw_entries(self):
 
 class MDDataset(ProteinFragments):
     """
-    Part of the proteinfragments dataset that is generated from the molecular dynamics with their model.
+    MDDataset is a subset of the proteinfragments dataset that
+    generated from the molecular dynamics with their model.
+    The sampling was done with Molecular Dynamics
+    at room temperature 300K in various solvent phase:
+
+    Subsets:
+        Polyalanine:
+            All the polyalanine are sampled in gas phase. AceAla15Lys is
+            a polyalanine peptides capped with an N-terminal acetyl group
+            and a protonated lysine residue at the C-terminus,
+            Acela15nme is polyalanine peptide capped with an N-terminal acetyl group
+            and a C-terminal N-methyl amide group\n
+        Crambin: 46-residue protein crambin in aqueous solution (25,257 atoms)
+
+    Usage:
+    ```python
+    from openqdc.datasets import MDDataset
+    dataset = MDDataset()
+    ```
+
+    References:
+        https://www.science.org/doi/10.1126/sciadv.adn4397
     """
 
     __name__ = "mddataset"

diff --git a/openqdc/datasets/potential/qmx.py b/openqdc/datasets/potential/qmx.py
@@ -77,6 +77,26 @@ def read_raw_entries(self):
 # ['smiles', 'E1-CC2', 'E2-CC2', 'f1-CC2', 'f2-CC2', 'E1-PBE0', 'E2-PBE0', 'f1-PBE0', 'f2-PBE0',
 # 'E1-PBE0.1', 'E2-PBE0.1', 'f1-PBE0.1', 'f2-PBE0.1', 'E1-CAM', 'E2-CAM', 'f1-CAM', 'f2-CAM']
 class QM7(QMX):
+    """
+    QM7 is a dataset constructed from subsets of the GDB-13 database (
+    stable and synthetically accessible organic molecules),
+    containing up to seven “heavy” atoms.
+    The molecules conformation are optimized using DFT at the
+    PBE0/def2-TZVP level of theory.
+
+    Chemical species:
+        [C, N, O, S, H]
+
+    Usage:
+    ```python
+    from openqdc.datasets import QM7
+    dataset = QM7()
+    ```
+
+    References:
+        https://arxiv.org/pdf/1703.00564
+    """
+
     __links__ = {"qm7.hdf5.gz": "https://zenodo.org/record/3588337/files/150.hdf5.gz?download=1"}
     __name__ = "qm7"
 
@@ -167,6 +187,26 @@ class QM7(QMX):
 
 
 class QM7b(QMX):
+    """
+    QM7b is a dataset constructed from subsets of the GDB-13 database (
+    stable and synthetically accessible organic molecules),
+    containing up to seven “heavy” atoms.
+    The molecules conformation are optimized using DFT at the
+    PBE0/def2-TZVP level of theory.
+
+    Chemical species:
+        [C, N, O, S, Cl, H]
+
+    Usage:
+    ```python
+    from openqdc.datasets import QM7b
+    dataset = QM7b()
+    ```
+
+    References:
+        https://arxiv.org/pdf/1703.00564
+    """
+
     __links__ = {"qm7b.hdf5.gz": "https://zenodo.org/record/3588335/files/200.hdf5.gz?download=1"}
     __name__ = "qm7b"
     energy_target_names = [
@@ -251,16 +291,25 @@ class QM7b(QMX):
 
 
 class QM8(QMX):
-    """QM8 is the dataset used in a study on modeling quantum
+    """QM8 is the subset of QM9 used in a study on modeling quantum
     mechanical calculations of electronic spectra and excited
-    state energy (ka increase of energy from the ground states) of small molecules. Multiple methods, including
+    state energy (a increase of energy from the ground states) of small molecules
+    up to eight heavy atoms.
+    Multiple methods were used, including
     time-dependent density functional theories (TDDFT) and
-    second-order approximate coupled-cluster (CC2)
-    - Column 1: Molecule ID (gdb9 index) mapping to the .sdf file
-    - Columns 2-5: RI-CC2/def2TZVP
-    - Columns 6-9: LR-TDPBE0/def2SVP
-    - Columns 10-13: LR-TDPBE0/def2TZVP
-    - Columns 14-17: LR-TDCAM-B3LYP/def2TZVP
+    second-order approximate coupled-cluster (CC2).
+    The molecules conformations are relaxed geometries computed using
+    the DFT B3LYP with basis set 6-31G(2df,p).
+    For more information about the sampling, check QM9 dataset.
+
+    Usage:
+    ```python
+    from openqdc.datasets import QM8
+    dataset = QM8()
+    ```
+
+    References:
+        https://arxiv.org/pdf/1504.01966
 
     """
 

diff --git a/openqdc/datasets/potential/waterclusters.py b/openqdc/datasets/potential/waterclusters.py
@@ -101,7 +101,31 @@ def format_geometry_and_entries(geometries, energies, subset):
 
 
 class SCANWaterClusters(BaseDataset):
-    """https://chemrxiv.org/engage/chemrxiv/article-details/662aaff021291e5d1db7d8ec"""
+    """
+    The SCAN Water Clusters dataset contains conformations of
+    neutral water clusters containing up to 20 monomers, charged water clusters,
+    and alkali- and halide-water clusters. This dataset consists of our data sets of water clusters:
+    the benchmark energy and geometry database (BEGDB) neutral water cluster subset; the WATER2723 set of 14
+    neutral, 5 protonated, 7 deprotonated, and one auto-ionized water cluster; and two sets of
+    ion-water clusters M...(H2O)n, where M = Li+, Na+, K+, F−, Cl−, or Br−.
+    Water clusters were obtained from  10 nanosecond gas-phase molecular dynamics
+    simulations using AMBER 9 and optimized to obtain
+    lowest energy isomers were determined using MP2/aug-cc-pVDZ//MP2/6-31G* Gibbs free energies.
+
+
+    Chemical Species:
+        [H, O, Li, Na, K, F, Cl, Br]
+
+    Usage:
+    ```python
+    from openqdc.datasets import SCANWaterClusters
+    dataset = SCANWaterClusters()
+    ```
+
+    References:
+        https://chemrxiv.org/engage/chemrxiv/article-details/662aaff021291e5d1db7d8ec\n
+        https://github.com/esoteric-ephemera/water_cluster_density_errors
+    """
 
     __name__ = "scanwaterclusters"
 

diff --git a/openqdc/datasets/potential/waterclusters3_30.py b/openqdc/datasets/potential/waterclusters3_30.py
@@ -53,6 +53,10 @@ class WaterClusters(BaseDataset):
     clusters of sizes n = 3 - 30. The cluster structures are derived and labeled with
     the TTM2.1-F ab-initio based interaction potential for water.
     It contains approximately 4.5 mil. structures.
+    Sampling was done with the Monte Carlo Temperature Basin Paving (MCTBP) method.
+
+    Chemical Species:
+        ["H", "O"]
 
     Usage:
     ```python
@@ -61,8 +65,8 @@ class WaterClusters(BaseDataset):
     ```
 
     References:
-    - https://doi.org/10.1063/1.5128378
-    - https://sites.uw.edu/wdbase/database-of-water-clusters/
+        https://doi.org/10.1063/1.5128378\n
+        https://sites.uw.edu/wdbase/database-of-water-clusters/\n
     """
 
     __name__ = "waterclusters3_30"