Skip to content

Commit

Permalink
License + docs + docstrings for ProteinFr, WtrCls, ScanWtr
Browse files Browse the repository at this point in the history
  • Loading branch information
FNTwin committed Jul 12, 2024
1 parent bbe9f3a commit dac9308
Show file tree
Hide file tree
Showing 12 changed files with 418 additions and 307 deletions.
360 changes: 159 additions & 201 deletions LICENSE

Large diffs are not rendered by default.

1 change: 1 addition & 0 deletions docs/API/regressor.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
::: openqdc.utils.regressor
46 changes: 0 additions & 46 deletions docs/_overrides/main.html

This file was deleted.

Binary file added docs/assets/qdc_logo.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
38 changes: 19 additions & 19 deletions docs/css/custom-openqdc.css
Original file line number Diff line number Diff line change
@@ -1,20 +1,20 @@
:root {
--datamol-primary: #ad069c;
--datamol-secondary: #343a40;
--openqdc-primary: ##201342;
--openqdc-secondary: #4A1E7E;

/* Primary color shades */
--md-primary-fg-color: var(--datamol-primary);
--md-primary-fg-color--light: var(--datamol-primary);
--md-primary-fg-color--dark: var(--datamol-primary);
--md-primary-bg-color: var(--datamol-secondary);
--md-primary-bg-color--light: var(--datamol-secondary);
--md-text-link-color: var(--datamol-secondary);
--md-primary-fg-color: var(--openqdc-primary);
--md-primary-fg-color--light: var(--openqdc-primary);
--md-primary-fg-color--dark: var(--openqdc-primary);
--md-primary-bg-color: var(--openqdc-secondary);
--md-primary-bg-color--light: var(--openqdc-secondary);
--md-text-link-color: var(--openqdc-secondary);

/* Accent color shades */
--md-accent-fg-color: var(--datamol-secondary);
--md-accent-fg-color--transparent: var(--datamol-secondary);
--md-accent-bg-color: var(--datamol-secondary);
--md-accent-bg-color--light: var(--datamol-secondary);
--md-accent-fg-color: var(--openqdc-secondary);
--md-accent-fg-color--transparent: var(--openqdc-secondary);
--md-accent-bg-color: var(--openqdc-secondary);
--md-accent-bg-color--light: var(--openqdc-secondary);
}

:root>* {
Expand All @@ -23,24 +23,24 @@
--md-code-fg-color: hsla(200, 18%, 26%, 1);

/* Footer */
--md-footer-bg-color: var(--datamol-primary);
--md-footer-bg-color: var(--openqdc-primary);
/* --md-footer-bg-color--dark: hsla(0, 0%, 0%, 0.32); */
--md-footer-fg-color: var(--datamol-secondary);
--md-footer-fg-color--light: var(--datamol-secondary);
--md-footer-fg-color--lighter: var(--datamol-secondary);
--md-footer-fg-color: var(--openqdc-secondary);
--md-footer-fg-color--light: var(--openqdc-secondary);
--md-footer-fg-color--lighter: var(--openqdc-secondary);

}

.md-header {
background-image: linear-gradient(to right, #ad069c, #470b41);
background-image: linear-gradient(to right, #131036, #4A1E7E);
}

.md-footer {
background-image: linear-gradient(to right, #ad069c, #470b41);
background-image: linear-gradient(to right, #131036, #4A1E7E);
}

.md-tabs {
background-image: linear-gradient(to right, #F4F6F9, #e9bde4);
background-image: linear-gradient(to right, #F4F6F9, #b39bce);
}

.md-header__topic {
Expand Down
32 changes: 32 additions & 0 deletions docs/e0s_and_qm.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
# Overview of QM Methods and Normalization

OpenQDC provides support for 250+ QM Methods and provides a way to standardize and categorize
the usage of different level of theories used for Quantum Mechanics Single Point Calculations
to add value and information to the datasets.

## Level of Theory

To avoid inconsistencies, level of theories are standardized and categorized into Python Enums
consisting of a functional, a basis set, and a correction method.
OpenQDC covers more than 106 functionals, 20 basis sets, and 11
correction methods.
OpenQDC provides the computed the isolated atom energies `e0` for each QM method.


## Normalization


We provide support of energies through "physical" and "regression" normalization to conserve the size extensivity of chemical systems.
OpenQDC through this normalization, provide a way to transform the potential energy to atomization energy by subtracting isolated atom energies `e0`
physically interpretable and extensivity-conserving normalization method. Alternatively, we pre-335
compute the average contribution of each atom species to potential energy via linear or ridge336
regression, centering the distribution at 0 and providing uncertainty estimation for the computed337
values. Predicted atomic energies can also be scaled to approximate a standard normal distribution

### Physical Normalization



### Regression Normalization


9 changes: 5 additions & 4 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -17,10 +17,12 @@ nav:
- Usage: usage.md
- CLI: cli.md
- Available Datasets: datasets.md
- QM methods: e0s_and_qm.md
- Tutorials:
- Really hard example: tutorials/usage.ipynb
- API:
- e0 and QM methods: API/methods.md
- e0 regression: API/regressor.md
- Datasets:
- Potential Energy:
- Alchemy : API/datasets/alchemy.md
Expand Down Expand Up @@ -63,13 +65,12 @@ nav:

theme:
name: material
custom_dir: docs/_overrides
#palette:
# primary: purple
# accent: purple
#custom_dir: docs/_overrides
features:
- navigation.tabs
#- navigation.expand
#favicon: assets/qdc_logo.png
logo: assets/qdc_logo.png


extra_css:
Expand Down
54 changes: 50 additions & 4 deletions openqdc/datasets/potential/proteinfragments.py
Original file line number Diff line number Diff line change
Expand Up @@ -88,12 +88,37 @@ def _unpack_data_tuple(self, data):
return q, s, z, r, e, f, d


# graphs is smiles
class ProteinFragments(BaseDataset):
"""https://www.science.org/doi/10.1126/sciadv.adn4397"""
"""
ProteinFragments is a dataset constructed from a subset of the
the data was generated from a top-down and bottom-up approach:
Top-down:
Fragments are generated by cutting out a spherical
region around an atom (including solvent molecules)
and saturating all dangling bonds.
Sampling was done with the Molecular Dynamics (MD) method from
conventional FF at room temperature.
Bottom-up:
Fragments are generated by constructing chemical graphs
of one to eight nonhydrogen atoms.
Sampling of multiple conformers per fragments was done with
MD simulations at high temperatures or normal mode sampling.
Usage:
```python
from openqdc.datasets import ProteinFragments
dataset = ProteinFragments()
```
References:
https://www.science.org/doi/10.1126/sciadv.adn4397
"""

__name__ = "proteinfragments"

# PBE0/def2-TZVPP+MBD
__energy_methods__ = [
PotentialMethod.WB97X_6_31G_D, # "wb97x/6-31g(d)"
]
Expand Down Expand Up @@ -135,7 +160,28 @@ def read_raw_entries(self):

class MDDataset(ProteinFragments):
"""
Part of the proteinfragments dataset that is generated from the molecular dynamics with their model.
MDDataset is a subset of the proteinfragments dataset that
generated from the molecular dynamics with their model.
The sampling was done with Molecular Dynamics
at room temperature 300K in various solvent phase:
Subsets:
Polyalanine:
All the polyalanine are sampled in gas phase. AceAla15Lys is
a polyalanine peptides capped with an N-terminal acetyl group
and a protonated lysine residue at the C-terminus,
Acela15nme is polyalanine peptide capped with an N-terminal acetyl group
and a C-terminal N-methyl amide group\n
Crambin: 46-residue protein crambin in aqueous solution (25,257 atoms)
Usage:
```python
from openqdc.datasets import MDDataset
dataset = MDDataset()
```
References:
https://www.science.org/doi/10.1126/sciadv.adn4397
"""

__name__ = "mddataset"
Expand Down
65 changes: 57 additions & 8 deletions openqdc/datasets/potential/qmx.py
Original file line number Diff line number Diff line change
Expand Up @@ -77,6 +77,26 @@ def read_raw_entries(self):
# ['smiles', 'E1-CC2', 'E2-CC2', 'f1-CC2', 'f2-CC2', 'E1-PBE0', 'E2-PBE0', 'f1-PBE0', 'f2-PBE0',
# 'E1-PBE0.1', 'E2-PBE0.1', 'f1-PBE0.1', 'f2-PBE0.1', 'E1-CAM', 'E2-CAM', 'f1-CAM', 'f2-CAM']
class QM7(QMX):
"""
QM7 is a dataset constructed from subsets of the GDB-13 database (
stable and synthetically accessible organic molecules),
containing up to seven “heavy” atoms.
The molecules conformation are optimized using DFT at the
PBE0/def2-TZVP level of theory.
Chemical species:
[C, N, O, S, H]
Usage:
```python
from openqdc.datasets import QM7
dataset = QM7()
```
References:
https://arxiv.org/pdf/1703.00564
"""

__links__ = {"qm7.hdf5.gz": "https://zenodo.org/record/3588337/files/150.hdf5.gz?download=1"}
__name__ = "qm7"

Expand Down Expand Up @@ -167,6 +187,26 @@ class QM7(QMX):


class QM7b(QMX):
"""
QM7b is a dataset constructed from subsets of the GDB-13 database (
stable and synthetically accessible organic molecules),
containing up to seven “heavy” atoms.
The molecules conformation are optimized using DFT at the
PBE0/def2-TZVP level of theory.
Chemical species:
[C, N, O, S, Cl, H]
Usage:
```python
from openqdc.datasets import QM7b
dataset = QM7b()
```
References:
https://arxiv.org/pdf/1703.00564
"""

__links__ = {"qm7b.hdf5.gz": "https://zenodo.org/record/3588335/files/200.hdf5.gz?download=1"}
__name__ = "qm7b"
energy_target_names = [
Expand Down Expand Up @@ -251,16 +291,25 @@ class QM7b(QMX):


class QM8(QMX):
"""QM8 is the dataset used in a study on modeling quantum
"""QM8 is the subset of QM9 used in a study on modeling quantum
mechanical calculations of electronic spectra and excited
state energy (ka increase of energy from the ground states) of small molecules. Multiple methods, including
state energy (a increase of energy from the ground states) of small molecules
up to eight heavy atoms.
Multiple methods were used, including
time-dependent density functional theories (TDDFT) and
second-order approximate coupled-cluster (CC2)
- Column 1: Molecule ID (gdb9 index) mapping to the .sdf file
- Columns 2-5: RI-CC2/def2TZVP
- Columns 6-9: LR-TDPBE0/def2SVP
- Columns 10-13: LR-TDPBE0/def2TZVP
- Columns 14-17: LR-TDCAM-B3LYP/def2TZVP
second-order approximate coupled-cluster (CC2).
The molecules conformations are relaxed geometries computed using
the DFT B3LYP with basis set 6-31G(2df,p).
For more information about the sampling, check QM9 dataset.
Usage:
```python
from openqdc.datasets import QM8
dataset = QM8()
```
References:
https://arxiv.org/pdf/1504.01966
"""

Expand Down
26 changes: 25 additions & 1 deletion openqdc/datasets/potential/waterclusters.py
Original file line number Diff line number Diff line change
Expand Up @@ -101,7 +101,31 @@ def format_geometry_and_entries(geometries, energies, subset):


class SCANWaterClusters(BaseDataset):
"""https://chemrxiv.org/engage/chemrxiv/article-details/662aaff021291e5d1db7d8ec"""
"""
The SCAN Water Clusters dataset contains conformations of
neutral water clusters containing up to 20 monomers, charged water clusters,
and alkali- and halide-water clusters. This dataset consists of our data sets of water clusters:
the benchmark energy and geometry database (BEGDB) neutral water cluster subset; the WATER2723 set of 14
neutral, 5 protonated, 7 deprotonated, and one auto-ionized water cluster; and two sets of
ion-water clusters M...(H2O)n, where M = Li+, Na+, K+, F−, Cl−, or Br−.
Water clusters were obtained from 10 nanosecond gas-phase molecular dynamics
simulations using AMBER 9 and optimized to obtain
lowest energy isomers were determined using MP2/aug-cc-pVDZ//MP2/6-31G* Gibbs free energies.
Chemical Species:
[H, O, Li, Na, K, F, Cl, Br]
Usage:
```python
from openqdc.datasets import SCANWaterClusters
dataset = SCANWaterClusters()
```
References:
https://chemrxiv.org/engage/chemrxiv/article-details/662aaff021291e5d1db7d8ec\n
https://github.com/esoteric-ephemera/water_cluster_density_errors
"""

__name__ = "scanwaterclusters"

Expand Down
8 changes: 6 additions & 2 deletions openqdc/datasets/potential/waterclusters3_30.py
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,10 @@ class WaterClusters(BaseDataset):
clusters of sizes n = 3 - 30. The cluster structures are derived and labeled with
the TTM2.1-F ab-initio based interaction potential for water.
It contains approximately 4.5 mil. structures.
Sampling was done with the Monte Carlo Temperature Basin Paving (MCTBP) method.
Chemical Species:
["H", "O"]
Usage:
```python
Expand All @@ -61,8 +65,8 @@ class WaterClusters(BaseDataset):
```
References:
- https://doi.org/10.1063/1.5128378
- https://sites.uw.edu/wdbase/database-of-water-clusters/
https://doi.org/10.1063/1.5128378\n
https://sites.uw.edu/wdbase/database-of-water-clusters/\n
"""

__name__ = "waterclusters3_30"
Expand Down
Loading

0 comments on commit dac9308

Please sign in to comment.