Skip to content

Commit

Permalink
Deploying to gh-pages from @ a73d08a 🚀
Browse files Browse the repository at this point in the history
  • Loading branch information
iantbeck committed Aug 15, 2024
1 parent 2d929c2 commit 971dd0a
Show file tree
Hide file tree
Showing 184 changed files with 22,089 additions and 0 deletions.
4 changes: 4 additions & 0 deletions .buildinfo
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
# Sphinx build info version 1
# This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done.
config: f00de67f5842cd54602bae746b2d95e8
tags: 645f666f9bcd5a90fca523b33c5a78b7
Binary file added _images/plot.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
9 changes: 9 additions & 0 deletions _sources/community/community.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@

PES-Learn Community
===================

.. toctree::

Support <support>
Contribute <contribute>

5 changes: 5 additions & 0 deletions _sources/community/contribute.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@

Contribute to PES-Learn!
========================

We welcome contributions and ideas for PES-Learn! to get started, check out the `developer documentation <../develop/dev_docs.html>`_.
5 changes: 5 additions & 0 deletions _sources/community/support.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@

PES-Learn Support
=================

WIP
3 changes: 3 additions & 0 deletions _sources/develop/dev_docs.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@

Developer Documentation coming soon!
====================================
396 changes: 396 additions & 0 deletions _sources/guides/api.rst

Large diffs are not rendered by default.

955 changes: 955 additions & 0 deletions _sources/guides/cli.rst

Large diffs are not rendered by default.

13 changes: 13 additions & 0 deletions _sources/guides/data_gen.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
########################
Data Generation Examples
########################

***************************
**Generating with Schemas**
***************************



*****************************
**Generating with Templates**
*****************************
18 changes: 18 additions & 0 deletions _sources/guides/examples.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
########
Examples
########

The following examples cover some specifics for each machine learning model type (GP, NN, KRR)
along with some other useful examples. Each example has some different keyword specifics and methods covered as well.

Before checking out the examples, it is recommended to take a look at the
`Tutorials <tutorials.html>`_ page to get an understanding of how PES-Learn works and the
different methods of which to use PES-Learn.

.. toctree::
:maxdepth: 1

Gaussian Process (GP) <gp_ex>
Neural Network (NN) <nn_ex>
Kernel Ridge Regression (KRR) <krr_ex>
Data Generation <data_gen>
203 changes: 203 additions & 0 deletions _sources/guides/ext_data.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,203 @@
######################################################
Training models with datasets Not created by PES-Learn
######################################################

PES-Learn supports builting machine learning (ML) models from user-supplied datasets in many flexible formats.
This tutorial covers all of the different kinds of datasets which can be loaded in and used.

***************************
**Supported Dataset Types**
***************************

Cartesian Coordinates
#####################

When PES-Learn imports Cartesian coordinate files, it re-orders the atoms to its standard ordering scheme.
This was found to be necessary in order to enable the use of permutation invariant polynomials with externally
supplied datasets. PES-Learn's standard atom order sorts elements by most common occurance, with an alphabetical
tiebraker. For example, if the Cartesian coordinates of acetate (:math:`C_2H_3O_2`) were given in the order
C,C,H,H,H,O,O, they would be automatically re-ordered to :math:`H_3C_2O_2`.

PES-Learn uses the set of interatomic distances for the geometries, which are defined to be the row-wise order
of the interatomic distance matrix in standard order:

.. code-block::
H H H C C O O
H
H r0
H r1 r2
C r3 r4 r5
C r6 r7 r8 r9
O r10 r11 r12 r13 r14
O r15 r16 r17 r18 r19 r20
Thus, in all the following water examples, the HOH atom order is internally reordered to HHO.

The "standard" way to express geometry, energy pairs with Cartesian coordinates is the following:

.. code-block::
3
-76.02075832627291
H 0.000000000000 -0.671751442127 0.596572464600
O -0.000000000000 0.000000000000 -0.075178977527
H -0.000000000000 0.671751442127 0.596572464600
3
-76.0264333762269331
H 0.000000000000 -0.727742220982 0.542307610016
O -0.000000000000 0.000000000000 -0.068340619196
H -0.000000000000 0.727742220982 0.542307610016
3
-76.0261926533675592
H 0.000000000000 -0.778194442078 0.483915467021
O -0.000000000000 0.000000000000 -0.060982147482
H -0.000000000000 0.778194442078 0.483915467021
Here, there is a number indicating the number of atoms, an energy on its own line in Hartrees,
and Cartesian coordinates in Angstroms.

Flexibility of Cartesian Coordinate input
-----------------------------------------

* The **atom number** is optional

.. code-block::
-76.02075832627291
H 0.000000000000 -0.671751442127 0.596572464600
O -0.000000000000 0.000000000000 -0.075178977527
H -0.000000000000 0.671751442127 0.596572464600
-76.0264333762269331
H 0.000000000000 -0.727742220982 0.542307610016
O -0.000000000000 0.000000000000 -0.068340619196
H -0.000000000000 0.727742220982 0.542307610016
-76.0261926533675592
H 0.000000000000 -0.778194442078 0.483915467021
O -0.000000000000 0.000000000000 -0.060982147482
H -0.000000000000 0.778194442078 0.483915467021
* **Blank lines** between each datablock are optional

.. code-block::
-76.02075832627291
H 0.000000000000 -0.671751442127 0.596572464600
O -0.000000000000 0.000000000000 -0.075178977527
H -0.000000000000 0.671751442127 0.596572464600
-76.0264333762269331
H 0.000000000000 -0.727742220982 0.542307610016
O -0.000000000000 0.000000000000 -0.068340619196
H -0.000000000000 0.727742220982 0.542307610016
-76.0261926533675592
H 0.000000000000 -0.778194442078 0.483915467021
O -0.000000000000 0.000000000000 -0.060982147482
H -0.000000000000 0.778194442078 0.483915467021
* Your **whitespace delimiters** do not matter at all, and can be completely erratic, if you're into that:

.. code-block::
-76.02075832627291
H 0.000000000000 -0.671751442127 0.596572464600
O -0.000000000000 0.000000000000 -0.075178977527
H -0.000000000000 0.671751442127 0.596572464600
-76.0264333762269331
H 0.000000000000 -0.727742220982 0.542307610016
O -0.000000000000 0.000000000000 -0.068340619196
H -0.000000000000 0.727742220982 0.542307610016
-76.0261926533675592
H 0.000000000000 -0.778194442078 0.483915467021
O -0.000000000000 0.000000000000 -0.060982147482
H -0.000000000000 0.778194442078 0.483915467021
* You can use Bohr instead of Angstroms (just remember the model is trained in terms of Bohr when using it in the future!), and you can use whatever energy unit you want (though, keep in mind PES-Learn assumes it is Hartrees when converting units to wavenumbers (cm :math:`^{-1}`))

Note that you don't need to use the ``units=bohr`` keyword when training a ML model on this dataset, this keyword is for using Bohr units
when generating schemas.

Arbitrary Internal Coordinates
##############################

.. note::

The keyword option ``use_pips`` should be set to ``false`` when using your own internal coordinates,
unless the coordinates correspond to the standard order PES-Learn uses for interatomic distances, described above.

For internal coordinates, the first line requires a series of geometry parameter labels, with the last column being
the energies labeled with E. One can use internal coordinates with comma or whitespace delimiters. A few examples:



.. code-block::
a1,r1,r2,E
104.5,0.95,0.95,-76.026433
123.0,0.95,0.95,-76.026193
95.0,0.95,0.95,-76.021038
.. code-block::
a1 r1 r2 E
104.5 0.95 0.95 -76.026433
123.0 0.95 0.95 -76.026193
95.0 0.95 0.95 -76.021038
.. code-block::
r0 r1 r2 E
1.4554844420 0.9500000000 0.9500000000 -76.0264333762
1.5563888842 0.9500000000 0.9500000000 -76.0261926534
1.6454482672 0.9500000000 0.9500000000 -76.0210378425
****************************************
**Creating ML models with the datasets**
****************************************

Using an external dataset called ``dataset_name`` is the same whether it is a Cartesian coordinate or
internal coordinate file.

With the Python API:

.. code-block:: python
import peslearn
input_string = ("""
use_pips = false
hp_maxit = 15
training_points = 500
sampling = structure_based
""")
gp = peslearn.ml.GaussianProcess("dataset_name", input_obj)
gp.optimize_model()
Using a Neural Network:

.. code-block:: python
nn = peslearn.ml.NeuralNetwork("dataset_name", input_obj)
nn.optimize_model()
Using the commane line interface:

.. code-block::
use_pips = false
hp_maxit = 15
training_points = 1000
sampling = smart_random
ml_model = gp
pes_name = 'dataset_name'
Using the Python API, one can even partition and supply their own training, validation, and testing datasets:

.. code-block:: python
nn = peslearn.ml.NeuralNetwork('full_dataset_name', input_obj, train_path='my_training_set', valid_path='my_validation_set', test_path='my_test_set')
nn.optimize_model()
59 changes: 59 additions & 0 deletions _sources/guides/faq.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@

Frequently Asked Questions
==========================
#. **How do I install PES-Learn?**

* Check out the installation guide `here <../started/instalation.html>`_ for information on several different ways to install PES-Learn.

#. **How do I use PES-Learn?**

* The code can be used in two formats, either with an input file ``input.dat`` or with the Python API. See Tutorials section for examples. If an input file is created, one just needs to run ``python path/to/PES-Learn/peslearn/driver.py`` while in the directory containing the input file. To use the Python API, create a python file which imports peslearn ``import peslearn``. If you have compiled from source, this may require the package to be in your Python path: ``export PYTHONPATH="absolute/path/to/directory/containing/peslearn"``. This can be executed on the command line or added to your shell intializer (e.g. ``.bashrc```) for more permanent access.

#. **Why is data generation so slow?**

* First off, the data generation code performance was improved 100-fold in `this pull request <https://github.com/CCQC/PES-Learn/pull/20>`_, July 17th, 2019. Update to this version if data generation is slow. Also, if one is generating a lot of points (1-10 million) one can expect slow performance when using ``grid_reduction = x`` for large values of x (20,000-100,000). Multiplying the grid increments together gives the total number of points, so if there are 6 geometry parameters with 10 increments each thats 10^6 internal coordinate configurations. If you are not removing redundancies with ``remove_redundancy=false`` and reducing the grid size to some value (e.g. ``grid_reduction=1000``) it is recommended to only generate tens of thousands of points at a time. This is because writing many directories/files can be quite expensive. If you are removing redundancies and/or filtering geometries, it is not recommended to generate more than a few million internal coordinate configurations. Finally, the algorithm behind ``remember_redundancies=true`` and ``grid_reduction = 10000`` can be slow in some circumstances.

#. **Why is my Machine learning model so bad?**

* 95% of the time it means the dataset is bad. Open the dataset and look at the energies. If it is a PES-Learn generated dataset, the energies are in increasing order by default (can be disabled with ``sort_pes=false``.) Scrolling through the dataset, the energies should be smoothly increasing. If there are large jumps in the energy values (typcially towards the end of the file) these points are probably best deleted. If the dataset looks good, the ML algorithm probably just needs more training points in order to model the dimensionality and features of the surface. Either that, or PES-Learn's automated ML model optimization routines are just not working for your use case.

#. **Why is training machine learning models so slow?**

* Machine learning can be slow sometimes, especially when working with very large datasets. However there are a few things you can do to speed up the process:
- Train over less hyperparameter iterations

* Ensure multiple cores/threads are being used by your CPU. This can be done by checking which BLAS library NumPy is using:
* Open an interactive python session with ``python`` and then ``import numpy as np`` followed by ``np.show_config()``. If this displays a bunch of references to ``mkl``, then NumPy is using Intel MKL. If this displays a bunch of references to ``openblas`` then Numpy is using OpenBLAS. If Numpy is using MKL, you can control CPU usage with the environment variable ``MKL_NUM_THREADS=4`` or however many physical cores your CPU has (this is recommended by Intel; do not use hyperthreading). If Numpy is using OpenBLAS, you can control CPU usage with ``OMP_NUM_THREADS=8`` or however many threads are available. In bash, environment variables can be set by typing ``export OMP_NUM_THREADS=8`` into the command line. Note that instabilities such as memory leaks due to thread-overallocation can occur if *both* of these environment variables are set depending on your configuration (i.e., if one is set to 4 or 8 or whatever, make sure to set the other to =1).

* Use an appropriate number of training points for the ML algorithm.
* Gaussian processes scale poorly with the number of training points. Any more than 1000-2000 is unreasonable on a personal computer. If submitting to some external computing resource, anything less than 5000 or so is reasonable. Use neural networks, or kernel ridge regression for large training sets. If it is still way too slow, you can try to constrain the neural networks to use the Adam optimizer instead of the BFGS optimizer.

#. **How do I use this machine learning model?**

* When a model is finished training PES-Learn exports a folder ``model1_data`` which contains a bunch of stuff including a Python code ``compute_energy.py`` with convenience function ``pes()`` for evaluating energies with the ML model. Directions for use are written directly into the ``compute_energy.py`` file. The convenience function can be imported into other Python codes that are in the same directory with ``from compute_energy import pes``. This is also in principle accessible from codes written in other programming languages such as C, C++ through their respective Python APIs, though these can be tricky to use.

#. **What are all these hyperparameters?**

* ``scale_X`` is how each individual input (geometry parameter) is scaled. ``scale_y`` is how the energies (outputs) are scaled.
* ``std`` is standard scaling, each column of data is scaled to a mean of 0 and variance of 1.

* ``mm01`` is minmax scaling, each column of data is scaled such that it runs from 0 to 1

* ``mm11`` is minmax scaling with a range -1 to 1

* ``morse`` is whether interatomic distances are transformed into morse variables :math:`r_1 \rightarrow e^{r_1/\alpha}`

* ``pip`` stands for permutation invariant polynomials; i.e. the geometries are being transformed into a permutation invariant representation using the fundamental invariants library.

* ``degree_reduce`` is when each fundamental invariant polynomial result is taken to the :math:`1/n` power where :math:`n` is the degree of the polynomial.

* ``layers`` is a list of the number of nodes in each hidden layer of the neural network.

#. **How many points do I need to generate?**

* It's very hard to say what size of training set is required for a given target accuracy; it depends on a lot of things. First, the application: if you are doing some variational computation of the vibrational energy levels and only want the fundamentals, you might be able to get away with less points because you really just need a good description of the surface around the minimum. If one wants high-lying vibrational states with VCI, the surface needs a lot more coverage, and therefore more points. If the application involves a reactive potential energy surface across several stationary points, even more points are needed. The structure of the surface itself can also influence the number of points needed. You don't know until you try. For a given system, one should try out a few internal coordinate grids, reduce them to some size with ``grid_reduction``, compute the points at a low level of theory, and see how well the models perform. This process can be automated with the Python API.

#. **How big can the molecular system be?**

* No more than 5-6 atoms for 'full' PESs. Any larger than that, and generating data by displacing in internal coordinates is impractical (if you have 6 atoms and want 5 increments along each internal coordinate, that's already ~240 million points). This is just an unfortunate reality of high-dimensional spaces: ample coverage over each coordinate and all possible coordinate displacement couplings requires an impossibly large grid of points for meaningful results. One can still do large systems if they only scan over some of the coordinates. For example, you can do relaxed scans across the surface, fixing just a few internal coordinates and relaxing all others through geometry optimization at each point, and creating a model of this 'sub-manifold' of the surface is no problem (i.e., train on the fixed coordinate parameters and 'learn' the relaxed energies). This is useful for inspecting reaction coordinates/reaction entrance channels, for example. Future releases will support including gradient information in training the model, and this may allow for slightly larger systems and smaller dataset sizes. In theory, gradients can give the models more indication of the curvature of the surface with less points.

4 changes: 4 additions & 0 deletions _sources/guides/gp_ex.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
###########################
Gaussian Process Regression
###########################

12 changes: 12 additions & 0 deletions _sources/guides/guide.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@


User Guides
===========

.. toctree::
:maxdepth: 2

Tutorials <tutorials>
Frequently Asked Questions <faq>
Examples <examples>

Loading

0 comments on commit 971dd0a

Please sign in to comment.