Skip to content

Commit

Permalink
misc improvements
Browse files Browse the repository at this point in the history
  • Loading branch information
Carlos Hernandez committed Jun 22, 2016
1 parent 53c6459 commit 11319c4
Show file tree
Hide file tree
Showing 4 changed files with 139 additions and 12 deletions.
2 changes: 1 addition & 1 deletion docs/config_file.rst
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,7 @@ Search Space
The search space describes the space of hyperparameters to search over
to find the best model. It is specified as the product space of
bounded intervals for different variables, which can either be of type
``int``, ``float``, or ``enum``. Variables of type ``float`` can also
``int``, ``float``, ``jump``, or ``enum``. Variables of type ``float`` can also
be warped into log-space, which means that the optimization will be
performed on the log of the parameter instead of the parameter itself.

Expand Down
145 changes: 136 additions & 9 deletions docs/getting_started.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,10 @@
Getting Started
===============

Getting started with Osprey is as easy as setting up a single ``YAML``
Introduction
------------

Getting started with Osprey is as simple as setting up a single ``YAML``
configuration file. This configuration file will contain your model
estimators (``estimator``), hyperparameter search strategy
(``strategy``), hyperparameter search space (``search_space``), dataset
Expand All @@ -12,8 +15,11 @@ over how to set up a basic Osprey toy project and then a more realistic
example for a `molecular
dynamics <https://en.wikipedia.org/wiki/Molecular_dynamics>`__ dataset.

First, we'll begin with a simple C-Support Vector Classification example
using ``sklearn`` to introduce the basic ``YAML`` fields for Osprey. To
``scikit-learn`` Example
------------------------

First, we'll begin with a basic C-Support Vector Classification example
using ``scikit-learn`` to introduce the basic ``YAML`` fields for Osprey. To
tell Osprey that we want to use ``sklearn``'s ``SVC`` as our estimator,
we can type:

Expand All @@ -22,16 +28,13 @@ we can type:
estimator:
entry_point: sklearn.svm.SVC
If we want to use `gaussian process
prediction <https://en.wikipedia.org/wiki/Gaussian_process#Gaussian_process_prediction.2C_or_kriging>`__
to decide where to search in hyperparameter space, we can add:
If we want to use random search to decide where to search next in
hyperparameter space, we can add:

.. code:: yaml
strategy:
name: gp
params:
seeds: 5
name: random
The search space can be defined for any hyperparameter available in the
``estimator`` class. Here we can adjust the value range of the ``C`` and
Expand All @@ -53,18 +56,142 @@ The search space can be defined for any hyperparameter available in the
warp: log
type: float
To perform 5-fold cross validation, we add:

.. code:: yaml
cv: 5
To load the digits classification example dataset from ``scikit-learn``,
we write:

.. code:: yaml
dataset_loader:
name: sklearn_dataset
params:
method: load_digits
And finally we need to list the SQL database where our cross-validation
results will be saved:

.. code:: yaml
trials:
uri: sqlite:///osprey-trials.db
Once this all has been written to a ``YAML`` file (e.g. ``config.yaml``),
we can start an osprey job in the command-line by invoking:

.. code:: bash
$ osprey worker config.yaml
``msmbuilder`` Example
----------------------

Now that we understand the basics, we can move on to a more practical example.
This section will go over how to set up a Osprey configuration for
cross-validating Markov state models from protein simulations. Our model will
be constructed by first calculating torsion angles, performing dimensionality
reduction using tICA, clustering using mini-batch k-means, and, finally, an
maximum-likelihood estimated Markov state model.

We begin by defining a ``Pipeline`` which will construct our desired model:

.. code:: yaml
estimator:
eval: |
Pipeline([
('featurizer', DihedralFeaturizer()),
('tica', tICA()),
('cluster', MiniBatchKMeans()),
('msm', MarkovStateModel(n_timescales=5, verbose=False)),
])
eval_scope: msmbuilder
Notice that we can easily set default parameters (e.g. ``msm.n_timescales``)
in our ``Pipeline`` even if we don't plan on optimizing them.

If we wish to use `gaussian process
prediction <https://en.wikipedia.org/wiki/Gaussian_process#Gaussian_process_prediction.2C_or_kriging>`__
to decide where to search in hyperparameter space, we can add:

.. code:: yaml
strategy:
name: gp
params:
seeds: 50
In this example, we'll be optimizing the type of featurization,
the number of cluster centers and the number of independent components:

.. code:: yaml
search_space:
featurizer__types:
choices:
- ['phi', 'psi']
- ['phi', 'psi', 'chi1']
type: enum
tica__n_components:
min: 2
max: 5
type: int
cluster__n_clusters:
min: 10
max: 100
type: int
As seen in the previous example, we'll set ``tica__n_components`` and
``cluster__n_clusters`` as integers with a set range. Notice that we can
change which torsion angles to use in our featurization by creating an ``enum``
which contains a list of different dihedral angle types.


In this example, we'll be using 50-50 ``shufflesplit`` cross-validation.
This method is optimal for Markov state model cross-validation, as it maximizes
the amount of unique data available in your training and test sets:

.. code:: yaml
cv:
name: shufflesplit
params:
n_iter: 5
test_size: 0.5
We'll be using MDTraj to load our trajectories. Osprey already includes an
``mdtraj`` dataset loader to make it easy to list your trajectory and topology
files as a glob-string:

.. code:: yaml
dataset_loader:
name: mdtraj
params:
trajectories: ~/local/msmbuilder/Tutorial/XTC/*/*.xtc
topology: ~/local/msmbuilder/Tutorial/native.pdb
stride: 1
And finally we need to list the SQL database where our cross-validation
results will be saved:

.. code:: yaml
trials:
uri: sqlite:///osprey-trials.db
Just as before, once this all has been written to a ``YAML`` file
we can start an osprey job in the command-line by invoking:

.. code:: bash
$ osprey worker config.yaml
2 changes: 1 addition & 1 deletion osprey/cli/parser_currentbest.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ def func(args, parser):


def configure_parser(sub_parsers):
help = 'Get paramters for the current best model'
help = 'Get parameters for the current best model'
p = sub_parsers.add_parser('current_best', description=help, help=help,
formatter_class=ArgumentDefaultsHelpFormatter)
p.add_argument('config', help='Path to worker config file (yaml)')
Expand Down
2 changes: 1 addition & 1 deletion osprey/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@
instance of sklearn.base.BaseEstimator.
- search_space: the specification of the hyperparameter search space
- strategy: strategy for adaptive exploration of hyperparameters.
- dataset_lodaer: the specification of the dataset to fit the models with.
- dataset_loader: the specification of the dataset to fit the models with.
- trials: as each hyperparameter setting is explored, the results are
serialized to a database specified in this section.
- cv: specification for cross-validation.
Expand Down

0 comments on commit 11319c4

Please sign in to comment.