Skip to content

Commit

Permalink
Update docs (#37)
Browse files Browse the repository at this point in the history
* Add doc sections

* Update doc pytest
  • Loading branch information
reidjohnson authored Feb 25, 2024
1 parent 2d250ad commit 8f34801
Show file tree
Hide file tree
Showing 10 changed files with 243 additions and 118 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/build.yml
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,7 @@ jobs:
- name: Test with pytest
run: |
python -m pytest docs/*rst
python -m pytest docs/user_guide/*rst
python -m pytest --pyargs quantile_forest --cov=quantile_forest
- name: Upload coverage reports to Codecov
Expand Down
1 change: 1 addition & 0 deletions docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -156,6 +156,7 @@ def setup(app):
# Custom sidebar templates, maps document names to template names.
html_sidebars = {
"index": [],
"releases/changes": [],
"**": ["sidebar-nav-bs"],
}

Expand Down
42 changes: 9 additions & 33 deletions docs/install.rst → docs/getting_started/developers.rst
Original file line number Diff line number Diff line change
@@ -1,28 +1,11 @@
.. _install:
.. _developers:

Getting Started
===============

Prerequisites
-------------

The quantile-forest package requires the following dependencies:

* python (>=3.8)
* numpy (>=1.23)
* scikit-learn (>=1.0)
* scipy (>=1.4)

Install
-------

quantile-forest can be installed using `pip`::

pip install quantile-forest

Developer Install
Developer's Guide
-----------------

Development Installation
~~~~~~~~~~~~~~~~~~~~~~~~

Building the package from source additionally requires the following dependencies:

* cython (>=3.0a4)
Expand All @@ -32,7 +15,7 @@ To manually build and install the package, run::
pip install --verbose --editable .

Troubleshooting
---------------
~~~~~~~~~~~~~~~

If the build fails because SciPy is not installed, ensure OpenBLAS and LAPACK are available and accessible.

Expand All @@ -43,28 +26,21 @@ On macOS, run::
export SYSTEM_VERSION_COMPAT=1

Test and Coverage
-----------------
~~~~~~~~~~~~~~~~~

To test the code::

$ python -m pytest quantile_forest -v

To test the documentation::

$ python -m pytest docs/*rst
$ python -m pytest docs/user_guide/*rst

Documentation
-------------
~~~~~~~~~~~~~

To build the documentation, run::

$ pip install -r ./docs/sphinx_requirements.txt
$ mkdir -p ./docs/_images
$ sphinx-build -b html ./docs ./docs/_build

.. toctree::
:maxdepth: 2
:caption: Install
:hidden:

Getting Started <self>
29 changes: 29 additions & 0 deletions docs/getting_started/installation.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
.. _install:

Getting Started
---------------

Prerequisites
~~~~~~~~~~~~~

The quantile-forest package requires the following dependencies:

* python (>=3.8)
* numpy (>=1.23)
* scikit-learn (>=1.0)
* scipy (>=1.4)

Installation
~~~~~~~~~~~~

quantile-forest can be installed using `pip`::

pip install quantile-forest

.. toctree::
:maxdepth: 1
:caption: Installation
:hidden:

self
developers
7 changes: 4 additions & 3 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ quantile-forest
A guide that provides installation requirements and instructions, as well as procedures for developers.

.. grid-item-card:: User Guide
:link: user_guide
:link: user-guide-intro
:link-type: ref
:link-alt: User guide

Expand All @@ -50,9 +50,10 @@ quantile-forest
:maxdepth: 1
:hidden:

Getting Started <install>
User Guide <user_guide>
Getting Started <getting_started/installation>
User Guide <user_guide/introduction>
Examples <gallery/index>
API <api>
Release Notes <releases/changes>

.. _GitHub: http://github.com/zillow/quantile-forest
115 changes: 115 additions & 0 deletions docs/releases/changes.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,115 @@
:html_theme.sidebar_secondary.remove:

.. _changes:

Release Notes
=============

Version 1.3.4 (released Feb 21, 2024)
-------------------------------------

- Reorder multi-target outputs (#35)
- Add tests for model serialization (#36)
- Update and fix documentation and examples

Version 1.3.3 (released Feb 16, 2024)
-------------------------------------

- Set default value of `weighted_leaves` at prediction time to False (#34)
- Update and fix documentation and examples

Version 1.3.2 (released Feb 15, 2024)
-------------------------------------

- Fix bug in multi-target output when `max_samples_leaf` > 1 (#30)
- Update quantile forest examples (#31)
- Update and fix documentation (#33)

Version 1.3.1 (released Feb 12, 2024)
-------------------------------------

- Fix single-output performance regression (#29)

Version 1.3.0 (released Feb 11, 2024)
-------------------------------------

- Support for multiple-output quantile regression (#26)
- Update conformalized quantile regression example (#28)

Version 1.2.5 (released Feb 10, 2024)
-------------------------------------

- Fix weighted leaf and quantile bug (#27)

Version 1.2.4 (released Jan 16, 2024)
-------------------------------------

- Use base model parameter validation when available
- Resolve Cython 3 deprecation warnings

Version 1.2.3 (released Oct 09, 2023)
-------------------------------------

- Fix bug that could prevent interpolation from being correctly applied (#15)
- Update documentation and docstrings

Version 1.2.2 (released Oct 08, 2023)
-------------------------------------

- Optimize performance for predictions when `max_samples_leaf` = 1 (#13)
- Update documentation and examples (#14)

Version 1.2.1 (released Oct 04, 2023)
-------------------------------------

- More efficient calculation of weighted quantiles (#11)
- Add support for Python version 3.12

Version 1.2.0 (released Aug 01, 2023)
-------------------------------------

- Add optional default_quantiles parameter to the model initialization
- Update documentation

Version 1.1.3 (released Jul 08, 2023)
-------------------------------------

- Fix building from the source distribution
- Minor update to documentation

Version 1.1.2 (released Mar 22, 2023)
-------------------------------------

- Fix for compatibility with development version of scikit-learn
- Update documentation and examples

Version 1.1.1 (released Dec 19, 2022)
-------------------------------------

- Fix for compatibility with scikit-learn 1.2.0
- Fix to documentation
- Update version requirements

Version 1.1.0 (released Nov 07, 2022)
-------------------------------------

- Update default `max_samples_leaf` to 1 (previously None)
- Update documentation and unit tests
- Miscellaneous update for compatibility with scikit-learn >= 1.1.0

This version supports Python versions 3.8 to 3.11. Note that support for 32-bit Python on Windows has been dropped in this release.

Version 1.0.2 (released Mar 28, 2022)
-------------------------------------

- Add sample weighting by leaf size

Version 1.0.1 (released Mar 23, 2022)
-------------------------------------

- Suppresses UserWarning

Version 1.0.0 (released Mar 23, 2022)
-------------------------------------

Initial release.
82 changes: 1 addition & 81 deletions docs/user_guide.rst → docs/user_guide/fit_predict.rst
Original file line number Diff line number Diff line change
@@ -1,35 +1,4 @@
.. _user_guide:

User Guide
==========

Introduction
------------

Random forests have proven to be very popular and powerful for regression and classification. For regression, random forests give an accurate approximation of the conditional mean of a response variable. That is, if we let :math:`Y` be a real-valued response variable and :math:`X` a covariate or predictor variable, they estimate :math:`E(Y | X)`, which can be interpreted as the expected value of the output :math:`Y` given the input :math:`X`.

However random forests provide information about the full conditional distribution of the response variable, not only about the conditional mean. Quantile regression forests, a generalization of random forests, can be used to infer conditional quantiles. That is, they return :math:`y` at :math:`q` for which :math:`F(Y=y|X) = q`, where :math:`q` is the quantile.

The quantiles give more complete information about the distribution of :math:`Y` as a function of the predictor variable :math:`X` than the conditional mean alone. They can be useful, for example, to build prediction intervals or to perform outlier detection in a high-dimensional dataset.

In practice, the empirical estimation of quantiles can be calculated in several ways. In this package, a desired quantile is calculated from the input rank :math:`x` such that :math:`x = (N + 1 - 2C)q + C`, where :math:`q` is the quantile, :math:`N` is the number of samples, and :math:`C` is a constant (degree of freedom). In this package, :math:`C = 1`. This package provides methods that calculate quantiles using samples that are weighted and unweighted. In a weighted quantile, :math:`N` is calculated from the fraction of the total weight instead of the total number of samples.

Quantile Regression Forests
---------------------------

A standard decision tree can be extended in a straightforward way to estimate conditional quantiles. When a decision tree is fit, rather than storing only the sufficient statistics of the response variable at the leaf node, such as the mean and variance, all of the response values can be stored with the leaf node. At prediction time, these values can then be used to calculate empirical quantile estimates.

The quantile-based approach can be extended to random forests. To estimate :math:`F(Y=y|x) = q`, each response value in the training set is given a weight or frequency. Formally, the weight or frequency given to the :math:`j`\th training sample, :math:`y_j`, while estimating the quantile is

.. math::
\frac{1}{T} \sum_{t=1}^{T} \frac{\mathbb{1}(y_j \in L(x))}{\sum_{i=1}^N \mathbb{1}(y_i \in L(x))},
where :math:`L(x)` denotes the leaf that :math:`x` falls into.

Informally, this means that given a new unknown sample, we first find the leaf that it falls into for each tree in the ensemble. Each training sample :math:`y_j` that falls into the same leaf as the new sample is given a weight that equals the fraction of samples in the leaf. Each :math:`y_j` that does not fall into the same leaf as the new sample is given a weight or frequency of zero. The weights or frequencies for each :math:`y_j` are then summed or aggregated across all of the trees in the ensemble. This information can then be used to calculate the empirical quantile estimates.

This approach was first proposed by :cite:t:`2006:meinshausen`.
.. _user-guide-fit-predict:

Fitting and Predicting
----------------------
Expand Down Expand Up @@ -171,52 +140,3 @@ The predictions of a standard random forest can also be recovered from a quantil
>>> y_pred_qrf = qrf.predict(X_test, **kwargs)
>>> np.allclose(y_pred_rf, y_pred_qrf)
True

Quantile Ranks
--------------

The quantile rank is the fraction of scores in a frequency distribution that are less than (or equal to) that score. For a quantile forest, the frequency distribution is the set of training sample response values that are used to construct the empirical quantile estimates. The quantile rank of each sample is calculated by aggregating the response values from all of the training samples that share the same leaf node across all of the trees. The output quantile rank will be a value in the range [0, 1] for each test sample::

>>> from sklearn import datasets
>>> from sklearn.model_selection import train_test_split
>>> from quantile_forest import RandomForestQuantileRegressor
>>> X, y = datasets.load_diabetes(return_X_y=True)
>>> X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)
>>> reg = RandomForestQuantileRegressor().fit(X_train, y_train)
>>> y_ranks = reg.quantile_ranks(X_test, y_test) # quantile ranks of y_test

Out-of-bag (OOB) quantile ranks can be returned by specifying `oob_score = True`::

>>> y_ranks_oob = reg.quantile_ranks(X_train, y_train, oob_score=True)

Proximity Counts
----------------

Proximity counts are counts of the number of times that two samples share a leaf node. When a test set is present, the proximity counts of each sample in the test set with each sample in the training set can be computed::

>>> from sklearn import datasets
>>> from sklearn.model_selection import train_test_split
>>> from quantile_forest import RandomForestQuantileRegressor
>>> X, y = datasets.load_diabetes(return_X_y=True)
>>> X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)
>>> reg = RandomForestQuantileRegressor().fit(X_train, y_train)
>>> proximities = reg.proximity_counts(X_test) # proximity counts for X_test

For each test sample, the method outputs a list of tuples of the training index and proximity count, listed in descending order by proximity count. For example, a test sample with an output of [(1, 5), (0, 3), (3, 1)], means that the test sample shared 5, 3, and 1 leaf nodes with the training samples that were (zero-)indexed as 1, 0, and 3 during model fitting, respectively.

The maximum number of proximity counts output per test sample can be limited by specifying `max_proximities`::

>>> proximities = reg.proximity_counts(X_test, max_proximities=10)
>>> np.all([len(prox) <= 10 for prox in proximities])
True

Out-of-bag (OOB) proximity counts can be returned by specifying `oob_score = True`::

>>> proximities = reg.proximity_counts(X_train, oob_score=True)

.. toctree::
:maxdepth: 2
:caption: User Guide
:hidden:

User Guide <self>
39 changes: 39 additions & 0 deletions docs/user_guide/introduction.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
.. _user-guide-intro:

Introduction
------------

Random forests have proven to be very popular and powerful for regression and classification. For regression, random forests give an accurate approximation of the conditional mean of a response variable. That is, if we let :math:`Y` be a real-valued response variable and :math:`X` a covariate or predictor variable, they estimate :math:`E(Y | X)`, which can be interpreted as the expected value of the output :math:`Y` given the input :math:`X`.

However random forests provide information about the full conditional distribution of the response variable, not only about the conditional mean. Quantile regression forests, a generalization of random forests, can be used to infer conditional quantiles. That is, they return :math:`y` at :math:`q` for which :math:`F(Y=y|X) = q`, where :math:`q` is the quantile.

The quantiles give more complete information about the distribution of :math:`Y` as a function of the predictor variable :math:`X` than the conditional mean alone. They can be useful, for example, to build prediction intervals or to perform outlier detection in a high-dimensional dataset.

In practice, the empirical estimation of quantiles can be calculated in several ways. In this package, a desired quantile is calculated from the input rank :math:`x` such that :math:`x = (N + 1 - 2C)q + C`, where :math:`q` is the quantile, :math:`N` is the number of samples, and :math:`C` is a constant (degree of freedom). In this package, :math:`C = 1`. This package provides methods that calculate quantiles using samples that are weighted and unweighted. In a weighted quantile, :math:`N` is calculated from the fraction of the total weight instead of the total number of samples.

Quantile Regression Forests
---------------------------

A standard decision tree can be extended in a straightforward way to estimate conditional quantiles. When a decision tree is fit, rather than storing only the sufficient statistics of the response variable at the leaf node, such as the mean and variance, all of the response values can be stored with the leaf node. At prediction time, these values can then be used to calculate empirical quantile estimates.

The quantile-based approach can be extended to random forests. To estimate :math:`F(Y=y|x) = q`, each response value in the training set is given a weight or frequency. Formally, the weight or frequency given to the :math:`j`\th training sample, :math:`y_j`, while estimating the quantile is

.. math::
\frac{1}{T} \sum_{t=1}^{T} \frac{\mathbb{1}(y_j \in L(x))}{\sum_{i=1}^N \mathbb{1}(y_i \in L(x))},
where :math:`L(x)` denotes the leaf that :math:`x` falls into.

Informally, this means that given a new unknown sample, we first find the leaf that it falls into for each tree in the ensemble. Each training sample :math:`y_j` that falls into the same leaf as the new sample is given a weight that equals the fraction of samples in the leaf. Each :math:`y_j` that does not fall into the same leaf as the new sample is given a weight or frequency of zero. The weights or frequencies for each :math:`y_j` are then summed or aggregated across all of the trees in the ensemble. This information can then be used to calculate the empirical quantile estimates.

This approach was first proposed by :cite:t:`2006:meinshausen`.

.. toctree::
:maxdepth: 1
:caption: User Guide
:hidden:

self
fit_predict
quantile_ranks
proximities
Loading

0 comments on commit 8f34801

Please sign in to comment.