diff --git a/.github/workflows/build.yml b/.github/workflows/build.yml index 5c6ac05..0d14d9f 100644 --- a/.github/workflows/build.yml +++ b/.github/workflows/build.yml @@ -47,7 +47,7 @@ jobs: - name: Test with pytest run: | - python -m pytest docs/*rst + python -m pytest docs/user_guide/*rst python -m pytest --pyargs quantile_forest --cov=quantile_forest - name: Upload coverage reports to Codecov diff --git a/docs/conf.py b/docs/conf.py index c63a92d..11000d6 100755 --- a/docs/conf.py +++ b/docs/conf.py @@ -156,6 +156,7 @@ def setup(app): # Custom sidebar templates, maps document names to template names. html_sidebars = { "index": [], + "releases/changes": [], "**": ["sidebar-nav-bs"], } diff --git a/docs/install.rst b/docs/getting_started/developers.rst similarity index 59% rename from docs/install.rst rename to docs/getting_started/developers.rst index f2be6fa..ba9783f 100755 --- a/docs/install.rst +++ b/docs/getting_started/developers.rst @@ -1,28 +1,11 @@ -.. _install: +.. _developers: -Getting Started -=============== - -Prerequisites -------------- - -The quantile-forest package requires the following dependencies: - -* python (>=3.8) -* numpy (>=1.23) -* scikit-learn (>=1.0) -* scipy (>=1.4) - -Install -------- - -quantile-forest can be installed using `pip`:: - - pip install quantile-forest - -Developer Install +Developer's Guide ----------------- +Development Installation +~~~~~~~~~~~~~~~~~~~~~~~~ + Building the package from source additionally requires the following dependencies: * cython (>=3.0a4) @@ -32,7 +15,7 @@ To manually build and install the package, run:: pip install --verbose --editable . Troubleshooting ---------------- +~~~~~~~~~~~~~~~ If the build fails because SciPy is not installed, ensure OpenBLAS and LAPACK are available and accessible. @@ -43,7 +26,7 @@ On macOS, run:: export SYSTEM_VERSION_COMPAT=1 Test and Coverage ------------------ +~~~~~~~~~~~~~~~~~ To test the code:: @@ -51,20 +34,13 @@ To test the code:: To test the documentation:: - $ python -m pytest docs/*rst + $ python -m pytest docs/user_guide/*rst Documentation -------------- +~~~~~~~~~~~~~ To build the documentation, run:: $ pip install -r ./docs/sphinx_requirements.txt $ mkdir -p ./docs/_images $ sphinx-build -b html ./docs ./docs/_build - -.. toctree:: - :maxdepth: 2 - :caption: Install - :hidden: - - Getting Started diff --git a/docs/getting_started/installation.rst b/docs/getting_started/installation.rst new file mode 100755 index 0000000..3168621 --- /dev/null +++ b/docs/getting_started/installation.rst @@ -0,0 +1,29 @@ +.. _install: + +Getting Started +--------------- + +Prerequisites +~~~~~~~~~~~~~ + +The quantile-forest package requires the following dependencies: + +* python (>=3.8) +* numpy (>=1.23) +* scikit-learn (>=1.0) +* scipy (>=1.4) + +Installation +~~~~~~~~~~~~ + +quantile-forest can be installed using `pip`:: + + pip install quantile-forest + +.. toctree:: + :maxdepth: 1 + :caption: Installation + :hidden: + + self + developers diff --git a/docs/index.rst b/docs/index.rst index 97d7f63..5170d93 100755 --- a/docs/index.rst +++ b/docs/index.rst @@ -26,7 +26,7 @@ quantile-forest A guide that provides installation requirements and instructions, as well as procedures for developers. .. grid-item-card:: User Guide - :link: user_guide + :link: user-guide-intro :link-type: ref :link-alt: User guide @@ -50,9 +50,10 @@ quantile-forest :maxdepth: 1 :hidden: - Getting Started - User Guide + Getting Started + User Guide Examples API + Release Notes .. _GitHub: http://github.com/zillow/quantile-forest diff --git a/docs/releases/changes.rst b/docs/releases/changes.rst new file mode 100755 index 0000000..32871f4 --- /dev/null +++ b/docs/releases/changes.rst @@ -0,0 +1,115 @@ +:html_theme.sidebar_secondary.remove: + +.. _changes: + +Release Notes +============= + +Version 1.3.4 (released Feb 21, 2024) +------------------------------------- + +- Reorder multi-target outputs (#35) +- Add tests for model serialization (#36) +- Update and fix documentation and examples + +Version 1.3.3 (released Feb 16, 2024) +------------------------------------- + +- Set default value of `weighted_leaves` at prediction time to False (#34) +- Update and fix documentation and examples + +Version 1.3.2 (released Feb 15, 2024) +------------------------------------- + +- Fix bug in multi-target output when `max_samples_leaf` > 1 (#30) +- Update quantile forest examples (#31) +- Update and fix documentation (#33) + +Version 1.3.1 (released Feb 12, 2024) +------------------------------------- + +- Fix single-output performance regression (#29) + +Version 1.3.0 (released Feb 11, 2024) +------------------------------------- + +- Support for multiple-output quantile regression (#26) +- Update conformalized quantile regression example (#28) + +Version 1.2.5 (released Feb 10, 2024) +------------------------------------- + +- Fix weighted leaf and quantile bug (#27) + +Version 1.2.4 (released Jan 16, 2024) +------------------------------------- + +- Use base model parameter validation when available +- Resolve Cython 3 deprecation warnings + +Version 1.2.3 (released Oct 09, 2023) +------------------------------------- + +- Fix bug that could prevent interpolation from being correctly applied (#15) +- Update documentation and docstrings + +Version 1.2.2 (released Oct 08, 2023) +------------------------------------- + +- Optimize performance for predictions when `max_samples_leaf` = 1 (#13) +- Update documentation and examples (#14) + +Version 1.2.1 (released Oct 04, 2023) +------------------------------------- + +- More efficient calculation of weighted quantiles (#11) +- Add support for Python version 3.12 + +Version 1.2.0 (released Aug 01, 2023) +------------------------------------- + +- Add optional default_quantiles parameter to the model initialization +- Update documentation + +Version 1.1.3 (released Jul 08, 2023) +------------------------------------- + +- Fix building from the source distribution +- Minor update to documentation + +Version 1.1.2 (released Mar 22, 2023) +------------------------------------- + +- Fix for compatibility with development version of scikit-learn +- Update documentation and examples + +Version 1.1.1 (released Dec 19, 2022) +------------------------------------- + +- Fix for compatibility with scikit-learn 1.2.0 +- Fix to documentation +- Update version requirements + +Version 1.1.0 (released Nov 07, 2022) +------------------------------------- + +- Update default `max_samples_leaf` to 1 (previously None) +- Update documentation and unit tests +- Miscellaneous update for compatibility with scikit-learn >= 1.1.0 + +This version supports Python versions 3.8 to 3.11. Note that support for 32-bit Python on Windows has been dropped in this release. + +Version 1.0.2 (released Mar 28, 2022) +------------------------------------- + +- Add sample weighting by leaf size + +Version 1.0.1 (released Mar 23, 2022) +------------------------------------- + +- Suppresses UserWarning + +Version 1.0.0 (released Mar 23, 2022) +------------------------------------- + +Initial release. diff --git a/docs/user_guide.rst b/docs/user_guide/fit_predict.rst similarity index 58% rename from docs/user_guide.rst rename to docs/user_guide/fit_predict.rst index c6afd18..bbaed97 100755 --- a/docs/user_guide.rst +++ b/docs/user_guide/fit_predict.rst @@ -1,35 +1,4 @@ -.. _user_guide: - -User Guide -========== - -Introduction ------------- - -Random forests have proven to be very popular and powerful for regression and classification. For regression, random forests give an accurate approximation of the conditional mean of a response variable. That is, if we let :math:`Y` be a real-valued response variable and :math:`X` a covariate or predictor variable, they estimate :math:`E(Y | X)`, which can be interpreted as the expected value of the output :math:`Y` given the input :math:`X`. - -However random forests provide information about the full conditional distribution of the response variable, not only about the conditional mean. Quantile regression forests, a generalization of random forests, can be used to infer conditional quantiles. That is, they return :math:`y` at :math:`q` for which :math:`F(Y=y|X) = q`, where :math:`q` is the quantile. - -The quantiles give more complete information about the distribution of :math:`Y` as a function of the predictor variable :math:`X` than the conditional mean alone. They can be useful, for example, to build prediction intervals or to perform outlier detection in a high-dimensional dataset. - -In practice, the empirical estimation of quantiles can be calculated in several ways. In this package, a desired quantile is calculated from the input rank :math:`x` such that :math:`x = (N + 1 - 2C)q + C`, where :math:`q` is the quantile, :math:`N` is the number of samples, and :math:`C` is a constant (degree of freedom). In this package, :math:`C = 1`. This package provides methods that calculate quantiles using samples that are weighted and unweighted. In a weighted quantile, :math:`N` is calculated from the fraction of the total weight instead of the total number of samples. - -Quantile Regression Forests ---------------------------- - -A standard decision tree can be extended in a straightforward way to estimate conditional quantiles. When a decision tree is fit, rather than storing only the sufficient statistics of the response variable at the leaf node, such as the mean and variance, all of the response values can be stored with the leaf node. At prediction time, these values can then be used to calculate empirical quantile estimates. - -The quantile-based approach can be extended to random forests. To estimate :math:`F(Y=y|x) = q`, each response value in the training set is given a weight or frequency. Formally, the weight or frequency given to the :math:`j`\th training sample, :math:`y_j`, while estimating the quantile is - -.. math:: - - \frac{1}{T} \sum_{t=1}^{T} \frac{\mathbb{1}(y_j \in L(x))}{\sum_{i=1}^N \mathbb{1}(y_i \in L(x))}, - -where :math:`L(x)` denotes the leaf that :math:`x` falls into. - -Informally, this means that given a new unknown sample, we first find the leaf that it falls into for each tree in the ensemble. Each training sample :math:`y_j` that falls into the same leaf as the new sample is given a weight that equals the fraction of samples in the leaf. Each :math:`y_j` that does not fall into the same leaf as the new sample is given a weight or frequency of zero. The weights or frequencies for each :math:`y_j` are then summed or aggregated across all of the trees in the ensemble. This information can then be used to calculate the empirical quantile estimates. - -This approach was first proposed by :cite:t:`2006:meinshausen`. +.. _user-guide-fit-predict: Fitting and Predicting ---------------------- @@ -171,52 +140,3 @@ The predictions of a standard random forest can also be recovered from a quantil >>> y_pred_qrf = qrf.predict(X_test, **kwargs) >>> np.allclose(y_pred_rf, y_pred_qrf) True - -Quantile Ranks --------------- - -The quantile rank is the fraction of scores in a frequency distribution that are less than (or equal to) that score. For a quantile forest, the frequency distribution is the set of training sample response values that are used to construct the empirical quantile estimates. The quantile rank of each sample is calculated by aggregating the response values from all of the training samples that share the same leaf node across all of the trees. The output quantile rank will be a value in the range [0, 1] for each test sample:: - - >>> from sklearn import datasets - >>> from sklearn.model_selection import train_test_split - >>> from quantile_forest import RandomForestQuantileRegressor - >>> X, y = datasets.load_diabetes(return_X_y=True) - >>> X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25) - >>> reg = RandomForestQuantileRegressor().fit(X_train, y_train) - >>> y_ranks = reg.quantile_ranks(X_test, y_test) # quantile ranks of y_test - -Out-of-bag (OOB) quantile ranks can be returned by specifying `oob_score = True`:: - - >>> y_ranks_oob = reg.quantile_ranks(X_train, y_train, oob_score=True) - -Proximity Counts ----------------- - -Proximity counts are counts of the number of times that two samples share a leaf node. When a test set is present, the proximity counts of each sample in the test set with each sample in the training set can be computed:: - - >>> from sklearn import datasets - >>> from sklearn.model_selection import train_test_split - >>> from quantile_forest import RandomForestQuantileRegressor - >>> X, y = datasets.load_diabetes(return_X_y=True) - >>> X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25) - >>> reg = RandomForestQuantileRegressor().fit(X_train, y_train) - >>> proximities = reg.proximity_counts(X_test) # proximity counts for X_test - -For each test sample, the method outputs a list of tuples of the training index and proximity count, listed in descending order by proximity count. For example, a test sample with an output of [(1, 5), (0, 3), (3, 1)], means that the test sample shared 5, 3, and 1 leaf nodes with the training samples that were (zero-)indexed as 1, 0, and 3 during model fitting, respectively. - -The maximum number of proximity counts output per test sample can be limited by specifying `max_proximities`:: - - >>> proximities = reg.proximity_counts(X_test, max_proximities=10) - >>> np.all([len(prox) <= 10 for prox in proximities]) - True - -Out-of-bag (OOB) proximity counts can be returned by specifying `oob_score = True`:: - - >>> proximities = reg.proximity_counts(X_train, oob_score=True) - -.. toctree:: - :maxdepth: 2 - :caption: User Guide - :hidden: - - User Guide diff --git a/docs/user_guide/introduction.rst b/docs/user_guide/introduction.rst new file mode 100755 index 0000000..8899b81 --- /dev/null +++ b/docs/user_guide/introduction.rst @@ -0,0 +1,39 @@ +.. _user-guide-intro: + +Introduction +------------ + +Random forests have proven to be very popular and powerful for regression and classification. For regression, random forests give an accurate approximation of the conditional mean of a response variable. That is, if we let :math:`Y` be a real-valued response variable and :math:`X` a covariate or predictor variable, they estimate :math:`E(Y | X)`, which can be interpreted as the expected value of the output :math:`Y` given the input :math:`X`. + +However random forests provide information about the full conditional distribution of the response variable, not only about the conditional mean. Quantile regression forests, a generalization of random forests, can be used to infer conditional quantiles. That is, they return :math:`y` at :math:`q` for which :math:`F(Y=y|X) = q`, where :math:`q` is the quantile. + +The quantiles give more complete information about the distribution of :math:`Y` as a function of the predictor variable :math:`X` than the conditional mean alone. They can be useful, for example, to build prediction intervals or to perform outlier detection in a high-dimensional dataset. + +In practice, the empirical estimation of quantiles can be calculated in several ways. In this package, a desired quantile is calculated from the input rank :math:`x` such that :math:`x = (N + 1 - 2C)q + C`, where :math:`q` is the quantile, :math:`N` is the number of samples, and :math:`C` is a constant (degree of freedom). In this package, :math:`C = 1`. This package provides methods that calculate quantiles using samples that are weighted and unweighted. In a weighted quantile, :math:`N` is calculated from the fraction of the total weight instead of the total number of samples. + +Quantile Regression Forests +--------------------------- + +A standard decision tree can be extended in a straightforward way to estimate conditional quantiles. When a decision tree is fit, rather than storing only the sufficient statistics of the response variable at the leaf node, such as the mean and variance, all of the response values can be stored with the leaf node. At prediction time, these values can then be used to calculate empirical quantile estimates. + +The quantile-based approach can be extended to random forests. To estimate :math:`F(Y=y|x) = q`, each response value in the training set is given a weight or frequency. Formally, the weight or frequency given to the :math:`j`\th training sample, :math:`y_j`, while estimating the quantile is + +.. math:: + + \frac{1}{T} \sum_{t=1}^{T} \frac{\mathbb{1}(y_j \in L(x))}{\sum_{i=1}^N \mathbb{1}(y_i \in L(x))}, + +where :math:`L(x)` denotes the leaf that :math:`x` falls into. + +Informally, this means that given a new unknown sample, we first find the leaf that it falls into for each tree in the ensemble. Each training sample :math:`y_j` that falls into the same leaf as the new sample is given a weight that equals the fraction of samples in the leaf. Each :math:`y_j` that does not fall into the same leaf as the new sample is given a weight or frequency of zero. The weights or frequencies for each :math:`y_j` are then summed or aggregated across all of the trees in the ensemble. This information can then be used to calculate the empirical quantile estimates. + +This approach was first proposed by :cite:t:`2006:meinshausen`. + +.. toctree:: + :maxdepth: 1 + :caption: User Guide + :hidden: + + self + fit_predict + quantile_ranks + proximities diff --git a/docs/user_guide/proximities.rst b/docs/user_guide/proximities.rst new file mode 100755 index 0000000..731fbe2 --- /dev/null +++ b/docs/user_guide/proximities.rst @@ -0,0 +1,26 @@ +.. _user-guide-proximities: + +Proximity Counts +---------------- + +Proximity counts are counts of the number of times that two samples share a leaf node. When a test set is present, the proximity counts of each sample in the test set with each sample in the training set can be computed:: + + >>> from sklearn import datasets + >>> from sklearn.model_selection import train_test_split + >>> from quantile_forest import RandomForestQuantileRegressor + >>> X, y = datasets.load_diabetes(return_X_y=True) + >>> X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25) + >>> reg = RandomForestQuantileRegressor().fit(X_train, y_train) + >>> proximities = reg.proximity_counts(X_test) # proximity counts for X_test + +For each test sample, the method outputs a list of tuples of the training index and proximity count, listed in descending order by proximity count. For example, a test sample with an output of [(1, 5), (0, 3), (3, 1)], means that the test sample shared 5, 3, and 1 leaf nodes with the training samples that were (zero-)indexed as 1, 0, and 3 during model fitting, respectively. + +The maximum number of proximity counts output per test sample can be limited by specifying `max_proximities`:: + + >>> proximities = reg.proximity_counts(X_test, max_proximities=10) + >>> all([len(prox) <= 10 for prox in proximities]) + True + +Out-of-bag (OOB) proximity counts can be returned by specifying `oob_score = True`:: + + >>> proximities = reg.proximity_counts(X_train, oob_score=True) diff --git a/docs/user_guide/quantile_ranks.rst b/docs/user_guide/quantile_ranks.rst new file mode 100755 index 0000000..60b76bb --- /dev/null +++ b/docs/user_guide/quantile_ranks.rst @@ -0,0 +1,18 @@ +.. _user-guide-quantile-ranks: + +Quantile Ranks +-------------- + +The quantile rank is the fraction of scores in a frequency distribution that are less than (or equal to) that score. For a quantile forest, the frequency distribution is the set of training sample response values that are used to construct the empirical quantile estimates. The quantile rank of each sample is calculated by aggregating the response values from all of the training samples that share the same leaf node across all of the trees. The output quantile rank will be a value in the range [0, 1] for each test sample:: + + >>> from sklearn import datasets + >>> from sklearn.model_selection import train_test_split + >>> from quantile_forest import RandomForestQuantileRegressor + >>> X, y = datasets.load_diabetes(return_X_y=True) + >>> X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25) + >>> reg = RandomForestQuantileRegressor().fit(X_train, y_train) + >>> y_ranks = reg.quantile_ranks(X_test, y_test) # quantile ranks of y_test + +Out-of-bag (OOB) quantile ranks can be returned by specifying `oob_score = True`:: + + >>> y_ranks_oob = reg.quantile_ranks(X_train, y_train, oob_score=True)