Skip to content

Commit

Permalink
deploy: 9fccd85
Browse files Browse the repository at this point in the history
  • Loading branch information
reidjohnson committed Aug 29, 2024
1 parent e977f1e commit 9acae39
Show file tree
Hide file tree
Showing 9 changed files with 80 additions and 81 deletions.
8 changes: 4 additions & 4 deletions _sources/gallery/plot_huggingface_model.rst.txt
Original file line number Diff line number Diff line change
Expand Up @@ -114,8 +114,8 @@ of each sample. The model used is available on Hugging Face Hub
<summary> Click to expand </summary>

```python
import pickle
with open(qrf_pkl_filename, 'rb') as file:
import pickle
with open(qrf_pkl_filename, 'rb') as file:
qrf = pickle.load(file)
```

Expand Down Expand Up @@ -332,8 +332,8 @@ of each sample. The model used is available on Hugging Face Hub
<summary> Click to expand </summary>
```python
import pickle
with open(qrf_pkl_filename, 'rb') as file:
import pickle
with open(qrf_pkl_filename, 'rb') as file:
qrf = pickle.load(file)
```
Expand Down
56 changes: 28 additions & 28 deletions _sources/user_guide/fit_predict.rst.txt
Original file line number Diff line number Diff line change
Expand Up @@ -15,52 +15,52 @@ Let's fit a quantile forest on a simple regression dataset::
>>> from quantile_forest import RandomForestQuantileRegressor
>>> X, y = datasets.load_diabetes(return_X_y=True)
>>> X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)
>>> reg = RandomForestQuantileRegressor()
>>> reg.fit(X_train, y_train)
>>> qrf = RandomForestQuantileRegressor()
>>> qrf.fit(X_train, y_train)
RandomForestQuantileRegressor(...)

During model initialization, the parameter `max_samples_leaf` can be specified, which determines the maximum number of samples per leaf node to retain. If `max_samples_leaf` is smaller than the number of samples in a given leaf node, then a subset of values are randomly selected. By default, the model retains one randomly selected sample per leaf node (`max_samples_leaf = 1`), which enables the use of optimizations at prediction time that are not available when a variable number of samples may be retained per leaf. All samples can be retained by specifying `max_samples_leaf = None`. Note that the number of retained samples can materially impact the size of the model object.
During model initialization, the parameter `max_samples_leaf` can be specified, which determines the maximum number of samples per leaf node to retain. If `max_samples_leaf` is smaller than the number of samples in a given leaf node, then a subset of values are randomly selected. By default, the model retains one randomly selected sample per leaf node (`max_samples_leaf=1`), which enables the use of optimizations at prediction time that are not available when a variable number of samples may be retained per leaf. All samples can be retained by specifying `max_samples_leaf=None`. Note that the number of retained samples can materially impact the size of the model object.

Making Predictions
~~~~~~~~~~~~~~~~~~

A notable advantage of quantile forests is that they can be fit once, while arbitrary quantiles can be estimated at prediction time. Accordingly, since the quantiles can be specified at prediction time, the model accepts an optional parameter during the call to the `predict` method, which can be a float or list of floats that specify the empirical quantiles to return::

>>> y_pred = reg.predict(X_test, quantiles=[0.25, 0.5, 0.75])
>>> y_pred = qrf.predict(X_test, quantiles=[0.25, 0.5, 0.75])
>>> y_pred.shape[1]
3

If the `predict` method is called without quantiles, the prediction defaults to the empirical median (`quantiles = 0.5`)::
If the `predict` method is called without quantiles, the prediction defaults to the empirical median (`quantiles=0.5`)::

>>> y_pred = reg.predict(X_test) # returns empirical median prediction
>>> y_pred = qrf.predict(X_test) # returns empirical median prediction

If the `predict` method is explicitly called with `quantiles = "mean"`, the prediction returns the empirical mean::
If the `predict` method is explicitly called with `quantiles="mean"`, the prediction returns the empirical mean::

>>> y_pred = reg.predict(X_test, quantiles="mean") # returns mean prediction
>>> y_pred = qrf.predict(X_test, quantiles="mean") # returns mean prediction

Default quantiles can be specified at model initialization using the `default_quantiles` parameter:

>>> reg = RandomForestQuantileRegressor(default_quantiles=[0.25, 0.5, 0.75])
>>> reg.fit(X_train, y_train)
>>> qrf = RandomForestQuantileRegressor(default_quantiles=[0.25, 0.5, 0.75])
>>> qrf.fit(X_train, y_train)
RandomForestQuantileRegressor(default_quantiles=[0.25, 0.5, 0.75])
>>> y_pred = reg.predict(X_test) # predicts using the default quantiles
>>> y_pred = qrf.predict(X_test) # predicts using the default quantiles
>>> y_pred.ndim == 2
True
>>> y_pred.shape[1] == 3
True

The default quantiles can be overwritten at prediction time by specifying a value for `quantiles`:

>>> reg = RandomForestQuantileRegressor(default_quantiles=[0.25, 0.5, 0.75])
>>> reg.fit(X_train, y_train)
>>> qrf = RandomForestQuantileRegressor(default_quantiles=[0.25, 0.5, 0.75])
>>> qrf.fit(X_train, y_train)
RandomForestQuantileRegressor(default_quantiles=[0.25, 0.5, 0.75])
>>> y_pred = reg.predict(X_test, quantiles=0.5) # uses override quantiles
>>> y_pred = qrf.predict(X_test, quantiles=0.5) # uses override quantiles
>>> y_pred.ndim == 1
True

The output of the `predict` method is an array with one column for each specified quantile or a single column if no quantiles are specified. The order of the output columns corresponds to the order of the quantiles, which can be specified in any order (i.e., they do not need to be monotonically ordered)::

>>> y_pred = reg.predict(X_test, quantiles=[0.5, 0.25, 0.75])
>>> y_pred = qrf.predict(X_test, quantiles=[0.5, 0.25, 0.75])
>>> bool((y_pred[:, 0] >= y_pred[:, 1]).all())
True

Expand All @@ -71,47 +71,47 @@ Multi-target quantile regression is also supported. If the target values are mul
>>> X, y = datasets.make_regression(n_samples=100, n_targets=2, random_state=0)
>>> y.shape
(100, 2)
>>> reg_multi = RandomForestQuantileRegressor()
>>> reg_multi.fit(X, y)
>>> qrf_multi = RandomForestQuantileRegressor()
>>> qrf_multi.fit(X, y)
RandomForestQuantileRegressor()
>>> quantiles = [0.25, 0.5, 0.75]
>>> y_pred = reg_multi.predict(X, quantiles=quantiles)
>>> y_pred = qrf_multi.predict(X, quantiles=quantiles)
>>> y_pred.ndim == 3
True
>>> y_pred.shape[0] == len(X)
True
>>> y_pred.shape[-1] == len(quantiles)
>>> y_pred.shape[2] == len(quantiles)
True
>>> y_pred.shape[1] == y.shape[1]
>>> y_pred.shape[1] == y.shape[1] # number of targets
True

Quantile Weighting
~~~~~~~~~~~~~~~~~~

By default, the predict method calculates quantiles using a weighted quantile method (`weighted_quantile = True`), which assigns a weight to each sample in the training set based on the number of times that it co-occurs in the same leaves as the test sample. When the number of samples in the training set is larger than the expected size of this list (i.e., :math:`n_{train} \gg n_{trees} \cdot n_{leaves} \cdot n_{leafsamples}`), it can be more efficient to calculate an unweighted quantile (`weighted_quantile = False`), which aggregates the list of training `y` values for each leaf node to which the test sample belongs across all trees. For a given input, both methods can return the same output values::
By default, the predict method calculates quantiles using a weighted quantile method (`weighted_quantile=True`), which assigns a weight to each sample in the training set based on the number of times that it co-occurs in the same leaves as the test sample. When the number of samples in the training set is larger than the expected number of co-occurring samples across all trees, it can be more efficient to calculate an unweighted quantile (`weighted_quantile=False`), which aggregates a list of training `y` values for each leaf node to which the test sample belongs across all trees. For a given input, both methods can return the same output values::

>>> import numpy as np
>>> y_pred_weighted = reg.predict(X_test, weighted_quantile=True)
>>> y_pred_unweighted = reg.predict(X_test, weighted_quantile=False)
>>> y_pred_weighted = qrf.predict(X_test, weighted_quantile=True)
>>> y_pred_unweighted = qrf.predict(X_test, weighted_quantile=False)
>>> np.allclose(y_pred_weighted, y_pred_unweighted)
True

By default, the predict method calculates quantiles by giving each sample in a leaf (including repeated bootstrap samples) equal weight (`weighted_leaves = False`). If `weighted_leaves = True`, each sample will be weighted inversely according to the size of its leaf node. Note that this leaf-based weighting can only be used with weighted quantiles.
By default, the predict method calculates quantiles by giving each sample in a leaf (including repeated bootstrap samples) equal weight (`weighted_leaves=False`). If `weighted_leaves=True`, each sample will be weighted inversely according to the size of its leaf node. Note that this leaf-based weighting can only be used with weighted quantiles.

Out-of-Bag Estimation
~~~~~~~~~~~~~~~~~~~~~

Out-of-bag (OOB) predictions can be returned by specifying `oob_score = True`::
Out-of-bag (OOB) predictions can be returned by specifying `oob_score=True`::

>>> y_pred_oob = reg.predict(X_train, quantiles=[0.5], oob_score=True)
>>> y_pred_oob = qrf.predict(X_train, quantiles=0.5, oob_score=True)

By default, when the `predict` method is called with the OOB flag set to True, it assumes that the input samples are the training samples, arranged in the same order as during model fitting. It accepts an optional parameter that can be used to specify the training index of each input sample, with -1 used to specify non-training samples that can in effect be scored in-bag (IB) during the same call::

>>> import numpy as np
>>> X_mixed = np.concatenate([X_train, X_test])
>>> indices = np.concatenate([np.arange(len(X_train)), np.full(len(X_test), -1)])
>>> kwargs = {"oob_score": True, "indices": indices}
>>> y_pred_mix = reg.predict(X_mixed, quantiles=[0.25, 0.5, 0.75], **kwargs)
>>> y_pred_mix = qrf.predict(X_mixed, quantiles=[0.25, 0.5, 0.75], **kwargs)
>>> y_pred_train_oob = y_pred_mix[:len(X_train)] # training predictions are OOB
>>> y_pred_test = y_pred_mix[-len(X_test):] # new test data predictions are IB

Expand All @@ -120,7 +120,7 @@ This allows all samples, both from the training and test sets, to be scored with
Random Forest Predictions
~~~~~~~~~~~~~~~~~~~~~~~~~

The predictions of a standard random forest can also be recovered from a quantile forest without retraining by passing `quantiles = "mean"` and `aggregate_leaves_first = False`, the latter which specifies a Boolean flag to average the leaf values before aggregating the leaves across trees. This configuration essentially replicates the prediction process used by a standard random forest regressor, which is an averaging of mean leaf values across trees::
The predictions of a standard random forest can also be recovered from a quantile forest without retraining when initialized with `max_samples_leaf=None`. This can be accomplished at inference time by passing `quantiles="mean"` (or `quantiles=0.5` if the model was specifically fitted with `criterion="absolute_error"`) and `aggregate_leaves_first=False`, the latter which specifies a Boolean flag to average the leaf values before aggregating the leaves across trees. This configuration essentially replicates the prediction process used by a standard random forest regressor, which is an averaging of mean (or median) leaf values across trees::

>>> import numpy as np
>>> from sklearn import datasets
Expand Down
10 changes: 5 additions & 5 deletions _sources/user_guide/proximities.rst.txt
Original file line number Diff line number Diff line change
Expand Up @@ -10,17 +10,17 @@ Proximity counts are counts of the number of times that two samples share a leaf
>>> from quantile_forest import RandomForestQuantileRegressor
>>> X, y = datasets.load_diabetes(return_X_y=True)
>>> X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)
>>> reg = RandomForestQuantileRegressor().fit(X_train, y_train)
>>> proximities = reg.proximity_counts(X_test) # proximity counts for X_test
>>> qrf = RandomForestQuantileRegressor().fit(X_train, y_train)
>>> proximities = qrf.proximity_counts(X_test) # proximity counts for test data

For each test sample, the method outputs a list of tuples of the training index and proximity count, listed in descending order by proximity count. For example, a test sample with an output of [(1, 5), (0, 3), (3, 1)], means that the test sample shared 5, 3, and 1 leaf nodes with the training samples that were (zero-)indexed as 1, 0, and 3 during model fitting, respectively.

The maximum number of proximity counts output per test sample can be limited by specifying `max_proximities`::

>>> proximities = reg.proximity_counts(X_test, max_proximities=10)
>>> proximities = qrf.proximity_counts(X_test, max_proximities=10)
>>> all([len(prox) <= 10 for prox in proximities])
True

Out-of-bag (OOB) proximity counts can be returned by specifying `oob_score = True`::
Out-of-bag (OOB) proximity counts can be returned by specifying `oob_score=True`::

>>> proximities = reg.proximity_counts(X_train, oob_score=True)
>>> proximities = qrf.proximity_counts(X_train, oob_score=True)
8 changes: 4 additions & 4 deletions _sources/user_guide/quantile_ranks.rst.txt
Original file line number Diff line number Diff line change
Expand Up @@ -10,9 +10,9 @@ The quantile rank is the fraction of scores in a frequency distribution that are
>>> from quantile_forest import RandomForestQuantileRegressor
>>> X, y = datasets.load_diabetes(return_X_y=True)
>>> X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)
>>> reg = RandomForestQuantileRegressor().fit(X_train, y_train)
>>> y_ranks = reg.quantile_ranks(X_test, y_test) # quantile ranks of y_test
>>> qrf = RandomForestQuantileRegressor().fit(X_train, y_train)
>>> y_ranks = qrf.quantile_ranks(X_test, y_test) # quantile ranks for test data

Out-of-bag (OOB) quantile ranks can be returned by specifying `oob_score = True`::
Out-of-bag (OOB) quantile ranks can be returned by specifying `oob_score=True`::

>>> y_ranks_oob = reg.quantile_ranks(X_train, y_train, oob_score=True)
>>> y_ranks_oob = qrf.quantile_ranks(X_train, y_train, oob_score=True)
2 changes: 1 addition & 1 deletion _static/_image_hashes.json
Original file line number Diff line number Diff line change
@@ -1 +1 @@
{"plot_quantile_interpolation.png": "64403bde568aefd4126ce9afd13bdf18", "plot_predict_custom.png": "d93bb87e4412de61511ec04e7cfc57cc", "plot_quantile_extrapolation.png": "df5cd201a56427aecdd2b0fb67383b1e", "plot_quantile_multioutput.png": "a7db7a29994b823fbd5a7a3ea89e31b2", "plot_quantile_example.png": "56f2d452901be0aaa61cae8fdd382677", "plot_quantile_conformalized.png": "25fb11140f72b784df7c81538d28b4bc", "plot_quantile_intervals.png": "31f06cdda63b101d5d4cd7bb5c7242d1", "plot_quantile_vs_standard.png": "a7e09a7c286249020edb212a8c8964e5", "plot_treeshap_example.png": "390c464d8dd7b212f8bfe64e9e5bbf62", "plot_proximity_counts.png": "c3014295e7d995861eb4e1c2653dd9e4", "plot_quantile_ranks.png": "2dc7135b0065af3b72770ab39ce0aa6a", "plot_huggingface_model.png": "e55a6128dcf1aa3b145342f8a347edbd"}
{"plot_quantile_interpolation.png": "64403bde568aefd4126ce9afd13bdf18", "plot_predict_custom.png": "d93bb87e4412de61511ec04e7cfc57cc", "plot_quantile_extrapolation.png": "df5cd201a56427aecdd2b0fb67383b1e", "plot_quantile_multioutput.png": "a7db7a29994b823fbd5a7a3ea89e31b2", "plot_quantile_example.png": "56f2d452901be0aaa61cae8fdd382677", "plot_quantile_conformalized.png": "25fb11140f72b784df7c81538d28b4bc", "plot_quantile_intervals.png": "31f06cdda63b101d5d4cd7bb5c7242d1", "plot_quantile_vs_standard.png": "a7e09a7c286249020edb212a8c8964e5", "plot_treeshap_example.png": "390c464d8dd7b212f8bfe64e9e5bbf62", "plot_proximity_counts.png": "c3014295e7d995861eb4e1c2653dd9e4", "plot_quantile_ranks.png": "2dc7135b0065af3b72770ab39ce0aa6a", "plot_huggingface_model.png": "c87554d2fada2c6debe8c18c118efff8"}
2 changes: 1 addition & 1 deletion searchindex.js

Large diffs are not rendered by default.

Loading

0 comments on commit 9acae39

Please sign in to comment.