-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Review/generic loss questions #60
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -65,31 +65,51 @@ def calc_parameters( | |
float, float | ||
estimated parameters and its uncertainties | ||
""" | ||
sorting = feature.lex_binned_data.argsort() | ||
sorted_bins = feature.lex_binned_data[sorting] | ||
bins, split_indices = np.unique(sorted_bins, return_index=True) | ||
split_indices = split_indices[1:] | ||
# ! TODO: [Q] Why these operations? | ||
# I probably do not understand the CB algo, high level explainer would be great | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The difference to the other (older) CB modes is that we do the optimizations here numerically by explicitly minimizing a loss, rather than analytically. But in principle, this is the same thing. |
||
sorting = feature.lex_binned_data.argsort() # 1. get element index row-wise ordered from smallest to greatest | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. correct |
||
sorted_bins = feature.lex_binned_data[sorting] # 2. return the bins sorted from smallest to greatest | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. correct |
||
# do not quite understand how this works | ||
# as my example with a3=np.random.rand(3,10) and a3[a3.argsort()] was returning an IndexError | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. lex_binnded_data is just a vector |
||
bins, split_indices = np.unique( | ||
sorted_bins, return_index=True | ||
) # 3. return only the unique values for each bin ordered | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This returns the unique values (only needed for the special case of empty bins in multi-dimensional features) and its indices. The latter are needed for the split of all the target and prediction values in the different bins. |
||
split_indices = split_indices[1:] # 5. drop the zero index bin | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We are looking for bin ranges. |
||
|
||
y_pred = np.hstack((y[..., np.newaxis], self.unlink_func(pred.predict_link())[..., np.newaxis])) | ||
# 6. joining the values of the target variable with those of the predictions | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. correct |
||
y_pred = np.hstack((y_pred, self.weights[..., np.newaxis])) | ||
# 7. joining the previous matrix with the weights (of each input variable?) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. correct |
||
y_pred_bins = np.split(y_pred[sorting], split_indices) | ||
# 8. sort the predictions according to the bins (of the input variable?) and split this into bins | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, split the target and prediction values in the bins of the feature considered here. This is done to perform independent optimizations in the following. |
||
|
||
# keep potential empty bins in multi-dimensional features | ||
all_bins = range(max(feature.lex_binned_data) + 1) | ||
empty_bins = list(set(bins) ^ set(all_bins)) | ||
empty_bins = set(bins) ^ set(all_bins) | ||
# 9. returns the elements which are either in set(bins) | ||
# or set(all_bins). | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is exclusive or. The idea is to find empty bins. These are in all_bins but not in bins. Empty bins can occur in multi-dimensional features (which are mapped to one-dimensional structure before this function). |
||
# ! TODO: list can be removed as only iterator is used below | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. fair enough |
||
# why does this return the empty bins though? | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. see comment above about multi-dimensional features |
||
# because all_bins is a superset of sorted_bins, this is tantamount to finding the values | ||
# which are not in bins. Bins return a list of all the values | ||
# check, for example, a5= np.array([[i*j + 1 for i in range(0,3)] for j in range(0,3)]) | ||
# bins , split_indices = np.unique(a5, return_index=True) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The point is, you do not find empty bins in lex_binned_data. So, there can be multi-dimensional bins which we would miss here. But we have to include it to not mess up the multi-dimensional binning structure. |
||
for i in empty_bins: | ||
y_pred_bins.insert(i, np.zeros((0, 3))) | ||
y_pred_bins.insert(i, np.zeros((0, 3))) # ! TODO: [Q] Is the (0,3) format due to (y, y_hat, weights)? | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yes |
||
|
||
n_bins = len(y_pred_bins) | ||
parameters = np.zeros(n_bins) | ||
uncertainties = np.zeros(n_bins) | ||
|
||
# 10. Try to minimize a loss function given y, y_pred and the weights? | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. correct There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. But do this independently for each bin of the feature at hand. |
||
for bin in range(n_bins): | ||
parameters[bin], uncertainties[bin] = self.optimization( | ||
y_pred_bins[bin][:, 0], y_pred_bins[bin][:, 1], y_pred_bins[bin][:, 2] | ||
) | ||
# ! TODO: What parameters are being returned? | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The parameters returned are the factors (or summands for additive CB modes) of the different bins and its uncertainties (needed for the smoothing). |
||
|
||
neutral_factor = self.unlink_func(np.array(self.neutral_factor_link)) | ||
# 11. if there is one more bin corresponding to the neutral factor, then add it to the parameters | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. correct |
||
if n_bins + 1 == feature.n_bins: | ||
parameters = np.append(parameters, neutral_factor) | ||
uncertainties = np.append(uncertainties, 0) | ||
|
@@ -404,7 +424,7 @@ def quantile_global_scale( | |
weights: np.ndarray, | ||
prior_prediction_column: Union[str, int, None], | ||
link_func, | ||
) -> None: | ||
) -> Tuple: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. right |
||
""" | ||
Calculation of the global scale for quantile regression, corresponding | ||
to the (continuous approximation of the) respective quantile of the | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The general idea is to do an independent optimization in each bin of the feature at hand. And the outcome of each of these optimizations is the factor (or summand for additive CB modes) and its uncertainty.