-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Review/generic loss questions #60
Review/generic loss questions #60
Conversation
sorted_bins = feature.lex_binned_data[sorting] | ||
bins, split_indices = np.unique(sorted_bins, return_index=True) | ||
split_indices = split_indices[1:] | ||
# ! TODO: [Q] Why these operations? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The general idea is to do an independent optimization in each bin of the feature at hand. And the outcome of each of these optimizations is the factor (or summand for additive CB modes) and its uncertainty.
bins, split_indices = np.unique(sorted_bins, return_index=True) | ||
split_indices = split_indices[1:] | ||
# ! TODO: [Q] Why these operations? | ||
# I probably do not understand the CB algo, high level explainer would be great |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The difference to the other (older) CB modes is that we do the optimizations here numerically by explicitly minimizing a loss, rather than analytically. But in principle, this is the same thing.
split_indices = split_indices[1:] | ||
# ! TODO: [Q] Why these operations? | ||
# I probably do not understand the CB algo, high level explainer would be great | ||
sorting = feature.lex_binned_data.argsort() # 1. get element index row-wise ordered from smallest to greatest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
correct
# ! TODO: [Q] Why these operations? | ||
# I probably do not understand the CB algo, high level explainer would be great | ||
sorting = feature.lex_binned_data.argsort() # 1. get element index row-wise ordered from smallest to greatest | ||
sorted_bins = feature.lex_binned_data[sorting] # 2. return the bins sorted from smallest to greatest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
correct
# as my example with a3=np.random.rand(3,10) and a3[a3.argsort()] was returning an IndexError | ||
bins, split_indices = np.unique( | ||
sorted_bins, return_index=True | ||
) # 3. return only the unique values for each bin ordered |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This returns the unique values (only needed for the special case of empty bins in multi-dimensional features) and its indices. The latter are needed for the split of all the target and prediction values in the different bins.
bins, split_indices = np.unique( | ||
sorted_bins, return_index=True | ||
) # 3. return only the unique values for each bin ordered | ||
split_indices = split_indices[1:] # 5. drop the zero index bin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are looking for bin ranges.
|
||
y_pred = np.hstack((y[..., np.newaxis], self.unlink_func(pred.predict_link())[..., np.newaxis])) | ||
# 6. joining the values of the target variable with those of the predictions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
correct
y_pred = np.hstack((y_pred, self.weights[..., np.newaxis])) | ||
# 7. joining the previous matrix with the weights (of each input variable?) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
correct
y_pred_bins = np.split(y_pred[sorting], split_indices) | ||
# 8. sort the predictions according to the bins (of the input variable?) and split this into bins |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, split the target and prediction values in the bins of the feature considered here. This is done to perform independent optimizations in the following.
for bin in range(n_bins): | ||
parameters[bin], uncertainties[bin] = self.optimization( | ||
y_pred_bins[bin][:, 0], y_pred_bins[bin][:, 1], y_pred_bins[bin][:, 2] | ||
) | ||
# ! TODO: What parameters are being returned? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The parameters returned are the factors (or summands for additive CB modes) of the different bins and its uncertainties (needed for the smoothing).
|
||
neutral_factor = self.unlink_func(np.array(self.neutral_factor_link)) | ||
# 11. if there is one more bin corresponding to the neutral factor, then add it to the parameters |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
correct
@@ -404,7 +424,7 @@ def quantile_global_scale( | |||
weights: np.ndarray, | |||
prior_prediction_column: Union[str, int, None], | |||
link_func, | |||
) -> None: | |||
) -> Tuple: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right
|
||
n_bins = len(y_pred_bins) | ||
parameters = np.zeros(n_bins) | ||
uncertainties = np.zeros(n_bins) | ||
|
||
# 10. Try to minimize a loss function given y, y_pred and the weights? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
correct
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But do this independently for each bin of the feature at hand.
sorting = feature.lex_binned_data.argsort() # 1. get element index row-wise ordered from smallest to greatest | ||
sorted_bins = feature.lex_binned_data[sorting] # 2. return the bins sorted from smallest to greatest | ||
# do not quite understand how this works | ||
# as my example with a3=np.random.rand(3,10) and a3[a3.argsort()] was returning an IndexError |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lex_binnded_data is just a vector
for i in empty_bins: | ||
y_pred_bins.insert(i, np.zeros((0, 3))) | ||
y_pred_bins.insert(i, np.zeros((0, 3))) # ! TODO: [Q] Is the (0,3) format due to (y, y_hat, weights)? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes
empty_bins = list(set(bins) ^ set(all_bins)) | ||
empty_bins = set(bins) ^ set(all_bins) | ||
# 9. returns the elements which are either in set(bins) | ||
# or set(all_bins). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is exclusive or. The idea is to find empty bins. These are in all_bins but not in bins. Empty bins can occur in multi-dimensional features (which are mapped to one-dimensional structure before this function).
empty_bins = set(bins) ^ set(all_bins) | ||
# 9. returns the elements which are either in set(bins) | ||
# or set(all_bins). | ||
# ! TODO: list can be removed as only iterator is used below |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fair enough
# 9. returns the elements which are either in set(bins) | ||
# or set(all_bins). | ||
# ! TODO: list can be removed as only iterator is used below | ||
# why does this return the empty bins though? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see comment above about multi-dimensional features
# because all_bins is a superset of sorted_bins, this is tantamount to finding the values | ||
# which are not in bins. Bins return a list of all the values | ||
# check, for example, a5= np.array([[i*j + 1 for i in range(0,3)] for j in range(0,3)]) | ||
# bins , split_indices = np.unique(a5, return_index=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The point is, you do not find empty bins in lex_binned_data. So, there can be multi-dimensional bins which we would miss here. But we have to include it to not mess up the multi-dimensional binning structure.
Hope my answers are kind of understandable :). |
Thanks for the clarification @FelixWick ! 🙏 |
A set of questions to better understand the CB algo