Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Review/generic loss questions #60

Closed

Conversation

lbventura
Copy link
Collaborator

A set of questions to better understand the CB algo

sorted_bins = feature.lex_binned_data[sorting]
bins, split_indices = np.unique(sorted_bins, return_index=True)
split_indices = split_indices[1:]
# ! TODO: [Q] Why these operations?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The general idea is to do an independent optimization in each bin of the feature at hand. And the outcome of each of these optimizations is the factor (or summand for additive CB modes) and its uncertainty.

bins, split_indices = np.unique(sorted_bins, return_index=True)
split_indices = split_indices[1:]
# ! TODO: [Q] Why these operations?
# I probably do not understand the CB algo, high level explainer would be great
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The difference to the other (older) CB modes is that we do the optimizations here numerically by explicitly minimizing a loss, rather than analytically. But in principle, this is the same thing.

split_indices = split_indices[1:]
# ! TODO: [Q] Why these operations?
# I probably do not understand the CB algo, high level explainer would be great
sorting = feature.lex_binned_data.argsort() # 1. get element index row-wise ordered from smallest to greatest
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

correct

# ! TODO: [Q] Why these operations?
# I probably do not understand the CB algo, high level explainer would be great
sorting = feature.lex_binned_data.argsort() # 1. get element index row-wise ordered from smallest to greatest
sorted_bins = feature.lex_binned_data[sorting] # 2. return the bins sorted from smallest to greatest
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

correct

# as my example with a3=np.random.rand(3,10) and a3[a3.argsort()] was returning an IndexError
bins, split_indices = np.unique(
sorted_bins, return_index=True
) # 3. return only the unique values for each bin ordered
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This returns the unique values (only needed for the special case of empty bins in multi-dimensional features) and its indices. The latter are needed for the split of all the target and prediction values in the different bins.

bins, split_indices = np.unique(
sorted_bins, return_index=True
) # 3. return only the unique values for each bin ordered
split_indices = split_indices[1:] # 5. drop the zero index bin
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are looking for bin ranges.


y_pred = np.hstack((y[..., np.newaxis], self.unlink_func(pred.predict_link())[..., np.newaxis]))
# 6. joining the values of the target variable with those of the predictions
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

correct

y_pred = np.hstack((y_pred, self.weights[..., np.newaxis]))
# 7. joining the previous matrix with the weights (of each input variable?)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

correct

y_pred_bins = np.split(y_pred[sorting], split_indices)
# 8. sort the predictions according to the bins (of the input variable?) and split this into bins
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, split the target and prediction values in the bins of the feature considered here. This is done to perform independent optimizations in the following.

for bin in range(n_bins):
parameters[bin], uncertainties[bin] = self.optimization(
y_pred_bins[bin][:, 0], y_pred_bins[bin][:, 1], y_pred_bins[bin][:, 2]
)
# ! TODO: What parameters are being returned?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parameters returned are the factors (or summands for additive CB modes) of the different bins and its uncertainties (needed for the smoothing).


neutral_factor = self.unlink_func(np.array(self.neutral_factor_link))
# 11. if there is one more bin corresponding to the neutral factor, then add it to the parameters
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

correct

@@ -404,7 +424,7 @@ def quantile_global_scale(
weights: np.ndarray,
prior_prediction_column: Union[str, int, None],
link_func,
) -> None:
) -> Tuple:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right


n_bins = len(y_pred_bins)
parameters = np.zeros(n_bins)
uncertainties = np.zeros(n_bins)

# 10. Try to minimize a loss function given y, y_pred and the weights?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

correct

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But do this independently for each bin of the feature at hand.

sorting = feature.lex_binned_data.argsort() # 1. get element index row-wise ordered from smallest to greatest
sorted_bins = feature.lex_binned_data[sorting] # 2. return the bins sorted from smallest to greatest
# do not quite understand how this works
# as my example with a3=np.random.rand(3,10) and a3[a3.argsort()] was returning an IndexError
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lex_binnded_data is just a vector

for i in empty_bins:
y_pred_bins.insert(i, np.zeros((0, 3)))
y_pred_bins.insert(i, np.zeros((0, 3))) # ! TODO: [Q] Is the (0,3) format due to (y, y_hat, weights)?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

empty_bins = list(set(bins) ^ set(all_bins))
empty_bins = set(bins) ^ set(all_bins)
# 9. returns the elements which are either in set(bins)
# or set(all_bins).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is exclusive or. The idea is to find empty bins. These are in all_bins but not in bins. Empty bins can occur in multi-dimensional features (which are mapped to one-dimensional structure before this function).

empty_bins = set(bins) ^ set(all_bins)
# 9. returns the elements which are either in set(bins)
# or set(all_bins).
# ! TODO: list can be removed as only iterator is used below
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fair enough

# 9. returns the elements which are either in set(bins)
# or set(all_bins).
# ! TODO: list can be removed as only iterator is used below
# why does this return the empty bins though?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see comment above about multi-dimensional features

# because all_bins is a superset of sorted_bins, this is tantamount to finding the values
# which are not in bins. Bins return a list of all the values
# check, for example, a5= np.array([[i*j + 1 for i in range(0,3)] for j in range(0,3)])
# bins , split_indices = np.unique(a5, return_index=True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point is, you do not find empty bins in lex_binned_data. So, there can be multi-dimensional bins which we would miss here. But we have to include it to not mess up the multi-dimensional binning structure.

@FelixWick
Copy link
Collaborator

Hope my answers are kind of understandable :).

@lbventura
Copy link
Collaborator Author

Thanks for the clarification @FelixWick ! 🙏

@lbventura lbventura closed this Dec 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants