Review/generic loss questions #60

lbventura · 2023-11-10T10:04:52Z

A set of questions to better understand the CB algo

FelixWick · 2023-11-10T16:10:01Z

cyclic_boosting/generic_loss.py

-        sorted_bins = feature.lex_binned_data[sorting]
-        bins, split_indices = np.unique(sorted_bins, return_index=True)
-        split_indices = split_indices[1:]
+        # ! TODO: [Q] Why these operations?


The general idea is to do an independent optimization in each bin of the feature at hand. And the outcome of each of these optimizations is the factor (or summand for additive CB modes) and its uncertainty.

FelixWick · 2023-11-10T16:11:59Z

cyclic_boosting/generic_loss.py

-        bins, split_indices = np.unique(sorted_bins, return_index=True)
-        split_indices = split_indices[1:]
+        # ! TODO: [Q] Why these operations?
+        #  I probably do not understand the CB algo, high level explainer would be great


The difference to the other (older) CB modes is that we do the optimizations here numerically by explicitly minimizing a loss, rather than analytically. But in principle, this is the same thing.

FelixWick · 2023-11-10T16:12:07Z

cyclic_boosting/generic_loss.py

-        split_indices = split_indices[1:]
+        # ! TODO: [Q] Why these operations?
+        #  I probably do not understand the CB algo, high level explainer would be great
+        sorting = feature.lex_binned_data.argsort()  # 1. get element index row-wise ordered from smallest to greatest


FelixWick · 2023-11-10T16:12:15Z

cyclic_boosting/generic_loss.py

+        # ! TODO: [Q] Why these operations?
+        #  I probably do not understand the CB algo, high level explainer would be great
+        sorting = feature.lex_binned_data.argsort()  # 1. get element index row-wise ordered from smallest to greatest
+        sorted_bins = feature.lex_binned_data[sorting]  # 2. return the bins sorted from smallest to greatest


FelixWick · 2023-11-10T16:14:38Z

cyclic_boosting/generic_loss.py

+        # as my example with a3=np.random.rand(3,10) and a3[a3.argsort()] was returning an IndexError
+        bins, split_indices = np.unique(
+            sorted_bins, return_index=True
+        )  # 3. return only the unique values for each bin ordered


This returns the unique values (only needed for the special case of empty bins in multi-dimensional features) and its indices. The latter are needed for the split of all the target and prediction values in the different bins.

FelixWick · 2023-11-10T16:15:08Z

cyclic_boosting/generic_loss.py

+        bins, split_indices = np.unique(
+            sorted_bins, return_index=True
+        )  # 3. return only the unique values for each bin ordered
+        split_indices = split_indices[1:]  # 5. drop the zero index bin


We are looking for bin ranges.

FelixWick · 2023-11-10T16:15:22Z

cyclic_boosting/generic_loss.py


        y_pred = np.hstack((y[..., np.newaxis], self.unlink_func(pred.predict_link())[..., np.newaxis]))
+        # 6. joining the values of the target variable with those of the predictions


FelixWick · 2023-11-10T16:15:33Z

cyclic_boosting/generic_loss.py

        y_pred = np.hstack((y_pred, self.weights[..., np.newaxis]))
+        # 7. joining the previous matrix with the weights (of each input variable?)


FelixWick · 2023-11-10T16:16:57Z

cyclic_boosting/generic_loss.py

        y_pred_bins = np.split(y_pred[sorting], split_indices)
+        # 8. sort the predictions according to the bins (of the input variable?) and split this into bins


Yes, split the target and prediction values in the bins of the feature considered here. This is done to perform independent optimizations in the following.

FelixWick · 2023-11-10T16:18:09Z

cyclic_boosting/generic_loss.py

        for bin in range(n_bins):
            parameters[bin], uncertainties[bin] = self.optimization(
                y_pred_bins[bin][:, 0], y_pred_bins[bin][:, 1], y_pred_bins[bin][:, 2]
            )
+            # ! TODO: What parameters are being returned?


The parameters returned are the factors (or summands for additive CB modes) of the different bins and its uncertainties (needed for the smoothing).

FelixWick · 2023-11-10T16:18:31Z

cyclic_boosting/generic_loss.py


        neutral_factor = self.unlink_func(np.array(self.neutral_factor_link))
+        # 11. if there is one more bin corresponding to the neutral factor, then add it to the parameters


FelixWick · 2023-11-10T16:19:37Z

cyclic_boosting/generic_loss.py

@@ -404,7 +424,7 @@ def quantile_global_scale(
    weights: np.ndarray,
    prior_prediction_column: Union[str, int, None],
    link_func,
-) -> None:
+) -> Tuple:


FelixWick · 2023-11-10T16:20:13Z

cyclic_boosting/generic_loss.py


        n_bins = len(y_pred_bins)
        parameters = np.zeros(n_bins)
        uncertainties = np.zeros(n_bins)

+        # 10. Try to minimize a loss function given y, y_pred and the weights?


But do this independently for each bin of the feature at hand.

FelixWick · 2023-11-10T18:06:07Z

cyclic_boosting/generic_loss.py

+        sorting = feature.lex_binned_data.argsort()  # 1. get element index row-wise ordered from smallest to greatest
+        sorted_bins = feature.lex_binned_data[sorting]  # 2. return the bins sorted from smallest to greatest
+        # do not quite understand how this works
+        # as my example with a3=np.random.rand(3,10) and a3[a3.argsort()] was returning an IndexError


lex_binnded_data is just a vector

FelixWick · 2023-11-10T18:07:36Z

cyclic_boosting/generic_loss.py

        for i in empty_bins:
-            y_pred_bins.insert(i, np.zeros((0, 3)))
+            y_pred_bins.insert(i, np.zeros((0, 3)))  # ! TODO: [Q] Is the (0,3) format due to (y, y_hat, weights)?


FelixWick · 2023-11-10T19:10:18Z

cyclic_boosting/generic_loss.py

-        empty_bins = list(set(bins) ^ set(all_bins))
+        empty_bins = set(bins) ^ set(all_bins)
+        # 9. returns the elements which are either in set(bins)
+        # or set(all_bins).


This is exclusive or. The idea is to find empty bins. These are in all_bins but not in bins. Empty bins can occur in multi-dimensional features (which are mapped to one-dimensional structure before this function).

FelixWick · 2023-11-10T19:13:10Z

cyclic_boosting/generic_loss.py

+        empty_bins = set(bins) ^ set(all_bins)
+        # 9. returns the elements which are either in set(bins)
+        # or set(all_bins).
+        # ! TODO: list can be removed as only iterator is used below


fair enough

FelixWick · 2023-11-10T19:14:00Z

cyclic_boosting/generic_loss.py

+        # 9. returns the elements which are either in set(bins)
+        # or set(all_bins).
+        # ! TODO: list can be removed as only iterator is used below
+        # why does this return the empty bins though?


see comment above about multi-dimensional features

FelixWick · 2023-11-10T19:17:04Z

cyclic_boosting/generic_loss.py

+        # because all_bins is a superset of sorted_bins, this is tantamount to finding the values
+        # which are not in bins. Bins return a list of all the values
+        # check, for example, a5= np.array([[i*j + 1 for i in range(0,3)] for j in range(0,3)])
+        # bins , split_indices = np.unique(a5, return_index=True)


The point is, you do not find empty bins in lex_binned_data. So, there can be multi-dimensional bins which we would miss here. But we have to include it to not mess up the multi-dimensional binning structure.

FelixWick · 2023-11-10T19:17:44Z

Hope my answers are kind of understandable :).

lbventura · 2023-12-22T08:45:21Z

Thanks for the clarification @FelixWick ! 🙏

lbventura added 2 commits November 10, 2023 11:00

first set of questions

953153d

clean-up questions

bdd72ba

FelixWick reviewed Nov 10, 2023

View reviewed changes

lbventura closed this Dec 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Review/generic loss questions #60

Review/generic loss questions #60

lbventura commented Nov 10, 2023

FelixWick Nov 10, 2023

FelixWick Nov 10, 2023

FelixWick Nov 10, 2023

FelixWick Nov 10, 2023

FelixWick Nov 10, 2023

FelixWick Nov 10, 2023

FelixWick Nov 10, 2023

FelixWick Nov 10, 2023

FelixWick Nov 10, 2023

FelixWick Nov 10, 2023

FelixWick Nov 10, 2023

FelixWick Nov 10, 2023

FelixWick Nov 10, 2023

FelixWick Nov 10, 2023

FelixWick Nov 10, 2023

FelixWick Nov 10, 2023

FelixWick Nov 10, 2023

FelixWick Nov 10, 2023

FelixWick Nov 10, 2023

FelixWick Nov 10, 2023

FelixWick commented Nov 10, 2023

lbventura commented Dec 22, 2023


		y_pred = np.hstack((y[..., np.newaxis], self.unlink_func(pred.predict_link())[..., np.newaxis]))
		# 6. joining the values of the target variable with those of the predictions

		y_pred = np.hstack((y_pred, self.weights[..., np.newaxis]))
		# 7. joining the previous matrix with the weights (of each input variable?)

		y_pred_bins = np.split(y_pred[sorting], split_indices)
		# 8. sort the predictions according to the bins (of the input variable?) and split this into bins


		neutral_factor = self.unlink_func(np.array(self.neutral_factor_link))
		# 11. if there is one more bin corresponding to the neutral factor, then add it to the parameters

Review/generic loss questions #60

Review/generic loss questions #60

Conversation

lbventura commented Nov 10, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

FelixWick commented Nov 10, 2023

lbventura commented Dec 22, 2023