Estimator #57

lionelkusch · 2024-12-13T13:43:39Z

The estimator Dnn_learner and RandomForestModified were not tested after that the BBI file was removed. I tried to implement some tests for improving the coverage of the tests.

I moved these files into a sub-module of hidimstat because they are not part of the core of methods proposed by the library.
Moreover, Dnn_learner include a dependency on torch and torchmetric which is not essential for the other methods of the library. In consequence, moving them to a sub-module, removes this requirement for using the other methods of the libraries.

lionelkusch · 2024-12-13T13:46:22Z

hidimstat/estimator/Dnn_learner_single.py

@@ -241,6 +241,7 @@ def fit(self, X, y=None):
        loss = np.array(res_ens[4])

        if self.n_ensemble == 1:
+            raise Warning("The model can't be fit with n_ensemble = 1")


@jpaillard Can you check if it's correct that without multiple ensembles, there is not fitting?

I think there is a fitting, as suggested by the function call joblib_ensemble_dnnet (btw should this be a private function ? Should we integrate it in the same module for clarity ?)

The following lines select the n_ensemble best models (which is useless when n_ensemble==1).

I would suggest to have a function _keep_n_ensemble that is called line 243 and gathers the code until line 261 to manage this distinction.

lionelkusch · 2024-12-13T13:49:07Z

hidimstat/estimator/Dnn_learner_single.py

@@ -283,6 +284,9 @@ def encode_outcome(self, y, train=True):
            y = y.reshape(-1, 1)
        if self.problem_type == "regression":
            list_y.append(y)
+        # Encoding the target with the ordinal case


@jpaillard @bthirion
Can you tell me, if you know what is the "ordinal methods".
if yes, do you thing it's interesting to keep it? (the function was half implemented.)
If it's interesting to keep it, can you check if my modification was corrected?

It stands for regression problems where the ordering of values matters, but not the values themselves. Usually the values are discretized. I propose to keep it for the moment.

lionelkusch · 2024-12-13T13:51:25Z

hidimstat/estimator/RandomForestModified.py

+                if samples_in_leaf.size > 0:
+                    leaf_samples.append(
+                        y_minus_i[rng.choice(samples_in_leaf, size=random_samples)]
+                    )


@jpaillard @bthirion
I modified the function for considering when the samples leaf was empty.
However, I don't know what was doing the function.
Can you validate if it's the correct way to do it?

This should never be empty by construction (Random forests represent the samples in a tree structures). By default, there is a minimum number of samples in each leaf.

lionelkusch · 2024-12-13T13:52:55Z

hidimstat/estimator/test/test_Dnn_learner.py

+    assert not np.all(predict_prob[:, 0] == 0)
+    assert not np.all(predict_prob[:, 1] == 0)
+    # Check if the predicted probabilities are not all ones for each class
+    assert not np.all(predict_prob[:, 0] == 1)


@jpaillard
Can you check if there are not too many assertions and if I miss some assertions?

There are probably enough ;-)
We should just make sure they're not redundant.

lionelkusch · 2024-12-13T13:53:35Z

hidimstat/estimator/test/test_Dnn_learner_single.py

+    learner = DnnLearnerSingle(do_hypertuning=True, problem_type="ordinal", n_jobs=10, verbose=0)
+    learner.fit(X, y)
+    predict_prob = learner.predict_proba(X)[:,0]
+    # Check if the predicted class labels match the true labels for at least one instance


@jpaillard @bthirion
Can you help me to define some tests for this method?

lionelkusch · 2024-12-13T13:54:03Z

hidimstat/estimator/test/test_RandomForestModified.py

+    # Check if the feature importances are not all close to zero
+    assert not np.allclose(learner.feature_importances_, 0)
+    # Check if the feature importances are not all close to one
+    assert not np.allclose(learner.feature_importances_, 1)


@jpaillard
Can you check if there are not too many assertions and if I miss some assertions?

codecov · 2024-12-13T14:10:53Z

Codecov Report

Attention: Patch coverage is 90.83558% with 34 lines in your changes missing coverage. Please review.

Project coverage is 94.39%. Comparing base (7571832) to head (3f7a5ea).
Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
hidimstat/estimator/_utils/u_Dnn_learner.py	83.82%	33 Missing ⚠️
hidimstat/estimator/Dnn_learner_single.py	75.00%	1 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##             main      #57       +/-   ##
===========================================
+ Coverage   77.11%   94.39%   +17.27%     
===========================================
  Files          46       52        +6     
  Lines        2465     2603      +138     
===========================================
+ Hits         1901     2457      +556     
+ Misses        564      146      -418

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

bthirion

Thx for opening this ! I have a bunch of comments.

bthirion · 2024-12-13T14:53:51Z

hidimstat/estimator/Dnn_learner_single.py

@@ -283,6 +284,9 @@ def encode_outcome(self, y, train=True):
            y = y.reshape(-1, 1)
        if self.problem_type == "regression":
            list_y.append(y)
+        # Encoding the target with the ordinal case


It stands for regression problems where the ordering of values matters, but not the values themselves. Usually the values are discretized. I propose to keep it for the moment.

bthirion · 2024-12-13T14:55:13Z

hidimstat/estimator/RandomForestModified.py

+                if samples_in_leaf.size > 0:
+                    leaf_samples.append(
+                        y_minus_i[rng.choice(samples_in_leaf, size=random_samples)]
+                    )


This should never be empty by construction (Random forests represent the samples in a tree structures). By default, there is a minimum number of samples in each leaf.

bthirion · 2024-12-13T14:55:37Z

hidimstat/estimator/RandomForestModified.py

-                leaf_samples.append(
-                    y_minus_i[rng.choice(samples_in_leaf, size=random_samples)]
-                )
+                if samples_in_leaf.size > 0:


You can remove the condition here too.

bthirion · 2024-12-13T14:56:19Z

hidimstat/estimator/test/__init__.py

Is this needed ?

bthirion · 2024-12-13T14:56:40Z

hidimstat/estimator/test/_utils_test.py

+        Data matrix
+    y : np.array
+        Target vector
+    grps : np.array


Suggested change

grps : np.array

groups : np.array

bthirion · 2024-12-13T15:45:21Z

hidimstat/estimator/_utils/u_Dnn_learner.py

+        self.loss = 0
+
+    def forward(self, x):
+        if self.group_stacking:


bthirion · 2024-12-13T15:45:29Z

hidimstat/estimator/_utils/u_Dnn_learner.py

+            x = torch.cat(list_stacking, dim=1)
+        return self.layers(x)
+
+    def training_step(self, batch, device, problem_type):


docstring ?

bthirion · 2024-12-13T15:45:36Z

hidimstat/estimator/_utils/u_Dnn_learner.py

+            loss = F.binary_cross_entropy_with_logits(y_pred, y)
+        return loss
+
+    def validation_step(self, batch, device, problem_type):


docstring ?

bthirion · 2024-12-13T15:45:50Z

hidimstat/estimator/_utils/u_Dnn_learner.py

+                "batch_size": len(X),
+            }
+
+    def validation_epoch_end(self, outputs, problem_type):


docstring ?

bthirion · 2024-12-13T15:46:03Z

hidimstat/estimator/_utils/u_Dnn_learner.py

+            print("Epoch [{}], val_mse: {:.4f}".format(epoch + 1, result["val_mse"]))
+
+
+def _evaluate(model, loader, device, problem_type):


docstring (1 line)

jpaillard · 2024-12-14T11:21:24Z

hidimstat/estimator/Dnn_learner_single.py

@@ -241,6 +241,7 @@ def fit(self, X, y=None):
        loss = np.array(res_ens[4])

        if self.n_ensemble == 1:
+            raise Warning("The model can't be fit with n_ensemble = 1")


I think there is a fitting, as suggested by the function call joblib_ensemble_dnnet (btw should this be a private function ? Should we integrate it in the same module for clarity ?)

The following lines select the n_ensemble best models (which is useless when n_ensemble==1).

I would suggest to have a function _keep_n_ensemble that is called line 243 and gathers the code until line 261 to manage this distinction.

jpaillard · 2024-12-14T11:31:05Z

hidimstat/estimator/Dnn_learner_single.py

As discussed with you, as a user, I don't like integrating the ensembling (n_ensemble) and hyper-parameter tuning (do_hypertuning, dict_hypertuning) in a single class, which becomes huge.

Also, I think other libraries (sklearn for ensembling, optuna for hyperparameters) offer more & better options for these advanced training strategies.

I suggest separating these aspects from the DNN_learner class and leaving it up to the user to optimize the training separately from humidistat.

lionelkusch · 2024-12-16T09:42:47Z

The primary aim of this pull request was to increase test coverage and separate the estimation functions from the other methods to reduce dependency on torch.

Most of your comments concern the code that was there before. I had no intention of dealing with it at the moment because, for me, it wasn't the priority. However, if you think it's very important, I can do it.

lionelkusch added 6 commits December 13, 2024 14:28

Create a folder with the estimators

9a2680e

Transfer of function from global utils to local utils

65cdc16

Modifier Random forest for using their last function

30ca111

Add test for DNN learner

3978270

Add test for Dnn_learner_single and fix some bugs

cc1d414

Ignore coverage files

bf893c7

lionelkusch commented Dec 13, 2024

View reviewed changes

lionelkusch requested review from bthirion and jpaillard December 13, 2024 13:54

Missing a file for test

176f20a

Modified coverage configuration for new tests

c278ecc

bthirion reviewed Dec 13, 2024

View reviewed changes

jpaillard reviewed Dec 14, 2024

View reviewed changes

Fix bug in the import of test

3f7a5ea

lionelkusch mentioned this pull request Dec 17, 2024

Improve the coverage of the tests and fix some minor bugs #64

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Estimator #57

Estimator #57

lionelkusch commented Dec 13, 2024

lionelkusch Dec 13, 2024

jpaillard Dec 14, 2024

lionelkusch Dec 13, 2024

bthirion Dec 13, 2024

lionelkusch Dec 13, 2024

bthirion Dec 13, 2024

lionelkusch Dec 13, 2024

bthirion Dec 13, 2024

lionelkusch Dec 13, 2024

lionelkusch Dec 13, 2024

codecov bot commented Dec 13, 2024 •

edited

Loading

bthirion left a comment

bthirion Dec 13, 2024

bthirion Dec 13, 2024

bthirion Dec 13, 2024

bthirion Dec 13, 2024

bthirion Dec 13, 2024

bthirion Dec 13, 2024

bthirion Dec 13, 2024

bthirion Dec 13, 2024

bthirion Dec 13, 2024

bthirion Dec 13, 2024

jpaillard Dec 14, 2024

jpaillard Dec 14, 2024

lionelkusch commented Dec 16, 2024

		print("Epoch [{}], val_mse: {:.4f}".format(epoch + 1, result["val_mse"]))


		def _evaluate(model, loader, device, problem_type):

Estimator #57

Are you sure you want to change the base?

Estimator #57

Conversation

lionelkusch commented Dec 13, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Dec 13, 2024 • edited Loading

Codecov Report

bthirion left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lionelkusch commented Dec 16, 2024

codecov bot commented Dec 13, 2024 •

edited

Loading