Make hurdle regression work #51

ekaterinakuzmina · 2023-11-20T14:20:44Z

This PR adds the following changes:

Fixes HurdleRegression class in ViewsEstimators.py, so it works;
Adds tests to this class for different models;
Adds test_hurdle_regressionnotebook that can be run to test the changes;
Adds FixedFirstSplitRegression class in ViewsEstimators.py that can
- split data into two groups based on the value of a spllit_by feature
- fits models to each set separately
- WHAT IS UNCLEAR: since the spllit_by feature is not the target Y, the target Y stays the same for both subsets meaning that without any additional transformation, only regressors or only classifiers can be applied to both subsets. If we want to change the target Y based on the feature spllit_by, we can apply a rule, transform the data and then feed it into HurdleRegression class. The use cases here require clarification.

hhegre

I was not able to run the test_hurdle_regression.ibynb - it stopped on a ModuleNotFoundError when importing test_hurdle_regression. I may have done something stupid - please tell me how I should run this to explore how it works. Then I will look more closely at this.

Thanks for tidying up. Most of it looks good to me, but some of the changes seem to alter how the class works - this should be discussed before implemented. Some cases I saw:

Why did you remove njobs=-2, what is the effect on execution time?

Please motivate the code in lines 105ff. - If y has only one unique value it might be better that the model returns an error message?

.predict() should by default use predict_proba. The previous naming was not good, but behavior should not change from the previous version

Polichinel

I still need to do the last bit, but this should give you something to work with

Polichinel · 2023-11-23T07:26:33Z

SystemUpdates/ModelDefinitions.py

I think we are moving away from using this big comprehensive ModelDefinition.py file. At least in its current form. In any case, you should make a model-specific config like the ones Noorain is working on. He posted and example in Slack, but the simplest thing is probably just to aks him about the specifics.

Agree 100%. And this is why it is important to work in branches and merge, so we can build on work of each other. So, instead of asking Noorain and doing what he has already done, I could pull his changes and build on top of them.

Polichinel · 2023-11-23T07:27:51Z

SystemUpdates/test_hurdle_regression.ipynb

Good - out with these except for experimentation and sanity checks

Polichinel · 2023-11-23T07:35:20Z

Tools/ViewsEstimators.py

-                 'HGBRegressor': HistGradientBoostingRegressor(max_iter=200),
-                 'HGBClassifier': HistGradientBoostingClassifier(max_iter=200),
+        estimators = {
+                    'linear': LinearRegression(),


Why is n_jobs removed?

Also. it is unfortunate that so many hyperparameters (HPs) are hardcoded here. They would be in the config file, but I do realize that since there are multiple different models to choose from, this could get messy. Still, we need to find a way. This is important so that when we create a sweep file for W&B we can iterate over both different HPs and models.

Also, the random_state should also be defined in the config

Agree 100%. This is something we should agree on how to implement and follow the agreement. I do not want to create more technical debt just doing it the way I want if others do it in a different way :)

On n_jobs - good catch. No particular reason. My main focus was to make the thing work and refactor it. n_jobs should not be defined inside these functions anyways as you said, they should be in a config file, so one can easily change and experiment. Will add back for now as it was before.

Polichinel · 2023-11-23T07:42:45Z

Tools/ViewsEstimators.py


-        self.reg_ = self._resolve_estimator(self.reg_name)
+        # Instantiate the regressor
+        self.reg_ = self._resolve_estimator(self.reg_name, random_state=42)


random_state should be defined in config

Agree 100%.

Polichinel · 2023-11-23T07:47:43Z

Tools/ViewsEstimators.py

+                         force_all_finite=True) #'allow-nan'
+
+        # Set the number of features seen during fit
+        self.n_features_in_ = X.shape[1]

        if X.shape[1] < 2:


Surely then it should be:

if X.shape[1] == 1:

Not important but still....

Agree :) Fixed.

Polichinel · 2023-11-23T07:51:46Z

Tools/ViewsEstimators.py


        if X.shape[1] < 2:
            raise ValueError('Cannot fit model when n_features = 1')

-        self.clf_ = self._resolve_estimator(self.clf_name)
+        # Instantiate the classifier
+        self.clf_ = self._resolve_estimator(self.clf_name, random_state=42)


random state in config

Polichinel · 2023-11-23T07:57:42Z

Tools/ViewsEstimators.py

-        self.clf_.fit(X, y > 0)
-        self.clf_fi = self.clf_.feature_importances_
+
+        # Check if there are more than one unique values in y 


I don't get this. If there is only one y then something must have gone wrong way back in the data part. I think it is prudent with these kinds of checks, but this is a bit late to do it no? And why try to fit it if the data is wrong? I.e. with the DummyClassifier?

Good question! I made some changes here already, but I will explain why I was fumbling with checking whether a training set has 1 sample only:

I called check_estimator(hr) function inside the test_hurdle_regression() function to do automatic sanity checks on the estimator. check_estimator() is an sklearn native check and among other tests it runs test called check_fit2d_1sample to make sure that the estimator can handle an edge case where only one sample is passed to the fitting stage. So, to pass this automatic sklearn test within my test function, I needed to do something hacky in the fit method :)

I decided to simply delete check_estimator(hr), so I do not need to have a hacky work around to pass it. We can add explicit tests for the estimator into the testing function if needed.

Polichinel

I think there is a larger thing in regards to how we structure things in VIEWS. My preference is clearly that model classes have their own script in a model_class dir. And until functions can be collected in larger scripts in a utils dir. Config files naturally then go in a larger config dir and lastly, the training is done in one script, which is placed in a training script dir, which imports what it needs from other scripts and then executes.

I think - but correct me if I am wrong -that this is a rather uncontroversial structure that makes everything more parseable.

Polichinel · 2023-11-23T08:28:13Z

Tools/ViewsEstimators.py

+        # Check if there are more than one unique values in y 
+        # and if yes, fit the classifier to X, so y that is > 0, becomes 1
+        # and if not, it is 0
+        if len(np.unique(y)) > 1:


Or am I misunderstanding this completely? Is y here the indicator Håvard wants to bring in? Wich I thought was for the split-first thing (but what do I know?) In any case, y should ever only be the target feature - unless I'm completely misunderstanding this (which happens for sure).

Apologies for the confusion - you are right, this does not have anything to do with the split-thing. The split-thing is in a different class. This part is related to the previous hacky thing with the check_fit2d_1sample test. When I had check_estimator() testing in test_hurdle_regression() function, it ran the test for the edge case with 1 sample in the training data and in this case there is only 1 target value :) So, again I needed a work-around to pass it.

Since i decided to simply not run automatic check_estimator() in test_hurdle_regression(), this part that caused confusion is deleted.

On the structure - your suggestion makes perfect sense to me, will implement it and we will see how it goes.

Polichinel · 2023-11-23T08:34:58Z

Tools/ViewsEstimators.py

        """ Lookup table for supported estimators.
        This is necessary because sklearn estimator default arguments
        must pass equality test, and instantiated sub-estimators are not equal. """

-        funcs = {'linear': LinearRegression(),


Is it not a bit imprudent that the same call can return both classifiers and regressors? Not a biggy but still. More on this later.

Interesting point. I tend to agree that it could be better to separate regressors from classifiers.

Polichinel · 2023-11-23T08:37:20Z

Tools/ViewsEstimators.py


        if X.shape[1] < 2:
            raise ValueError('Cannot fit model when n_features = 1')

-        self.clf_ = self._resolve_estimator(self.clf_name)


here we use _resolve_estimator to get a classifier, but we do check if it is actually returning a classifier. Since the method is also capable of taking regression, it would be nice with some sort of check here - or to have different calls for classification and regression respectively.

Good point! I will see what I can add.

Polichinel · 2023-11-23T08:38:56Z

Tools/ViewsEstimators.py


-        self.reg_ = self._resolve_estimator(self.reg_name)


Here we call _resolve_estimator to get a regression model, but we do not check if it is a regression model. I.e. if a user inputted the model names wrongly we could get a classifier here, no?

Yep, would be nice to add checks, agree. Will see what I can add.

Polichinel · 2023-11-23T08:52:28Z

Tools/ViewsEstimators.py



+####################
+# HurdleRegression #
+####################
 class HurdleRegression(BaseEstimator):


Given how much of the script is taken up by this class, I think it should get its own class script. We might need to collect these kinds of scripts for all models used either in a utils folder or in a dedicated model_class folder

Agree, had the same thought. This is a refactoring decision I would like to agree on collaboratively, so we follow it consistently.

Polichinel · 2023-11-23T08:54:44Z

Tools/ViewsEstimators.py


-def manual_test():


Wait, do then we have to function here and then a new class below? That seems a bit messy, but perhaps I am misunderstanding somthing

Agree, it should not be like this. Probably it would be the best to have either tests.py file and viewser_models.py with different classes or have a separate file for each model class with its test. Again - something I would be happy to discuss and agree on.

Polichinel · 2023-11-23T08:57:13Z

Tools/ViewsEstimators.py

+
+    assert y_pred_prob.shape == y_test.shape, "Probability predictions and y do not have the same shape"
+
+


New class, I think it deserves a new script - but that might just be me :)

Connected to your previous comments - I would like to discuss the refactoring decisions first and then implement.

Polichinel · 2023-11-23T09:02:52Z

Tools/ViewsEstimators.py


-#if __name__ == '__main__':


Why remove?

I do not see why we need it. Do we plan to execite this file directly? My understanding is that we will import a class from its file in some other file like for example this from Tools.models.hurdle_regression_model import HurdleRegression, then we do not need this line. Afaik, this line is needed only if the code in the file needs to be executed directly. Let me know if you think it is needed.

ekaterinakuzmina · 2023-11-23T09:10:33Z

Thank you for the review, @hhegre !

Why did you remove njobs=-2, what is the effect on execution time?

As I also replied to Simon - I was focused more on the structure and making it run rather than what goes inside + what goes inside should be defined outside of the class ideally. Good catch though - added them back as it was before. We should discuss later how to move them in configs in a consistent way.

Please motivate the code in lines 105ff. - If y has only one unique value it might be better that the model returns an error message?

Very good idea with the error. I will work more on this part today.

.predict() should by default use predict_proba. The previous naming was not good, but behavior should not change from the previous version

Good catch. I will change them.

ekaterinakuzmina added 5 commits November 20, 2023 15:20

Refactor notebook to understand what is happening

c8d6fb3

Put paths appendings in one file

5bbfbc7

Remove hurdle regression definition from where it should not be

4292c9e

Fix hurdle regression and tests

1c0ea45

Add notebook testing hurdle regression

ba87fe1

ekaterinakuzmina marked this pull request as ready for review November 21, 2023 14:46

Delete unrelated files

2e86b4e

ekaterinakuzmina requested review from hhegre and Polichinel November 21, 2023 14:50

ekaterinakuzmina changed the title ~~Refactor notebook to understand what is happening~~ Make hurdle regression work Nov 21, 2023

ekaterinakuzmina self-assigned this Nov 21, 2023

ekaterinakuzmina added bug Something isn't working and removed bug Something isn't working labels Nov 21, 2023

ekaterinakuzmina added 7 commits November 22, 2023 12:23

Delete unused imports

de30c01

Add support for external binary indicator to the class

afc49c0

Change the notebook accordingly

609b559

Improve hurdle regression class

af98350

Change testing notebook accordingly

4ba5308

Comment back the FixedFirstSplitRegression

025e752

Delete unused notebook

eb138cc

hhegre reviewed Nov 23, 2023

View reviewed changes

Polichinel requested changes Nov 23, 2023

View reviewed changes

ekaterinakuzmina added 2 commits November 23, 2023 09:52

Make a more explicit comparison

68da07e

Add n_jobs back

b29d954

Polichinel requested changes Nov 23, 2023

View reviewed changes

ekaterinakuzmina added 4 commits November 23, 2023 12:02

Simplify and clean hurdle regression class

abce7f8

Edit notebook

e65a7cb

Delete check_estimator() to avoid check_fit2d_1sample test

9a46766

Refactor files

5025e03

Fix imports in ModelDefinition

9b06a4a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make hurdle regression work #51

Make hurdle regression work #51

ekaterinakuzmina commented Nov 20, 2023 •

edited

Loading

hhegre left a comment

Polichinel left a comment

Polichinel Nov 23, 2023

ekaterinakuzmina Nov 23, 2023

Polichinel Nov 23, 2023

Polichinel Nov 23, 2023

ekaterinakuzmina Nov 23, 2023

ekaterinakuzmina Nov 23, 2023

Polichinel Nov 23, 2023

ekaterinakuzmina Nov 23, 2023

Polichinel Nov 23, 2023

ekaterinakuzmina Nov 23, 2023

Polichinel Nov 23, 2023

Polichinel Nov 23, 2023

ekaterinakuzmina Nov 23, 2023 •

edited

Loading

Polichinel left a comment

Polichinel Nov 23, 2023

ekaterinakuzmina Nov 23, 2023

ekaterinakuzmina Nov 23, 2023

Polichinel Nov 23, 2023

ekaterinakuzmina Nov 23, 2023

Polichinel Nov 23, 2023

ekaterinakuzmina Nov 23, 2023

Polichinel Nov 23, 2023

ekaterinakuzmina Nov 23, 2023

Polichinel Nov 23, 2023

ekaterinakuzmina Nov 23, 2023

Polichinel Nov 23, 2023

ekaterinakuzmina Nov 23, 2023

Polichinel Nov 23, 2023

ekaterinakuzmina Nov 23, 2023

Polichinel Nov 23, 2023

ekaterinakuzmina Nov 23, 2023

ekaterinakuzmina commented Nov 23, 2023 •

edited

Loading


		assert y_pred_prob.shape == y_test.shape, "Probability predictions and y do not have the same shape"


		def manual_test():

Make hurdle regression work #51

Are you sure you want to change the base?

Make hurdle regression work #51

Conversation

ekaterinakuzmina commented Nov 20, 2023 • edited Loading

hhegre left a comment

Choose a reason for hiding this comment

Polichinel left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ekaterinakuzmina Nov 23, 2023 • edited Loading

Choose a reason for hiding this comment

Polichinel left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ekaterinakuzmina commented Nov 23, 2023 • edited Loading

ekaterinakuzmina commented Nov 20, 2023 •

edited

Loading

ekaterinakuzmina Nov 23, 2023 •

edited

Loading

ekaterinakuzmina commented Nov 23, 2023 •

edited

Loading