Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make hurdle regression work #51

Open
wants to merge 20 commits into
base: fatalities003
Choose a base branch
from

Conversation

ekaterinakuzmina
Copy link

@ekaterinakuzmina ekaterinakuzmina commented Nov 20, 2023

This PR adds the following changes:

  1. Fixes HurdleRegression class in ViewsEstimators.py, so it works;
  2. Adds tests to this class for different models;
  3. Adds test_hurdle_regressionnotebook that can be run to test the changes;
  4. Adds FixedFirstSplitRegression class in ViewsEstimators.py that can
    • split data into two groups based on the value of a spllit_by feature
    • fits models to each set separately
    • WHAT IS UNCLEAR: since the spllit_by feature is not the target Y, the target Y stays the same for both subsets meaning that without any additional transformation, only regressors or only classifiers can be applied to both subsets. If we want to change the target Y based on the feature spllit_by, we can apply a rule, transform the data and then feed it into HurdleRegression class. The use cases here require clarification.

@ekaterinakuzmina ekaterinakuzmina marked this pull request as ready for review November 21, 2023 14:46
@ekaterinakuzmina ekaterinakuzmina changed the title Refactor notebook to understand what is happening Make hurdle regression work Nov 21, 2023
@ekaterinakuzmina ekaterinakuzmina self-assigned this Nov 21, 2023
@ekaterinakuzmina ekaterinakuzmina added bug Something isn't working and removed bug Something isn't working labels Nov 21, 2023
Copy link
Collaborator

@hhegre hhegre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was not able to run the test_hurdle_regression.ibynb - it stopped on a ModuleNotFoundError when importing test_hurdle_regression. I may have done something stupid - please tell me how I should run this to explore how it works. Then I will look more closely at this.

Thanks for tidying up. Most of it looks good to me, but some of the changes seem to alter how the class works - this should be discussed before implemented. Some cases I saw:

Why did you remove njobs=-2, what is the effect on execution time?

Please motivate the code in lines 105ff. - If y has only one unique value it might be better that the model returns an error message?

.predict() should by default use predict_proba. The previous naming was not good, but behavior should not change from the previous version

Copy link

@Polichinel Polichinel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still need to do the last bit, but this should give you something to work with

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we are moving away from using this big comprehensive ModelDefinition.py file. At least in its current form. In any case, you should make a model-specific config like the ones Noorain is working on. He posted and example in Slack, but the simplest thing is probably just to aks him about the specifics.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree 100%. And this is why it is important to work in branches and merge, so we can build on work of each other. So, instead of asking Noorain and doing what he has already done, I could pull his changes and build on top of them.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good - out with these except for experimentation and sanity checks

'HGBRegressor': HistGradientBoostingRegressor(max_iter=200),
'HGBClassifier': HistGradientBoostingClassifier(max_iter=200),
estimators = {
'linear': LinearRegression(),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is n_jobs removed?

Also. it is unfortunate that so many hyperparameters (HPs) are hardcoded here. They would be in the config file, but I do realize that since there are multiple different models to choose from, this could get messy. Still, we need to find a way. This is important so that when we create a sweep file for W&B we can iterate over both different HPs and models.

Also, the random_state should also be defined in the config

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree 100%. This is something we should agree on how to implement and follow the agreement. I do not want to create more technical debt just doing it the way I want if others do it in a different way :)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On n_jobs - good catch. No particular reason. My main focus was to make the thing work and refactor it. n_jobs should not be defined inside these functions anyways as you said, they should be in a config file, so one can easily change and experiment. Will add back for now as it was before.


self.reg_ = self._resolve_estimator(self.reg_name)
# Instantiate the regressor
self.reg_ = self._resolve_estimator(self.reg_name, random_state=42)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

random_state should be defined in config

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree 100%.

force_all_finite=True) #'allow-nan'

# Set the number of features seen during fit
self.n_features_in_ = X.shape[1]

if X.shape[1] < 2:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Surely then it should be:

if X.shape[1] == 1:

Not important but still....

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree :) Fixed.


if X.shape[1] < 2:
raise ValueError('Cannot fit model when n_features = 1')

self.clf_ = self._resolve_estimator(self.clf_name)
# Instantiate the classifier
self.clf_ = self._resolve_estimator(self.clf_name, random_state=42)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

random state in config

self.clf_.fit(X, y > 0)
self.clf_fi = self.clf_.feature_importances_

# Check if there are more than one unique values in y

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get this. If there is only one y then something must have gone wrong way back in the data part. I think it is prudent with these kinds of checks, but this is a bit late to do it no? And why try to fit it if the data is wrong? I.e. with the DummyClassifier?

Copy link
Author

@ekaterinakuzmina ekaterinakuzmina Nov 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question! I made some changes here already, but I will explain why I was fumbling with checking whether a training set has 1 sample only:

  • I called check_estimator(hr) function inside the test_hurdle_regression() function to do automatic sanity checks on the estimator. check_estimator() is an sklearn native check and among other tests it runs test called check_fit2d_1sample to make sure that the estimator can handle an edge case where only one sample is passed to the fitting stage. So, to pass this automatic sklearn test within my test function, I needed to do something hacky in the fit method :)
  • I decided to simply delete check_estimator(hr), so I do not need to have a hacky work around to pass it. We can add explicit tests for the estimator into the testing function if needed.

Copy link

@Polichinel Polichinel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is a larger thing in regards to how we structure things in VIEWS. My preference is clearly that model classes have their own script in a model_class dir. And until functions can be collected in larger scripts in a utils dir. Config files naturally then go in a larger config dir and lastly, the training is done in one script, which is placed in a training script dir, which imports what it needs from other scripts and then executes.

I think - but correct me if I am wrong -that this is a rather uncontroversial structure that makes everything more parseable.

# Check if there are more than one unique values in y
# and if yes, fit the classifier to X, so y that is > 0, becomes 1
# and if not, it is 0
if len(np.unique(y)) > 1:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or am I misunderstanding this completely? Is y here the indicator Håvard wants to bring in? Wich I thought was for the split-first thing (but what do I know?) In any case, y should ever only be the target feature - unless I'm completely misunderstanding this (which happens for sure).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for the confusion - you are right, this does not have anything to do with the split-thing. The split-thing is in a different class. This part is related to the previous hacky thing with the check_fit2d_1sample test. When I had check_estimator() testing in test_hurdle_regression() function, it ran the test for the edge case with 1 sample in the training data and in this case there is only 1 target value :) So, again I needed a work-around to pass it.

Since i decided to simply not run automatic check_estimator() in test_hurdle_regression(), this part that caused confusion is deleted.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the structure - your suggestion makes perfect sense to me, will implement it and we will see how it goes.

""" Lookup table for supported estimators.
This is necessary because sklearn estimator default arguments
must pass equality test, and instantiated sub-estimators are not equal. """

funcs = {'linear': LinearRegression(),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it not a bit imprudent that the same call can return both classifiers and regressors? Not a biggy but still. More on this later.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting point. I tend to agree that it could be better to separate regressors from classifiers.


if X.shape[1] < 2:
raise ValueError('Cannot fit model when n_features = 1')

self.clf_ = self._resolve_estimator(self.clf_name)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here we use _resolve_estimator to get a classifier, but we do check if it is actually returning a classifier. Since the method is also capable of taking regression, it would be nice with some sort of check here - or to have different calls for classification and regression respectively.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! I will see what I can add.


self.reg_ = self._resolve_estimator(self.reg_name)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we call _resolve_estimator to get a regression model, but we do not check if it is a regression model. I.e. if a user inputted the model names wrongly we could get a classifier here, no?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, would be nice to add checks, agree. Will see what I can add.



####################
# HurdleRegression #
####################
class HurdleRegression(BaseEstimator):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given how much of the script is taken up by this class, I think it should get its own class script. We might need to collect these kinds of scripts for all models used either in a utils folder or in a dedicated model_class folder

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, had the same thought. This is a refactoring decision I would like to agree on collaboratively, so we follow it consistently.


def manual_test():

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, do then we have to function here and then a new class below? That seems a bit messy, but perhaps I am misunderstanding somthing

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, it should not be like this. Probably it would be the best to have either tests.py file and viewser_models.py with different classes or have a separate file for each model class with its test. Again - something I would be happy to discuss and agree on.


assert y_pred_prob.shape == y_test.shape, "Probability predictions and y do not have the same shape"


Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New class, I think it deserves a new script - but that might just be me :)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Connected to your previous comments - I would like to discuss the refactoring decisions first and then implement.


#if __name__ == '__main__':

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why remove?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not see why we need it. Do we plan to execite this file directly? My understanding is that we will import a class from its file in some other file like for example this from Tools.models.hurdle_regression_model import HurdleRegression, then we do not need this line. Afaik, this line is needed only if the code in the file needs to be executed directly. Let me know if you think it is needed.

@ekaterinakuzmina
Copy link
Author

ekaterinakuzmina commented Nov 23, 2023

Thank you for the review, @hhegre !

Why did you remove njobs=-2, what is the effect on execution time?

As I also replied to Simon - I was focused more on the structure and making it run rather than what goes inside + what goes inside should be defined outside of the class ideally. Good catch though - added them back as it was before. We should discuss later how to move them in configs in a consistent way.

Please motivate the code in lines 105ff. - If y has only one unique value it might be better that the model returns an error message?

Very good idea with the error. I will work more on this part today.

.predict() should by default use predict_proba. The previous naming was not good, but behavior should not change from the previous version

Good catch. I will change them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants