-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make hurdle regression work #51
base: fatalities003
Are you sure you want to change the base?
Make hurdle regression work #51
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was not able to run the test_hurdle_regression.ibynb - it stopped on a ModuleNotFoundError when importing test_hurdle_regression. I may have done something stupid - please tell me how I should run this to explore how it works. Then I will look more closely at this.
Thanks for tidying up. Most of it looks good to me, but some of the changes seem to alter how the class works - this should be discussed before implemented. Some cases I saw:
Why did you remove njobs=-2, what is the effect on execution time?
Please motivate the code in lines 105ff. - If y has only one unique value it might be better that the model returns an error message?
.predict() should by default use predict_proba. The previous naming was not good, but behavior should not change from the previous version
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still need to do the last bit, but this should give you something to work with
SystemUpdates/ModelDefinitions.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we are moving away from using this big comprehensive ModelDefinition.py file. At least in its current form. In any case, you should make a model-specific config like the ones Noorain is working on. He posted and example in Slack, but the simplest thing is probably just to aks him about the specifics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree 100%. And this is why it is important to work in branches and merge, so we can build on work of each other. So, instead of asking Noorain and doing what he has already done, I could pull his changes and build on top of them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good - out with these except for experimentation and sanity checks
Tools/ViewsEstimators.py
Outdated
'HGBRegressor': HistGradientBoostingRegressor(max_iter=200), | ||
'HGBClassifier': HistGradientBoostingClassifier(max_iter=200), | ||
estimators = { | ||
'linear': LinearRegression(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is n_jobs removed?
Also. it is unfortunate that so many hyperparameters (HPs) are hardcoded here. They would be in the config file, but I do realize that since there are multiple different models to choose from, this could get messy. Still, we need to find a way. This is important so that when we create a sweep file for W&B we can iterate over both different HPs and models.
Also, the random_state should also be defined in the config
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree 100%. This is something we should agree on how to implement and follow the agreement. I do not want to create more technical debt just doing it the way I want if others do it in a different way :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On n_jobs
- good catch. No particular reason. My main focus was to make the thing work and refactor it. n_jobs
should not be defined inside these functions anyways as you said, they should be in a config file, so one can easily change and experiment. Will add back for now as it was before.
Tools/ViewsEstimators.py
Outdated
|
||
self.reg_ = self._resolve_estimator(self.reg_name) | ||
# Instantiate the regressor | ||
self.reg_ = self._resolve_estimator(self.reg_name, random_state=42) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
random_state should be defined in config
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree 100%.
Tools/ViewsEstimators.py
Outdated
force_all_finite=True) #'allow-nan' | ||
|
||
# Set the number of features seen during fit | ||
self.n_features_in_ = X.shape[1] | ||
|
||
if X.shape[1] < 2: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Surely then it should be:
if X.shape[1] == 1:
Not important but still....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree :) Fixed.
Tools/ViewsEstimators.py
Outdated
|
||
if X.shape[1] < 2: | ||
raise ValueError('Cannot fit model when n_features = 1') | ||
|
||
self.clf_ = self._resolve_estimator(self.clf_name) | ||
# Instantiate the classifier | ||
self.clf_ = self._resolve_estimator(self.clf_name, random_state=42) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
random state in config
Tools/ViewsEstimators.py
Outdated
self.clf_.fit(X, y > 0) | ||
self.clf_fi = self.clf_.feature_importances_ | ||
|
||
# Check if there are more than one unique values in y |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't get this. If there is only one y then something must have gone wrong way back in the data part. I think it is prudent with these kinds of checks, but this is a bit late to do it no? And why try to fit it if the data is wrong? I.e. with the DummyClassifier?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question! I made some changes here already, but I will explain why I was fumbling with checking whether a training set has 1 sample only:
- I called
check_estimator(hr)
function inside thetest_hurdle_regression()
function to do automatic sanity checks on the estimator.check_estimator()
is ansklearn
native check and among other tests it runs test calledcheck_fit2d_1sample
to make sure that the estimator can handle an edge case where only one sample is passed to the fitting stage. So, to pass this automatic sklearn test within my test function, I needed to do something hacky in the fit method :) - I decided to simply delete
check_estimator(hr)
, so I do not need to have a hacky work around to pass it. We can add explicit tests for the estimator into the testing function if needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is a larger thing in regards to how we structure things in VIEWS. My preference is clearly that model classes have their own script in a model_class dir. And until functions can be collected in larger scripts in a utils dir. Config files naturally then go in a larger config dir and lastly, the training is done in one script, which is placed in a training script dir, which imports what it needs from other scripts and then executes.
I think - but correct me if I am wrong -that this is a rather uncontroversial structure that makes everything more parseable.
Tools/ViewsEstimators.py
Outdated
# Check if there are more than one unique values in y | ||
# and if yes, fit the classifier to X, so y that is > 0, becomes 1 | ||
# and if not, it is 0 | ||
if len(np.unique(y)) > 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or am I misunderstanding this completely? Is y here the indicator Håvard wants to bring in? Wich I thought was for the split-first thing (but what do I know?) In any case, y should ever only be the target feature - unless I'm completely misunderstanding this (which happens for sure).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies for the confusion - you are right, this does not have anything to do with the split-thing. The split-thing is in a different class. This part is related to the previous hacky thing with the check_fit2d_1sample
test. When I had check_estimator()
testing in test_hurdle_regression()
function, it ran the test for the edge case with 1 sample in the training data and in this case there is only 1 target value :) So, again I needed a work-around to pass it.
Since i decided to simply not run automatic check_estimator()
in test_hurdle_regression()
, this part that caused confusion is deleted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the structure - your suggestion makes perfect sense to me, will implement it and we will see how it goes.
""" Lookup table for supported estimators. | ||
This is necessary because sklearn estimator default arguments | ||
must pass equality test, and instantiated sub-estimators are not equal. """ | ||
|
||
funcs = {'linear': LinearRegression(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it not a bit imprudent that the same call can return both classifiers and regressors? Not a biggy but still. More on this later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting point. I tend to agree that it could be better to separate regressors from classifiers.
|
||
if X.shape[1] < 2: | ||
raise ValueError('Cannot fit model when n_features = 1') | ||
|
||
self.clf_ = self._resolve_estimator(self.clf_name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here we use _resolve_estimator to get a classifier, but we do check if it is actually returning a classifier. Since the method is also capable of taking regression, it would be nice with some sort of check here - or to have different calls for classification and regression respectively.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! I will see what I can add.
|
||
self.reg_ = self._resolve_estimator(self.reg_name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we call _resolve_estimator to get a regression model, but we do not check if it is a regression model. I.e. if a user inputted the model names wrongly we could get a classifier here, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, would be nice to add checks, agree. Will see what I can add.
Tools/ViewsEstimators.py
Outdated
|
||
|
||
#################### | ||
# HurdleRegression # | ||
#################### | ||
class HurdleRegression(BaseEstimator): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given how much of the script is taken up by this class, I think it should get its own class script. We might need to collect these kinds of scripts for all models used either in a utils folder or in a dedicated model_class folder
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, had the same thought. This is a refactoring decision I would like to agree on collaboratively, so we follow it consistently.
|
||
def manual_test(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait, do then we have to function here and then a new class below? That seems a bit messy, but perhaps I am misunderstanding somthing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, it should not be like this. Probably it would be the best to have either tests.py
file and viewser_models.py
with different classes or have a separate file for each model class with its test. Again - something I would be happy to discuss and agree on.
Tools/ViewsEstimators.py
Outdated
|
||
assert y_pred_prob.shape == y_test.shape, "Probability predictions and y do not have the same shape" | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New class, I think it deserves a new script - but that might just be me :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Connected to your previous comments - I would like to discuss the refactoring decisions first and then implement.
|
||
#if __name__ == '__main__': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why remove?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not see why we need it. Do we plan to execite this file directly? My understanding is that we will import a class from its file in some other file like for example this from Tools.models.hurdle_regression_model import HurdleRegression
, then we do not need this line. Afaik, this line is needed only if the code in the file needs to be executed directly. Let me know if you think it is needed.
Thank you for the review, @hhegre !
As I also replied to Simon - I was focused more on the structure and making it run rather than what goes inside + what goes inside should be defined outside of the class ideally. Good catch though - added them back as it was before. We should discuss later how to move them in configs in a consistent way.
Very good idea with the error. I will work more on this part today.
Good catch. I will change them. |
This PR adds the following changes:
HurdleRegression
class inViewsEstimators.py
, so it works;test_hurdle_regression
notebook that can be run to test the changes;FixedFirstSplitRegression
class inViewsEstimators.py
that canspllit_by
featurespllit_by
feature is not the target Y, the target Y stays the same for both subsets meaning that without any additional transformation, only regressors or only classifiers can be applied to both subsets. If we want to change the target Y based on the featurespllit_by
, we can apply a rule, transform the data and then feed it intoHurdleRegression
class. The use cases here require clarification.