-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace category with regression formula in regress_samples
#30
Comments
using such a formula is not supported by the regression methods in scikit-learn — am I missing something @mortonjt ? Instead, all feature data are used to build the model. Only a single metadata category can be predicted at once. Multilabel prediction is not supported (in a useful way) by scikit-learn. |
That's pretty easy to do -- we can enable it by using patsy. This has not been traditionally supported in scikit-learn, but it is supported in statsmodels, and is a standard when analyzing datasets in R. The multilabel prediction is actually supported in scikit-learn (I know! I was surprised too). This could be a huge improvement in the usability over what is currently offered in scikit-learn, and also seriously open the doors for building complex models. |
sorry, I meant multioutput prediction. scikit-learn does support basic multioutput but this is merely training multiple independent regressors, and does not predict the relationship among targets (i.e., metadata categories). There is a little more discussion of this in #15 . I like the suggestion to use patsy for building regression formulae, but I don't think this is feasible here. The intended features (independent variables) are a feature table, which would most likely consist of many many features, NOT metadata. Metadata categories are the targets (dependent variables). Building a formula with hundreds of features would be arduous. I am familiar with the use of such formulae in R and patsy but when metadata are used as the independent variables and features/observed data are used as dependent variables. This is the reverse of what |
Sorry -- let me try to clarify. I totally agree, you don't want to use regression formulas for the features (i.e. OTUs). That will get disgusting very quickly. But you can also use the regression formulas to model the interactions between the outputs. Here is an example. Say that we wanted to use lasso regression. We can use something as follows
where
|
Improvement Description
It is very useful to simultaneously analyze multiple metadata categories and their interactions.
Proposed Behavior
Fortunately, this is just a two line change but would give the user much more flexibility when building comprehensive models.
References
(see here)
The text was updated successfully, but these errors were encountered: