Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace category with regression formula in regress_samples #30

Open
mortonjt opened this issue Jul 14, 2017 · 4 comments
Open

Replace category with regression formula in regress_samples #30

mortonjt opened this issue Jul 14, 2017 · 4 comments

Comments

@mortonjt
Copy link

mortonjt commented Jul 14, 2017

Improvement Description
It is very useful to simultaneously analyze multiple metadata categories and their interactions.

Proposed Behavior
Fortunately, this is just a two line change but would give the user much more flexibility when building comprehensive models.

References
(see here)

@nbokulich
Copy link
Member

using such a formula is not supported by the regression methods in scikit-learn — am I missing something @mortonjt ?

Instead, all feature data are used to build the model. Only a single metadata category can be predicted at once. Multilabel prediction is not supported (in a useful way) by scikit-learn.

@mortonjt
Copy link
Author

That's pretty easy to do -- we can enable it by using patsy. This has not been traditionally supported in scikit-learn, but it is supported in statsmodels, and is a standard when analyzing datasets in R.

The multilabel prediction is actually supported in scikit-learn (I know! I was surprised too).
See the input types for lasso regression and random forests regression

This could be a huge improvement in the usability over what is currently offered in scikit-learn, and also seriously open the doors for building complex models.

@nbokulich
Copy link
Member

sorry, I meant multioutput prediction. scikit-learn does support basic multioutput but this is merely training multiple independent regressors, and does not predict the relationship among targets (i.e., metadata categories). There is a little more discussion of this in #15 .

I like the suggestion to use patsy for building regression formulae, but I don't think this is feasible here. The intended features (independent variables) are a feature table, which would most likely consist of many many features, NOT metadata. Metadata categories are the targets (dependent variables). Building a formula with hundreds of features would be arduous. I am familiar with the use of such formulae in R and patsy but when metadata are used as the independent variables and features/observed data are used as dependent variables. This is the reverse of what q2-sample-classifier is meant to perform. @mortonjt could you please provide a little more clarification on how you would image these formulae being used?

@mortonjt
Copy link
Author

Sorry -- let me try to clarify.

I totally agree, you don't want to use regression formulas for the features (i.e. OTUs). That will get disgusting very quickly. But you can also use the regression formulas to model the interactions between the outputs.

Here is an example. Say that we wanted to use lasso regression. We can use something as follows

res = lasso()
res.fit('age * sex + disease', Y=metadata, X=otu_table)

where age, sex and disease are metadata variables. The formula will allow you create a new modified output matrix, that will explicitly test for the interaction effect due to age. It will also allow you to test for multiple categories simultaneously. In this particular case it will test for how well that the following can be predicted

  1. age
  2. sex
  3. age * sex interaction
  4. disease

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants