Replace category with regression formula in `regress_samples` #30

mortonjt · 2017-07-14T15:22:39Z

Improvement Description
It is very useful to simultaneously analyze multiple metadata categories and their interactions.

Proposed Behavior
Fortunately, this is just a two line change but would give the user much more flexibility when building comprehensive models.

References
(see here)

nbokulich · 2017-07-19T17:00:50Z

using such a formula is not supported by the regression methods in scikit-learn — am I missing something @mortonjt ?

Instead, all feature data are used to build the model. Only a single metadata category can be predicted at once. Multilabel prediction is not supported (in a useful way) by scikit-learn.

mortonjt · 2017-07-19T18:26:46Z

That's pretty easy to do -- we can enable it by using patsy. This has not been traditionally supported in scikit-learn, but it is supported in statsmodels, and is a standard when analyzing datasets in R.

The multilabel prediction is actually supported in scikit-learn (I know! I was surprised too).
See the input types for lasso regression and random forests regression

This could be a huge improvement in the usability over what is currently offered in scikit-learn, and also seriously open the doors for building complex models.

nbokulich · 2017-07-19T19:17:57Z

sorry, I meant multioutput prediction. scikit-learn does support basic multioutput but this is merely training multiple independent regressors, and does not predict the relationship among targets (i.e., metadata categories). There is a little more discussion of this in #15 .

I like the suggestion to use patsy for building regression formulae, but I don't think this is feasible here. The intended features (independent variables) are a feature table, which would most likely consist of many many features, NOT metadata. Metadata categories are the targets (dependent variables). Building a formula with hundreds of features would be arduous. I am familiar with the use of such formulae in R and patsy but when metadata are used as the independent variables and features/observed data are used as dependent variables. This is the reverse of what q2-sample-classifier is meant to perform. @mortonjt could you please provide a little more clarification on how you would image these formulae being used?

mortonjt · 2017-07-19T20:16:12Z

Sorry -- let me try to clarify.

I totally agree, you don't want to use regression formulas for the features (i.e. OTUs). That will get disgusting very quickly. But you can also use the regression formulas to model the interactions between the outputs.

Here is an example. Say that we wanted to use lasso regression. We can use something as follows

res = lasso()
res.fit('age * sex + disease', Y=metadata, X=otu_table)

where age, sex and disease are metadata variables. The formula will allow you create a new modified output matrix, that will explicitly test for the interaction effect due to age. It will also allow you to test for multiple categories simultaneously. In this particular case it will test for how well that the following can be predicted

age
sex
age * sex interaction
disease

nbokulich mentioned this issue Aug 3, 2017

multi-way PERMANOVA qiime2/q2-diversity#123

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace category with regression formula in `regress_samples` #30

Replace category with regression formula in `regress_samples` #30

mortonjt commented Jul 14, 2017 •

edited by Mestabrook3

Loading

nbokulich commented Jul 19, 2017

mortonjt commented Jul 19, 2017

nbokulich commented Jul 19, 2017

mortonjt commented Jul 19, 2017

Replace category with regression formula in regress_samples #30

Replace category with regression formula in regress_samples #30

Comments

mortonjt commented Jul 14, 2017 • edited by Mestabrook3 Loading

nbokulich commented Jul 19, 2017

mortonjt commented Jul 19, 2017

nbokulich commented Jul 19, 2017

mortonjt commented Jul 19, 2017

Replace category with regression formula in `regress_samples` #30

Replace category with regression formula in `regress_samples` #30

mortonjt commented Jul 14, 2017 •

edited by Mestabrook3

Loading