-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IMP: Add *_samples_ncv pipelines #177
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Oddant1 thanks for putting this together. This is a good start, you have the basic workflow details in place, but a lot more work needs to be done. See the in-line comments.
If you plan to proceed, please also:
- register these actions in
plugin_setup.py
- write a basic test for each pipeline (just to make sure they work). Create some toy arrays for this with numpy, do not use the real datasets we currently have in there, we need to cut down on runtime.
@@ -323,6 +323,68 @@ def regress_samples_ncv( | |||
return y_pred, importances | |||
|
|||
|
|||
def regress_samples_ncv_piepline( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
misspelled pipeline
@@ -323,6 +323,68 @@ def regress_samples_ncv( | |||
return y_pred, importances | |||
|
|||
|
|||
def regress_samples_ncv_piepline( | |||
ctx, table: biom.Table, metadata: qiime2.NumericMetadataColumn, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pipelines should not use type annotation like this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see classify_samples
for an example
estimator: str = defaults['estimator_r'], stratify: str = False, | ||
parameter_tuning: bool = False, | ||
missing_samples: str = defaults['missing_samples'] | ||
) -> (pd.Series, pd.DataFrame): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these outputs do not match the returns.
But more importantly pipelines should not include return annotations. See classify_samples
for an example
missing_samples: str = defaults['missing_samples'] | ||
) -> (pd.Series, pd.DataFrame): | ||
|
||
y_pred, importances, probabilities = nested_cross_validation( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
get action classify_samples_ncv
, do not call nested_cross_validation
directly.
|
||
|
||
def classify_samples_ncv_pipeline( | ||
ctx, table: biom.Table, metadata: qiime2.CategoricalMetadataColumn, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove type annotations
missing_samples: str = defaults['missing_samples'] | ||
) -> (pd.Series, pd.DataFrame, pd.DataFrame): | ||
|
||
y_pred, importances, probabilities = nested_cross_validation( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use get action, do not call nested_cross_validation
directly
stratify=True, parameter_tuning=parameter_tuning, classification=False, | ||
scoring=accuracy_score, missing_samples=missing_samples) | ||
|
||
split = ctx.get_action('sample_classifier', 'split_table') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we do NOT want split or fit here — that's why classify_samples_ncv
should be called. Remove these.
X_train, X_test = split(table, metadata, test_size, random_state, | ||
stratify=True, missing_samples=missing_samples) | ||
|
||
sample_estimator, importance = fit( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove
confusion = ctx.get_action('sample_classifier', 'confusion_matrix') | ||
heat = ctx.get_action('sample_classifier', 'heatmap') | ||
|
||
X_train, X_test = split(table, metadata, test_size, random_state, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove
accuracy_results, = confusion(y_pred, metadata, probabilities, | ||
missing_samples='ignore') | ||
_heatmap, _ = heat(table, importance, sample_metadata=metadata, | ||
group_samples=True, missing_samples=missing_samples) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can set missing_samples='ignore'
here, since the importances should always be ≤ the table features. Same with in the classify_samples
pipeline if I did not catch that before.
This PR is being closed and issue #160 is being deferred to @nbokulich |
Closes #160. Initial stab at implementing the
*_samples_ncv
pipelines