Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Peppy #8

Open
wants to merge 25 commits into
base: main
Choose a base branch
from
Open

Peppy #8

wants to merge 25 commits into from

Conversation

vreuter
Copy link

@vreuter vreuter commented Apr 29, 2019

@johanneskoester @nsheff

Note that units_peppy.tsv is included not for necessity, but to show how a peppy user may encode a subsample/units table, and have it still function with the workflow.

@vreuter vreuter mentioned this pull request Apr 29, 2019
@vreuter
Copy link
Author

vreuter commented Apr 29, 2019

pepkit/peppy#286

@nsheff
Copy link

nsheff commented Apr 29, 2019

that is a thing of beauty. 🍦

validate(samples, schema="../schemas/samples.schema.yaml")

units = pd.read_table(config["units"], dtype=str).set_index(["sample", "unit"], drop=False)
units = p.subsample_table
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really nice! What about always converting the subsample_table index into string inside peppy (in order to get rid of the line below)?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi johannes, it looks like this has been done (the line is now gone).

@nsheff
Copy link

nsheff commented May 3, 2019

hey @johanneskoester what do you think? What's the next step?

@johanneskoester
Copy link
Contributor

So, as I have seen from your repo, there is now a SnakeProject subclass? Can you tell me what is the difference to the main Project class? If possible I would love to not have a special case, but maybe there is some reason I do not see.

@vreuter
Copy link
Author

vreuter commented May 3, 2019

So, as I have seen from your repo, there is now a SnakeProject subclass? Can you tell me what is the difference to the main Project class? If possible I would love to not have a special case, but maybe there is some reason I do not see.

It handles naming differences you requested removed from the workflow.

@nsheff
Copy link

nsheff commented May 8, 2019

Hey @johanneskoester -- it would be possible to not use the SnakeProject class if you instead subscribed to the original PEP names (like subsample instead of unit). All the rest of the functionality is in the main Project class.

What do you think? Since we used different terminology to refer to these things, we have to either have an adapter, or we have to change one of the standards -- in this case we implemented an adapter.

@johanneskoester
Copy link
Contributor

Changing from unit to subsample and the other colname changes would be fine for me. Regarding the rest:

  • this is not needed for Snakemake. In best practice workflows we ensure presence of certain columns via JSON schema validation.
  • this I would love to see in the original peppy project. Are there arguments against it? Could be also configurable, in the sense of Project(..., subsample_index_cols=[...]).

@vreuter
Copy link
Author

vreuter commented May 9, 2019

Changing from unit to subsample and the other colname changes would be fine for me. Regarding the rest:

* [this](https://github.com/pepkit/peppy/blob/d2c23497077809b25cd37d70cbf2d4f25e4aca25/peppy/snake_project.py#L64) is not needed for Snakemake. In best practice workflows we ensure presence of certain columns via JSON schema validation.

OK

* [this](https://github.com/pepkit/peppy/blob/d2c23497077809b25cd37d70cbf2d4f25e4aca25/peppy/snake_project.py#L69) I would love to see in the original peppy project. Are there arguments against it? Could be also configurable, in the sense of `Project(..., subsample_index_cols=[...])`.

I'm fine with setting the index. Identifier column name(s) shouldn't be configurable, though. The solution in place bubbles knowledge of the schema with which standard data (like sample name and subsample name/index) may be referenced up to the level of the types themselves (Project, SnakeProject). Furthermore, we'd like to couple the column name used to denote something like sample identifier to an actual attribute on Sample objects that are created and used from the metadata sheets, so parameterizing the sheet column would break that coupling.

@nsheff
Copy link

nsheff commented May 9, 2019

For your first point: we are also planning to adopt the snakemake (or similar) validation for PEPs as well. So, sounds good. we should address that soon.

For your second point: I agree we should set the index in peppy and also I think that @vreuter is right that making the column name parameterizable could lead to some downstream trouble... one solution, maybe: could we take whatever column name the user provides and internally just standardize it to 'sample_name'? would solve your concern I think

@johanneskoester
Copy link
Contributor

I think we have a misunderstanding here. With the parameter to Project I did not mean to make the column name parameterizable, but rather wanted to give the user the possibility to select over which columns the index shall be created (e.g., sample_name by default for the sample table, sample_name + subsample_name by default for the subsample_table).

@vreuter
Copy link
Author

vreuter commented May 9, 2019

I think we have a misunderstanding here. With the parameter to Project I did not mean to make the column name parameterizable, but rather wanted to give the user the possibility to select over which columns the index shall be created (e.g., sample_name by default for the sample table, sample_name + subsample_name by default for the subsample_table).

Indeed! Can't speak for @nsheff , but I'd misinterpreted your suggestion. This sounds fine, but to minimally burden Snakemake workflow authors, would you want the default for the second component (besides name) of indexing for subsample_table indexing to be unit rather than subsample_name? (Is column naming in this workflow's units.tsv typical?)

@nsheff
Copy link

nsheff commented May 28, 2019

@johanneskoester, @vreuter -- where do we stand on this? I agree we can easily make this parameterizable. I think we should use subsample_name and not unit as the default. but feel free to speak out if you disagree

@vreuter
Copy link
Author

vreuter commented May 28, 2019

The selection of columns to index, or certain names? Indexing flexibility is fine, but naming flexibility of particularly meaningful columns isn’t, at least in terms of how they’re referenced once the object is built. So yeah we could accept, for example, unit and subsample_name but have them referenced the same

@nsheff
Copy link

nsheff commented May 28, 2019

indexes are parameterize, not names.

With the parameter to Project I did not mean to make the column name parameterizable, but rather wanted to give the user the possibility to select over which columns the index shall be created (e.g., sample_name by default for the sample table, sample_name + subsample_name by default for the subsample_table).

I guess you can make it unit in the snakeproject object if you want. @johanneskoester can update if he wants otherwise I guess

@johanneskoester
Copy link
Contributor

I am fine with renaming unit to subsample_name in our pipelines here. I would like to not have a snakeproject at all. If I am not wrong, once the indexing can be configured upon instantiating Project, there should be no need for a snakeproject.

@nsheff nsheff mentioned this pull request Jun 4, 2019
@johanneskoester
Copy link
Contributor

With the new release, can you give me a quick pointer how to adapt this PR to the new capabilities? Or do you want to do it yourself?

@vreuter
Copy link
Author

vreuter commented Jun 21, 2019

@johanneskoester it looks like the latest release of Snakemake on PyPI targets Python 3.5 or at least supports it, but that this workflow uses string formatting introduced in PEP 498 that only targets Python 3.6?

@johanneskoester
Copy link
Contributor

@johanneskoester it looks like the latest release of Snakemake on PyPI targets Python 3.5 or at least supports it, but that this workflow uses string formatting introduced in PEP 498 that only targets Python 3.6?

Good catch, that was not intended. Fixed in master.

prjcfg.yaml Outdated
sample_annotation: samples.tsv
sample_subannotation: units.tsv
sample_table: samples.tsv
sample_subtable: units.tsv
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be subsample_table, not sample_subtable, right?

@stolarczyk stolarczyk mentioned this pull request Mar 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants