Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using peppy #6

Closed
wants to merge 15 commits into from
Closed

Conversation

vreuter
Copy link

@vreuter vreuter commented Mar 14, 2019

Copy link
Contributor

@johanneskoester johanneskoester left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome start! Thanks a lot. I have some comments below.

return None
if "sample" in df.columns and PEP_SAMPLE_COL in df.columns:
raise Exception("Multiple sample identifier columns present: {}".format(", ".join(["sample", PEP_SAMPLE_COL])))
return df.rename({PEP_SAMPLE_COL: "sample"}, axis=1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the name that peppy uses here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it sample_name or may it be something else? If it is sample_name, I'd rather change the access to this df in the rules to use sample_name.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, sample_name

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is exactly the same issue as the units vs subsample_name and can be solved in the same way.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@johanneskoester so is this something you will change?

return go(t, n + 1, curr, acc) if h == curr else go(t, 1, h, acc + [n])
return go(names[1:], 1, names[0], []) if names else []
df.insert(1, "unit", [i for n in count_names(list(df[PEP_SAMPLE_COL])) for i in range(1, n + 1)])
return df
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, the peppy dataframe does contain a column subsample_name, right? So, we can change from unit to subsample_name in the workflow.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does; OK, sounds good

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we were trying to make it so that existing workflows could work without any updates, but yeah, it would be easier to change the name. my suggestion below of a snakePEP class could also hide this away.

validate(samples, schema="../schemas/samples.schema.yaml")

units = pd.read_table(config["units"], dtype=str).set_index(["sample", "unit"], drop=False)
units = peppy_units(peppy_rename(p.sample_subannotation)).set_index(["sample", "unit"], drop=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer if peppy would set the dataframe index like that itself. Is there a reason to not do it?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That should already be fine, or if not we can probably accommodate it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does peppy know about the unit column at all? I thought it just requires the sample_name column in the subannotation.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we added it above to accommodate snakemake... but in our typical use case, the unit column is called subsample_name -- but it is optional in generic peppy Projects...


###### Config file and sample sheets #####
configfile: "config.yaml"
p = Project("prjcfg.yaml")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do I get a dictionary with the config contents from the Project object? Then, I would introduce a new Snakemake directive

peppyconfig: "project.yaml"

that loads the project and makes it available as a global object peppy. From this, I would like to access samples and subannotation directly (getting rid of the boilerplate here).
However, I have doubts regarding the naming of the attributes. Why is the sample dataframe called sheet and the subannotation called sample_subannotation. Seems to be asymmetric. Why not calling the first samples and the second sample_subannotation?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes you do; in fact that is basically what we're already doing... p is an attmap to be precise... so p.attribute (or p["attribute"] should give you whatever configuration options are in your project.yaml file, already... you just say p = Project('project.yaml') as we've done here.

as far as getting rid of the boilerplate... the only thing the boilerplate is doing is converting the name "unit" to "sample_subannotation" and "sample" to "sample_name". one idea I had is to create a new class, say, snakePEP, that extends peppy.Project and includes this boilerplate. You would just import this object and use it as a PEP but it would handle those name conversions.

I've raised the naming issue here: pepkit/peppy#280 -- we can probably accommodate.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or one could add keyword args to Project for renaming the two? Like Project("project.yaml", sample_col="sample", sample_subannotation_col="unit")?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or we just adapt the naming in this workflow. Although it feels weird to change the wildcards to sample_subannotation instead of unit. It somehow does not sound like the right name in this context.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is just because of what you're used to... or can you elaborate?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another thing. Shouldn't the project.yaml read like this for consistency:

metadata:
    sample_table: path/to/samples.tsv
    subsample_table: path/to/subsamples.tsv

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In your example here, I cannot find subsample_name in the subannotation table at all.

True -- it's optional. if you don't provide it they will be indexed numerically, which is I believe what you were doing... but you can include it and then pull out subsamples by name (with get_subsample), or index them in the table with the subsample_name column.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But subsample_name alone is not necessarily unique, right? In many experimental setups, it will only be unique in combination with sample_name (e.g., when encoding the lane or replicate in subsample_name).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_subsample only exists on a Sample object rather than on an entire Project, and the subsample naming should be unique within local scope/context of a single sample (i.e., units are uniquely identified by name alone when considering just one sample.) In the table, though, the unique identification would definitely be problematic unless combined with sample like you say. Here's an example

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, exactly. So, I would suggest to always do set_index(("sample_name", "subsample_name"), drop=False) on the subsample_table.

@vreuter
Copy link
Author

vreuter commented Apr 29, 2019

Closing in favor of #8

@vreuter vreuter closed this Apr 29, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants