Do we need a more general mechanism for selecting / identifying frequency sets #45

mpresteg · 2017-11-13T22:09:57Z

This started with contemplating how to assign population and cohort information to a frequency set. For example, a population could be associated with a race group (HIS, AFA, CAU, etc) which could transcend cohort boundaries. A cohort could be associated with a set of donors recruited to a registry (BTM / NMDP, etc) which could transcend population boundaries (race group in this case). This implies the need for a many to many relationship between cohort and population. Then, if we consider other means of defining a population (geography, etc), it may be useful to have a more general criteria by which a frequency set can be tagged / annotated (like a selection criteria). Perhaps the label could serve this purpose, but then it may be useful to categorize a set of 'label types'. So, the question is, how can we sufficiently define a means of annotating any frequency set, such that the frequency sets are selectable given any available/reasonable criteria (e.g. shoe size, eye color, etc)? This may be better discussed in person (DaSH 7 - 2017!!). @fscheel @sauter @mmaiers-nmdp @HofmannJ @mhalagan-nmdp @hpeberhard

mmaiers-nmdp · 2017-11-14T10:33:34Z

The intent of having such a simple model (one to many between population and cohort) was to avoid falling into some of the deep problems of defining what a population is. I would prefer to get PHYCUS running with a minimal structure for "population" and start loading it with real data and making applications (predictive match, imputation, haplostats) work against it. If we have an actual use case where a more complex model is required then add additional annotation capabilities or expand the basic model.

sauter · 2017-11-14T12:26:54Z

I agree with Martin here.

The definition of a population is clearly out of scope for PHYCUS. Taken generally in this context, a population is a group of people, that satisfy Hardy-Weinberg (diploid organisms, only sexual reproduction, random mating, infinitely large populations, no generation overlapping, allele frequencies are equal between sexes, no migration, no mutation, no selection). While the first two are true for humans, the rest is not so clear or even wrong. Hence, any group of people is only an approximation to a population. (Which it should be in order to use EM algorithms). To which extend a specific group satisfies the population criteria is to be evaluated per group individually. A "race" is a way to roughly define a group of people that is a good candidate -if you will- for a population. For some countries, nationality will do as well. But there is no universally established set of populations of the world or any kind of hierarchy among them. Although populations can have sub-populations.
Hence, for a population, the ID is somewhat less important than the accompanying description of it. This, to this date, is a free text field. This is because there is no universal set of definitions. Today, there are only two options: Come up with a proper (!) description of what you understand/think your population is or search and understand already exiting descriptions whether they apply to you your population. Some day in the future, maybe, some is able to deduce from the descriptions, whether existing populations in the service are in some kind of relationship to each other. But that's a whole different task.

The cohort concept in the PHYCUS is a technical one: a cohort is the method as how you have sampled your population. The cohort is the description of that group. In this understanding, in an ideal world the cohort would be identical to the population (and be indefinitely large...) but as we are limited by resources we cannot type entire populations (except for eg Iceland) and hence can only use a subset of the population. This subset is the cohort. If people in a cohort belong to different populations (like 500 people from Iceland and 500 people from Kongo) you should split the cohort - otherwise the em will produce non-valid estimations.

hpeberhard · 2017-11-14T13:54:27Z

Thanks guys. I do believe this is an important question. Just a few unordered remarks:

I have to admit that I do not even fully understand the simple model (one to many). Which probably is due to my laziness.
Therefore, I am probably a follower of the "grow as we go" proposal.
Talking about applications sounds to me as if we are about to enter userland. Would it be worthwhile to provide/discuss our ideas/models in a high level document?

mpresteg · 2017-11-14T16:15:33Z

Grow as we go - fair enough. Re: point 3 - I assume you mean ideas/models for how we annotate / describe frequencies within the frequency service (from a pragmatic implementation point of view)...I agree.

sauter · 2017-11-14T16:29:32Z

Then we have general agreement. At least among us here....

mpresteg · 2017-12-11T15:20:30Z

This thread may move us in the direction of an (eventual) user guide. Immediate steps are to consider some real world examples of frequency sets with overlapping populations / cohorts. Immediate steps (not entire issue) assigned to @hpeberhard and @sauter .

mpresteg changed the title ~~Do we need a more general mechanism for selecting / identify frequency sets~~ Do we need a more general mechanism for selecting / identifying frequency sets Nov 13, 2017

mpresteg added the question label Nov 14, 2017

mpresteg assigned sauter and hpeberhard Dec 11, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do we need a more general mechanism for selecting / identifying frequency sets #45

Do we need a more general mechanism for selecting / identifying frequency sets #45

mpresteg commented Nov 13, 2017 •

edited

Loading

mmaiers-nmdp commented Nov 14, 2017

sauter commented Nov 14, 2017

hpeberhard commented Nov 14, 2017

mpresteg commented Nov 14, 2017

sauter commented Nov 14, 2017

mpresteg commented Dec 11, 2017 •

edited

Loading

Do we need a more general mechanism for selecting / identifying frequency sets #45

Do we need a more general mechanism for selecting / identifying frequency sets #45

Comments

mpresteg commented Nov 13, 2017 • edited Loading

mmaiers-nmdp commented Nov 14, 2017

sauter commented Nov 14, 2017

hpeberhard commented Nov 14, 2017

mpresteg commented Nov 14, 2017

sauter commented Nov 14, 2017

mpresteg commented Dec 11, 2017 • edited Loading

mpresteg commented Nov 13, 2017 •

edited

Loading

mpresteg commented Dec 11, 2017 •

edited

Loading