-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Do we need a more general mechanism for selecting / identifying frequency sets #45
Comments
The intent of having such a simple model (one to many between population and cohort) was to avoid falling into some of the deep problems of defining what a population is. I would prefer to get PHYCUS running with a minimal structure for "population" and start loading it with real data and making applications (predictive match, imputation, haplostats) work against it. If we have an actual use case where a more complex model is required then add additional annotation capabilities or expand the basic model. |
I agree with Martin here. The definition of a population is clearly out of scope for PHYCUS. Taken generally in this context, a population is a group of people, that satisfy Hardy-Weinberg (diploid organisms, only sexual reproduction, random mating, infinitely large populations, no generation overlapping, allele frequencies are equal between sexes, no migration, no mutation, no selection). While the first two are true for humans, the rest is not so clear or even wrong. Hence, any group of people is only an approximation to a population. (Which it should be in order to use EM algorithms). To which extend a specific group satisfies the population criteria is to be evaluated per group individually. A "race" is a way to roughly define a group of people that is a good candidate -if you will- for a population. For some countries, nationality will do as well. But there is no universally established set of populations of the world or any kind of hierarchy among them. Although populations can have sub-populations. The cohort concept in the PHYCUS is a technical one: a cohort is the method as how you have sampled your population. The cohort is the description of that group. In this understanding, in an ideal world the cohort would be identical to the population (and be indefinitely large...) but as we are limited by resources we cannot type entire populations (except for eg Iceland) and hence can only use a subset of the population. This subset is the cohort. If people in a cohort belong to different populations (like 500 people from Iceland and 500 people from Kongo) you should split the cohort - otherwise the em will produce non-valid estimations. |
Thanks guys. I do believe this is an important question. Just a few unordered remarks:
|
Grow as we go - fair enough. Re: point 3 - I assume you mean ideas/models for how we annotate / describe frequencies within the frequency service (from a pragmatic implementation point of view)...I agree. |
Then we have general agreement. At least among us here.... |
This thread may move us in the direction of an (eventual) user guide. Immediate steps are to consider some real world examples of frequency sets with overlapping populations / cohorts. Immediate steps (not entire issue) assigned to @hpeberhard and @sauter . |
This started with contemplating how to assign population and cohort information to a frequency set. For example, a population could be associated with a race group (HIS, AFA, CAU, etc) which could transcend cohort boundaries. A cohort could be associated with a set of donors recruited to a registry (BTM / NMDP, etc) which could transcend population boundaries (race group in this case). This implies the need for a many to many relationship between cohort and population. Then, if we consider other means of defining a population (geography, etc), it may be useful to have a more general criteria by which a frequency set can be tagged / annotated (like a selection criteria). Perhaps the label could serve this purpose, but then it may be useful to categorize a set of 'label types'. So, the question is, how can we sufficiently define a means of annotating any frequency set, such that the frequency sets are selectable given any available/reasonable criteria (e.g. shoe size, eye color, etc)? This may be better discussed in person (DaSH 7 - 2017!!). @fscheel @sauter @mmaiers-nmdp @HofmannJ @mhalagan-nmdp @hpeberhard
The text was updated successfully, but these errors were encountered: