Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do we need a more general mechanism for selecting / identifying frequency sets #45

Open
mpresteg opened this issue Nov 13, 2017 · 6 comments
Assignees
Labels

Comments

@mpresteg
Copy link
Contributor

mpresteg commented Nov 13, 2017

This started with contemplating how to assign population and cohort information to a frequency set. For example, a population could be associated with a race group (HIS, AFA, CAU, etc) which could transcend cohort boundaries. A cohort could be associated with a set of donors recruited to a registry (BTM / NMDP, etc) which could transcend population boundaries (race group in this case). This implies the need for a many to many relationship between cohort and population. Then, if we consider other means of defining a population (geography, etc), it may be useful to have a more general criteria by which a frequency set can be tagged / annotated (like a selection criteria). Perhaps the label could serve this purpose, but then it may be useful to categorize a set of 'label types'. So, the question is, how can we sufficiently define a means of annotating any frequency set, such that the frequency sets are selectable given any available/reasonable criteria (e.g. shoe size, eye color, etc)? This may be better discussed in person (DaSH 7 - 2017!!). @fscheel @sauter @mmaiers-nmdp @HofmannJ @mhalagan-nmdp @hpeberhard

@mpresteg mpresteg changed the title Do we need a more general mechanism for selecting / identify frequency sets Do we need a more general mechanism for selecting / identifying frequency sets Nov 13, 2017
@mmaiers-nmdp
Copy link
Contributor

The intent of having such a simple model (one to many between population and cohort) was to avoid falling into some of the deep problems of defining what a population is. I would prefer to get PHYCUS running with a minimal structure for "population" and start loading it with real data and making applications (predictive match, imputation, haplostats) work against it. If we have an actual use case where a more complex model is required then add additional annotation capabilities or expand the basic model.

@sauter
Copy link
Collaborator

sauter commented Nov 14, 2017

I agree with Martin here.

The definition of a population is clearly out of scope for PHYCUS. Taken generally in this context, a population is a group of people, that satisfy Hardy-Weinberg (diploid organisms, only sexual reproduction, random mating, infinitely large populations, no generation overlapping, allele frequencies are equal between sexes, no migration, no mutation, no selection). While the first two are true for humans, the rest is not so clear or even wrong. Hence, any group of people is only an approximation to a population. (Which it should be in order to use EM algorithms). To which extend a specific group satisfies the population criteria is to be evaluated per group individually. A "race" is a way to roughly define a group of people that is a good candidate -if you will- for a population. For some countries, nationality will do as well. But there is no universally established set of populations of the world or any kind of hierarchy among them. Although populations can have sub-populations.
Hence, for a population, the ID is somewhat less important than the accompanying description of it. This, to this date, is a free text field. This is because there is no universal set of definitions. Today, there are only two options: Come up with a proper (!) description of what you understand/think your population is or search and understand already exiting descriptions whether they apply to you your population. Some day in the future, maybe, some is able to deduce from the descriptions, whether existing populations in the service are in some kind of relationship to each other. But that's a whole different task.

The cohort concept in the PHYCUS is a technical one: a cohort is the method as how you have sampled your population. The cohort is the description of that group. In this understanding, in an ideal world the cohort would be identical to the population (and be indefinitely large...) but as we are limited by resources we cannot type entire populations (except for eg Iceland) and hence can only use a subset of the population. This subset is the cohort. If people in a cohort belong to different populations (like 500 people from Iceland and 500 people from Kongo) you should split the cohort - otherwise the em will produce non-valid estimations.

@hpeberhard
Copy link
Collaborator

Thanks guys. I do believe this is an important question. Just a few unordered remarks:

  1. I have to admit that I do not even fully understand the simple model (one to many). Which probably is due to my laziness.
  2. Therefore, I am probably a follower of the "grow as we go" proposal.
  3. Talking about applications sounds to me as if we are about to enter userland. Would it be worthwhile to provide/discuss our ideas/models in a high level document?

@mpresteg
Copy link
Contributor Author

Grow as we go - fair enough. Re: point 3 - I assume you mean ideas/models for how we annotate / describe frequencies within the frequency service (from a pragmatic implementation point of view)...I agree.

@sauter
Copy link
Collaborator

sauter commented Nov 14, 2017

Then we have general agreement. At least among us here....

@mpresteg
Copy link
Contributor Author

mpresteg commented Dec 11, 2017

This thread may move us in the direction of an (eventual) user guide. Immediate steps are to consider some real world examples of frequency sets with overlapping populations / cohorts. Immediate steps (not entire issue) assigned to @hpeberhard and @sauter .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants