Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Seeking clarification on parent/TOPMed study differences #2

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

aofarrel
Copy link
Contributor

@aofarrel aofarrel commented Sep 8, 2020

The older version was a bit unclear as to what the user should actually expect when in Gen3 when looking at a Parent study, especially as the line about Parent studies lacking genomic data came after saying that in Gen3 they do have genomic data if there is also a TOPMed study.

I still have some questions that need to be reflected in the documentation though, so please let me know so I can add good information to this PR before it's pulled:

  • If a user is going to be using phenotypic and genotypic data for their analysis and this is a topmed-plus-parent situation like COPDGene, should we direct the user to ONLY import the Parent study in order to prevent them from importing the genomic data twice?
  • Likewise, if a user is only using genomic data, is there any use for them to import the Parent study? Should we direct them to import just the TOPMed study in order to reduce workspace clutter?
  • Comparing the number of CRAM files in Parent COPDGene and TOPMed COPDGene shows that they are not equivalent. The Parent version of COPDGene has ~300 CRAMs less than the TOPMed version. Therefore, it follows that most, but not all, genomic data is in the Parent study, and that fact isn't reflected in the documentation as it currently is. What data is missing? Why?

merge to update my own fork
Clarifications based on my understanding.
@aofarrel aofarrel requested review from ac3eb and bethsheets September 8, 2020 20:47
@ac3eb
Copy link
Contributor

ac3eb commented Sep 8, 2020

Hi Ash, good questions.

  • If the user is only going to be using genomic data they should go with TOPMed

  • We create a link between genomic and phenotypic data at the subject level. We've seen many instances where not all of the subject_ids are shared between parent and topmed so it's not surprising that the numbers don't match 100%. However, this means that there are a few subjects (or a lot) that have genomic data available but not phenotypic

  • The last piece of the puzzle is version updates. If a parent study is 1 version "old" for example, or the same for the TOPMed study, it's possible that the subjects within a group have been moved around (this is why versioning can be really important). In this case there could be a small discrepancy in the subject lists

Therefore the answer is a bit tricky, but for the most part 1) if you want to run a GWAS or just want pheno + geno then go with parent. If you want geno go with TOPMed, and ultimately you can do both if you want to double check or do the subject linking yourself in the workspace

@aofarrel aofarrel closed this Sep 8, 2020
@aofarrel aofarrel reopened this Sep 8, 2020
@aofarrel
Copy link
Contributor Author

aofarrel commented Sep 8, 2020

Let's just pretend I didn't misclick and close the PR for a second there...

@ac3eb Should we notify users of the discrepancy, or do you think that the mismatch (ie lack of 300 CRAMs in the case of COPDGene) is not important in the grand scheme of things?

@ac3eb
Copy link
Contributor

ac3eb commented Sep 9, 2020

Hmm, I don't think we should call it a discrepancy or lack of CRAMs since the link is correct. It's simply a nature of the studies, wherein not all of the subjects are present in parent and TOPMed. The situation would be the same if they tried to establish that link separately by accessing parent and then TOPMed. I do think it's a great idea to mention that they shouldn't expect both pools of subjects to match 100% every time. In fact, we've seen a couple of examples where there is a 0% match between parent and TOPMed, even though TOPMed studies are technically considered child studies of the parent. I can add an explanation to that section on what Gen3 does to create that link and how it works when exporting a study.

Note that we created that link between parent and TOPMed subjects after receiving several requests.

@aofarrel
Copy link
Contributor Author

I do think that explanation would be helpful, especially since that link was so widely requested. For instance, I still don't quite understand why a parent and TOPMed study would have 0% overlap. That being said -- if our researchers are coming from a TOPMed background, this may be a lot less mysterious to them than it is to me. So whatever you think is appropriate in terms of explanation, I'll go along with that.

R/e adding an explanation, should I merge this PR so you can add your contributions easily?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants