Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Determine data types to be processed #28

Open
3 tasks
ByroneCole-SageBionetworks opened this issue Jan 11, 2023 · 15 comments
Open
3 tasks

Determine data types to be processed #28

ByroneCole-SageBionetworks opened this issue Jan 11, 2023 · 15 comments
Assignees
Labels
Sage Sage Bionetworks task

Comments

@ByroneCole-SageBionetworks
Copy link

ByroneCole-SageBionetworks commented Jan 11, 2023

  • Look at the list of data types cohorts are collecting and determine existing nextflow/CWL workflows that exist for these data types
  • Categorize the data types with and without existing workflows and prioritize the data types we potentially want to process and add to Cavatica.
  • Bonus: Determine process of running nf-core workflows within Cavatica

Edit: Since we aren't so sure about the metadata available, this might be more helpful:

  1. enumerate anticipated data types;
  2. determine which data types we already have existing workflows to support;
  3. determine which data types we already have models/schemas to collect metadata;
  4. map out a plan to fill in any gaps for both data models and workflows (based on downstream use cases)

Internal JIRA tickets:

@thomasyu888
Copy link
Contributor

@ByroneCole-SageBionetworks , @kjflynn . Thanks for setting this up, do you know who I should reach out to to find out the new genomic data types that are being supported in INCLUDE?

@thomasyu888 thomasyu888 moved this to In Progress in DMC V3 Tasks Jan 19, 2023
@kjflynn
Copy link

kjflynn commented Jan 19, 2023

Hi @thomasyu888 do you mean for V3 or generally? probably for both actually start with @lopierra

@thomasyu888
Copy link
Contributor

@kjflynn For V3 and just generally. Thanks!

@lopierra
Copy link
Member

Did you mean new data types, or just new data? I don't think we have new genomic data types, just WGS and RNAseq as last time. We will have WGS from de Smith, Hakonarson, and HTP, and RNAseq from Hakonarson and HTP.

@thomasyu888
Copy link
Contributor

Thanks @lopierra . I meant new data types.

@thomasyu888
Copy link
Contributor

Had a discussion internally, and this is the summary:

  • See here for a listof data types that cohorts are collecting. These are cohorts we've had intro calls with - we don't actually have data from most of them.

@thomasyu888
Copy link
Contributor

I have some questions.

  • Is there a difference between R01 metablomics and metablomics? If so, what is it?
  • What is the difference between Flow Cytometry, CyTOF and Cytokine profiles?
  • Did we have a list of the "other sequencing?"

@kjflynn
Copy link

kjflynn commented Feb 6, 2023 via email

@thomasyu888
Copy link
Contributor

Thanks! I took what was in this spreadsheet and I did a unique counts on the data_type column (minus the cognitive and clinical data_type_short) and got this count.

Data Type # cohorts
Other sequencing (targeted, GWAS, DNA methylation, etc.) 6
Neuroimaging 4
Metabolomics/RO1 metablomics 4
Cytokine profiles 3
Proteomics 3
CyTOF 2
EEG 1
Head/neck MRI 1
Flow Cytometry 1
Pulse wave velocity 1
Sleep - summary, saturation, PSG, etc 1
Home & lab sleep apnea test (Nox A1) 1
sleep - Actigraphy, PSG 1

I'm thinking we could try to find nexflow or CWL workflows for those data types that don't have defined workflows as Cavatica applications to execute on the data. Some questions:

  • Am I looking at the right spreadsheet for potential data types?
  • Does it make sense to you all to focus on those data types that will have more than one cohort contributing data?
  • We will start identifying those data types that have pre-existing written CWL/nextflow workflows, but will also identify those that don't. For the ones that don't is the thought that we would need to develop them?

@lopierra
Copy link
Member

lopierra commented Feb 6, 2023

The Assays tab in that same spreadsheet is a little more granular on different types of sequencing, etc.

"R01 metabolomics" just means metabolomics data from a previous R01 grant.

I'm not sure we have enough metadata currently to start setting up workflows. Like there are numerous types of proteomics, and I'm not sure what each cohort has, and probably wouldn't ask for the details until they're actually ready to send data.

I'm also not sure if the plan is to import and harmonize all the actual data, or just make the files available. I guess it depends on the data type and whether multiple cohorts are doing comparable assays that could be analyzed together.

@thomasyu888
Copy link
Contributor

Thanks @lopierra - this is very helpful!

This ticket is specifically to determine data types we would want processed along with whether or not there are existing bioinformatics workflows. I'll take a look at the assays tab and regenerate some numbers.

@lopierra
Copy link
Member

lopierra commented Feb 6, 2023

thanks for doing that! I just added a couple more assays for the Aldinger cohort (we just talked last week and I haven't had a chance to update her info in the other tabs yet). She will have single-cell RNAseq and genotyping of fetal tissue.

@thomasyu888
Copy link
Contributor

thomasyu888 commented Mar 27, 2023

Sorry for the long delays, but I took a look from the assays sheet, and it would be helpful if we had a dictionary of assays that cohorts could choose from. That said, here are all the assays that had greater than 1 cohort (Aside from RNASeq and WGS - which have workflows)

Assay Number of Cohorts
Neuroimaging - volumetric MRI, fMRI, fNIRS, DTI, DSI) 6
Metabolomics/NMR Metabolomics/P4C mass spec metabolomics / R01 metabolomics 5
cytokine / MSD cytokine 3
SOMAscan proteomics / proteomics 3
CyTOF 2
amyloid-PET 2
tau-PET 2

Are these still true:

I'm not sure we have enough metadata currently to start setting up workflows. Like there are numerous types of proteomics, and I'm not sure what each cohort has, and probably wouldn't ask for the details until they're actually ready to send data.

I'm also not sure if the plan is to import and harmonize all the actual data, or just make the files available. I guess it depends on the data type and whether multiple cohorts are doing comparable assays that could be analyzed together.

@lopierra
Copy link
Member

We have not gotten any more assay data since the Oct 2022 release. However, Korenberg is getting ready to send us data - they have RNAseq, methylation, MRI imaging, cognitive tests, and lab data.
I still don't know about harmonization vs. making files available - we should bring this up at Data Implementers at some point. An additional complication is that ABC-DS will not allow their assay data to be displayed in the portal, so I'm not sure we should even count those in the number of cohorts.

@thomasyu888
Copy link
Contributor

thomasyu888 commented Mar 28, 2023

Ah I see.... Thanks for the update - will discuss this in the data implementors meeting soon!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Sage Sage Bionetworks task
Projects
No open projects
Status: In Progress
Development

No branches or pull requests

4 participants