CPTAC3 data catalog

Tables listing CPTAC3 data at GDC and Ding Lab genomic processing results at DCC, as well as details about downloaded data

Overview

Description of files in this project

CPTAC3.cases.dat: All known cases associated with CPTAC3 project. This is the master list
- This also defines the cohort (discovery or confirmatory) and batch of each case. Note that each case may be in multiple batches
CPTAC3.Catalog.dat: Details about all sequence data (WGS, WXS, RNA-Seq, miRNA-Seq, Methylation Array, Targeted Sequencing) at GDC associated with all known cases
- Note that this was previously called CPTAC3.AR.dat
CPTAC3.Demographics.dat: Demographic information associated with all known cases
CPTAC3.Catalog.Summary.txt: Summary of files available for each case on GDC.
./BamMap - has details about GDC data downloaded to Ding Lab
- *.BamMap.dat: "BamMap" files for various systems indicating locations of downloaded hg19, hg38, and FASTQ sequence data
- *.BamMap-summary.txt - summary of files available on a given system as well as GDC.
  - For given system (e.g., katmai), format is similar to CPTAC3.file-summary.txt, except that upper-case symbol indicates presence on given system and lower-case symbol indicates that that sample is in GDC but not on system
./DCC_Analysis_Summary - has details about analyses uploaded to DCC

Additional details about catalog creation are found in CPTAC3.case.discover.

Updates

CPTAC3 Catalog Version 2.3

sample_metadata column updated to key=value format with multiple entries separated by space.
Support for sample tags for non-CPTAC-style aliquot names, with unique code based on crc32 hash of aliquot name.
- Example: aliquot 7316UP-1206-1105262 with GDC annotation Duplicate Item: CHOP GBM Duplicate Recurrent Tumor DNA Aliquot will have sample metadata sample_tag=ADNA_27ea3fbd

CPTAC3 Catalog Version 2.2

Flags datasets associated with heterogeneity studies based on GDC aliquot annotation note.

Fields added

Adding the following columns to catalog file:

sample_id - GDC sample name
sample_metadata - Ad hoc metadata associated with this sample. May be comma-separated list
aliquot_annotation - Annotation note associated with aliquot, from GDC

Also, sample_name has additional element based on aliquot_annotation. Details in Heterogeneity Studies section below.

CPTAC3 Catalog Version 2.1

Adding support for scRNA-Seq

CPTAC3 Catalog Version 2.0

Added experimental_strategies "MethArray" and "Targeted Sequencing"
Aliquot information replaces sample information
Added column 10, result_type, and shifted remaining columns to right. This column codes for two distinct things:
- For Methylation Array data, it is the channel (Green or Red)
- For RNA-Seq harmonized BAMs, it is the result type, with values of genomic, chimeric, transcriptome
Added sample type column with long sample type names
AR file renamed Catalog file
file_summary file renamed Catalog.Summary

File Details

CPTAC3.Catalog.dat

List of all WGS, WXS, RNA-Seq, miRNA-Seq, Targeted Sequencing, Methylation Array data available at GDC. Generated by CPATC3 Case Discover.

Catalog file columns:

sample_name - ad hoc name for this file, generated for convenience and consistency.
- See CPTAC3 Case Discover for details
case
disease
experimental_strategy - WGS, WXS, RNA-Seq, miRNA-Seq, Methylation Array, Targeted Sequencing
short_sample_type - short name for sample_type: blood_normal, tissue_normal, tumor, buccal_normal, tumor_bone_marrow, tumor_peripheral_blood
aliquot - name of aliquot used
filename
filesize
data_format - BAM, FASTQ, IDAT
result_type - ad hoc value specific to sample type
- "chimeric", "genomic", "transcriptome" for RNA-Seq BAMs,
- "Red" or "Green" for Methylation Array
- "NA" otherwise
UUID
MD5
reference - best guess at reference of aligned sequence data. Note that these assumptions may not hold in future
- hg19 for submitted aligned reads
  - hg38 for miRNA submitted reads
- NA for submitted unaligned reads
- hg38 for harmonized reads
sample_type - sample type as reported from GDC, e.g., Blood Derived Normal, Solid Tissue Normal, Primary Tumor, and others
sample_id - GDC sample name
sample_metadata - Ad hoc metadata associated with this sample. Identifies heterogeneity studies and any other tags obtained for custom sample names as described below
aliquot_annotation - Annotation note associated with aliquot, from GDC

CPTAC3.Catalog.Summary.dat

Catalog summary files provide a one-line representation of data available for a given case on GDC. Following case and disease, each column represents a particular data type, and one-letter codes T, N, A indicate availability of tumor, blood normal, and tissue adjacent normal samples, respectively. Repeated codes indicate repeated data files.

Lists counts of tumor (T), blood normal (N), and adjacent / tissue normal (A) for each of

WGS.hg19 - WGS data as submitted to GDC, assumed hg19
WXS.hg19 - WXS (aka WES, exome) data as submitted to GDC, assumed hg19
RNA.fq - RNA-Seq data as submitted to GDC, FASTQ format
- R1 and R2 FASTQs are listed individually, so will typically have two of each sample type
miRNA.fq - miRNA-Seq data as submitted to GDC, BAM format
WGS.hg38 - Harmonized WGS data
WXS.hg38 - Harmonized WXS data
RNA.hg38 - Harmonized RNA-Seq data
- Harmonization generates chimeric, genomic, and transcriptome BAM files, so each entry will have 3 of each sample type
miRNA.hg38 - Harmonized miRNA-Seq data
MethArray - Methylation Array data

Example:

C3L-00001   LUAD        WGS.hg19 T N A      WXS.hg19 T N A      RNA.fq TT  AA       miRNA.fq T  A       WGS.hg38 T N A      WXS.hg38 T N A      RNA.hg38 TTT  AAA       miRNA.hg38 T  A     MethArray TT  AA

This line indicates that LUAD case C3L-00001 has tumor, blood normal, and adjacent normal samples for WGS and WXS data as submitted (hg19); tumor and adjacent normal RNA-Seq data (TT, AA because FASTQ data comes in pairs); and tumor and adjacent miRNA data in FASTQ format. All these are available as harmonized hg38 WGS and WXS, and harmonized hg38 RNA-Seq chimeric, genomic, and transcriptome BAMs are available for tumor and adjacent normal. Methylation array data for tumor and tissue adjacent also available (Green and Red channel for each).

CPTAC3.cases.dat

Comprehensive list of cases along with their disease, cohort, and batch information.

Current cases list consists of 3696 cases and their disease.
- Obtained from file Batches1through9_samples_attribute_tumorcode_added.xlsx, personal communication with Mathangi Thiagarajan
Cohort is an ad hoc column which tries to categorize cases according to Discovery or Confirmatory cohort, per year of contract.
Batch column indicates the year and batch(es) in which each case was processed. Y1 and Y2 correspond to Year 1 and 2, respectively. Note that a given case may belong to several different batches, since not all data for a given case was available at a given time. Such batches are listed as comma-separated names. In the future batch information should be indicated in a different file.

DCC_Analysis_Summary directory

Files here track analyses uploaded to DCC, with one file per analysis pipeline. See DCC_Analysis_Summary/README.md for additional details.

BamMap directory

Contents of ./BamMap directory track in-house data downloaded to Ding Lab servers from GDC. These change frequently and are specific to Ding Lab systems.

Example line from BamMap with header names:

     1  sample_name   C3L-00017.RNA-Seq.R1.A
     2  case    C3L-00017
     3  disease PDA
     4  experimental_strategy   RNA-Seq
     5  sample_type tissue_normal
     6  data_path   /gscmnt/gc2521/dinglab/mwyczalk/somatic-wrapper-data/GDC_import/data/7829f978-5fd7-436a-9ec2-2e58a7bcb1f7/180508_UNC31-K00269_0127_AHTV7YBBXX_ACTGAT_S59_L007_R1_001.fastq.gz
     7  filesize    4470007340
     8  data_format FASTQ
     9  reference   NA
    10  UUID    7829f978-5fd7-436a-9ec2-2e58a7bcb1f7
    11  system  MGI

BamMap summary

As an example from MGI.BamMap-summary.txt:

CCRCC	    WGS.hg19 t n a	    WXS.hg19 t n a	    RNA.fq TT  AA	    miRNA.fq t  a	    WGS.hg38 T N a	    WXS.hg38 T N A	    RNA.hg38 Ttt  Aaa

This indicates that all RNA-Seq FASTQ, harmonized WGS tumor and blood normal, all harmonized WXS, and genomic hg38 RNA-Seq data are available at MGI. Lower case letters indicate which data are available at GDC but not at MGI.

NOTE BamMap summary files are not updated regularly and are considered deprectated.

Custom sample names

SampleRename.dat is a TSV file used to add suffixes based on matches to UUID, aliquot, and experimental strategy. Input TSV file format is one of the following: a) uuid, suffix b) aliquot, experimental_strategy, suffix

The wildcard * will be used to indicate all experimental strategies multiple matches will give multiple sequential suffixes It is parsed by CPTAC3 Case Discover src/make_catalog.sh

Currently, it is used to add suffixes .core and .high_cov to select PDA samples.

Aliquots associated with core biopsies are obtained from
- DeepCoverage_Broad_PDA.xlsx for WXS
- PDA_Bulk_WGS_2_13.xlsx for WGS
Aliquots for high coverage WXS samples are obtained from
- DeepCoverage_Broad_PDA.xlsx
- CPTAC_SupplementalData_WGS&WES_renamingneeded_Aug2019_93samples_check.xlsx All files from Mathangi Thiagarajan and Ana Robles

`deprecated` sample names

A number of samples are marked with a suffix .deprecated. These labels are based on an analysis of duplicate aliquots, and correspond largely to instances where one aliquot has been superceded by another, with the original not removed from GDC. Details of this analysis can be found on shiso:/Users/mwyczalk/Projects/CPTAC3/CPTAC3.Cases/20200501.find_duplicates/README.md

Heterogeneity Studies and duplicates

GDC provides annotations associated with aliquots which contain additional context regarding cases with multiple tumor samples. This information is stored in the field aliquot_annotation and is used to generate a convenient label used in the sample metadata and sample name fields.

If aliquot_annotation is defined for a given data file, we generate sample label consisting of a label prefix followed by an ID code. An example sample label may be HET_qZq3G, where the prefix HET indicates heterogeneity and the ID code is qZq3G. This code is hash ID generated with bashids, where the input numerical string is obtained from the aliquot name (CPT0000650008) with "CPT" and any leading 0's removed. The sample label used for the sample_name and sample_metadata fields

Table below lists all known GDC aliquot annotations and the prefix used to generate the sample tag.

Aliquot annotation	Label prefix
Additional DNA Distribution - Additional aliquot	`ADD`
BioTEXT_RNA	`BIOTEXT`
Duplicate item: Additional DNA for PDA Deep Sequencing	`DEEP`
Duplicate item: Additional DNA requested	`ADNA`
Duplicate item: Additional RNA requested	`ARNA`
Duplicate item: CCRCC Tumor heterogeneity study	`HET`
Duplicate Item: CHOP GBM Duplicate Primary Tumor DNA Aliquot	`ADNA`
Duplicate Item: CHOP GBM Duplicate Primary Tumor RNA Aliquot	`ADNA`
Duplicate Item: CHOP GBM Duplicate Recurrent Tumor DNA Aliquot	`ADNA`
Duplicate Item: CHOP GBM Duplicate Recurrent Tumor RNA Aliquot	`ADNA`
Duplicate item: No new shipment/material. DNA aliquot resubmission for Broad post-harmonization sequencing and sample type mismatch correction.	`RDNA`
Duplicate item: PDA BIOTEXT DNA	`BIOTEXT`
Duplicate item: PDA Pilot - bulk-derived DNA	`BULK`
Duplicate item: PDA Pilot - core-derived DNA	`CORE`
Duplicate item: Replacement DNA Distribution - original aliquot failed	`RDNA`
Duplicate item: Replacement RNA Aliquot	`RRNA`
Duplicate item: Replacement RNA Distribution - original aliquot failed	`RRNA`
Duplicate item: UCEC BioTEXT Pilot	`BIOTEXT`
Duplicate item: UCEC LMD Heterogeneity Pilot	`LMD`
Original DNA Aliquot	`ODNA`
Replacement DNA Aliquot	`RDNA`
This entity was not yet authorized to be released by the submitters	`UNAV`
unknown	`UNK`

Contact

Matthew Wyczalkowski m.wyczalkowski@wustl.edu, Ding Lab, Washinton University School of Medicine

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

CPTAC3 data catalog

Overview

Updates

CPTAC3 Catalog Version 2.3

CPTAC3 Catalog Version 2.2

Fields added

CPTAC3 Catalog Version 2.1

CPTAC3 Catalog Version 2.0

File Details

CPTAC3.Catalog.dat

CPTAC3.Catalog.Summary.dat

CPTAC3.cases.dat

DCC_Analysis_Summary directory

BamMap directory

BamMap summary

Custom sample names

`deprecated` sample names

Heterogeneity Studies and duplicates

Contact

Files

README.md

Latest commit

History

README.md

File metadata and controls

CPTAC3 data catalog

Overview

Updates

CPTAC3 Catalog Version 2.3

CPTAC3 Catalog Version 2.2

Fields added

CPTAC3 Catalog Version 2.1

CPTAC3 Catalog Version 2.0

File Details

CPTAC3.Catalog.dat

CPTAC3.Catalog.Summary.dat

CPTAC3.cases.dat

DCC_Analysis_Summary directory

BamMap directory

BamMap summary

Custom sample names

deprecated sample names

Heterogeneity Studies and duplicates

Contact

`deprecated` sample names