The CTMM TraIT project recently added the Cell Line Use Case (CLUC) to tranSMART. The CLUC is a collection of data on colorectal and prostate cell lines from an exceptionally broad set of platforms, as shown in the table below. This diverse set is used to:
- Standardize data formats and data processing pipelines from the four domain
- Test the integration pipelines within the TraIT translational toolset
By incorporating the same platforms as used for ongoing research projects, this cell line set gives a representative test set comparable to real patient data, without the legal burden of handling personal data. The TraIT Cell Line Use Case is available under the CC0 license for download here.
Please use the following citation when making use of this dataset: Mariska Bierkens & Jochem Bijlard "The TraIT cell line use case." Manuscript in preparation. More information can also be found on the Bio-IT World Poster "Multi-omics data analysis in tranSMART using the Cell Line Use Case dataset".
During the setup you should configure whether you want to load the data to a local transmart or over ssh to a remote one. Follow the instructions on how to do this here.
This is the directory structure of the downloaded dataset. A similar structure is recommended as transmart-batch functions optimally when a params file is given as a argument when that file is a subdirectory of a file called study.params
. We will go over some key concepts below.
TraIT-Cell-lines
│ study.params
├── acgh
├── genes
| ├── 180k
| | | cnv.params
| | | Cell-line_samples.txt
| | | Cell-line_subjectmapping.txt
| ├── 244k
| └── annotation
| | | cnv_annotation.params
| | | GPL8687-gene_platform.txt
├── qdnaseq
└── regions
├── clinical
| clinical.params
| Cell-line_columns.txt
| Cell-line_wordmap.txt
| Cell-line_NHTMP.txt
├── tags
| tags.params
| tags.txt
├── expression
├── mirna
├── proteomics
└── rnaseq
Below I will shortly highlight four classes of parameter files you can find in the CLUC. I will not go into too much details and if you want to know more you can follow the provided links.
STUDY_ID
= TRAIT-CELL-LINESECURITY_REQUIRED
= NTOP_NODE
= \Public Studies\TraIT-Cell-line
For more info and additional options see docs.
COLUMN_MAP_FILE
= Cell-line_columns.txtWORD_MAP_FILE
= Cell-line_wordmap.txt
For more info and additional options see docs.
ANNOTATIONS_FILE
= GPL8687-gene_platform.txtGENOME_RELEASE
= hg19ORGANISM
= Homo sapiensTITLE
= GPL8687_acgh
For more info and additional options see docs.
DATA_FILE
= Cell-line_samples.txtMAP_FILENAME
= Cell-line_subjectmapping.txtDATA_TYPE
= RALLOW_MISSING_ANNOTATIONS
= Y
For more info and additional options see docs.
TAGS_FILE
= tags.txt
For more info and additional options see docs.
Because all data cleanup and preparation has been done here, you should not have to worry about this for now. Let's load this dataset into your transmart database. So, first make sure you have unpacked the dataset and are able to browser the different files present. Here I will put them in ~/data/trait-cell-lines/
, but you can put it anywhere you wish.
When you have prepared you transmart-batch setup and have the executable somewhere you can access it (I have put it in ~/transmart-batch/batch-release.jar
) and have setup the properties file appropriately (i.e. at ~/transmart-batch/batchdb.properties
) you are ready to go.
Important to know is that in TranSMART you will always want clinical data to load first. So I navigated to the dataset directory and called transmart-batch from there.
$ cd ~/data/trait-cell-lines/
$ ~/transmart-batch/batch-release.jar -c ~/transmart-batch/batchdb.properties -p clinical/clinical.params
Next step is to upload all annotation platform data. This is required so data in data files can be verified before loading.
Like with clinical data you load the annotation data by calling transmart-batch with the -p
parameter pointing to a annotation params. E.g. for cnv annotations:
$ ~/transmart-batch/batch-release.jar -c ~/transmart-batch/batchdb.properties -p acgh/genes/annotation/cnv_annotation.params
Only after you have finished uploading annotations for a datatype you can upload the data itself. My advise is to first upload annotations of all the datatypes you want to load before proceding.
Next you will want to load the data files.
$ ~/transmart-batch/batch-release.jar -c ~/transmart-batch/batchdb.properties -p acgh/genes/180k/cnv.params
As a last step you can load the meta data tags.
$ ~/transmart-batch/batch-release.jar -c ~/transmart-batch/batchdb.properties -p tags/tags.params