Skip to content
ryotayamanaka edited this page Feb 24, 2019 · 25 revisions

The aim of this project is to generate RDF data from the (table format) data from ICGC Data Portal for better reusability and interoperability.

How to generate RDF data

  • Ubuntu 18.04
  • For using AWS, EC2 t2.medium (4GB memory) is recommended.
  • For generating the whole data, 250GB disk space is needed.

Get scripts

Pull this repository

$ git clone https://github.com/med2rdf/icgc.git

Install Docker

Add the OS user to docker group and login again to the console.

$ sudo groupadd docker
$ sudo usermod -aG docker $USER
$ exit

Install docker via apt.

$ sudo apt-get update
$ sudo apt install docker.io
$ sudo systemctl start docker
$ docker --version
Docker version 18.06.1-ce, build e68fc7a

Install Oracle Database

Get Dockerfile to build Docker image of Oracle Database.

$ mkdir oracle
$ cd oracle
$ git clone https://github.com/oracle/docker-images.git

Download Oracle Database 18.3.0 from here, and put it in this directory.

$ mv LINUX.X64_180000_db_home.zip \
  ~/oracle/docker-images/OracleDatabase/SingleInstance/dockerfiles/18.3.0/

Build docker image (needs 4GB memory).

$ cd ~/oracle/docker-images/OracleDatabase/SingleInstance/dockerfiles/
$ bash buildDockerImage.sh -v 18.3.0 -e

Launch Oracle Database on a docker container.

$ docker run --name oracle \
  -p 1521:1521 -e ORACLE_PWD=Welcome1 \
  -v $HOME:/host-home \
  oracle/database:18.3.0-ee

Configure the database as a triplestore.

$ docker exec -it oracle \
  sqlplus sys/Welcome1@ORCLPDB1 as sysdba @/host-home/icgc/scripts/setup.sql

Download Data

For downloading the latest project list, access Data Portal, click Available Data Type > SSM, and click "Export Table as TSV" icon.

Create project list: project.tsv

$ cd scripts/download/
$ bash 01_projects.sh projects_2018_02_14_10_50_42.tsv > projects.tsv

Download all files. Use projects_test.tsv for testing.

$ bash 02_download_all.sh projects.tsv

Generate RDF Data

Use projects_test.tsv for testing.

$ chmod 777 ~/icgc/log ~/icgc/output
$ docker start oracle
$ docker exec -it oracle \
  sqlplus sys/Welcome1@ORCLPDB1 as sysdba @/host-home/icgc/scripts/00_user.sql
$ docker exec -it oracle \
  sh /host-home/icgc/scripts/00_run.sh download/projects.tsv \
  > ~/icgc/log/main.log

Ontologies

Ontologies referenced

Guideline

データ変換の手順

  • ICGC Data Portal の表形式のデータをデータベースに格納します
  • マッピング定義(R2RML で記載)に従ってデータを RDF に変換します

詳細は以下のページを参照

RDF データの定義方針

  • 参照する外部 RDF リソース: UniProt(遺伝子名)