cgp-dss-data-loader

Simple data loader for CGP HCA Data Store

Common Setup

(optional) We recommend using a Python 3 virtual environment.
Run:

pip3 install cgp-dss-data-loader

Setup for Development

Clone the repo:

git clone https://github.com/DataBiosphere/cgp-dss-data-loader.git
Go to the root directory of the cloned project:

cd cgp-dss-data-loader
Make sure you are on the branch develop.
Run (ideally in a new virtual environment):

make develop

Cloud Credentials Setup

Because this program uses Amazon Web Services and Google Cloud Platform, you will need to set up credentials for both of these before you can run the program.

AWS Credentials

If you haven't already you will need to make an IAM user and create a new access key. Instructions are here.
Next you will need to store your credentials so that Boto can access them. Instructions are here.

GCP Credentials

Follow the steps here to set up your Google Credentials.

(Optional) Cloud Metadata Credentials Setup

When the loader submits data, it actually needs access to the referenced files in the account to obtain metadata (e.g. hash and size) that may be in another account.

If the data is public, this is unnecessary. However, if access is controlled, additional credentials must be provided.

If using metadata credentials, it's strongly encouraged to perform a dry run first as a test. This will ensure your credentials are correct.

(Optional) GCP Metadata Credentials

If GCP files are being loaded that specifically require Google user credentials (rather than Google Service Account credentials), perform the following steps:

Make sure you have gcloud installed.
Run

gcloud auth application-default login
Follow the link to the account accessed.
This will generate a json with your user credentials with a path similar to:

/home/<user>/.config/gcloud/application_default_credentials.json
Copy this json to another location so that it will not accidentally be used as a default by the main application.
This file can then be used by the loader by specifying (as an example):

--gcp-metadata-cred /home/<user>/metadata_credentials/my_user_credentials.json

(Optional) AWS Metadata Credentials

For when AWS files are being loaded that require assuming a role for access.

**One caveat, AWS allows a maximum of 12 hours under an assumed role for a single session, so if loading takes longer than that, it may break.

This involves the setup of an AssumedRole on the account that your main AWS credentials have access to. If this is done already, all you need to do is supply a file containing the AWS ARN to that assumed role and the loader will assume the role on your behalf when gathering information about the metadata.

Additional information on setting up an AssumedRole through AWS:

If AWS files are being loaded that require assuming a role for access, perform the following steps:

Write a file containing the ARN, for example:

arn:aws:iam::************:role/ROLE_NAME_HERE
This file can then be used by the loader by specifying (as an example):

--aws-metadata-cred /home/<user>/aws_credentials.config

Running Tests

Run:

make test

Getting Data from Gen3 and Loading it

The first step is to extract the Gen3 data you want using the sheepdog exporter. The TopMed public data extracted from sheepdog is available on the release page under Assets. Assuming you use this data, you will now have a file called topmed-public.json
Make sure you are running the virtual environment you set up in the Setup instructions.
Now you will need to transform the data into the 'standard' loader format. Do this using the newt-transformer. You can follow the common setup, then the section for transforming data from sheepdog.

Now that we have our new transformed output we can run it with the loader.

If accessing public access data, use the command:

dssload --no-dry-run --dss-endpoint MY_DSS_ENDPOINT --staging-bucket NAME_OF_MY_S3_BUCKET transformed-topmed-public.json

Alternatively, if supplying additional credentials for private data:

dssload --no-dry-run --dss-endpoint MY_DSS_ENDPOINT --staging-bucket NAME_OF_MY_S3_BUCKET -p GOOGLE_PROJECT_ID --gcp-metadata-cred gs_credentials.json --aws-metadata-cred aws_credentials.config gtex.json

You did it!

Name		Name	Last commit message	Last commit date
Latest commit History 114 Commits
datasets/topmed/topmed_107_open_access		datasets/topmed/topmed_107_open_access
loader		loader
scripts		scripts
tests		tests
util		util
.flake8		.flake8
.gitignore		.gitignore
.travis.yml		.travis.yml
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
Makefile		Makefile
README.md		README.md
VERSION		VERSION
client-secret.json.enc		client-secret.json.enc
common.mk		common.mk
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

cgp-dss-data-loader

Common Setup

Setup for Development

Cloud Credentials Setup

AWS Credentials

GCP Credentials

(Optional) Cloud Metadata Credentials Setup

(Optional) GCP Metadata Credentials

(Optional) AWS Metadata Credentials

Running Tests

Getting Data from Gen3 and Loading it

About

Releases 4

Packages

Contributors 3

Languages

License

ucsc-cgp/cgp-dss-data-loader

Folders and files

Latest commit

History

Repository files navigation

cgp-dss-data-loader

Common Setup

Setup for Development

Cloud Credentials Setup

AWS Credentials

GCP Credentials

(Optional) Cloud Metadata Credentials Setup

(Optional) GCP Metadata Credentials

(Optional) AWS Metadata Credentials

Running Tests

Getting Data from Gen3 and Loading it

About

Resources

License

Stars

Watchers

Forks

Releases 4

Packages 0

Contributors 3

Languages

Packages