Explore UniProtKB with Amazon Neptune

Introduction

The Universal Protein Knowledge Base (https://www.uniprot.org/) (UniProtKB) is a widely used protein data source that is now available through the Registry of Open Data on AWS. UniProt data is highly structured with many relationships between protein sequences, annotations, ontologies and other related data sources. UniProtKB can be directly accessed via the UniProt website and is available for bulk downloads in several formats, including RDF which is particularly well suited to represent the complex and connected nature of the data as a graph. Creating a custom knowledge base can enable more advanced use cases, such as joining with other data sources, augmenting data with custom annotations and relationships, or inferring new relationships with analytics or machine learning.

In this example, we will demonstrate the step-by-step process to create and use your own protein knowledge base using UniProt RDF data. We will show how to ingest a subset of UniProtKB data into your own Amazon Neptune database directly from the Registry of Open Data on AWS. We will then show how to query the data with SPARQL, create new relationships in the data and visualise the data as a graph.

How to run this example

If you would like to try this example yourself, there is a cloud formation stack that can be run within your own AWS account. If your region is not listed, you can modify the launch link and replace with your region.

Region	Launch Template
Europe (Paris) eu-west-3
US East (N. Virginia) us-east-1
Asia Pacific (Singapore) ap-southeast-1
Asia Pacific (Sydney) ap-southeast-2

The cloud formation stack will create the following resources:

A VPC with Subnets, Internet Gateway, NAT Gateway, Routing Tables, and Security Groups
The NeptuneDB cluster and instance
An IAM role used to load data from the AWS Open Data Registry
An S3 VPC endpoint for Neptune to access the open data bucket
An Amazon SageMaker notebook to load the data and query the database

The resources provisioned via the CloudFormation stack are shown in the below diagram:

The total time to run the lab is approximately one hour. To load the dataset into the Neptune instance as quickly as possible, we use a large writer instance db.r5.8xlarge which is not covered by the AWS free tier. This will cost approximately 9 USD. Once the data is loaded we should switch to a smaller instance to minimize costs. Here is a detailed cost estimation for loading data in the eu-west-3 region:

Item	Unit Price	Unit	Cost
Instance hour (db.r5.8xlarge)	6.456 per hour	1	6.456
Storage IO Usage	0.22 per million IO	10	2.2
Storage Usage	0.11 per GB per month	0.1	0.01

The actual costs will vary depending on region, but will be comparable. We will look more into loading times below. For more information on the costs for running Neptune, go here.

Open the Neptune Workstation Notebook

Once we have run the Cloud Formation template, we can open the Neptune Workbench Notebook that was created. From the Neptune console, click on Notebooks, select the notebook, and then click the Open Notebook button.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
images		images
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Explore UniProtKB with Amazon Neptune.ipynb		Explore UniProtKB with Amazon Neptune.ipynb
LICENSE		LICENSE
README.md		README.md
UniProtKB.template		UniProtKB.template

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Explore UniProtKB with Amazon Neptune

Introduction

How to run this example

Open the Neptune Workstation Notebook

About

Releases

Packages

Languages

License

redaschi/explore-uniprotkb-with-amazon-neptune

Folders and files

Latest commit

History

Repository files navigation

Explore UniProtKB with Amazon Neptune

Introduction

How to run this example

Open the Neptune Workstation Notebook

About

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages