Skip to content

CoronaWhy/coronawhy-infrastructure

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CoronaWhy Common Research and Data Infrastructure

What is CoronaWhy?

CoronaWhy.org is a global volunteer organization dedicated to driving actionable insights into significant world issues using industry-leading data science, artificial intelligence and knowledge sharing. CoronaWhy was founded during the 2020 COVID-19 crisis, following a White House call to help extract valuable data from more than 50,000 coronavirus-related scholarly articles, dating back decades. Currently at over 1000 volunteers, CoronaWhy is composed of data scientists, doctors, epidemiologists, students, and various subject matter experts on everything from technology and engineering to communications and program management.

What has CoronaWhy produced so far?

Read about our creations before you start.

CoronaWhy infrastructure setup

The infrastructure can be setup locally and exposed as a number of CoronaWhy services using traefik tool.

You need to specify the value of "traefikhost" before you'll start to deploy the infrastructure:

export traefikhost=apps.coronawhy.org or export traefikhost=localhost

and create docker network for all the containers you would expose on the web

docker network create traefik

download all CoronaWhy notebooks

./build-coronawhy-infra.sh 

and

docker-compose up

after that there would be exposed next CoronaWhy services:

if you want to run Apache Airflow at http://airflow.apps.coronawhy.org

docker-compose -f docker-compose-airflow.yml up

if you want to run Portainer at http://portainer.apps.coronawhy.org

docker-compose -f docker-compose-portainer.yml up

Warning: in the example all infrastructure components deployed on *.apps.coronawhy.org, you should be able to get a local deployment on *.localhost (doccano.localhost, etc) or *.lab.coronawhy.org

CoronaWhy datasets

CoronaWhy community is building an Infrastructure for Open Science that can be distributed and scaled up in the future and reused for other important tasks like cancer research. The vision of the community is to build it completely from Open Source components, all data should be published data in FAIR way and keep all available provenance information.

We're using Harvard Data Commons as a foundation that allows all CoronaWhy members to work together. We’re building a different services and running an experimental Labs and our data infrastructure is something common and reusable, a place where all research groups are sharing the same resources. It’s build on top of Dataverse data repository developed by Harvard University and available on datasets.coronawhy.org.

You can get access to datasets content uploaded as tabular files, for example: http://datasets.coronawhy.org/dataset.xhtml?persistentId=doi:10.5072/FK2/3OZLV6

That’s how to get the overview of all files in it:

curl -X GET "http://api.apps.coronawhy.org/dataverse/showfiles?doi=doi:10.5072/FK2/3OZLV6" -H  "accept: application/json"

Read specific file from Dataverse by API and expose as JSON:

curl -X GET "http://api.apps.coronawhy.org/dataverse/getfile?fileid=61" -H  "accept: application/json"

CoronaWhy also maintaining various APIs to integrate COVID-19 datasets from various sources, the documentation available here: http://api.apps.coronawhy.org/docs.

You can access the aggregated COVID-19 data by querying CoronaWhy Data API with using country codes, for example, FRA for France http://api.apps.coronawhy.org/country/FRA

CoronaWhy dashboards

  1. Task-Risk helps to identify risk factors that can increase the chance of being infected, or affects the severity or the survival outcome of the infection

  2. Task-Ties to explore transmission, incubation and environment stability

  3. Named Entity Recognition across the entire corpus of CORD-19 papers with full text

  4. Match Clinical Trials allows exploration of the results from the COVID-19 International Clinical Trials dataset

  5. COVID-19 Literature Visualization helps to explore the data behind the AI-powered literature review

  6. AI-Powered Literature Review - CoronaWhy Team Task-TIES contributions to the AI-powered literature review from the CoronaWhy Team: Task-TIES

More detailed information about every dashboard published on Kaggle.

CORD-19 preprocessing pipeline

Download COVID-19 Open Research Dataset Challenge (CORD-19) from Kaggle

bash ./download_dataset.sh

Start NLP pipeline manually by executing

docker run -v /data/distrib/covid-19-infrastructure/data/original:/data -it coronawhy/pipeline /bin/bash

or automatically with

docker-compose -f ./docker-compose-pipeline.yml up

Follow all updates from our YouTube and CoronaWhy Github

Getting Started with CoronaWhy Common infrastructure

How to access Elasticsearch and Dataverse, notebook

CoronaWhy Elasticsearch Tutorial notebook

How to Create Knowledge Graph, notebook

Dataverse Colab Connect, notebook

GitHub dataset sync with Dataverse, notebook

CoronaWhy Services

You can connect your notebooks to the number of services listed below, all services coming from CoronaWhy Labs have an experimental status. Join the fight against COVID-19 if you want to help us!

Data repository

Dataverse deployed as a data service on https://datasets.coronawhy.org Dataverse is an open source web application to share, preserve, cite, explore, and analyze research data. It facilitates making data available to others.

Elasticsearch

CoronaWhy Elasticsearch has CORD-19 indexes on sentences level and available at CoronaWhy Search

Available indexes:

  1. CORD-19 sentences
  2. CORD-19 sections
  3. GRID Affiliations
  4. MeSH
  5. Geonames

MongoDB

MongoDB service deployed on mongodb.coronawhy.org and available from CoronaWhy Labs Virtual Machines. Please contact our administrators if you want to use it.

Hypothesis

Our Hypothesis annotation service is running on hypothesis.labs.coronawhy.org and allows to manually annotate CORD-19 papers. Please try our Hypothesis Demo if you're interested.

OpenLink Virtuoso triplestore

We are providing Virtuoso as a service with public SPARQL Endpoint that offers an HTTP-based Query Service that operates on Entity Relationship Types (Relations) represented as RDF sentence collections using the SPARQL Query Language. https://virtuoso.openlinksw.com

You can run a simple SPARQL query to get some overview of triples from CoronaWhy Knowledge Graph.

Kibana

Kibana deployed as a community service connected to CoronaWhy Elasticsearch on https://kibana.labs.coronawhy.org Allows to visualize Elasticsearch data and navigate the Elastic Stack so you can do anything from tracking query load to understanding the way requests flow through your apps. https://www.elastic.co/kibana

BEL

BEL Commons 3.0 available as a service https://bel.labs.coronawhy.org

An environment for curating, validating, and exploring knowledge assemblies encoded in Biological Expression Language (BEL) to support elucidating disease-specific, mechanistic insight.

You can watch the introduction video and read Corona BEL Tutorial if you want to know more.

INDRA

INDRA deployed as a service on https://indra.labs.coronawhy.org/indra.

INDRA (Integrated Network and Dynamical Reasoning Assembler) generates executable models of pathway dynamics from natural language (using the TRIPS and REACH parsers), and BioPAX and BEL sources (including the Pathway Commons database and NDEx.

You can quickly test the service by running:

curl -X POST "https://indra.labs.coronawhy.org/bel/process_pybel_neighborhood" -H "accept: application/json" -H "content-type: application/json" -d "{ \"genes\": [ \"MAP2K1\" ]}" -l -o test_coronawhy_map2k1.json

Geoparser

Geoparser as a service https://geoparser.labs.coronawhy.org

The Geoparser is a software tool that can process information from any type of file, extract geographic coordinates, and visualize locations on a map. Users who are interested in seeing a geographical representation of information or data can choose to search for locations using the Geoparser, through a search index or by uploading files from their computer. https://github.com/nasa-jpl-memex/GeoParser

Tabula

Tabula allows you to extract data from PDF files into a CSV or Microsoft Excel spreadsheet using a simple, easy-to-use interface. We deployed it as a CoronaWhy service available for all community members. More information at Tabula website.

Teamchatviz

We use Teamchatviz to explore how communication works in our distributed team and learn how communication shapes culture in CoronaWhy community. https://moovel.github.io/teamchatviz/

In progress

We are working on the deployment Neo4j graph database.

Articles produced by CoronaWhy people

I’m an AI researcher and here’s how I fight corona by Artur Kiulian

Exploration of Document Clustering with SPECTER Embeddings by Brandon Eychaner

COVID-19 Research Papers Geolocation by Ishan Sharma

Sweeping Towards Better Coronavirus Forecasting by Isaac Godfried

Transferring Knowledge on Time Series with the Transformer by Isaac Godfried

Releases

No releases published

Packages

No packages published

Languages

  • Python 90.4%
  • Shell 5.0%
  • Dockerfile 4.6%