Skip to content

Latest commit

 

History

History
148 lines (112 loc) · 6.57 KB

README.md

File metadata and controls

148 lines (112 loc) · 6.57 KB

Publish to GitHub Packages

NodeNormalization

Introduction

Node normalization takes a CURIE, and returns:

  • The preferred CURIE for this entity
  • All other known equivalent identifiers for the entity
  • Semantic types for the entity as defined by the Biolink Model

The data currently served by Node Normalization is created by the prototype project Babel, which attempts to find identifier equivalences, and makes sure that CURIE prefixes are BioLink Model compliant. The NodeNormalization service, however, is independent of Babel and as improved identifier equivalence tools are developed, their results can be easily incorporated.

To determine whether Node Normalization is likely to be useful, check /get_semantic_types, which lists the BioLink semantic types for which normalization has been attempted, and /get_curie_prefixes, which lists the number of times each prefix is used for a semantic type.

For examples of service usage, see the example notebook.

The Node normalization website leverages the R3 (Redis-REST with referencing) Redis data design and configuration.

Users can find the publicly available website at service.

Installation

Create a virtual environment

    python -m venv nodeNormalization-env

Activate the virtual environment

    source nodeNormalization-env/bin/activate

Install requirements

    > pip install -r requirements.txt

Generating equivalence data

The equivalence data can be generated by running Babel. An example of the contents of a compendia file is shown below:

    {"id": {"identifier": "PUBCHEM:50986940"}, "equivalent_identifiers": [{"identifier": "PUBCHEM:50986940"}, {"identifier": "INCHIKEY:CYMOSKLLKPIPCD-UHFFFAOYSA-N"}], "type": ["chemical_substance", "named_thing", "biological_entity", "molecular_entity"]}
    {"id": {"identifier": "CHEMBL.COMPOUND:CHEMBL1546789", "label": "CHEMBL1546789"}, "equivalent_identifiers": [{"identifier": "CHEMBL.COMPOUND:CHEMBL1546789", "label": "CHEMBL1546789"}, {"identifier": "PUBCHEM:4879549"}, {"identifier": "INCHIKEY:FUIYIXDZTPMQEH-UHFFFAOYSA-N"}], "type": ["chemical_substance", "named_thing", "biological_entity", "molecular_entity"]}

Creating and loading a Redis container with data

A running instance of Redis is needed to house the node normalization data. a Redis Docker container image can be downloaded from Docker hub. The Redis caonteriner can be started with thie following docker command:

    docker run --name node-norm-redis -p 6379:6379 -d redis redis-server --appendonly yes

Note that the dataset for Node normalization is quite large and 256Gb of memory and disk space should be made available to the Redis instance to insure proper loading of the complete compendia.

Configuration

Insure that the ./config.json file is created and contains the parameters for the node normalization load specific to your environment.

The configuration parameters compendium_directory and data_files specify the location of the compendia files. An example of the files' contents
are listed below:

    {
        "compendium_directory": "<path to files>",
        "data_files": "anatomy.txt,BiologicalProcess.txt,cell.txt,cellular_component.txt,disease.txt,gene_compendium.txt,gene_family_compendium.txt,MolecularActivity.txt,pathways.txt,phenotypes.txt,taxon_compendium.txt",
        "redis_host": "<Redis host server name>",
        "redis_port": <Redis connection port>,
        "redis_password": "<Redis password",
        "test_mode": 1,
        "debug_messages": 0
    }

Loading of the Redis server with compendia data

The load.py script reads the configuration file for load parameters and the loads the compendia data into the Redis instance.

The redis command line can be used to monitor various aspects of the load.

It is possible to observer the progress of the load opening a command line within the container and issuing Redis commands.

View the number of keys loaded so far.

   redis-cli info keyspace

Once the database has completed loading it is recommended that the Redis database be persisted to disk.

    redis-cli save

Monitor the database to determine if the save has completed.

    redis-cli info persistence

Starting the FASTAPI webserver from the command line

The web server can be started after successful completion of the load.

    cd <Node normalization code root>

    pip install -r requirements.txt
   
    uvicorn --host 0.0.0.0 --port 8000 --workers 1 node_normalizer.server:app

Then navigate to http://localhost:8000/docs to run the application

Webserver Docker container creation and execution

Much like the Redis Docker container noted above, a Docker container can also be created and executed to run the webserver.

Build the webserver Docker image

    cd <Node normalization code root>

    docker build --tag <image_tag> .

Start the container:

Note the Dockerfile specifies port 6380 for the webservice container.

    docker run --name Node-normalization -p 8000:6380 node-norm

Then navigate to: http://localhost:8000/docs to run the application

Kubernetes configurations

Kubernetes configurations and helm charts for this project can be found at:

    https://github.com/helxplatform/translator-devops/helm/r3

Configuration

NodeNorm can be configured by setting environmental variables:

  • SERVER_NAME: The name of this server (defaults to infores:sri-node-normalizer)
  • SERVER_ROOT: The server root (defaults to /)
  • LOG_LEVEL: The log level (defaults to ERROR)
  • TRAPI_VERSION: The TRAPI version this version of NodeNorm supports.
  • MATURITY_VALUE: How mature is this NameRes (defaults to maturity, e.g. development)
  • LOCATION_VALUE: Where is this NameRes setup (defaults to location, e.g. RENCI)
  • EQ_BATCH_SIZE: The size of the get_eqids_and_types() batch size (defaults to 2500)
  • OTEL_ENABLED: Turn on Open TELemetry (default: 'false') -- only 'true' will turn this on.
    • JAEGER_HOST and JAEGER_PORT: Hostname and port for the Jaegar instance to provide telemetry to.
    • JAEGER_SERVICE_NAME: The name of this service (defaults to the value of SERVER_NAME)