Filter sequences by lineage (#501)

* Add group (lineage) filter serverside * simplify, break out common components * Add UI components * working query * Add selected group fields to status box * fix duplicate URL params * basic API docs * Add metadata map docs * Add TOC and citing * move citing section to front * Amend final data integration section
vector-engineering · Jan 19, 2022 · 8f6cfe9 · 8f6cfe9
1 parent e8aa1ee
commit 8f6cfe9
Show file tree

Hide file tree

Showing 17 changed files with 882 additions and 339 deletions.
diff --git a/API.md b/API.md
@@ -0,0 +1,189 @@
+# COVID CG API
+
+- [COVID CG API](#covid-cg-api)
+- [Data enabling COVID CG](#data-enabling-covid-cg)
+- [Metadata mappings](#metadata-mappings)
+- [Aggregate data](#aggregate-data)
+  - [Mutation mode](#mutation-mode)
+  - [Group mode (lineage mode)](#group-mode-lineage-mode)
+- [Lineage mutation frequencies](#lineage-mutation-frequencies)
+
+## Data enabling COVID CG
+
+We are extremely grateful to the [GISAID Initiative](https://www.gisaid.org/) and all its data contributors, i.e. the Authors from the Originating laboratories responsible for obtaining the speciments and the Submitting laboratories where genetic sequence data were generated and shared via the GISAID Initiative, on which this research is based.
+
+Elbe, S., and Buckland-Merrett, G. (2017) Data, disease and diplomacy: GISAID’s innovative contribution to global health. _Global Challenges_, 1:33-46. DOI:[10.1002/gch2.1018](https://doi.org/10.1002/gch2.1018) PMCID: [31565258](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6607375/)
+
+Note: When using results from these analyses in your manuscript, ensure that you acknowledge the contributors of data, i.e. _We gratefully acknowledge all the Authors from the Originating laboratories responsible for obtaining the speciments and the Submitting laboratories where genetic sequence data were generated and shared via the GISAID Initiative, on which this research is based_.
+
+## Metadata mappings
+
+Much of the input to the API and output from the API is encoded into integers for faster query times. For example, mutations from the aggregate data API are returned as integer IDs instead of human-readable mutations. To map these back from IDs, first get the metadata map with:
+
+```
+curl https://covidcg.org/init
+```
+
+This map contains integer mappings that the server is currently using, for mutations and all other metadata.
+
+**Update your metadata map every time before running a real query. The metadata mappings change every day**
+
+## Aggregate data
+
+### Mutation mode
+
+```
+curl --header "Content-Type: application/json" --request POST --data '{
+  "group_key": "mutation",
+  "dna_or_aa": "AA",
+  "coordinate_mode": "gene",
+  "coordinate_ranges": [[21563, 25384]],
+  "selected_gene": "S",
+  "region": [0, 1, 2],
+  "country": [10, 11, 12],
+  "division": [3, 4, 5],
+  "location": [19, 20, 21],
+  "selected_group_fields": { "lineage": ["AY.4", "BA.1"] },
+  "selected_metadata_fields": { "host": [0, 1] },
+  "start_date": "2021-01-01",
+  "end_date": "2021-06-01",
+  "subm_start_date": "2021-03-01",
+  "subm_end_date": "2021-05-01"
+}' https://covidcg.org/data
+```
+
+#### Parameters
+
+- `group_key`: string - required
+  - "mutation" to aggregate over sequence mutations
+  - "lineage" to aggregate over sequence lineage (see section below)
+- `dna_or_aa`: string - required
+  - "DNA": Mutations are on the nucleotide level
+  - "AA": Mutations are on the amino acid level
+- `coordinate_mode`: string - required only for AA mode
+  - "gene": Amino acid mutations are relative to canonical genes (may be missing ORFs within some genes). i.e., will return mutations relative to Orf1a coding frame instead of relative to nsp1's or nsp2's
+  - "protein": Amino acid mutations are relative to all ORFs
+- `coordinate_ranges`: array of array of ints - required
+  - Only extracts mutations within this set of ranges. Each range is an array of two integers, `[a, b]`, where a and b are both nucleotide positions relative to the WIV04 reference sequence MN996528.1.
+- `selected_gene`: string - required only for gene coordinate mode
+  - Specify the name of the gene to extract mutations from. A list of gene names is in `static_data/genes.json`
+- `selected_protein`: string - required only for protein coordinate mode
+  - Specify the name of the protein to extract mutations from. A list of protein names is in `static_data/proteins.json`
+- `region`, `country`, `division`, `location`: array of ints - at least one is required
+  - Geographical IDs from which to filter sequences from. _These are integers representing IDs of geographical locations, not strings!_
+  - To acquire mappings of geographical IDs -> names, see the previous [Metadata mappings](#metadata-mappings) section.
+- `selected_group_fields`: object
+  - Has the form: `{ group: [group names...], ... }`
+  - e.g., to filter for specific PANGO lineages corresponding to Omicron: `{ "lineage": ["BA.1", "BA.2", "BA.3"] }`
+- `selected_metadata_fields`: object
+  - Has the form: `{ metadata_field: [metadata_value_IDs...], ... }`
+  - Available metadata fields are defined in the relevant config YAML file in `config/`. These can also be obtained from the [metadata map](#metadata-mappings)
+  - Metadata values are integers, _not strings_, and are defined by the [metadata map](#metadata-mappings).
+- `start_date`, `end_date`: string, required
+  - Start and end date of sample collection. Dates are strings in ISO format (YYYY-MM-DD)
+- `subm_start_date`, `subm_end_date`: string
+  - Start and end date of sample submission. Dates are strings in ISO format (YYYY-MM-DD)
+
+#### Returns
+
+Returns a JSON object, with the format:
+
+```
+[
+  {
+    "location": string
+      - Name of the specified location. e.g., "North America"
+    "collection_date": integer
+      - Collection date, in javascript time (milliseconds since Unix epoch)
+    "group_id": array of integers or null
+      - Represents a group of co-occurring mutation IDs. These mutation IDs can be mapped with the metadata map
+      - null value represents no mutations within the specified genomic region
+    "counts": integer
+      - Occurrences of this group of mutations, at the location, at the given collection date
+  },
+  ...
+]
+```
+
+Important to note:
+
+- Mutations are only reported within the genomic coordinates specified in the request parameters. For example, if requesting AA mutations within the spike gene, mutations in the N gene will not be reported.
+- Mutations can be double-counted if request contains overlapping locations. For example, if requesting mutations in "North America" and "United States", the result will contain entries for both these locations separately.
+- If you desire to count individual mutations, then unfold the list of mutations in `group_id` and then re-aggregate
+
+### Group mode (lineage mode)
+
+```
+curl --header "Content-Type: application/json" --request POST --data '{
+  "group_key": "lineage",
+  "region": [0, 1, 2],
+  "country": [10, 11, 12],
+  "division": [3, 4, 5],
+  "location": [19, 20, 21],
+  "selected_group_fields": { "lineage": ["AY.4", "BA.1"] },
+  "selected_metadata_fields": { "host": [0, 1] },
+  "start_date": "2021-01-01",
+  "end_date": "2021-06-01",
+  "subm_start_date": "2021-03-01",
+  "subm_end_date": "2021-05-01"
+}' https://covidcg.org/data
+```
+
+Same as mutation mode, except `group_key` is set to `lineage` (PANGO lineage designation) or another grouping as defined by the `group_cols` setting in the config YAML file. The GISAID site has an additional `clade` phylogenetic grouping. Any fields relating to mutatins or genomic coordinates can be omitted in group mode.
+
+## Lineage mutation frequencies
+
+```
+curl --header "Content-Type: application/json" --request POST --data '{"group": "lineage", "mutation_type": "gene_aa", "consensus_threshold": 0.9}' https://covidcg.org/group_mutation_frequencies
+```
+
+#### Parameters
+
+`group` determines the field with which sequences are grouped. Typically this is set to `lineage`.
+
+Valid choices for `group` are:
+
+- `lineage`: PANGO lineage designation
+- `clade`: GISAID clade designation (GISAID site only, not available on Genbank site)
+
+`mutation_type` determines the format of mutations to return.
+
+Valid choices for `mutation_type` are:
+
+- `dna`: Mutations on the nucleotide level.
+- `gene_aa`: Mutations on the AA level, assigned to genes (not all ORFs, i.e., will return mutations relative to Orf1a coding frame instead of relative to nsp1's or nsp2's)
+- `protein_aa`: Mutations on the AA level, assigned to proteins (all ORFs)
+
+`consensus_threshold` determines the cutoff for reporting mutations. i.e., mutations that occur less than this frequency will be excluded from the results. Set this to `0.0` to include all mutations associated with the particular grouping.
+
+#### Returns
+
+Returns a JSON object, with the format:
+
+```
+[
+  {
+    "name": string
+        - Name of the grouping, i.e., for `group` of `lineage`, this will be the lineage name,
+    "count": integer
+        - Number of occurrences of this mutation within the group
+    "fraction": float
+        - Fraction of occurrences of this mutation within the group
+    "gene": string
+        - Name of gene - only for `mutation_type` of `gene_aa`
+    "protein": string
+        - Name of protein/ORF - only for `mutation_type` of `protein_aa`
+    "mutation_id": integer
+        - Mutation ID in our database - these IDs change frequently
+    "pos": integer
+        - Position of the mutation. With `mutation_type` of `dna`, this is the position in nucleotides relative to the WIV04 reference genome MN996528.1
+    "ref": string
+        The reference base (for `dna` mode) or amino acid (for `gene_aa` or `protein_aa` mode). A `_` character designates a stop codon, and a `-` character represents a gap, i.e., an insertion when ref = `-`
+    "alt": string
+        - The alternate base (for `dna` mode) or amino acid (for `gene_aa` or `protein_aa` mode). A `_` character designates a stop codon, and a `-` character represents a gap, i.e., a deletion when alt = `-`
+    "mutation_name": string
+        - Human readable name of the mutation, in the form ref:pos:alt if in `dna` mode, or gene/protein:ref:pos:alt if in `gene_aa` or `protein_aa` mode.
+  },
+  ...,
+]
+```
diff --git a/README.md b/README.md
@@ -5,6 +5,7 @@
 Table of Contents
 
 - [COVID-19 CG (CoV Genetics)](#covid-19-cg-cov-genetics)
+- [Data enabling COVID CG](#data-enabling-covid-cg)
 - [Installation](#installation)
   - [Dependency changes](#dependency-changes)
   - [Database refresh](#database-refresh)
@@ -17,11 +18,18 @@ Table of Contents
   - [Ingestion](#ingestion)
   - [Main Analysis](#main-analysis)
 - [About the project](#about-the-project)
-- [Data enabling COVID CG](#data-enabling-covid-cg)
 - [Citing COVID CG](#citing-covid-cg)
   - [License](#license)
   - [Contributing](#contributing)
 
+## Data enabling COVID CG
+
+We are extremely grateful to the [GISAID Initiative](https://www.gisaid.org/) and all its data contributors, i.e. the Authors from the Originating laboratories responsible for obtaining the speciments and the Submitting laboratories where genetic sequence data were generated and shared via the GISAID Initiative, on which this research is based.
+
+Elbe, S., and Buckland-Merrett, G. (2017) Data, disease and diplomacy: GISAID’s innovative contribution to global health. _Global Challenges_, 1:33-46. DOI:[10.1002/gch2.1018](https://doi.org/10.1002/gch2.1018) PMCID: [31565258](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6607375/)
+
+Note: When using results from these analyses in your manuscript, ensure that you acknowledge the contributors of data, i.e. _We gratefully acknowledge all the Authors from the Originating laboratories responsible for obtaining the speciments and the Submitting laboratories where genetic sequence data were generated and shared via the GISAID Initiative, on which this research is based_.
+
 ## Installation
 
 The COVID-19 CG website comprises of 3 services (PostgreSQL database, Flask server, React frontend). These can be run separately (see detailed instructions at [per-service installation](#per-service-installation)) but we recommend using Docker to manage these services.
@@ -215,7 +223,7 @@ snakemake --configfile ../config/config_genbank.yaml
 
 This pipeline will align sequences to the reference sequence with `minimap2`, extract mutations on both the NT and AA level, and combine all metadata and mutation information into one file: `data_package.json.gz`.
 
-To pass this data onto the front-end application, host the `data_package.json.gz` file on an accessible endpoint, then specify that endpoint in the `data_package_url` field in the `config/config_[workflow]` file that you are using.
+The output data can be uploaded to a PostgreSQL database with `workflow_main/scripts/push_to_database.py`. Or, you can use the output files directly for your own analyses.
 
 ---
 
@@ -233,12 +241,6 @@ Contact the authors by email: [[email protected]](mailto:covidcg@broadi
 
 Python/snakemake scripts were run and tested on MacOS 10.15.4 (8 threads, 16 GB RAM), Google Cloud Debian 10 (buster), (64 threads, 412 GB RAM), and Windows 10/Ubuntu 20.04 via. WSL2 (48 threads, 128 GB RAM)
 
-## Data enabling COVID CG
-
-We are extremely grateful to the [GISAID Initiative](https://www.gisaid.org/) and all its data contributors, i.e. the Authors from the Originating laboratories responsible for obtaining the speciments and the Submitting laboratories where genetic sequence data were generated and shared via the GISAID Initiative, on which this research is based.
-
-Elbe, S., and Buckland-Merrett, G. (2017) Data, disease and diplomacy: GISAID’s innovative contribution to global health. _Global Challenges_, 1:33-46. DOI:[10.1002/gch2.1018](https://doi.org/10.1002/gch2.1018) PMCID: [31565258](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6607375/)
-
 ## Citing COVID CG
 
 Users are encouraged to share, download, and further analyze data from this site. Plots can be downloaded as PNG or SVG files, and the data powering the plots and tables can be downloaded as well. Please attribute any data/images to [covidcg.org](https://covidcg.org/), or cite our manuscript:

diff --git a/services/server/cg_server/query/selection.py b/services/server/cg_server/query/selection.py
@@ -7,6 +7,7 @@
 
 import pandas as pd
 from psycopg2 import sql
+from cg_server.config import config
 from cg_server.constants import constants
 
 
@@ -113,6 +114,10 @@ def build_sequence_where_filter(req):
         - Structured as { metadata_field: [metadata_values] }
         - Keys are a metadata field, as a string
         - Values are a list of metadata value IDs (integers)
+    selected_group_fields: dict
+        - Strucutred as { group_key: [group_vals] }
+        - Key are group types, i.e., "lineage"
+        - Values are a list of group values, i.e., ["B.1.617.2", "BA.1"]
 
     Returns
     -------
@@ -130,7 +135,8 @@ def build_sequence_where_filter(req):
     subm_start_date = None if subm_start_date == "" else pd.to_datetime(subm_start_date)
     subm_end_date = None if subm_end_date == "" else pd.to_datetime(subm_end_date)
 
-    selected_metadata_fields = req.get("selected_metadata_fields", None)
+    selected_metadata_fields = req.get("selected_metadata_fields", {})
+    selected_group_fields = req.get("selected_group_fields", {})
 
     # Construct submission date filters
     if subm_start_date is None and subm_end_date is None:
@@ -170,14 +176,39 @@ def build_sequence_where_filter(req):
     else:
         metadata_filters = sql.SQL("")
 
+    group_filters = []
+    for group_key, group_vals in selected_group_fields.items():
+        # Skip if no group values provided
+        if not group_vals:
+            continue
+
+        # Skip if group key is not valid
+        if group_key not in config["group_cols"].keys():
+            continue
+
+        group_filters.append(
+            sql.SQL("{field} IN {vals}").format(
+                field=sql.Identifier(group_key),
+                vals=sql.Literal(tuple([str(val) for val in group_vals])),
+            )
+        )
+
+    if group_filters:
+        group_filters = sql.SQL(" AND ").join(group_filters)
+        group_filters = sql.Composed([group_filters, sql.SQL(" AND ")])
+    else:
+        group_filters = sql.SQL("")
+
     sequence_where_filter = sql.SQL(
         """
         {metadata_filters}
+        {group_filters}
         "collection_date" >= {start_date} AND "collection_date" <= {end_date}
         {submission_date_filter}
         """
     ).format(
         metadata_filters=metadata_filters,
+        group_filters=group_filters,
         start_date=sql.Literal(start_date),
         end_date=sql.Literal(end_date),
         submission_date_filter=submission_date_filter,

diff --git a/src/components/Buttons/DeselectButton.js b/src/components/Buttons/DeselectButton.js
@@ -0,0 +1,20 @@
+import styled from 'styled-components';
+import Button from './Button';
+
+const DeselectButton = styled(Button)`
+  background-color: #fff;
+  background-image: none;
+  color: #888;
+  padding: 0px 5px;
+  border: none;
+  cursor: pointer;
+  transition: 0.1s all ease-in-out;
+
+  &:hover,
+  &:active {
+    background-color: #f8f8f8;
+    color: #ff5555;
+  }
+`;
+
+export default DeselectButton;