Add Bulk-FHIR Connectathon examples to Jupyter Notebook #24

ianfore · 2021-01-28T18:28:16Z

Jonathan worked up the Search examples in the attachment from queries Brian Walsh had done as
https://colab.research.google.com/drive/1HhEEB3MJ8LbMP2ta946s8OARPc5RflHu?usp=sharing#scrollTo=nM-GHd3IWeqF

Jonathan wrote

A Search implementation with an appropriate connection to FHIR could take care of everything before "Clinical / Care Management Example Query" step, leaving just the interesting query parts up to the data consumer. Those queries were written for sqlite. Here's the same query adapted to PrestoSQL (now Trino), with table name and filters adjusted so it returns some rows from the kidsfirst FHIR data at http://ga4gh-search-adapter-presto-public.staging.dnastack.com/search:

These would be useful to have in a Jupyter notebook. Have attached the queries Jonathan created as an attachment.

fhir_query_examples.txt

jfuerth · 2021-01-29T14:24:04Z

Update: I found a workaround for the performance issue with the transitive closure query. If we first run the transitive closure to get the list of applicable terms, then run the FHIR query with those literal array values in it (rather than the subquery) then it completes much faster.

First, get the subtypes of HP:0410030:

 WITH hpo_subclassof(node) as (
   select array_agg(array [REPLACE(from_term,'_',':'), REPLACE(to_term,'_',':')]) from search_postgres_pgpc.ontology.axiom where ontology_version = 'http://purl.obolibrary.org/obo/hp/releases/2019-04-15' and relation='SubClassOf'
 )
 select transitive_closure((SELECT node FROM hpo_subclassof), array ['HP:0410030']);

Returns:

[HP:0000153, HP:0000163, HP:0000152, HP:0000001, HP:0031816, HP:0410030, HP:0000271, HP:0000118, HP:0000234, HP:0000202]

Then, run the original query with the subtypes as an array literal:

WITH hpo_subclassof(node) as (
  select array_agg(array [REPLACE(from_term,'_',':'), REPLACE(to_term,'_',':')]) from search_postgres_pgpc.ontology.axiom where ontology_version = 'http://purl.obolibrary.org/obo/hp/releases/2019-04-15' and relation='SubClassOf'
)
SELECT * FROM (
    SELECT 
      json_extract_scalar(Observation.observation, '$.subject.reference') patient,
      json_extract_scalar(Observation.observation, '$.extension[0].valueAge.value') ageAtEvent,
      json_extract_scalar(Observation.observation, '$.code.text') phenotype,
      json_extract_scalar(Observation.observation, '$.interpretation[0].coding[0].code') observed,
      row_number() OVER (
        PARTITION BY    
          json_extract_scalar(Observation.observation, '$.subject.reference')
        ORDER BY
          json_extract_scalar(Observation.observation, '$.extension[0].valueAge.value') DESC
      ) as rank
    FROM
      kidsfirst.ga4gh_tables.observation,
      UNNEST (cast(json_extract(Observation.observation, '$.code.coding') as array(json))) AS coding(value)
      WHERE
        json_extract_scalar(coding.value, '$.system') = 'http://purl.obolibrary.org/obo/hp.owl'
        AND contains(array ['HP:0410030', 'HP:0000153', 'HP:0000163', 'HP:0000152', 'HP:0000001', 'HP:0031816', 'HP:0410030', 'HP:0000271', 'HP:0000118', 'HP:0000234', 'HP:0000202'], json_extract_scalar(coding.value, '$.code'))
  )
  WHERE rank = 1 AND observed = 'POS';

Returns the same 42 rows as before. Note that I've plugged the results of the first query into the second query as array ['HP:0410030', 'HP:0000153', 'HP:0000163', 'HP:0000152', 'HP:0000001', 'HP:0031816', 'HP:0410030', 'HP:0000271', 'HP:0000118', 'HP:0000234', 'HP:0000202']. Essentially, I "factored out" the transitive_closure subquery and ran it ahead of time. This makes me think the performance problem in the original, not-factored-out query was that the query planner decided to re-run the subquery for every row. This is a performance bug. Both queries should take the same amount of time.

jfuerth · 2021-01-29T23:47:52Z

Here's a colab notebook that does that above: https://colab.research.google.com/drive/1Fr6l02-1clmouxVuhUgtsRVe7cVphjQA?usp=sharing

ianfore added FASPHackathon2 January 2021 FASP Hackathon search Related to GA4GH Search labels Jan 28, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Bulk-FHIR Connectathon examples to Jupyter Notebook #24

Add Bulk-FHIR Connectathon examples to Jupyter Notebook #24

ianfore commented Jan 28, 2021

jfuerth commented Jan 29, 2021 •

edited

Loading

jfuerth commented Jan 29, 2021

Add Bulk-FHIR Connectathon examples to Jupyter Notebook #24

Add Bulk-FHIR Connectathon examples to Jupyter Notebook #24

Comments

ianfore commented Jan 28, 2021

jfuerth commented Jan 29, 2021 • edited Loading

jfuerth commented Jan 29, 2021

jfuerth commented Jan 29, 2021 •

edited

Loading