Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Bulk-FHIR Connectathon examples to Jupyter Notebook #24

Open
ianfore opened this issue Jan 28, 2021 · 2 comments
Open

Add Bulk-FHIR Connectathon examples to Jupyter Notebook #24

ianfore opened this issue Jan 28, 2021 · 2 comments
Labels
FASPHackathon2 January 2021 FASP Hackathon search Related to GA4GH Search

Comments

@ianfore
Copy link
Collaborator

ianfore commented Jan 28, 2021

Jonathan worked up the Search examples in the attachment from queries Brian Walsh had done as
https://colab.research.google.com/drive/1HhEEB3MJ8LbMP2ta946s8OARPc5RflHu?usp=sharing#scrollTo=nM-GHd3IWeqF

Jonathan wrote

A Search implementation with an appropriate connection to FHIR could take care of everything before "Clinical / Care Management Example Query" step, leaving just the interesting query parts up to the data consumer. Those queries were written for sqlite. Here's the same query adapted to PrestoSQL (now Trino), with table name and filters adjusted so it returns some rows from the kidsfirst FHIR data at http://ga4gh-search-adapter-presto-public.staging.dnastack.com/search:

These would be useful to have in a Jupyter notebook. Have attached the queries Jonathan created as an attachment.

fhir_query_examples.txt

@ianfore ianfore added FASPHackathon2 January 2021 FASP Hackathon search Related to GA4GH Search labels Jan 28, 2021
@jfuerth
Copy link

jfuerth commented Jan 29, 2021

Update: I found a workaround for the performance issue with the transitive closure query. If we first run the transitive closure to get the list of applicable terms, then run the FHIR query with those literal array values in it (rather than the subquery) then it completes much faster.

First, get the subtypes of HP:0410030:

 WITH hpo_subclassof(node) as (
   select array_agg(array [REPLACE(from_term,'_',':'), REPLACE(to_term,'_',':')]) from search_postgres_pgpc.ontology.axiom where ontology_version = 'http://purl.obolibrary.org/obo/hp/releases/2019-04-15' and relation='SubClassOf'
 )
 select transitive_closure((SELECT node FROM hpo_subclassof), array ['HP:0410030']);

Returns:

[HP:0000153, HP:0000163, HP:0000152, HP:0000001, HP:0031816, HP:0410030, HP:0000271, HP:0000118, HP:0000234, HP:0000202]

Then, run the original query with the subtypes as an array literal:

WITH hpo_subclassof(node) as (
  select array_agg(array [REPLACE(from_term,'_',':'), REPLACE(to_term,'_',':')]) from search_postgres_pgpc.ontology.axiom where ontology_version = 'http://purl.obolibrary.org/obo/hp/releases/2019-04-15' and relation='SubClassOf'
)
SELECT * FROM (
    SELECT 
      json_extract_scalar(Observation.observation, '$.subject.reference') patient,
      json_extract_scalar(Observation.observation, '$.extension[0].valueAge.value') ageAtEvent,
      json_extract_scalar(Observation.observation, '$.code.text') phenotype,
      json_extract_scalar(Observation.observation, '$.interpretation[0].coding[0].code') observed,
      row_number() OVER (
        PARTITION BY    
          json_extract_scalar(Observation.observation, '$.subject.reference')
        ORDER BY
          json_extract_scalar(Observation.observation, '$.extension[0].valueAge.value') DESC
      ) as rank
    FROM
      kidsfirst.ga4gh_tables.observation,
      UNNEST (cast(json_extract(Observation.observation, '$.code.coding') as array(json))) AS coding(value)
      WHERE
        json_extract_scalar(coding.value, '$.system') = 'http://purl.obolibrary.org/obo/hp.owl'
        AND contains(array ['HP:0410030', 'HP:0000153', 'HP:0000163', 'HP:0000152', 'HP:0000001', 'HP:0031816', 'HP:0410030', 'HP:0000271', 'HP:0000118', 'HP:0000234', 'HP:0000202'], json_extract_scalar(coding.value, '$.code'))
  )
  WHERE rank = 1 AND observed = 'POS';

Returns the same 42 rows as before. Note that I've plugged the results of the first query into the second query as array ['HP:0410030', 'HP:0000153', 'HP:0000163', 'HP:0000152', 'HP:0000001', 'HP:0031816', 'HP:0410030', 'HP:0000271', 'HP:0000118', 'HP:0000234', 'HP:0000202']. Essentially, I "factored out" the transitive_closure subquery and ran it ahead of time. This makes me think the performance problem in the original, not-factored-out query was that the query planner decided to re-run the subquery for every row. This is a performance bug. Both queries should take the same amount of time.

@jfuerth
Copy link

jfuerth commented Jan 29, 2021

Here's a colab notebook that does that above: https://colab.research.google.com/drive/1Fr6l02-1clmouxVuhUgtsRVe7cVphjQA?usp=sharing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
FASPHackathon2 January 2021 FASP Hackathon search Related to GA4GH Search
Projects
None yet
Development

No branches or pull requests

2 participants