improve mapped concept set query #299

ablack3 · 2024-07-05T07:21:36Z

The mapped concept set query is very slow on sqlite when the concept set is large sometimes taking minutes to complete.

OhdsiShinyModules/R/cohort-diagnostics-shared.R

Line 727 in 42f5bb4

mappedConceptSet <- function(dataSource,

It can be improved by adding indexes and performing the final join in R instead of the database.

@cebarboza and I were working on it.

Indexes drastically sped up this part of the query

WITH resolved_concepts_mapped
    AS (
    	SELECT concept_sets.concept_id AS resolved_concept_id,
    		c1.concept_id,
    		c1.concept_name,
    		c1.domain_id,
    		c1.vocabulary_id,
    		c1.concept_class_id,
    		c1.standard_concept,
    		c1.concept_code
    	FROM (
    		SELECT DISTINCT concept_id
    		FROM @schema.@resolved_concepts
    		WHERE database_id IN (@databaseIds)
    			AND cohort_id = @cohort_id
    		) concept_sets
    	INNER JOIN @schema.@concept_relationship cr ON concept_sets.concept_id = cr.concept_id_2
    	INNER JOIN @schema.@concept c1 ON cr.concept_id_1 = c1.concept_id
    	WHERE relationship_id = 'Maps to'
    		AND standard_concept IS NULL
    	)

However the final join was still slow. It is a many to many join that multiplies the rows of concept table (result from above) for every combination of database, concept set, and cohort.

 SELECT
        c.database_id,
    	c.cohort_id,
    	c.concept_set_id,
    	mapped.*
    FROM (SELECT DISTINCT concept_id, database_id, cohort_id, concept_set_id FROM @schema.@resolved_concepts) c
    INNER JOIN resolved_concepts_mapped mapped ON c.concept_id = mapped.resolved_concept_id
    {@cohort_id != ''} ? { WHERE c.cohort_id = @cohort_id};

However when we downloaded this SELECT DISTINCT concept_id, database_id, cohort_id, concept_set_id FROM @schema.@resolved_concepts) into R and then did the join it was very fast.

Perhaps there is a way to speed the whole query up in SQL. For us adding indexes and splitting the query into two and doing the final join in R did fix the slowness in the shiny app.

I was also thinking that the app might not really need to do this final join since it duplicates a lot of data. In any case we are recording our findings in this issue and can make a PR if the approach we came up with seems reasonable.

The proposal would be to optimize the mappedConceptSet function by adding indexes if they are not there (for sqlite only since I'm not sure if adding indexes really works across all dbms), split the query into two parts as shown above, and then do the final join in R instead of in sql.

The text was updated successfully, but these errors were encountered:

ablack3 · 2024-09-19T16:42:28Z

Adding these indexs to the sqlite database improves the speed and user experience of the cohort diagnostics shiny app.

DBI::dbExecute(con, "CREATE INDEX idx_resolved_concepts ON resolved_concepts (concept_id, database_id, cohort_id, concept_set_id);")
DBI::dbExecute(con, "CREATE INDEX idx_concept_relationship ON concept_relationship (concept_id_2, relationship_id);")
DBI::dbExecute(con, "CREATE INDEX idx_concept_main ON concept (concept_id, standard_concept);")

We could also consider pre-joining this data once when we build the sqlite database rather than joining in the database while the use is interacting with the app. It would add some columns to the resolved_concepts table so would require more disk space but might be better for the app's user experience.

jreps added CohortDiagnostics enhancement New feature or request labels Aug 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

improve mapped concept set query #299

improve mapped concept set query #299

ablack3 commented Jul 5, 2024

ablack3 commented Sep 19, 2024 •

edited

Loading

improve mapped concept set query #299

improve mapped concept set query #299

Comments

ablack3 commented Jul 5, 2024

ablack3 commented Sep 19, 2024 • edited Loading

ablack3 commented Sep 19, 2024 •

edited

Loading