Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🎉 App to find similar insights #3518

Merged
merged 6 commits into from
Nov 11, 2024
Merged

Conversation

pabloarosado
Copy link
Contributor

@pabloarosado pabloarosado commented Nov 8, 2024

Create a script that launches a streamlit app to do a semantic search over data insights.

The script loads and parses data insights (from the database), creates an embedding (on my laptop, it takes less than 10 seconds, but ideally this should happen under the hood, and store embeddings in the database), and sorts DIs by semantic similarity with respect to a given input string. For now, this is an experiment. If we decide it's useful, we can integrate it on our wizard.

I think it would be useful to have something like this on our wizard. For authors, it could be useful to find what has already been written about a certain topic. And for data peeps, it can open doors to do other kinds of analytics and experiments with our content.

The downside is that it requires installing some big libraries (transformers and pytorch). The first time it's build it needs to download some models, which are ~100MB. But maybe this can be useful for other similar applications.

@pabloarosado pabloarosado self-assigned this Nov 8, 2024
@owidbot
Copy link
Contributor

owidbot commented Nov 8, 2024

Quick links (staging server):

Site Admin Wizard Docs

Login: ssh owid@staging-site-app-to-find-similar-insights

chart-diff: ✅ No charts for review.
data-diff: ❌ Found differences
= Dataset garden/un/2024-04-09/undp_hdr
  = Table undp_hdr
    ~ Column abr (changed metadata)
-       -   We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area.
+       +   - We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area.
        ?  ++
    ~ Column co2_prod (changed metadata)
-       -   We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area.
+       +   - We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area.
        ?  ++
    ~ Column coef_ineq (changed metadata)
-       -   We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area.
+       +   - We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area.
        ?  ++
    ~ Column diff_hdi_phdi (changed metadata)
-       -   We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area.
+       +   - We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area.
        ?  ++
    ~ Column eys (changed metadata)
-       -   We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area.
+       +   - We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area.
        ?  ++
    ~ Column eys_f (changed metadata)
-       -   We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area.
+       +   - We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area.
        ?  ++
    ~ Column eys_m (changed metadata)
-       -   We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area.
+       +   - We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area.
        ?  ++
    ~ Column gdi (changed metadata)
-       -   We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area.
+       +   - We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area.
        ?  ++
    ~ Column gdi_group (changed metadata, changed data)
+       + description_processing: |-
+       +   - We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area.

        ~ Changed values: 11 / 7161 (0.15%)
                                country  year  gdi_group -  gdi_group +
                                 Europe  2022         <NA>     1.268419
                  High-income countries  2022         <NA>     1.392950
          Lower-middle-income countries  2022         <NA>     4.389009
                          South America  2022         <NA>     1.150919
          Upper-middle-income countries  2022         <NA>     2.057359
    ~ Column gii (changed metadata)
-       -   We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area.
+       +   - We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area.
        ?  ++
    ~ Column gii_rank (changed metadata, changed data)
+       + description_processing: |-
+       +   - We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area.

        ~ Changed values: 9 / 7161 (0.13%)
                                country  year  gii_rank -  gii_rank +
                                   Asia  2022        <NA>        3579
                                 Europe  2022        <NA>        1089
                  High-income countries  2022        <NA>        1832
                          South America  2022        <NA>        1092
          Upper-middle-income countries  2022        <NA>        3799
    ~ Column gni_pc_f (changed metadata)
-       -   We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area.
+       +   - We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area.
        ?  ++
    ~ Column gni_pc_m (changed metadata)
-       -   We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area.
+       +   - We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area.
        ?  ++
    ~ Column gnipc (changed metadata)
-       -   We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area.
+       +   - We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area.
        ?  ++
    ~ Column hdi (changed metadata)
-       -   We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area.
+       +   - We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area.
        ?  ++
    ~ Column hdi_f (changed metadata)
-       -   We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area.
+       +   - We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area.
        ?  ++
    ~ Column hdi_m (changed metadata)
-       -   We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area.
+       +   - We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area.
        ?  ++
    ~ Column hdi_rank (changed metadata, changed data)
+       + description_processing: |-
+       +   - We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area.

        ~ Changed values: 11 / 7161 (0.15%)
                                country  year  hdi_rank -  hdi_rank +
                                 Europe  2022        <NA>        1537
                  High-income countries  2022        <NA>        2161
          Lower-middle-income countries  2022        <NA>        7099
                          South America  2022        <NA>        1054
          Upper-middle-income countries  2022        <NA>        4964
    ~ Column ihdi (changed metadata)
-       -   We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area.
+       +   - We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area.
        ?  ++
    ~ Column ineq_edu (changed metadata)
-       -   We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area.
+       +   - We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area.
        ?  ++
    ~ Column ineq_inc (changed metadata)
-       -   We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area.
+       +   - We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area.
        ?  ++
    ~ Column ineq_le (changed metadata)
-       -   We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area.
+       +   - We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area.
        ?  ++
    ~ Column le (changed metadata)
-       -   We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area.
+       +   - We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area.
        ?  ++
    ~ Column le_f (changed metadata)
-       -   We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area.
+       +   - We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area.
        ?  ++
    ~ Column le_m (changed metadata)
-       -   We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area.
+       +   - We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area.
        ?  ++
    ~ Column lfpr_f (changed metadata)
-       -   We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area.
+       +   - We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area.
        ?  ++
    ~ Column lfpr_m (changed metadata)
-       -   We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area.
+       +   - We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area.
        ?  ++
    ~ Column loss (changed metadata, changed data)
+       + description_processing: |-
+       +   - We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area.

        ~ Changed values: 82 / 7161 (1.15%)
                                country  year  loss -      loss +
                                 Africa  2015     NaN 1714.844482
                                 Africa  2021     NaN 1684.291626
                                 Europe  2020     NaN  351.625641
                  High-income countries  2019     NaN  535.104187
          Lower-middle-income countries  2018     NaN 1293.413940
    ~ Column mf (changed metadata)
-       -   We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area.
+       +   - We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area.
        ?  ++
    ~ Column mmr (changed metadata)
-       -   We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area.
+       +   - We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area.
        ?  ++
    ~ Column mys (changed metadata)
-       -   We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area.
+       +   - We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area.
        ?  ++
    ~ Column mys_f (changed metadata)
-       -   We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area.
+       +   - We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area.
        ?  ++
    ~ Column mys_m (changed metadata)
-       -   We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area.
+       +   - We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area.
        ?  ++
    ~ Column phdi (changed metadata)
-       -   We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area.
+       +   - We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area.
        ?  ++
    ~ Column pop_total (changed metadata)
-       -   We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area.
+       +   - We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area.
        ?  ++
    ~ Column pr_f (changed metadata)
-       -   We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area.
+       +   - We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area.
        ?  ++
    ~ Column pr_m (changed metadata)
-       -   We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area.
+       +   - We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area.
        ?  ++
    ~ Column rankdiff_hdi_phdi (changed metadata, changed data)
+       + description_processing: |-
+       +   - We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area.

        ~ Changed values: 6 / 7161 (0.08%)
                      country  year  rankdiff_hdi_phdi -  rankdiff_hdi_phdi +
                       Africa  2022                 <NA>                   98
                         Asia  2022                 <NA>                 -340
                       Europe  2022                 <NA>                  100
          European Union (27)  2022                 <NA>                   79
                South America  2022                 <NA>                  130
    ~ Column se_f (changed metadata)
-       -   We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area.
+       +   - We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area.
        ?  ++
    ~ Column se_m (changed metadata)
-       -   We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area.
+       +   - We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area.
        ?  ++
= Dataset garden/who/2024-09-09/flu_test
  = Table flu_test
    ~ Dim country
-       - Removed values: 63 / 71983 (0.09%)
                date      country
          2024-10-28        Malta
          2024-10-28        Qatar
          2024-10-28     Slovenia
          2024-10-28 South Africa
          2024-10-07       Zambia
    ~ Dim date
-       - Removed values: 63 / 71983 (0.09%)
               country       date
                 Malta 2024-10-28
                 Qatar 2024-10-28
              Slovenia 2024-10-28
          South Africa 2024-10-28
                Zambia 2024-10-07
    ~ Column denomcombined (changed data)
-       - Removed values: 63 / 71983 (0.09%)
               country       date  denomcombined
                 Malta 2024-10-28            301
                 Qatar 2024-10-28            765
              Slovenia 2024-10-28            983
          South Africa 2024-10-28             85
                Zambia 2024-10-07            110
        ~ Changed values: 106 / 71983 (0.15%)
            country       date  denomcombined -  denomcombined +
             Brazil 2024-10-21             5188             4218
           Honduras 2024-10-07               70               68
          Indonesia 2023-10-09               37               38
           Slovenia 2024-10-21             1224             1183
             Uganda 2024-09-23               58               51
    ~ Column pcnt_poscombined (changed data)
-       - Removed values: 63 / 71983 (0.09%)
               country       date  pcnt_poscombined
                 Malta 2024-10-28          2.325581
                 Qatar 2024-10-28         17.385620
              Slovenia 2024-10-28          0.305188
          South Africa 2024-10-28          5.882353
                Zambia 2024-10-07          3.636364
        ~ Changed values: 114 / 71983 (0.16%)
               country       date  pcnt_poscombined -  pcnt_poscombined +
            Costa Rica 2024-10-07            0.326442            0.326797
               Denmark 2024-10-21            1.134791            1.140251
             Indonesia 2023-08-28           43.478260           40.000000
             Indonesia 2024-04-22           23.809525           24.390244
          South Africa 2024-09-16            8.730159            8.800000


Legend: +New  ~Modified  -Removed  =Identical  Details
Hint: Run this locally with etl diff REMOTE data/ --include yourdataset --verbose --snippet

Automatically updated datasets matching weekly_wildfires|excess_mortality|covid|fluid|flunet|country_profile|garden/ihme_gbd/2019/gbd_risk are not included

Edited: 2024-11-11 09:57:39 UTC
Execution time: 15.04 seconds

@pabloarosado pabloarosado requested a review from Marigold November 8, 2024 09:40
@pabloarosado pabloarosado marked this pull request as ready for review November 8, 2024 09:40
@Marigold
Copy link
Collaborator

Marigold commented Nov 8, 2024

@lucasrodes could you review it please? I can't install torch on my laptop due to this issue. It's probably solvable, but I've already spent an hour on it and didn't make any progress.

@pabloarosado
Copy link
Contributor Author

@lucasrodes could you review it please? I can't install torch on my laptop due to this issue. It's probably solvable, but I've already spent an hour on it and didn't make any progress.

Thanks Mojmir, I'm sorry about that issue, it sounds annoying! If you want I can add this app temporarily to wizard, so you can play with it (in any case I'm also happy if Lucas wants to have a look, or both).

@pabloarosado
Copy link
Contributor Author

Hey @Marigold I've moved it to wizard, so you can try it out. But of course, if this is going to break your ETL environment, we shouldn't push it. I find it very useful, and having that library on ETL could also let us experiment with other similar things, but we can also move it to its own repos if it's problematic (or discard it if others don't find it useful, it's just an experiment). Let me know what you think, thanks.

Copy link
Collaborator

@Marigold Marigold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks useful! I made it work with torch<2.3.0 (and committed to your PR).

@pabloarosado pabloarosado merged commit d286c4a into master Nov 11, 2024
8 checks passed
@pabloarosado pabloarosado deleted the app-to-find-similar-insights branch November 11, 2024 11:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants