Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New ETL channel (or namespace) to track dependencies of external repositories #2508

Closed
pabloarosado opened this issue Apr 10, 2024 · 3 comments · Fixed by #2587
Closed

New ETL channel (or namespace) to track dependencies of external repositories #2508

pabloarosado opened this issue Apr 10, 2024 · 3 comments · Fixed by #2587

Comments

@pabloarosado
Copy link
Contributor

pabloarosado commented Apr 10, 2024

Context & motivation

  • Currently, we track dependencies of things that are used in Grapher charts; we will get an error if we delete a dependency
  • But, we don't track all downstream uses:
    • Explorers
    • External repositories

We would love to get an error early, rather than breaking something.

Proposal

Inspired by this thread, I thought we could have a new ETL channel (or, as an easier alternative, a new namespace), e.g. repository, which hosts latest versions of steps that are loaded by external repositories, like poverty-data, covid-19-data, co2-data, energy-data, and also owid-grapher.

With this, if anyone accidentally deletes or archives a dependency of these repository steps, we will get an error by VersionTracker. We could also show these dependencies in the ETL Dashboard (and treat them as always active, even if they don't have charts).

Specifically, Sophia said it would be useful for her to have latest versions of grapher/worldbank_wdi/2023-05-29/wdi/wdi#ny_gdp_pcap_pp_kd and for grapher/demography/2023-03-31/population/population#population_historical. They could be easily created in the new channel or namespace.

Technical notes

  • Creating a new channel would be ideal, but it requires much more additional work. So, a much cheaper alternative would be to create a new namespace.

Scope

  • Decide on a special reserved namespace
  • Make a new dataset for Sophia (for example), using a special namespace
  • Modify the version tracker to always consider that namespace used
@danyx23
Copy link
Contributor

danyx23 commented Apr 10, 2024

Another option for how to do this would be to add an alias system that we use to keep what are probably basically our MiMs in an easily accessible namespace.

I.e. the aliases would not be an ETL channel or something like this but a new kind of URI scheme that is mapped to full ETL paths. Maybe something to discuss in our offsite.

@sophiamersmann
Copy link
Member

Grapher has the following indicator IDs currently hardcoded:

  • Continents: 123 (probably ok)
  • Population: 597929 (used in the admin as default size dimension for scatter plots)
  • More population variables: 525709, 525711, 597929, 597930 (used to exclude the population indicator from being displayed in the sources modal in certain cases)

In the future, the entity selector has these two variables hardcoded (not yet merged, but implemented in owid/owid-grapher#3466)

  • Population: 597930 (used to sort entity names)
  • GDP per capita: 735665 (used to sort entity names)

sophiamersmann added a commit to owid/owid-grapher that referenced this issue May 3, 2024
[Cycle 2024.2: Entity selector](#3349) | [Designs](https://www.figma.com/file/X5mOEX8zULS6qyHocUYdmh/Grapher-UI?type=design&node-id=2523%3A6266&mode=design&t=7edFp79OOjz6RENz-1)

## Summary

Offers to sort by "Population" and "GDP per capita", even if the chart doesn't include population or GDP per capita indicators.

## Details

- If the chart has a Population or GDP per capita indicator, then we re-use that data
- If we need to download additional indicators, then we do that on demand, i.e. selecting "Population" or "GDP per capita" triggers their download
- Indicator IDs for population and GDP per capita are hard-coded (but the data team might come up with a [better solution](owid/etl#2508)) 
- "Population" and "GDP per capita" are only offered for selection if entities are detected to include countries or regions
    - This is done by checking whether any of the available entities are listed in the [regions.json](https://github.com/owid/owid-grapher/blob/master/packages/%40ourworldindata/utils/src/regions.json) file
    - Testing the available entities against the `regions.json` file is not perfect since the default population and GDP per capita indicators that we are using have data for a few entities that are not listed in `regions.json` (see details below)
    - However, we only need a single matching entity to trigger sorting by population or GDP per capita, so in practice this works well
    - If we wanted to be more correct here, we could also download population and GDP per capita metadata when the entity selector is opened and then check the actual population/GDP per capita entities against the entities that are available for the chart

<details><summary>Entities of the population or GDP per capita indicator that are not included in the `regions.json` file</summary>

- For the population indicator:
      - Africa (UN)
      - Asia (UN)
      - Europe (UN)
      - High-income countries
      - Latin America and the Caribbean (UN)
      - Low-income countries
      - Lower-middle-income countries
      - Northern America (UN)
      - Oceania (UN)
      - Upper-middle-income countries
- For the GDP per capita indicator:
      - East Asia and Pacific (WB)
      - Europe and Central Asia (WB)
      - High-income countries
      - Latin America and Caribbean (WB)
      - Low-income countries
      - Lower-middle-income countries
      - Middle East and North Africa (WB)
      - Middle-income countries
      - North America (WB)
      - South Asia (WB)
      - Sub-Saharan Africa (WB)
      - Upper-middle-income countries

</details> 

## SVG tester

The SVG tester fails due to the changes in #3373
@pabloarosado
Copy link
Contributor Author

Now that the external channel is created, I've created #2609 to add all existing dependencies (including the ones listed by Sophia above).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants