API change proposal for `owid.catalog` #767

larsyencken · 2023-01-13T01:19:41Z

larsyencken
Jan 13, 2023
Maintainer

Background

We designed our data pipeline to create garden, a form of our data that's excellent for doing data science in. This gives us efficient file formats and the owid-catalog library to help us navigate them. As we have learned more, it makes sense to clean up and standardise the owid-catalog API more.

This year we will:

Keep improving the ETL and how we import data
Build data pages and refactor grapher, for which we'll want a JSON data API for specific known variables (search not necessary)
Start to point people to our data catalog and ETL steps from our main site

It's enough to motivate a small cleanup of this API, though not a lot more.

Proposed new API, in examples

Loading a catalog

You begin by calling connect(), which creates a RemoteCatalog instance by default without loading any data.

>>> from owid.catalog import connect
>>> cat = connect()

Motivation

Previously, we implemented methods like find() and find_one() directly on the owid.catalog module. But this actually limited our API, since you can't write special methods like __getitem__ on a module. It meant we had two APIs, one at the module level, and one if you used a LocalCatalog or RemoteCatalog instance. By forcing you to connect(), we can ensure we only have one API to maintain.

Previous API

>>> from owid import catalog

You did not need to call .connect(), since there were methods defined directly on owid.catalog.

Finding a dataset

Calling cat.find(query) works just like it does today, except that for datasets with multiple versions, only the latest is returned.

>>> cat.find('owid_energy')
            table      dataset     version namespace channel  is_public       dimensions                                              path             formats
1027  owid_energy  owid_energy  2022-12-28    energy  garden       True  [country, year]  garden/energy/2022-12-28/owid_energy/owid_energy  [feather, parquet]

We can also find all versions:

>>> cat.find('owid_energy', latest=False)
            table      dataset     version namespace channel  is_public       dimensions                                              path             formats
1026  owid_energy  owid_energy  2022-12-13    energy  garden       True  [country, year]  garden/energy/2022-12-13/owid_energy/owid_energy  [feather, parquet]
1027  owid_energy  owid_energy  2022-12-28    energy  garden       True  [country, year]  garden/energy/2022-12-28/owid_energy/owid_energy  [feather, parquet]

Because this behaviour is baked in, there is no find_latest() method in the new API.

Motivation

The previous API was designed when we were lucky to have one version of any dataset available. As we increasingly begin to do updates in the ETL, we're going to have a lot of versions of data available, and a consumer should not have to think about which one to load.

The current version of find_latest() also has a footgun in that it will return the table with the most recent version string, without checking that the options are all versions of the same table.

Alternate possibility

One alternative would be to have find() return table familes, rather than tables, where a table family is a table available in a list of versions.

>>> cat.find('owid_energy')
            table      dataset                  versions  namespace channel  is_public       dimensions                                     path             formats
1026  owid_energy  owid_energy  [2022-12-13, 2022-12-28]     energy  garden       True  [country, year]  garden/energy/*/owid_energy/owid_energy  [feather, parquet]

This would have the flow-on effect of moving the version choice to the load() verb:

>>> owid_energy = cat.find_one('owid_energy')
>>> df_latest = owid_energy.load()  # version='latest' by default
>>> df_prev = owid_energy.load(version='2022-12-13')

Previous API

>>> cat.find('owid_energy')
            table      dataset     version namespace channel  is_public       dimensions                                              path             formats
1026  owid_energy  owid_energy  2022-12-13    energy  garden       True  [country, year]  garden/energy/2022-12-13/owid_energy/owid_energy  [feather, parquet]
1027  owid_energy  owid_energy  2022-12-28    energy  garden       True  [country, year]  garden/energy/2022-12-28/owid_energy/owid_energy  [feather, parquet]

Picking a channel

We support the same API as before at the module level, where you can search for multiple channels:

>>> cat.find('energy_coverage', channels=['backport'])
                                table                           dataset  ...                                               path             formats
740  dataset_5309_iea_energy_coverage  dataset_5309_iea_energy_coverage  ...  backport/owid/latest/dataset_5309_iea_energy_c...  [feather, parquet]

However, we move the implementation of LocalCatalog, RemoteCatalog and CatalogMixin to using one lazy-loaded frame of options per channel, instead of one overall frame.

Motivation

There are inconsistencies in how find() is implemented and presented that we want to streamline and remove.

Previous API

The previous API let you search for multiple channels at the module level:

>>> catalog.find(..., channels=['garden', 'backport'])

but if you loaded a RemoteCatalog or LocalCatalog, the interface is different and only lets you pick one channel:

>>> cat = catalog.RemoteCatalog()
>>> cat.find(..., channel='garden')

We also have a bunch of weird logic we need to do in case you change which channels you're looking at, to invalidate and reload the global catalog cache, that we could remove with this change.

Loading a pinned dataset

This is nearly unchanged from before, but it becomes more natural since you're already working with a catalog object.

>>> from owid import catalog
>>> cat = catalog.connect()
>>> df = cat['garden/energy/2022-12-28/owid_energy/owid_energy'].load()

We call .load() at the end, since you might want to interrogate the object for other metadata, or get a URI for a parquet file for example, or do some other operation than just getting a data frame.

Previous API

In the existing API, you might have been doing catalog.find() at the module level, but you need to switch gear and instantiate a RemoteCatalog() instance if you want to get a data frame at a specific path.

>>> from owid import catalog
>>> cat = catalog.RemoteCatalog()
>>> df = cat['garden/energy/2022-12-28/owid_energy/owid_energy']

Working with indexes

We rename RemoteCatalog to IndexedCatalog, since the most important thing is not the location, its that it has an index that is the source of truth for what's there.

>>> from owid.catalog import LocalCatalog, IndexedCatalog
>>> lc = LocalCatalog('/path/to/data')
>>> lc.find('energy')  # scans garden folders for matches
...
>>> index = lc.reindex()  # expensive reindex step
>>> index.find('energy')  # scans the index of the garden channel for matches
...
>>> local = IndexedCatalog('/path/to/data')  # a local folder with an index
>>> public = IndexedCatalog('https://path/to/data')  # a public connection that does not include private data
>>> private = IndexedCatalog('s3://path/to/data')  # an authenticated connection that includes private data

Motivation

We make changes to the local catalog all the time without reindexing, it's the default behaviour in dev. That means that our indexes will be constantly out of date, but a LocalCatalog instance will use them regardless.

This change allows us to access the local catalog through two APIs, one which scans the disk as needed and one which operates off a cached index of what's there.

Previous API

The previous API has LocalCatalog which always operates off an index, and creates it if need be, and RemoteCatalog which must always use HTTP(S).

>>> from owid import catalog
>>> lc = LocalCatalog('/path/to/data')  # sometimes does an expensive reindex, sometimes instant

danyx23 · 2023-01-13T11:02:24Z

danyx23
Jan 13, 2023
Maintainer

These all sound good to me although I haven't used the python api much lately so others are probably in a better position to evaluate the details. I'd be happy for this to go ahead as described.

0 replies

Marigold · 2023-01-13T12:09:56Z

Marigold
Jan 13, 2023
Maintainer

I love it. I never really liked find method - it uses global variables internally, does some implicit magic and is not very discoverable.

Loading a catalog

Yes!!!

Finding a dataset

Calling find('natural_disasters') now returns two datasets natural_disasters and natural_disasters_yearly because we're matching on substring.

I never remember what the first argument represent. Is it a table? Is it a dataset? When I search for a phrase, it's equally likely it's either in a table or a phrase. How about we make the first argument magical query= and then match it on table name, dataset name, path, table / dataset description ... we could even do string similarity. Once I find what I'm looking for, I'll call it with explicit keyword arguments. (This could be implemented as cat.query(...) of course)

Picking a channel

Yeah, this is a mess. I think there might be a bug where searching for backport channel first and then for garden channel later will return results for both channels (not filtering to garden). That's because of global variables and how we do lazy loading currently.

Working with indexes

This change allows us to access the local catalog through two APIs, one which scans the disk as needed and one which operates off a cached index of what's there.

Is there any use case for indexed local catalog (i.e. having index cached in file)? It has only caused confusion in the past. We could reindex it in __init__ or have a method for it. I'd also remove reindex command and make it part of publish.

We don't use LocalCatalog anywhere in steps. We load all datasets with Dataset('...path...') which works well. I'm not sure if we even need LocalCatalog at all. Perhaps it could be replaced by a few functions returning lists of Datasets?

1 reply

larsyencken Jan 14, 2023
Maintainer Author

On find(), I agree we should have a vague contract on what the query checks, but also let you specify an exact table if you like. That said, I think the pain found in the past was due to a misuse of find() in steps, in a situation where we actually wanted to specify a more exact dataset by path.

On LocalCatalog, it's good that we've made things simple enough that we can skip the abstraction. It's not quite clear to me what things will look like after a refactor, and whether it will be useful or not. I'm tempted to improve it but time box the improvement, knowing we might throw it away later.

larsyencken · 2023-01-14T06:54:42Z

larsyencken
Jan 14, 2023
Maintainer Author

I think we should write find(query) as if it might later be powered by Algolia.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API change proposal for `owid.catalog` #767

{{title}}

Replies: 3 comments 1 reply

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

API change proposal for owid.catalog #767

larsyencken Jan 13, 2023 Maintainer

Background

Proposed new API, in examples

Loading a catalog

Finding a dataset

Picking a channel

Loading a pinned dataset

Working with indexes

Replies: 3 comments · 1 reply

danyx23 Jan 13, 2023 Maintainer

Marigold Jan 13, 2023 Maintainer

Loading a catalog

Finding a dataset

Picking a channel

Working with indexes

larsyencken Jan 14, 2023 Maintainer Author

larsyencken Jan 14, 2023 Maintainer Author

API change proposal for `owid.catalog` #767

larsyencken
Jan 13, 2023
Maintainer

Replies: 3 comments 1 reply

danyx23
Jan 13, 2023
Maintainer

Marigold
Jan 13, 2023
Maintainer

larsyencken Jan 14, 2023
Maintainer Author

larsyencken
Jan 14, 2023
Maintainer Author