API change proposal for owid.catalog
#767
Replies: 3 comments 1 reply
-
These all sound good to me although I haven't used the python api much lately so others are probably in a better position to evaluate the details. I'd be happy for this to go ahead as described. |
Beta Was this translation helpful? Give feedback.
-
I love it. I never really liked Loading a catalogYes!!! Finding a datasetCalling I never remember what the first argument represent. Is it a table? Is it a dataset? When I search for a phrase, it's equally likely it's either in a table or a phrase. How about we make the first argument magical Picking a channelYeah, this is a mess. I think there might be a bug where searching for Working with indexes
Is there any use case for indexed local catalog (i.e. having index cached in file)? It has only caused confusion in the past. We could reindex it in We don't use |
Beta Was this translation helpful? Give feedback.
-
I think we should write |
Beta Was this translation helpful? Give feedback.
-
Background
We designed our data pipeline to create
garden
, a form of our data that's excellent for doing data science in. This gives us efficient file formats and theowid-catalog
library to help us navigate them. As we have learned more, it makes sense to clean up and standardise theowid-catalog
API more.This year we will:
It's enough to motivate a small cleanup of this API, though not a lot more.
Proposed new API, in examples
Loading a catalog
You begin by calling
connect()
, which creates aRemoteCatalog
instance by default without loading any data.Motivation
Previously, we implemented methods like
find()
andfind_one()
directly on theowid.catalog
module. But this actually limited our API, since you can't write special methods like__getitem__
on a module. It meant we had two APIs, one at the module level, and one if you used aLocalCatalog
orRemoteCatalog
instance. By forcing you toconnect()
, we can ensure we only have one API to maintain.Previous API
You did not need to call
.connect()
, since there were methods defined directly onowid.catalog
.Finding a dataset
Calling
cat.find(query)
works just like it does today, except that for datasets with multiple versions, only the latest is returned.>>> cat.find('owid_energy') table dataset version namespace channel is_public dimensions path formats 1027 owid_energy owid_energy 2022-12-28 energy garden True [country, year] garden/energy/2022-12-28/owid_energy/owid_energy [feather, parquet]
We can also find all versions:
Because this behaviour is baked in, there is no
find_latest()
method in the new API.Motivation
The previous API was designed when we were lucky to have one version of any dataset available. As we increasingly begin to do updates in the ETL, we're going to have a lot of versions of data available, and a consumer should not have to think about which one to load.
The current version of
find_latest()
also has a footgun in that it will return the table with the most recent version string, without checking that the options are all versions of the same table.Alternate possibility
One alternative would be to have
find()
return table familes, rather than tables, where a table family is a table available in a list of versions.>>> cat.find('owid_energy') table dataset versions namespace channel is_public dimensions path formats 1026 owid_energy owid_energy [2022-12-13, 2022-12-28] energy garden True [country, year] garden/energy/*/owid_energy/owid_energy [feather, parquet]
This would have the flow-on effect of moving the version choice to the
load()
verb:Previous API
>>> cat.find('owid_energy') table dataset version namespace channel is_public dimensions path formats 1026 owid_energy owid_energy 2022-12-13 energy garden True [country, year] garden/energy/2022-12-13/owid_energy/owid_energy [feather, parquet] 1027 owid_energy owid_energy 2022-12-28 energy garden True [country, year] garden/energy/2022-12-28/owid_energy/owid_energy [feather, parquet]
Picking a channel
We support the same API as before at the module level, where you can search for multiple channels:
However, we move the implementation of
LocalCatalog
,RemoteCatalog
andCatalogMixin
to using one lazy-loaded frame of options per channel, instead of one overall frame.Motivation
There are inconsistencies in how
find()
is implemented and presented that we want to streamline and remove.Previous API
The previous API let you search for multiple channels at the module level:
but if you loaded a
RemoteCatalog
orLocalCatalog
, the interface is different and only lets you pick one channel:We also have a bunch of weird logic we need to do in case you change which channels you're looking at, to invalidate and reload the global catalog cache, that we could remove with this change.
Loading a pinned dataset
This is nearly unchanged from before, but it becomes more natural since you're already working with a catalog object.
We call
.load()
at the end, since you might want to interrogate the object for other metadata, or get a URI for a parquet file for example, or do some other operation than just getting a data frame.Previous API
In the existing API, you might have been doing
catalog.find()
at the module level, but you need to switch gear and instantiate aRemoteCatalog()
instance if you want to get a data frame at a specific path.Working with indexes
We rename
RemoteCatalog
toIndexedCatalog
, since the most important thing is not the location, its that it has an index that is the source of truth for what's there.Motivation
We make changes to the local catalog all the time without reindexing, it's the default behaviour in dev. That means that our indexes will be constantly out of date, but a
LocalCatalog
instance will use them regardless.This change allows us to access the local catalog through two APIs, one which scans the disk as needed and one which operates off a cached index of what's there.
Previous API
The previous API has
LocalCatalog
which always operates off an index, and creates it if need be, andRemoteCatalog
which must always use HTTP(S).Beta Was this translation helpful? Give feedback.
All reactions