Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch to a date-based catalog versioning system, and related updates #243

Merged
merged 45 commits into from
Nov 12, 2024

Conversation

marc-white
Copy link
Collaborator

Closes #191 .

This PR significantly alters the way the catalog generated by access-nri-intake-catalog is handled, stored, and versioned.

Catalog storage location

The live catalog will now be stored on gdata/xp65 (access_nri_intake.CATALOG_LOCATION), rather than being shipped with the package. This is because all of the catalog data lives on Gadi anyway, so there doesn't seem to be much point to being able to see the catalog definition YAML, but not be able to access any data.

However, to support power users and local developers, there is the option to place a catalog.yaml file at a defined location in the user's home directory (access_nri_intake.USER_CATALOG_LOCATION). This catalog will automatically take precedence over the 'live' catalog on Gadi, if it exists. I thought about adding utility functions that would allow users to create this directory, put a catalog.yaml in it, etc.; however, given that it's a power user move, I decided it's easier and safer to have the user do that themselves manually.

Versioning

Catalog versions will now be date-based, e.g., a catalog built today will be, by default, v2024-11-07. Attempts to set a version number that doesn't conform to this pattern will raise an exception.

catalog.yaml will now contain a min and max version. This is to cover the possibility that the catalog structure may change, and a particular version of catalog.yaml may not be compatible with certain catalog versions. I've confirmed that doing the usual < and > operations on our version strings has the expected output.

Under this versioning schema, there isn't really a need to have a symlinked latest version of the catalog (and also because we can update the live catalog on xp65 at will, without doing a code release).

Building the catalog

Because catalog.yaml will be placed on xp65 now, there is no need to make a new code release for every catalog build.

As mentioned above, a catalog built today will take today's date as the default version, although the user can override it.

The build process is now intelligent to the presence of older versions of catalog.yaml, and to directories that look like older catalogs:

  • If no catalog.yaml exists, one will be created. If the data directory contains folders that look like catalog versions (i.e. vYYYY-MM-DD), then the code will use those to construct the min and max version boundaries (i.e., it will assume that the new catalog.yaml is good for describing those existing sources).
  • If a catalog.yaml exists, and there is no structural change to the catalog, then the existing catalog.yaml will be updated with:
    • A new default (and probably max) version; and,
    • Updated storage metadata that is the union of the old and new storage metadata (i.e., we deliberately don't remove any storage metadata that is redundant with the new catalog version, as it will still be required for the older versions under this catalog.yaml).
  • If the new catalog is structurally different to the old catalog, then a new catalog.yaml will be created, with min version = max version = current/new version. The old catalog.yaml will be moved aside to catalog-<old min version>-<old max version>.yaml. These catalogs are nominally not accessible, unless the user hacks the access_nri_intake.CATALOG_LOCATION variable. For now, given how infrequently we'll be making the sort of changes that will trigger this scenario, I'm fine with that.

Documentation

To be updated once we're happy with the core structure of this update. The updates required will be:

  • Alter the catalog build process instructions;
  • Document the ability to have a local 'override' catalog.yaml.

Testing

All of the above should have at least one unit test.

@charles-turner-1
Copy link
Collaborator

Running a build test right now - will comment if I have any issues

@charles-turner-1
Copy link
Collaborator

@rbeucher It looks like you might have overwritten a default catalog location or similar on Friday? I'm currently getting the following error - are you able to take a look & let me know if the file in the error message is one you recognise?

>>> import intake
>>> intake.cat.access_nri
/g/data/hh5/public/apps/miniconda3/envs/analysis3-24.07/lib/python3.10/importlib/__init__.py:126](https://are.nci.org.au/g/data/hh5/public/apps/miniconda3/envs/analysis3-24.07/lib/python3.10/importlib/__init__.py#line=125): RuntimeWarning: Unable to access a default catalog location. Calling intake.cat.access_nri will not work.
  return _bootstrap._gcd_import(name[level:], package, level)
access_nri:
  args: {}
  description: ''
  driver: intake.catalog.base.Catalog
  metadata: {}

>>> from access_nri_intake.utils import get_catalog_fp
>>> get_catalog_fp()
'/g/data/xp65/public/apps/access-nri-intake-catalog/catalog.yaml'
>>> intake.open_catalog('/g/data/xp65/public/apps/access-nri-intake-catalog/catalog.yaml').access_nri
---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
Cell In[9], line 1
----> 1 intake.open_catalog('[/g/data/xp65/public/apps/access-nri-intake-catalog/catalog.yaml](https://are.nci.org.au/g/data/xp65/public/apps/access-nri-intake-catalog/catalog.yaml)').access_nri
...
FileNotFoundError: [Errno 2] No such file or directory: '[/g/data/tm70/rb5533/intake_tests/v0.1.4](https://are.nci.org.au/g/data/tm70/rb5533/intake_tests/v0.1.4)+13.g869f576.dirty[/metacatalog.csv](https://are.nci.org.au/metacatalog.csv)'

I've run a catalog build with just CMIP5 enabled which built without any issues - I'm just running into this slightly weird error trying to import it.

@rbeucher
Copy link
Member

Yes. Sorry, my bad. It looks like I have mess up with the file

@charles-turner-1
Copy link
Collaborator

No worries. I'm looking into it more closely now - probably something we're going to want to have some guard rails against.

@charles-turner-1
Copy link
Collaborator

I've restored the file to default to v0.1.3 for now.

@marc-white
Copy link
Collaborator Author

@charles-turner-1 I think you might have to hack the code to open your existing catalog (or, better yet, put it into ~/.access_nri_intake_catalog/) - the code only knows to look for a catalog to open at either CATALOG_LOCATION = "/g/data/xp65/public/apps/access-nri-intake-catalog/catalog.yaml" or USER_CATALOG_LOCATION = Path.home() / ".access_nri_intake_catalog/catalog.yaml".

We should probably look at adding a path option to intake.cat.access_nri to see if we can make that more flexible.

Copy link
Collaborator

@charles-turner-1 charles-turner-1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've managed to build a catalog subset & I'm happy that this behaves as expected.

I think it might be nice to add a toggle that allows the user to toggle discovery of $HOME/.access_nri_intake_catalog/catalog.yaml from the python interpreter - rather than requiring the user to rename the file/folder to disable it overriding the default, but perhaps that's for a separate PR?

@charles-turner-1
Copy link
Collaborator

Sorry @marc-white, just saw your comment - the default catalog was pointing at a nonexistent file. My catalog in .access_nri_intake_catalog/catalog.yaml worked just fine.

I've approved but I think before merging we want to update docs?

@marc-white
Copy link
Collaborator Author

Yes, a docs update is definitely required. I can get started on that today.

@marc-white
Copy link
Collaborator Author

I've added the necessary documentation, take a peek...

Copy link
Collaborator

@charles-turner-1 charles-turner-1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than the two minor typos, documentation changes look correct to me.

I updated cli.py::build if update: ... clause on my local machine to make it a bit more readable to check the docs - are you happy for me to push the changes to the branch?


:code:`access_nri_intake_catalog` only links a singular :code:`catalog.yaml` to the entry point :code:`intake.cat.access_nri`; either the
user's local version, or if that does not exist, the live version on Gadi (see :ref:`faq`). To load outdated catalogs from Gadi, we recommend
copying the :code:`catalog-<old min version>-<old max version>.yaml` to :code:`~/access_nri_intake_catalog/catalog.yaml`.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be ~/.access_nri_intake_catalog/catalog.yaml, not ~/access_nri_intake_catalog/catalog.yaml?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct

:code:`SCHEMA_HASH`). The easiest way to update this is to first set :code:`SCHEMA_HASH` to :code:`None`. The
updated hash will then be printed to screen when the sub-package is imported and this can be copied and pasted
across.
As of version 0.14, the catalog schema is now a part of the :code:`access_nri_intake_catalog` package, rather
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be version 0.1.4, not 0.14

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, I think there might be a couple of those...

@marc-white
Copy link
Collaborator Author

Other than the two minor typos, documentation changes look correct to me.

I updated cli.py::build if update: ... clause on my local machine to make it a bit more readable to check the docs - are you happy for me to push the changes to the branch?

Yep push the changes

@charles-turner-1
Copy link
Collaborator

charles-turner-1 commented Nov 12, 2024

@marc-white Can you confirm the shorthand variable names I used in 5c80be3 aren't misleading? Specifically driver_new, etc..

Other than that I think this is good to go.

@marc-white
Copy link
Collaborator Author

@marc-white Can you confirm the shorthand variable names I used in 5c80be3 aren't misleading? Specifically driver_new, etc..

Looks reasonable to me!

@charles-turner-1 charles-turner-1 merged commit 6d9bf87 into main Nov 12, 2024
19 checks passed
@marc-white
Copy link
Collaborator Author

I'm going to get onto Gadi now and create symlinks for the existing catalog versions, so they get picked up the first time we generate a 'new-style' catalog.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Default to a "latest" version?
3 participants