Defining data product version types #984

damonmcc · 2024-07-08T01:06:41Z

damonmcc
Jul 8, 2024
Maintainer

Current state

DE code

In dcpy/utils/versions.py we declare the version types that a data product can have. I'd think about these as "release" or "publish" version types to make it clear that we're talking about how we enumerate the data we share.

The current version types are: MajorMinor, Quarter, Date, FirstOfMonth.

We also use a field in our product recipes called version_strategy to allow automated generation of the release version of a build.

The current version strategies are: first_of_month, bump_latest_release, bump_latest_release(int)

DE data products

The release schedule for our data products is in the DE Data Catalog excel file in SharePoint. Some notable version schemes used by our data products:

ZAP: every month, 20240617
ZTL: every month, 20240501
DevDB: twice a year, 22Q2
PLUTO: every quarter and every month, 24v2.1
CPDB: three time a year, 24prelim

Problems

PR #973 was a temporary fix to a problem related to the FirstOfMonth version and surfaced a few concerns we have about the current state:

Having both FirstOfMonth and Date is unsustainable. If we publish a dataset on 5/1, does that mean we always publish on the first of a month?
During lifecycle.builds.plan, we determine the latest published version by parsing and sorting folder names.
Sorting versions of different types isn't always straightforward (e.g. FacDB with values like 2023-09-01, 24v1, 2024-01-01, 2024-02-01)

Proposed changes

Our version types should describe how frequently we publish a data product, be fully distinguishable, and be sortable to reflect the order in which releases were published.

we use `Calendar Versioning` to construct version schemes

All dataset versions use a calendar versioning scheme inspired by the Calver conventions. We use a combination of date segments and incremental segments to construct a version scheme for every data product.

These are the segments we use:

0Y - Zero-padded year - 06, 16, 24
FY0Y - Zero-padded fiscal year - FY06, FY24, FY25
Q - Quarter - Q1, Q2, Q3, Q4
0M - Zero-padded month - 01, 02 ... 11, 12
0D - Zero-padded day - 01, 02 ... 30, 31
0W - Zero-padded week - 01, 02, 33, 52
MAJOR - An increment
MINOR - An increment
PATCH - An increment
MODIFIER - An optional text tag, such as "dev", "alpha", "prelim", "exec"

We use these segments to construct version schemes. Each data product uses a version scheme. Multiple products may use the same scheme. For example:

Quarterly

scheme format: 0YQ

Monthly

scheme format: 0Y0M

Semantic

scheme format: 0Y.MAJOR.MINOR.PATCH

PLUTO

scheme format: 0YvMAJOR.MINOR.PATCH

Each data product recipe file has a default number of increments to bump a new version. This is a way to declare the release schedule of a data product. During a build, the release version is either automatically determined using the latest published version + default bump or manually declared via github action input. For example:

DevDB and FacDB

version scheme: Quarterly
default version bump: 2
24Q1, 24Q3

PLUTO

version scheme: PLUTO
default version bump: 1, major in recipe.yml and 1, minor in recipe-minor.yml
24v1, 24v1.1, 24v1.2, 24v2

we don't change the versions of data products

We don't need to change the versions we currently use for data products to improve how we represent and handle them internally. e.g. PLUTO can still have a v in it, CPDB doesn't have to change to use the fiscal year in it's version, etc.

BUT these are changes we may decide are worth making later and the changes proposed here have that in mind.

TylerMatteo · 2024-07-08T15:57:37Z

TylerMatteo
Jul 8, 2024
Maintainer

Very cool that you all are thinking about versioning! AE has a similar thread here which also mentions semver. Given that you all seem to be interested in including dates, you might be interested in CalVer? I came across it when I was researching versioning strategies. It seems to be a new "standard" than semver but has some interested advantages for folks that work on set release schedules.

0 replies

damonmcc · 2024-07-09T02:41:49Z

damonmcc
Jul 9, 2024
Maintainer Author

there may be a useful built-in python library for parsing/comparing versions:

>>> from packaging import version
>>> version.parse('2021.01.31') >= version.parse('2021.01.30.dev1')
True
>>> version.parse('2021.01.31.0012') >= version.parse('2021.01.31.1012')
False

1 reply

sf-dcp Jul 10, 2024
Maintainer

I would really like this to reduce amount of custom code. For example, as a part of the new workflow in publishing, we need to compare versions in order to:

publishing a version that is "brand new" or a patch to an existing version
update latest only if the version in question is newer.

damonmcc · 2024-07-09T13:51:04Z

damonmcc
Jul 9, 2024
Maintainer Author

in the summary, I noted that PLUTO could use a version scheme called Semantic. but that'd be a significant change from the current format (e.g. 24v1 becoming 24.1)

we probably don't wanna make such a disruptive change to PLUTO, so we could have a more specific version scheme called PLUTO and still construct it using the suggested segments:

PLUTO

scheme format: 0YvMAJOR.MINOR.PATCH

PLUTO

version scheme: PLUTO
default version bump: 1, major in recipe.yml and 1, minor in recipe-minor.yml
24v1, 24v1.1, 24v1.2, 24v2

5 replies

fvankrieken Jul 9, 2024
Maintainer

That would be changing majors from 24v2 to 24v2.0 but that seems like a fine change to me if it gives us more flexibility. And it sort of makes a little more sense - 24v2.0 and 24v2.1 really are both 24v2... just different minors. So it's a bit clearer in a way to not have a version that is explicitly "24v2"

damonmcc Jul 9, 2024
Maintainer Author

@sf-dcp asked about .0 yesterday and I felt like 24v2.0 wouldn't be necessary. they seem to sort correctly without it:

but it looks like most python packages (not all) use trailing zeroes. you think it'd be good to do the same?

fvankrieken Jul 9, 2024
Maintainer

But if you patch 24v2? Then it's 24v2.0.1?

sf-dcp Jul 10, 2024
Maintainer

In the dcpy.utils.versions.sort() fn, we actually sort Version objects and the major version is implicitly defined as minor = 0:

That's to say sorting doesn't seem to be an issue.

damonmcc Jul 10, 2024
Maintainer Author

I updated the summary to describe and use a PLUTO scheme

sf-dcp · 2024-07-10T01:35:11Z

sf-dcp
Jul 10, 2024
Maintainer

Nice write-up. Excited standardizing versions. Though I'm a bit unclear which products would change their version schemas as the result of this discussion...

1 reply

damonmcc Jul 10, 2024
Maintainer Author

oops yea that is unclear. I didn't mean to suggest we change data product version schemas asap, just that we change our code to declare and handle the versions that we already use

I'll change the discussion title and maybe add a note to the summary at the top

damonmcc · 2024-07-15T19:36:44Z

damonmcc
Jul 15, 2024
Maintainer Author

just tried to diagram what the version logic during plan could be

flowchart LR
    plan_build[Plan Build]
    get_latest_vers["get_latest_vers()"]
    bump_latest_vers["bump_latest_vers()"]
    bump_patch["bump_patch()"]
    load_data[Data Loading]

    q_prev_declared{prev. version\ndeclared?}
    q_vers_declared{version\ndeclared?}
    q_is_patch{is a\npatch?}

    plan_build --> q_prev_declared

    q_prev_declared -->|Yes| q_vers_declared
    q_prev_declared -->|No| get_latest_vers
    get_latest_vers --> q_vers_declared
    
    q_vers_declared -->|Yes| q_is_patch
    q_vers_declared -->|No| bump_latest_vers
    bump_latest_vers --> load_data

    q_is_patch -->|Yes| bump_patch
    bump_patch --> load_data
    q_is_patch -->|No| load_data

this logic might lead to the following recipe fields:

`models/lifecycle/builds.py`

class Recipe(BaseModel, extra="forbid", arbitrary_types_allowed=True):
    name: str
    product: str
    versions_type: versions.VersionType
    version: str | None
    previous_version: str | None
    ...

product recipe files + github action inputs

minimal (compute previous and current versions)

# recipe.yml
name: Template
product: db-template
version_type: Monthly
inputs:
  ...

declare current version (compute previous)

# recipe.yml
name: Template
product: db-template
version_type: Monthly
version: 2024-08
inputs:
  ...

patch a version (compute previous)

# recipe.yml
name: FacDB
product: db-facilities
version_type: Quarterly
version: 2024-08
inputs:
  ...

# build.yml
is_a_patch: true

3 replies

sf-dcp Jul 18, 2024
Maintainer

@damonmcc , something happened to the visual. Also, patch makes sense to be defined early on. Do we want to keep it in your visual as the desired outcome or to reflect the current state?

damonmcc Jul 19, 2024
Maintainer Author

@sf-dcp is the diagram working now? I think when u commented I saw it was messed up too, but I do see the diagram now

I'd love to keep this one as the desired outcome. but of course it wouldn't hurt to have another diagram somewhere to describe our current state (either here on in the wiki)

sf-dcp Jul 19, 2024
Maintainer

Yep, it's working now! The desired flow looks good to me.

damonmcc · 2024-08-20T23:29:34Z

damonmcc
Aug 20, 2024
Maintainer Author

while looking at the DCAT-US metadata schema here, I found standard values for release frequency called "ISO 8601 Repeating Duration" here

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Defining data product version types #984

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 6 comments 10 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Defining data product version types #984

damonmcc Jul 8, 2024 Maintainer

Current state

DE code

DE data products

Problems

Proposed changes

we use Calendar Versioning to construct version schemes

DevDB and FacDB

PLUTO

we don't change the versions of data products

Replies: 6 comments · 10 replies

TylerMatteo Jul 8, 2024 Maintainer

damonmcc Jul 9, 2024 Maintainer Author

sf-dcp Jul 10, 2024 Maintainer

damonmcc Jul 9, 2024 Maintainer Author

PLUTO

fvankrieken Jul 9, 2024 Maintainer

damonmcc Jul 9, 2024 Maintainer Author

fvankrieken Jul 9, 2024 Maintainer

sf-dcp Jul 10, 2024 Maintainer

damonmcc Jul 10, 2024 Maintainer Author

sf-dcp Jul 10, 2024 Maintainer

damonmcc Jul 10, 2024 Maintainer Author

damonmcc Jul 15, 2024 Maintainer Author

models/lifecycle/builds.py

product recipe files + github action inputs

sf-dcp Jul 18, 2024 Maintainer

damonmcc Jul 19, 2024 Maintainer Author

sf-dcp Jul 19, 2024 Maintainer

damonmcc Aug 20, 2024 Maintainer Author

damonmcc
Jul 8, 2024
Maintainer

we use `Calendar Versioning` to construct version schemes

Replies: 6 comments 10 replies

TylerMatteo
Jul 8, 2024
Maintainer

damonmcc
Jul 9, 2024
Maintainer Author

sf-dcp Jul 10, 2024
Maintainer

damonmcc
Jul 9, 2024
Maintainer Author

fvankrieken Jul 9, 2024
Maintainer

damonmcc Jul 9, 2024
Maintainer Author

fvankrieken Jul 9, 2024
Maintainer

sf-dcp Jul 10, 2024
Maintainer

damonmcc Jul 10, 2024
Maintainer Author

sf-dcp
Jul 10, 2024
Maintainer

damonmcc Jul 10, 2024
Maintainer Author

damonmcc
Jul 15, 2024
Maintainer Author

`models/lifecycle/builds.py`

sf-dcp Jul 18, 2024
Maintainer

damonmcc Jul 19, 2024
Maintainer Author

sf-dcp Jul 19, 2024
Maintainer

damonmcc
Aug 20, 2024
Maintainer Author