Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revamp GitHub tag versions, debugging individual projects, and more! #237

Draft
wants to merge 12 commits into
base: master
Choose a base branch
from

Conversation

MattTheCuber
Copy link
Contributor

@MattTheCuber MattTheCuber commented Dec 31, 2024

Closes #236

TODO

  • Decide if pre-releases, release candidates, post fix releases, dev releases, etc. count as releases.
  • Decide what to do with Cataclysm: Dark Days Ahead's releases, like 0.H, which are not PEP 440 compliant.
  • Update README.md.
  • Generate and manually validate projects.json.

Summary

This PR includes the following changes and features:

  • Created classes for handling data and grouping functionality in tools\gen_projects_json.py.
  • Rewrote how GitHub tags are parsed in tools\gen_projects_json.py using the PEP 440 compliant standard and custom regex on a per-project basis.
  • Added subcommands to tools\gen_projects_json.py for generate, info, and tags.
  • Refactored and simplified projects.yaml to not require nearly as much manual input.

Tag parsing

Tag parsing is much more generic now with the specific implementations moved to entries in the projects.yaml file. The base version simply trys to cast the version string as a PEP 440 compliant Version object (from packaging.version). If it fails, the tag is ignored for all aspects of the data generation (first, latest, count, etc.). Many (maybe like 1/3) repositories use non-compliant tag names. To solve this, each project can define custom regexs to apply to tags. For example, the Vala project uses tags that look like this: VALA_0_0_0. The updated entry adds a custom regex to convert this to a compliant version:

  - name: Vala
    gh_url: https://github.com/GNOME/vala
    tag_regex_subs:
      - search: ^VALA_(\d)+_(\d)+_(\d)+$
        replace: \1.\2.\3

Additionally, many projects use tag name prefixes. For example, the StreamEx project uses versions that look like streamex-0.8.3. To fix this, simply remove the prefix with this regex:

  - name: StreamEx
    gh_url: https://github.com/amaembo/streamex
    tag_regex_subs:
      - remove: ^streamex-

This system feels miles better (no offense intended). It also simplifies the code quite a bit and enables automatic parsing of many libraries and data that were previously not possible (React, FreeCAD, Haskell bytestring, OpenSSL, MAME, Window Maker, ReactOS, three.js, google-api-client, rand, distlib, etc.).

gen_projects_json.py CLI

gen_projects_json.py --help
> python .\tools\gen_projects_json.py --help
usage: gen_projects_json.py [-h] [-u USER] [-k TOKEN] [--disable-caching] {generate,info,tags} ...

Generate or update project.json using projects.yaml.

positional arguments:
  {generate,info,tags}  Available commands
    generate            Generate an updated projects.json file.
    info                Print automatically pulled info for a GitHub project for debugging.
    tags                Print all sorted tags for a GitHub project for debugging.

options:
  -h, --help            show this help message and exit
  -u, --user USER       GitHub Username for API authentication. Falls back to the "GH_USER" environment variable.
  -k, --token TOKEN     A path to a file containing a GitHub personal access token for API authentication. Falls back to the "GH_TOKEN" environment variable.
  --disable-caching     Flag to disable caching. Falls back to the "ZV_DISABLE_CACHING" environment variable.

Generate

gen_projects_json.py generate --help
> python .\tools\gen_projects_json.py generate --help
usage: gen_projects_json.py generate [-h] [-u USER] [-k TOKEN] [--disable-caching]

options:
  -h, --help         show this help message and exit
  -u, --user USER    GitHub Username for API authentication. Falls back to the "GH_USER" environment variable.
  -k, --token TOKEN  A path to a file containing a GitHub personal access token for API authentication. Falls back to the "GH_TOKEN" environment variable.
  --disable-caching  Flag to disable caching. Falls back to the "ZV_DISABLE_CACHING" environment variable.

This command did not change.

Info

gen_projects_json.py info --help
> python .\tools\gen_projects_json.py info --help    
usage: gen_projects_json.py info [-h] [-u USER] [-k TOKEN] name_or_link

positional arguments:
  name_or_link       The project.yaml exact entry name or GitHub link.

options:
  -h, --help         show this help message and exit
  -u, --user USER    GitHub Username for API authentication. Falls back to the "GH_USER" environment variable.
  -k, --token TOKEN  A path to a file containing a GitHub personal access token for API authentication. Falls back to the "GH_TOKEN" environment variable.

The info command allows you to view what would be written to projects.json for the specified project. You can pass either a GitHub link or the exact name of a entry in projects.yaml. It will then print the output that would be written to projects.json for easier debugging.

Tags

gen_projects_json.py tags --help
> python .\tools\gen_projects_json.py tags --help
usage: gen_projects_json.py tags [-h] [-u USER] [-k TOKEN] name_or_link

positional arguments:
  name_or_link       The project.yaml exact entry name or GitHub link.

options:
  -h, --help         show this help message and exit
  -u, --user USER    GitHub Username for API authentication. Falls back to the "GH_USER" environment variable.
  -k, --token TOKEN  A path to a file containing a GitHub personal access token for API authentication. Falls back to the "GH_TOKEN" environment variable.

This command is super helpful with the new tagging system for building and testing regexs. When adding a new library simply pass the GitHub address to see if the tags are not compliant (requiring a regex). From there you will be able to see every parsed version, duplicate version (due to improper regex patterns), and failed version. Here is a demonstration output:

> python .\tools\gen_projects_json.py tags https://github.com/test/repo
Processing https://github.com/test/repo

Parsed tags:
v0.5.0 (parsed as 0.5.0)
v0.4.0-RC1 (parsed as 0.4.0rc1)
v0.4.0 (parsed as 0.4.0)
v0.3.0 (parsed as 0.3.0)
v0.2.0 (parsed as 0.2.0)
v0.1.0-beta.1 (parsed as 0.1.0b1)
v0.1.0 (parsed as 0.1.0)

Failed tags:
latest (tried latest)
test-ci-1 (tried test-ci-1)

Here is an example output for a more complicated example:

  - name: 3proxy
    gh_url: https://github.com/z3APA3A/3proxy
    tag_regex_subs:
      - remove: ^3proxy-
> python .\tools\gen_projects_json.py tags 3proxy
Processing 3proxy

Parsed tags:
0.9.4 (parsed as 0.9.4)
0.9.3 (parsed as 0.9.3)
0.9.2 (parsed as 0.9.2)
0.9.1 (parsed as 0.9.1)
0.9.0 (parsed as 0.9.0)
0.9.0-rc (parsed as 0.9.0rc0)
0.8.13 (parsed as 0.8.13)
0.8.12 (parsed as 0.8.12)
0.8.11 (parsed as 0.8.11)
0.8.10 (parsed as 0.8.10)
0.8.9 (parsed as 0.8.9)
0.8.8 (parsed as 0.8.8)
3proxy-0.8.7 (parsed as 0.8.7)
3proxy-0.8.6 (parsed as 0.8.6)
3proxy-0.8.5 (parsed as 0.8.5)
3proxy-0.8.4 (parsed as 0.8.4)
3proxy-0.8.3 (parsed as 0.8.3)
3proxy-0.8.2 (parsed as 0.8.2)
3proxy-0.7.1.4 (parsed as 0.7.1.4)
3proxy-0.8.1 (parsed as 0.8.1)
3proxy-0.8.0 (parsed as 0.8.0)
3proxy-0.8-pre (parsed as 0.8rc0)
3proxy-0.7.1.3 (parsed as 0.7.1.3)
3proxy-0.7.1.2 (parsed as 0.7.1.2)
v0.7.1.2 (parsed as 0.7.1.2)
v0.7.1.1 (parsed as 0.7.1.1)
v0.7.1 (parsed as 0.7.1)
v0.7 (parsed as 0.7)

Duplicate tags:
3proxy-0.8.8 (parsed as 0.8.8)

In this second example we can see a duplicate tag, which is fine in this case since there are actually two tags with the same version.

@mahmoud
Copy link
Owner

mahmoud commented Jan 3, 2025

Just catching up on this now. Very cool! I think the optional regex transform per project makes a lot of sense. Definitely miles better, no offense taken.

To your TODO questions:

We're looking far beyond the Python ecosystem, and I'd expect PEP440 is probably too strict. The schema you have with match/replace/remove is fine, but instead of passing it to PEP440, we can say, if that string starts with 0, then the project is zerover. Versions that don't match the initial regex are ignored. We only log a failure if no releases match (the regex or URL is probably wrong).

This should definitely help with the huge increase in monorepos (architecturally good imo), but complicates tagging. Lots of server/0.1.0 / client/6.1.2-type situations.

In terms of release count, I'm fine merging/ignoring suffixed releases (dev/pre/post) with their equivalent numeric release.

@MattTheCuber
Copy link
Contributor Author

Thanks for the input!

@MattTheCuber
Copy link
Contributor Author

We're looking far beyond the Python ecosystem, and I'd expect PEP440 is probably too strict. The schema you have with match/replace/remove is fine, but instead of passing it to PEP440, we can say, if that string starts with 0, then the project is zerover. Versions that don't match the initial regex are ignored. We only log a failure if no releases match (the regex or URL is probably wrong).

I understand and agree. The trouble is that 99% of releases are PEP 440 compliant after regex parsing. The only one that is not is Cataclysm: Dark Days Ahead. However, there could be many more in the future, so it makes sense to not add restrictions. The part that makes this difficult is counting the number of releases. I'll see what I can do.

@MattTheCuber
Copy link
Contributor Author

MattTheCuber commented Jan 4, 2025

Hmm, this is proving challenging since I based most of the logic around the use of the Version object and custom regex to filter out undesired/duplicate tags like 0.8.0-beta1-candidate1, 0.10.2.0-KAFKA-5526, v1.4.4-changelog, v0.0.0-20230206210201-441728b4c075, sshuttle-0.60-macos-bin, v2.1.3plusPR822, clamav-0.98-dmgxar, tor-0.0.6incompat-merged, etc. All of these releases were duplicates of other releases and aren't directly parsable with PEP 440. Adding a more generic versioning checker that searched for tags beginning with X. would accept all of these and not be able to tell that they are duplicates...

Copy link
Owner

@mahmoud mahmoud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Idea for the regex. especially given how many projects we're already tracking, the chances one of them adds a new suffix is quite high, and maintaining the list would get pretty involved.

from packaging.version import InvalidVersion, Version


class RegexSubstituionDict(TypedDict):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RegexSubstituionDict -> RegexSubstitutionDict

return True

def process_name(self, regex_subs: list[RegexSubstituionDict] | None = None):
for sub in regex_subs or []:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to handle the issue of suffixes, the first thing that comes to mind for me is to just treat the search regex as a prefix that must match, and we can append our own suffix portion to the configured regex. something to the effect of [^\d].*. And we just snip off all suffixes after the last matched part. Do you think that would work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate on this?

Copy link
Owner

@mahmoud mahmoud Jan 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll try! Regex is one of those things that can be easier to do than say. :)

So if I recall the way version stuff currently works is that I have an ignore (or skip?) step and a strip step. One matched tags we don't want, and the other cleaned up versions we did.

Ideally we could just have one kind of regex that matched and extracted. We can start with a really general default, something like: ^\D*(\d+(?:\D\d+)+)\D*

This already works for simpler cases like Julia that require a tag_regex_match to remove suffixes now.

Of course, this runs into issues with, e.g., hashicorp vault, which has tagging for subcomponents. We could make every entry with a static prefix revert to regex, but I'd suggest:

  - name: HashiCorp Vault
    gh_url: https://github.com/hashicorp/vault
    emeritus: true
    tag_match:
       - prefix: v

So we build the regex for the project: v(\d+(?:\D\d+)+)\D*

For cases like stellarium which have two version formats, I'm thinking something like:

  - name: Stellarium
    gh_url: https://github.com/Stellarium/stellarium
    tag_match:
       - prefix: v
       - regex: stellarium-(\d+-\d+-\d+)

And then for the second pattern, we stick the \D* on at the end. So the second regex would be ^stellarium-(\d+-\d+-\d+)\D*. We always tack on the \D* and pull the first group from re.match, which has the effect of dropping suffixes. And if the first character of any match is 0, that's 0ver. If it doesn't match, we try other regexes. For the purposes of release counting and assessing whether the project is currently 0ver, we only look at releases that match a regex.

I think this gives us a pretty robust mechanism. Ideally one where we won't be in regexland every other day because some project decided to get cute with their tags :) lmk what you think!

@mahmoud
Copy link
Owner

mahmoud commented Jan 4, 2025

Also for Cataclysm in particular, I say we just kick it over to being manual. :P

@MattTheCuber MattTheCuber mentioned this pull request Jan 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make a script to test a GitHub repo's metadata
2 participants