Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workflow: Publish Data #980

Open
wants to merge 13 commits into
base: main
Choose a base branch
from
Open

Conversation

lianghai
Copy link
Member

@lianghai lianghai commented Dec 4, 2024

See the fork repo for test runs: https://github.com/lianghai/unicodetools/actions/workflows/publish-data.yml.

See commit messages for rationale behind the changes. The branch name “publish-ucd” is a misnomer, as this workflow covers more than UCD.

To do:

  • Run from a tag
    • The default workflow_dispatch interface’s “Use workflow from” dropdown already lets one to run from a selected ref (branch or tag). It only defaults to the main branch.
    • Tentatively decided to always tag a commit so this workflow can be triggered for a determined commit.
  • Differentiate run names per publication mode
  • Beta mode
  • Is there a directory (any from https://github.com/unicode-org/unicodetools/blob/main/.gitignore?) more conventional to UCD maintainers that should be used in place of “dist”?
    • Now moved to pub/tmp. Any “tmp” directory is already ignored.
  • Can COPY_YEAR simply be the current year?
    • Now obtained from date together with PUB_DATE.
  • Revise documentation: https://github.com/unicode-org/unicodetools/blob/main/docs/data-workflow.md#publication
  • Make usage of date compatible with macOS, for convenience of development

Future considerations:

  • SSH to server for actual deployment
  • Include the final release mode
  • Parametrize release components (UCD, emoji, IDNA, UCA, …) instead of the release phase (dev, alpha, beta, …)
  • Make usage of sed compatible with macOS, for convenience of development

Considered:

  • Can UNI_VER and EMOJI_VER be stored in a central location in the repo?
  • GitHub releases?
    • The artifact uploaded to a workflow run is not permanent, and can only be used for temporary exchange.
    • Need to decide how to tag releases.
    • Not very necessary. The commit will already be tagged. A GitHub release can be confusing for the public as it’s a duplicate of what’s deployed to https://unicode.org/Public/draft/.

Basic separation of input/ouput:

- $1 and $2 are now static because the workflow is run in an isolated environment and the cwd is always the repo root.
- Set dynamic inputs in the workflow as env variables.
- Upload built files as a workflow artifact.

Git doesn’t preserve complete file permissions, and the zipping/unzipping process doesn’t reliably preserve file permissions. Therefore file permissions need to be normalized at deployment site.

The “Copy files from elsewhere” note needs to be documented in data-workflow.md#publication.
Corrected how the dist directory is referenced when moving UCD.zip.

UNITOOLS_DATA and DRAFT are no longer necessary. “dist” as a conventional target direction is easier to read than a variable.

No need to remove old ZIP file because a workflow runs in an isolated environment.
Use modern syntax $(…) in place of `…`.

Quote $MODE in comparison to avoid the confusing “unary operator expected” error when value is empty in testing.

Copy directories directly (rather copy content of a directory to a ready-made directory) whenever possible, for simpler code.
@markusicu
Copy link
Member

Looks promising!

Some thoughts:

Include the final release mode?

That could be messy, because the /Public/draft/ folder structure (all files for a version under one root) differs from that of the final structure (different roots by type eg UCD vs. idna, each with version subfolders).

Is there a directory [...] more conventional to UCD maintainers that should be used in place of “dist”?

I have an aversion against "polluting the source tree". Could we copy into ../pub or ../dist or similar?

Can COPY_YEAR simply be the current year?

Probably.

Can UNI_VER and EMOJI_VER be stored in a central location in the repo?

Sure. Maybe a shared .sh script that gets included/imported?

$1 and $2 are now static because the workflow is run in an isolated environment and the cwd is always the repo root

I think it would be nice/useful if it was still possible to run the publication script(s) from a local command line.

I haven't looked into the nektos/act thing, but since we are mostly running a shell script, it should be easy to still invoke that directly. It would be ok if that meant having to define/export env vars rather than using command line args.

... the zipping/unzipping process doesn’t reliably preserve file permissions. Therefore file permissions need to be normalized at deployment site.

Ever since I added the chmod's to my scripts, I believe that @rickmcgowan has been able to unzip my file drops on the server and didn't need to fix permissions again. Unless he chimes in and says otherwise. @glechner9147 FYI

The “Copy files from elsewhere” note needs to be documented in data-workflow.md#publication.

It is. I just thought it was a useful reminder in the output of the manually-run script.


workflow_dispatch options: s/Snapshot/UCD/
(they are all snapshots...)

We could also consider a list of things to include: emoji, idna, uca, ...
(at a minimum, always publish ucd)

@markusicu
Copy link
Member

Please un-delete the old scripts for now, so that we can try out the new workflow while we still have the old scripts for backup.

@lianghai
Copy link
Member Author

@markusicu,

Include the final release mode?

That could be messy, because the /Public/draft/ folder structure (all files for a version under one root) differs from that of the final structure (different roots by type eg UCD vs. idna, each with version subfolders).

Just some conditional paths. Shouldn’t make the .sh file much more complicated. But anyway, we have more than half an year to experiment so I’m not in a hurry to implement that now.

Is there a directory [...] more conventional to UCD maintainers that should be used in place of “dist”?

I have an aversion against "polluting the source tree". Could we copy into ../pub or ../dist or similar?

Well you guys already have a massive https://github.com/unicode-org/unicodetools/blob/main/.gitignore file that offers numerous ignored directories. Writing to the filesystem outside the current repo is a worse pollution because it basically breaches a sandbox. But again, I’m not familiar with how you guys use this repo so I’ll leave it to you to decide.

Can COPY_YEAR simply be the current year?

Probably.

Can UNI_VER and EMOJI_VER be stored in a central location in the repo?

Sure. Maybe a shared .sh script that gets included/imported?

One convention is to have a source controlled .env file.

$1 and $2 are now static because the workflow is run in an isolated environment and the cwd is always the repo root

I think it would be nice/useful if it was still possible to run the publication script(s) from a local command line.

It is already possible. The external dependencies are those env vars under env: in the .yml file. When you execute the .sh directly, you can supply them like how you usually supply env vars in command line.

On the other hand, I split those env vars to the .yml file only because I consider them to be the dynamic parameters of the workflow, while the .sh file is the stable implementation. You can move them back to the .sh file too, which will be more self-contained.

I haven't looked into the nektos/act thing, but since we are mostly running a shell script, it should be easy to still invoke that directly. It would be ok if that meant having to define/export env vars rather than using command line args.

That usage of nektos/act is only for high-fidelity simulation of the GitHub workflow environment (it connects to a Docker instance to actually run the workflow in an isolated environment). For example, without nektos/act, the .sh file’s way of using the date and sed commands isn’t compatible with my platform (macOS).

... the zipping/unzipping process doesn’t reliably preserve file permissions. Therefore file permissions need to be normalized at deployment site.

Ever since I added the chmod's to my scripts, I believe that @rickmcgowan has been able to unzip my file drops on the server and didn't need to fix permissions again. Unless he chimes in and says otherwise. @glechner9147 FYI

Likely that’s because what we need on the server is simply 755/644, and that’s the default you get when unzipping files. My point is, ZIP files can’t be used to reliably preserve unusual permission settings. If you actually need to control permissions before zipping a bunch of files, you need to either carefully unzip them with some specific command (which is a fragile process when it’s done by a human), or you need to first make a .tar.

workflow_dispatch options: s/Snapshot/UCD/ (they are all snapshots...)

“UCD” is also confusing here. Does it make sense to actually call it “Pre-Alpha” or something? It’s really the release phase that matters.

Also now I realized that the whole workflow (including the filenames) shouldn’t be called “Publish UCD”, but should be something like “Publish Data”.

We could also consider a list of things to include: emoji, idna, uca, ... (at a minimum, always publish ucd)

I can easily refactor to that direction as long as you guys give me an exact list. Otherwise, feel free to make the changes directly. There’s some examples of the supported input types: https://docs.github.com/en/actions/writing-workflows/choosing-when-your-workflow-runs/events-that-trigger-workflows#providing-inputs.

This is gonna make the change history funny, but Markus requested this: unicode-org#980 (comment)
@eggrobin
Copy link
Member

Writing to the filesystem outside the current repo is a worse pollution because it basically breaches a sandbox.

It doesn’t breach anything so long as you control the subtree you work in; it is just a matter of cloning into a subdirectory, as we do in other workflows:

# Out-of-source build.
ucd-and-smoke-tests:
name: Check UCD consistency, invariants, smoke-test generators
runs-on: ubuntu-latest
steps:
- name: Checkout Unicode Tools
uses: actions/checkout@v3
with:
path: unicodetools/mine/src

This is a pretty typical « out-of-source » build setup.
The unicode tools also work with an in-source setup (which is what I use locally), and we have some CI tests to check that this works, but the default for most of our workflows is out-of-source.

@eggrobin
Copy link
Member

eggrobin commented Dec 12, 2024

Does it make sense to actually call it “Pre-Alpha” or something? It’s really the release phase that matters.

We use dev where we have a similar classification in the code:

DEV("dev"), // Before α.
ALPHA("α"), // α review.
BETA("β"); // β review.

(I have an issue to add a GAMMA, #936.)

This also shows up in the JSPs, with draft properties being accessible with the Udev:/Uα:/Uβ: qualifier at various stages in the release.

@lianghai
Copy link
Member Author

@eggrobin,

It doesn’t breach anything so long as you control the subtree you work in; it is just a matter of cloning into a subdirectory, as we do in other workflows:

My goal is to propose simple, straightforward code to get the work done. I write minimal code without “good to have” features because it’s easier to review and easier to maintain. But after merging, it’ll be in PAG’s hands for maintenance, and by all means you guys can complicate it to your taste.

We use dev where we have a similar classification in the code:

I’ll change “Snapshot” to “Dev”.

Also I’ve realized that until multiple files start to share some state like the latestVersionPhase, there’s no need to abstract an env var file for stuff like UNI_VER and EMOJI_VER.

@lianghai lianghai changed the title Workflow for publishing UCD Workflow: Publish Data Dec 12, 2024
@lianghai lianghai marked this pull request as ready for review December 12, 2024 17:24
@lianghai lianghai requested a review from markusicu December 12, 2024 17:25
@lianghai
Copy link
Member Author

Among other changes in the recent commits, I changed the output directory from “dist” to “pub/tmp”, so now it’s in a dedicated and .gitignore-d directory: 03710ab.

@eggrobin
Copy link
Member

eggrobin commented Dec 12, 2024

Can UNI_VER and EMOJI_VER be stored in a central location in the repo?

Sure. Maybe a shared .sh script that gets included/imported?

Also I’ve realized that until multiple files start to share some state like the latestVersionPhase, there’s no need to abstract an env var file for stuff like UNI_VER and EMOJI_VER.

There is a broader maintenance issue here, which is that the yearly exercise of #931 is getting steadily messier with a dozen files in various languages independently carrying the current version, but I agree that this is far out of scope for this PR. I should probably do something about it, since this also ties into the γ phase thing.


I write minimal code without “good to have” features because it’s easier to review and easier to maintain. […] by all means you guys can complicate it to your taste.

The reviewers here, who are maintainers of this codebase, are also striving for ease of review and maintenance, rather than tasteful complication.
FWIW I’m happy with the .gitignored tmp, since we are not completely consistent about workflows being out-of-source (e.g., the JSPs build is in-source).

mv $TMP/UCD/ucd/version-ReadMe.txt $TMP/UCD/ReadMe.txt
rm -r $TMP/UCD/ucd/Unihan

if [ "$RELEASE_PHASE" = "Dev" ]; then
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would normally go with bash conditional expressions (with [[ ]]) rather than the test command (aka [).

@markusicu, I am unsure about our bash style here, which one do we prefer?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn’t familiar with the difference between [ and [[, but now after some reading I do prefer [[ as it offers an experience similar to modern languages. Made the change.

The pub directory is not included in the sparse checkout, and the nektos/act tool for testing locally doesn’t do an actual sparse checkout (it does `docker cp` instead) so I didn’t catch this issue locally.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants