-
Notifications
You must be signed in to change notification settings - Fork 248
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Docs] Document CI System #14656
[Docs] Document CI System #14656
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for taking the time to do this. These documents are so so helpful. I'm very impressed how much you've understood in such a short time. I have a few comments, some that add context and one or two puzzles.
Your diagrams are brilliant. Would it be possible to use svg instead of png please? It gets very grainy when i zoom in!
When a new PR is opened, or an existing PR changes, hooks are triggered which alert the CI service. | ||
The CI service will then: | ||
|
||
- Check for validity |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reworded
dev-docs/services/ci/README.md
Outdated
|
||
- Check for validity | ||
- The author must be a known developer (or else a commit SHA can be individually approved via the CI service UI) | ||
- The PR must be against the `main` branch |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this extra context is super helpful. IMO it's worth including both "how we have it set up" and "what can be configured"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This also makes it sound like CI is more poll-based than trigger-based? I don't have enough permissions in the repo to check what hooks we have set up so some of this is speculative (I should probably have done / go ahead and do a better job of highlighting guesswork)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact, the advice from Daniel that "deleting CI's most recent deploy run will make it run again" seems like it must be polling based. I'll look into this today
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a bit of both. Updated the doc with that info
dev-docs/services/ci/README.md
Outdated
- The author must be a known developer (or else a commit SHA can be individually approved via the CI service UI) | ||
- The PR must be against the `main` branch | ||
- Run tests | ||
- The CI service will run tests against the PR |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 I'll add this under the details of what the test batch does below.
|
||
The process of running tests goes like: | ||
|
||
- CI will generate a new batch and submit it against the production Hail Batch service using its own service credentials |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 I've mentioned this now
dev-docs/services/ci/README.md
Outdated
- The batch contains jobs which will: | ||
- Clone the appropriate branch and build a new set of docker images for the updated services. | ||
- Deploy the batch suite of k8s services into one of many CI-owned namespaces in the Hail Batch cluster | ||
- These namespaces are called |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotta leave a couple of cliffhangers in here for the next season
- Clone the appropriate branch and build a new set of docker images for the updated services. | ||
- Deploy the batch suite of k8s services into one of many CI-owned namespaces in the Hail Batch cluster | ||
- These namespaces are called | ||
- Run a series of tests against the services |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It also stands up a private instance of the batch services and tests against those!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh i see you mentioned that below. nice.
dev-docs/services/ci/README.md
Outdated
Examples of CI test runs can be seen by searching through the production batch log, as long as you have developer | ||
permissions. For example: `/batches?q=user+%3D+ci%0D%0Atest+%3D+1` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to be a member of the ci
billing project i believe.
dev-docs/services/ci/README.md
Outdated
Readiness is determined by github status checks. The following conditions must be met: | ||
|
||
- The PR must be against the `main` branch | ||
- The PR must have passed all tests (and there must be none outstanding) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in gcp. note that the ci tests on azure are not required to merge the pr
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this true? I could have sworn I'd seen PRs with GCP CI green and Azure CI yellow still not merged. As with the comment above... this makes me think I should check whether CI is trigger-based or polling-based
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's true as far as github is concerned. I haven't looked too closely about what ci does with respect to its counterpart in foreign clouds
dev-docs/services/ci/README.md
Outdated
- Check for validity | ||
- The author must be a known developer | ||
- The PR must be against the `main` branch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a copypasta or does ci actually do this? I thought it deployed new commits on main irrespective of the commit author?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It needs to be against a watched branch. Which in hail-is is only main
. Doesn't matter who the PR originator was.
dev-docs/services/ci/README.md
Outdated
Note: It's not actually quite as waterfall-y as this. In fact the jobs are all running in a hail | ||
batch, and each service being deploy has its own path through the DAG. So it's quite possible that the services are | ||
test/deploy-ing in parallel, and that the deploy for one service might happen before the test for another has even begun. | ||
|
||
This should all be fine, because it was previously tested as part of the PR approval process. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really deploy anything before all test steps have succeeded? My mental model was that we ignore the tests run on the pr pre-merging, and treat it as untested post-merge, because the merge could have introduced semantic conflicts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good question. This is an assumption based on -
- The batch lists the tasks in service-by-service order
- The "deploy batch" job is listed before "test batch" jobs
- There only seems to be one "deploy batch" job and it seems to target the
default
namespace
But I could easily have missed something. Worth double checking
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Side-note: oh wow, some job groups would be a really! nice addition when trying to work this kind of thing out!)
Some selected examples from a recent deploy:
- Job 27:
batch_image
ran from 21:58 to 22:00 - Job 39:
test_hail_java_0
ran from 22:12 to 22:17 - Job 44:
test_hail_python_0
ran from 22:12 to 22:18 - Job 78:
test_hail_python_local_backend_0
ran from 22:15 to 22:21 - Job 153
deploy_batch
ran from 22:04 to 22:05 - Job 176:
test_batch_0
ran from 22:15 to 22:23 - Job 182:
deploy_ci
ran from 22:05 to 22:05
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that you can see the dependencies directly in build.yaml. I believe deploy_batch
is what deploys the new batch service in the test namespace, as test_batch
depends on it. I'm not sure which job actually deploys to production.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To close the loop here: deploy_batch
does deploy to production (in the default namespace), which happens before tests are run. But, CI will only merge a PR if it has a successful CI run on top of the current head of main. That means the commit that will be merged and deployed is exactly the same (with the same git sha) as the commit that passed CI. There are still a few small holes for issues to slip past the tests, in particular because CI runs the batch tests in the test namespace in a fresh deployment, while the tests after deploying are run in the default namespace against the production system.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
diagrams are great. thanks for your effort. a couple of small things.
dev-docs/services/ci/README.md
Outdated
# The CI Service in Hail Batch | ||
|
||
Hail Batch includes a CI service which has three functional purposes: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is boarderline pedantry, but I don't think of "Hail Batch" being anything other than the batch service and the python client library. the batch service + ci etc are generally referred to as the "Services".
# The CI Service in Hail Batch | |
Hail Batch includes a CI service which has three functional purposes: | |
# The CI Service | |
The CI service has three functional purposes: |
dev-docs/services/ci/README.md
Outdated
Hail Batch includes a CI service which has three functional purposes: | ||
|
||
- Runs tests against pull requests | ||
- Merges PRs to the `main` branch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Merges PRs to the `main` branch | |
- Merges PRs to the `main` branch when all checks have passed |
dev-docs/services/ci/README.md
Outdated
As part of its configuration, CI must be configured with an access token allowing it to operate on | ||
behalf of a github account called hail-ci-robot. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
repetition of "configure"
As part of its configuration, CI must be configured with an access token allowing it to operate on | |
behalf of a github account called hail-ci-robot. | |
CI must be configured with an access token allowing it to operate on | |
behalf of the 'hail-ci-robot' github account. |
dev-docs/services/ci/README.md
Outdated
- The state of the deployed system ("`batch_changed`") | ||
- The state of the watched branch itself (for example - its current commit) ("`state_changed`") | ||
|
||
In hail-is, there is exactly one watched branch, which is `hail-is/hail:main`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would make more sense to describe the deployment model before mentioning the organisation. Reading this, i was like "what does hail-is have to do with this". Alternatively, say something along the lines of "the default-namespace deployment of ci currently tracks hail-is/hail:main
".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, I don't particularly like documenting the current values of config variables as these go stale quickly. I'd link to the file where interested readers can find current values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I can move it and reword it. I want to strike a balance between "documenting something very general purpose" and "explaining how it is being used by us". I'll try again
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. This document will be useful for understanding how our CI deployment works on hail-is/hail::main
. I think whether some behavior is hard-coded or controlled by configuration is of secondary importance. If some config variable changes frequently, then agreed we should avoid talking about its current setting, but I don't think that's the case here.
dev-docs/services/ci/README.md
Outdated
|
||
### Github Configuration | ||
|
||
To make CI more responsive it can be configured to receive webhook event triggers from the github repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To make CI more responsive it can be configured to receive webhook event triggers from the github repository. | |
To make CI more responsive it is configured to receive webhook event triggers from the github repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, this is an accident of trying to be both a general purpose doc ("can be") vs how we use it ("is")
dev-docs/services/ci/README.md
Outdated
- Tasks in the batch are defined in `build.yaml` in the top repo directory. CI generates jobs in the batch from the steps defined in there. | ||
- The batch contains jobs which will: | ||
- Clone the appropriate branch | ||
- Squash and rebase against `main` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
more pedantry
- Squash and rebase against `main` | |
- Squash and rebase onto `main` |
dev-docs/services/ci/README.md
Outdated
- The batch contains jobs which will: | ||
- Clone the appropriate branch | ||
- Squash and rebase against `main` | ||
- Build a new set of docker images for the updated services. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It also builds and tests images for hail query
dev-docs/services/ci/README.md
Outdated
Readiness is determined by github status checks. The following conditions must be met: | ||
|
||
- The PR must be against the `main` branch | ||
- The PR must have passed all tests in GCP |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- The PR must have passed all tests in GCP | |
- The PR must have passed all required checks |
dev-docs/services/ci/README.md
Outdated
2. Do we really deploy first / test second? And do we really deploy Batch and CI using jobs that are already running in | ||
Batch and CI? Do the services get shut down and reconnect to existing instances of the deploy jobs started by the | ||
previous version? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't quite understand the question.
There's one deploy job at a time that deploys already tested artefacts (albeit, tested in a different namespace).
k8s handles updating the various pods with their new images.
Sometimes, the services are brought down depending on database migrations. I think that's defined in build.yaml
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking great, Chris! I just have a few questions/clarifications.
dev-docs/services/ci/README.md
Outdated
|
||
When a PR state changes, it will cause the `github_changed` flag to become dirty. | ||
|
||
During its update loop, the CI service will determine whether any PRs targeting its watched branches are elibible to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
During its update loop, the CI service will determine whether any PRs targeting its watched branches are elibible to | |
During its update loop, the CI service will determine whether any PRs targeting its watched branches are eligible to |
dev-docs/services/ci/README.md
Outdated
|
||
## Merging PRs to the `main` branch | ||
|
||
When a PR state changes, it will cause the `github_changed` flag to become dirty. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the github_changed
flag of the target branch (if that is a watched branch)?
dev-docs/services/ci/README.md
Outdated
|
||
Readiness is determined by github status checks. The following conditions must be met: | ||
|
||
- The PR must be against the `main` branch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it main
, or any watched branch? Just since you started out making the distinction carefully, we should maintain it throughout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any watched branch, but this is automatically true or else we wouldn't be considering the PR for mergability in the first place... so I'll remove this line
dev-docs/services/ci/README.md
Outdated
The control flow from final approval to CI merging a PRs looks like: | ||
|
||
- The PR's state will change in github (either a check changes to SUCCESS, or a review is approved) | ||
- The github webhook callback will cause the `github_changed` flag will be marked as dirty for the `WatchedBranch` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The watched branch which is the target of the PR?
dev-docs/services/ci/README.md
Outdated
- The PR's state will change in github (either a check changes to SUCCESS, or a review is approved) | ||
- The github webhook callback will cause the `github_changed` flag will be marked as dirty for the `WatchedBranch` | ||
- The `WatchedBranch`'s `_update` method in [`github.py`](../../../ci/ci/github.py) scans all PRs against the branch and updates state that helps determine mergeability. | ||
- The `WatchedBranch`'s `_update` method in [`github.py`](../../../ci/ci/github.py) iterates again to merge all mergeable PRs, in priority order |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a subtlety here which is being glossed over. As I understand (without having inspected the code, so correct me if I'm wrong), we keep a PR's check set to SUCCESS even if the successful CI run was on top of an outdated main (i.e. another PR has merged since this PR was last tested). Rerunning the tests of all PRs every time any PR merges would be wasteful, but we don't want to merge a PR unles it has a successful CI run on top of the current main. My mental model is that CI chooses one PR to be the next merge candidate, and reruns the tests if the last run is out of date, then merges only if that suceeds. But I don't know for sure if that's right.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, interesting. I didn't see evidence for this in the code but it seems logical.
I also wonder whether this is why we sometimes see PRs get "stuck" with pending checks. Perhaps CI is just choosing to delay re-running until the PR is back on top of its mergeable list 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did a deep dive into this today, and you are indeed correct. I've added a flowchart showing the full set of paths through the update loop, which might be overkill in the end, but was a useful exercise for me in the process of making it.
What we seem to do is -
- During
heal
:- Nominate a current merge candidate
- If its tests are out of date, re-run them
- Nominate a current merge candidate
- During
try_to_merge
:- Iterate over all PRs
- Test is_mergeable
- Merge if mergeable (and return after the first success)
- Iterate over all PRs
So... I think it's possible that we might kick off a "merge candidate" test and then decide to merge some other PR while those tests are running (if its tests were already up to date)
During its update loop, the CI service will determine that the SHA of the `WatchedBranch` has changed and trigger | ||
a deployment. | ||
|
||
- Create a new batch job to: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this is also build.yaml
, just with a configuration variable set to make it deploy?
dev-docs/services/ci/README.md
Outdated
that services are deploy/test-ing in parallel, and that the deploy for one service might happen before the test for | ||
another has completed. | ||
|
||
This should all be fine, because everything was previously tested as part of the PR approval process. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should emphasize here that a PR is only merged (really, squashed and rebased) if the exact commit that is added to main, with the exact same SHA, has passed CI.
dev-docs/services/ci/README.md
Outdated
1. Sometimes in logs we see logs like `"update github br-hail-ci-test-ci-test-<RANDOM>-main"` for various random branch names. | ||
What does this mean? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could these branches be the results of squashing and rebasing a PR before running CI?
@ehigham @patrick-schultz The PR history is getting a little long and messy now, but I think I've now answered/addressed all your comments 🤞 |
I like to resolve threads to keep track of what I've addressed. Graphite especially makes it easy to what threads are still unresolved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is brilliant work, thank you very much indeed
Changes made, ready for re-review
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again for all your work on this, I think it's super valuable!
Fixes #14657
An initial set of documentation for how our CI system works.
Documentation only, no production change.
Here's a link to a rendered version of the markdown.
Note to reviewers - please (please!) check the details to make sure I got this right!