This is the container and deployment scripts for the Cockpit integration tests and automated maintenance tasks.
The container has optional mounts:
- A directory for image files. Defined by
$COCKPIT_IMAGES_DATA_DIR
env variable, conventionally/cache/images
. On production hosts, this is mounted from/var/cache/cockpit-tasks/images
. - S3 access tokens for image and log buckets. Defined by
$COCKPIT_S3_KEY_DIR
env variable, conventionally/run/secrets/s3-keys
. On production hosts, this is mounted from/var/lib/cockpit-secrets/s3-keys
. - A directory for GitHub and AMQP secrets. Used by both the tasks and the the webhook container.
Must be in
/run/secrets/webhook
(bots currently assumes that)..config--github-token
: GitHub token to create and update issues and PRs.amqp-{client,server}.{pem,key}
: TLS certificates for RabbitMQca.pem
: The general cockpit CI Certificate Authority which signed the above AMQP certificates On production hosts, this is mounted from/var/lib/cockpit-secrets/webhook
.
To generate the certificates needed for cross-cluster AMQP authentication,
run the credentials/webhook/generate.sh script script.
This requires a generic "Cockpit CI" certificate authority first, so if you
don't have that yet, run credentials/generate-ca.sh first.
Run either script in the target directory (e.g.
/var/lib/cockpit-secrets/webhook/
).
This happens through Ansible depending on the target cloud.
These tasks containers controlled by systemd units cockpit-tasks@*
.
OpenShift primarily runs the GitHub webhook responder and AMQP server.
As /dev/kvm
support on OpenShift is hard to come by, current bots
job-runner
and the deployment resources currently only support a tasks
container which processes the statistics
and webhook
queues.
You need a persistent shared volume for test-results.db
and the Prometheus
database. Create it with
oc create -f tasks/images-claim-centosci.yaml
Now create all the remaining kubernetes objects. The secrets are created from
the /var/lib/cockpit-secrets/*
directories as described above:
make tasks-secrets | oc create -f -
oc create -f tasks/cockpit-tasks-webhook.json
oc create -f tasks/cockpit-tasks-centosci.json
Some helpful commands:
oc describe rc
oc describe pods
oc log -f cockpit-tasks-xxxx
For hacking on the webhook, task container, bots infrastructure, or validating new container images, you can also run a podman pod locally with RabbitMQ, webhook, minio S3, and tasks containers. Without arguments this will run some purely local integration tests:
pytest
This will also generate the secrets in a temporary directory.
By default this will use the
ghcr.io/cockpit-project/tasks:latest
container, but you can run a different image by setting $TASKS_IMAGE
.
You can also test the whole GitHub → webhook → tasks → GitHub status workflow on some cockpituous PR with specifying the PR number, your GitHub token, and optionally a non-default repository for testing against a fork:
pytest -vvsk test_real_pr --pr 123 --pr-repository yourfork/cockpituous --github-token=/home/user/.config/cockpit-dev/github-token
This will run tests-scan/tests-trigger on the given PR and trigger an
unit-tests test which simply does make check
.
You can get an interactive shell with
pytest -sm shell
to run things manually. For example, use publish-queue
to inject a job into
AMQP, or run job-runner
or some bots command.
This container can also be used for local development with toolbx, to get an "official" Cockpit development environment that's independent from the host:
toolbox create --image ghcr.io/cockpit-project/tasks cockpit
toolbox enter cockpit
When you want to debug a problem with a test which may be sensitive to its
particular resource configuration (such as calibrating RAM, /dev/shm sizes, or
behaviour of libvirt in a container, etc.), you can run the tasks container
directly with podman. The production parameters are set in the
job-runner.toml
file in the
tasks-systemd Ansible role.
You don't need secrets, custom networks, or most environment settings, the
crucial parts are the memory, device, and image cache configurations.
If you want to share your host's image cache (which is really a good idea), temporarily make it writable to the unprivileged user in the container:
chmod o+w ~/.cache/cockpit-images
Scaled down to a quarter of the size (to fit on a laptop), you can then run the container like this:
podman run -it --rm --device=/dev/kvm --memory=6g --pids-limit=4096 --shm-size=256m \
--security-opt label=disable -v ~/.cache/cockpit-images:/cache/images \
-e COCKPIT_IMAGES_DATA_DIR=/cache/images -e TEST_JOBS=2 ghcr.io/cockpit-project/tasks
Inside, you can then run a test, for example
git clone https://github.com/cockpit-project/cockpit-podman
cd cockpit-podman/
test/run
Add a webhook to your GitHub project on the Settings → Webhooks page of your project:
-
Set the Payload URL to the DNS host name of the deployed route:
http://webhook-cockpit.apps.ocp.cloud.ci.centos.org
-
Use the same secret as in
~/.config/github-webhook-token
above. -
Change the Content Type to
application/json
. -
Select "Let me select individual events" and let the hook run on "Pull requests" and "Statuses".
See GitHub development documentation for more information.
- On your project's "Settings → Manage Access" page, invite the cockpituous user as a collaborator. This should result in a pending invite.
- Join our Matrix channel #cockpit:fedoraproject.org and ask Martin Pitt (
pitti
), Allison Karlitskaya (lis
), or Matej Marusak (mmarusak
) to log into GitHub ascockpituous
user and accept the invite on the GitHub org or project's/invitations
page.
- In the Cockpit bots project, add your project to the test map. Start with
_manual
tests. - Send a first PR to your project and use bots/tests-trigger to trigger the contexts that you want. They should be picked up and run.
- Once you fix your tests to succeed, adjust the test map again to move them to the "main" branch. From now on, every PR against main will automatically trigger these tests.
When a pull request event or a status event is received, the webhook will trigger tests for the tasks bots (see "Event flow" below for details).
A pull request event is queued when the pull request is opened or synchronized. A status event is only queued where the description ends with "(direct trigger)".
We don't directly connect webhook events to tasks bots, as workers come and go, and fail quite often; also, we need something to schedule the incoming requests across all available workers.
So we put the webhook events into AMQP queues. AMQP provides a distributed, transactional, and fail-safe work queue, provides the scheduling for free, and is really easy to set up.
sequenceDiagram
GitHub ->>+ webhook pod: webhook event<br>(JSON payload)
webhook pod ->>+ RabbitMQ pod: AMQP with raw<br>JSON payload<br>→ webhook queue
RabbitMQ pod ->>+ webhook pod: queue put success
webhook pod ->>+ GitHub: Finish HTTP webhook request
RabbitMQ pod ->>+ tasks pod: run-queue<br>pick from webhook queue
note right of tasks pod: checks event type field,<br>calls {tests,issue}-scan
tasks pod ->>+ RabbitMQ pod: shell command into<br>internal or public<br>tasks queue
tasks pod ->>+ RabbitMQ pod: ACK webhook queue item
RabbitMQ pod ->>+ tasks pod: run-queue<br>pick from tasks queues
tasks pod ->>+ GitHub: update issue/PR to "in progress"
note right of tasks pod: execute shell command<br>(test, image refresh)
tasks pod ->>+ GitHub: update issue/PR to "pass" or "fail"
tasks pod ->>+ RabbitMQ pod: ACK tasks queue item
-
Project configures a webhook for the interesting bits; most importantly "pull request opened or pushed" and "issue changed".
-
A PR is opened/changed in a project, or an issue gets a bot-related task (e. g. "fetch new translations" or "check for NPM updates"). GitHub sends a webhook event of the corresponding type.
-
The webhook calls an OpenShift route, e. g.
http://webhook-cockpit.apps.ocp.cloud.ci.centos.org/
This is a route/service that gets that HTTP request to a pod that has (1) an off-the-shelf RabbitMQ container, and (2) a cockpit/tasks container that runs the actual webhook.
See the Kubernetes resources for details about the route, service, and pod.
That webhook is a fairly straightforward piece of Python that routes the various event types to
handle_{pull_request,issues,...}()
handlers and essentially just connect to the AMQP pod next to it (amqp.frontdoor.svc:5671) and put the payload into the "webhook" queue.This initial step involves no complicated logic or interpretation, and just puts the payload into a safe place. The reason for this indirection with one interpret-queue (the webhook queue) and several task queues, is to reduce load on the webhook. The interpreting of status events for instance can take up to 10 seconds and can fail, and HTTP requests ought to be replied fast and reliably.
-
Then all worker bots also connect to the same AMQP container (locally through the service or remotely through the route). run-queue consumes a queue entry, does its thing (see below), and once everything is done it acks the entry back to the AMQP server. If anything goes wrong in between and the worker crashes, AMQP automatically puts the item back into the queue.
Authentication to AMQP happens through client-side SSL certificates; we have a distributed_queue.py convenience wrapper for this.
-
Some cockpit/tasks bot picks up the event payload from the "webhook" queue, and interprets it with tests-scan or issue-scan depending on the event type. This results in a job-runner JSON task or a shell command like
prometheus-stats
, or similar. If this involves any Red Hat internal resources, like RHEL images, that command gets put into the "internal" queue, otherwise into the "public" queue. -
Some cockpit/tasks bot picks up the task from the internal or public queue (depending on whether it has access to Red Hat internal infrastructure), executes it, publishes the log, updates the GitHub status, and finally acks the queue item.