Skip to content

DevelopmentPrinciples

Martin Pitt edited this page Jan 27, 2021 · 10 revisions

Development

  • keep master working on as many (current) OSes as possible; use run-time feature/API detection
  • test for every change; some OS conditionals in tests
  • code is easy (or possible+documented) to run straight out of git tree; no permanent system modifications
  • test VMs double as devel environment for testing intrusive changes; faster iteration with scp instead of image-prepare
  • tests are easy to run and debug locally

Upstream CI

  • test on every supported OS
  • offline build and tests
  • provide our own versions of third-party services: FreeIPA, Samba AD, candlepin, OpenShift, ovirt, selenium containers
  • provide mechanics for creating rpms, debs, and entire repositories from scratch locally
  • separate OS image refreshes
  • test robustness: touched tests succeed 3x in a row, untouched tests succeed 1 out of 3; database of test flakes

Fedora/RHEL

  • run upstream integration tests in downstream gating
  • the above approach allows us to upload current master until the latest freeze

Releases

  • automate everything: github, fedora, copr, PPA, dockerhub, home page (docs)
  • process in principle: create tag, write blog post

Our tests/CI Error Budget

At January 2021 this is brand new, and a goal that we do not currently meet.

High-level goal: What keeps our velocity and motivation?

  • PRs get validated in a reasonable time (queue + test run time)
  • We don’t waste time on interpreting unstable test results
  • We are not afraid of touching code
  • Test failures are relevant and meaningful. Relieve us from having to decide about “unrelated or not” every. single. time.

Service Level Objectives

When the following objectives are fulfilled, we operate normally and happily. Once these drop below the mark (“exceeding error budget”), a part of the team (discussed in daily standups) stops feature development and non-urgent changes, and fixes our infrastructure and tests to get back into the agreed service level.

Objectives that support the high-level goal, in descending importance:

  1. A “no-change” PR becomes fully green with a 75% chance at the first attempt, and with a %95 chance after one retry
  2. Every individual test succeeds at least 90% of the time
  3. 95% of all PRs are merged without failed tests
  4. Test results are available within 60 minutes in 95% of runs
  5. Some queue runners are available 98% of the time (~ 1 h downtime/week)
  6. 95% of scheduled tests run through to completion (all tests ran and status got reported to PR)

These are not very ambitious and need to be improved, but let’s start somewhere.

Clone this wiki locally