forked from odoo/runbot
-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Syncing from upstream odoo/runbot (17.0-upgrade-from-xdo) #735
Open
bt-admin
wants to merge
255
commits into
brain-tec:17.0-upgrade-from-xdo
Choose a base branch
from
odoo:17.0-upgrade-from-xdo
base: 17.0-upgrade-from-xdo
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Syncing from upstream odoo/runbot (17.0-upgrade-from-xdo) #735
bt-admin
wants to merge
255
commits into
brain-tec:17.0-upgrade-from-xdo
from
odoo:17.0-upgrade-from-xdo
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
I'd been convinced this was an ORM error because the field is not recursive... in runbot_merge, in forwardbot it is and thus does indeed need to be flagged to avoid the warning.
Current system makes it hard to iterate feedback messages and make them clearer, this should improve things a touch. Use a bespoke model to avoid concerns with qweb rendering complexity (we just want GFM output and should not need logic). Also update fwbot test setup to always configure an fwbot name, in order to avoid ping messages closing the PRs they're talking about, that took a while to debug, and given the old message I assume I'd already hit it and just been too lazy to fix. This requires updating a bunch of tests as fwbot ping are sent *to* `fp_github_name`, but sent *from* the reference user (because that's the key we set). Note: noupdate on CSV files doesn't seem to work anymore, which isn't great. But instead set tracking on the template's templates, it's not quite as good but should be sufficient. Fixes #769
Largely informed by sentry, - Fix an inconsistency in staging ref verification, `set_ref` internally waits for the observed head to match the requested head, but then `try_staging` would re-check that and immediately fail if it didn't. Because github is *eventually* consistent (hopefully) this second check can fail (and is also an extra API call), breaking staging unnecessarily, especially as we're still going to wait for the update to be visible to git. Remove this redundant check entirely, as github provides no way to ensure we have a consistent view of anything, it doesn't have much value and can do much harm. - Add github request id to one of the sanity check warnings as that could be a useful thing to send upstream, missing github request ids in the future should be noted and added. - Reworked the GH object's calls to be clearer and more coherent: consistently log the same thing on all GH errors (if `check`), rather than just on the one without a `check` entry. Also remove `raise_for_status` and raise `HTTPError` by hand every time we hit a status >= 400, so we always forward the response body no matter what its type is. - Try again to log the request body (in full as it should be pretty small), also remove stripping since we specifically wanted to add a newline at the start, I've no idea what I was thinking. Fixes #735, #764, #544
- move sentry configuration and add exception-based filtering - clarify and reclassify (e.g. from warning to info) a few messages - convert assertions in rebase to MergeError so they can be correctly logged & reported, and ignored by sentry, also clarify them (especially the consistency one) Related to #544
Currently sentry is only hooked from the outside, which doesn't necessarily provide sufficiently actionable information. Add some a few hooks to (try and) report odoo / mergebot metadata: - add the user to WSGI transactions - add a transaction (with users) around crons - add the webhook event info to webhook requests - add a few spans to the long-running crons, when they cover multiple units per iteration (e.g. a span per branch being staged) Closes #544
During the 16.3 freeze an issue was noticed with the concurrency safety of the freeze wizard (because it blew up, which caused a few issues): it is possible for the cancelling of an active staging to the master branch to fail, which causes the mergebot side of the freeze to fail, but the github state is completed, which puts the entire thing in a less than ideal state. Especially with the additional issue that the branch inserter has its own concurrency issue (which maybe I should fix): if there are branches *being* forward-ported across the new branch, it's unable to see them, and thus can not create the now-missing PRs. Try to make the freeze wizard more resilient: 1. Take a lock on the master staging (if any) early on, this means if we can acquire it we should be able to cancel it, and it won't suffer a concurrency error. 2. Add the `process_updated_commits` cron to the set of locked crons, trying to read the log timeline it looks like the issue was commits being impacted on that staging while the action had started: REPEATABLE READ meant the freeze's transaction was unable to see the update from the commit statuses, therefore creating a diverging update when it cancelled the staging, which postgres then reported as a serialization error. I'd like to relax the locking of the cron (to just FOR SHARE), but I think it would work, per postgres: > SELECT FOR UPDATE, and SELECT FOR SHARE commands behave the same as > SELECT in terms of searching for target rows: they will only find > target rows that were committed as of the transaction start > time. However, such a target row might have already been updated (or > deleted or locked) by another concurrent transaction by the time it > is found. In this case, the repeatable read transaction will wait > for the first updating transaction to commit or roll back (if it is > still in progress). If the first updater rolls back, then its > effects are negated and the repeatable read transaction can proceed > with updating the originally found row. But if the first updater > commits (and actually updated or deleted the row, not just locked > it) then the repeatable read transaction will be rolled back with > the message This means it would be possible to lock the cron, and then get a transaction error because the cron modified one of the records we're going to hit while it was running: as far as the above is concerned the cron's worker had "just locked" the row so it's fine to continue. However this makes it more and more likely an error will be hit when trying to freeze (to no issue, but still). We'll have to see how that ends up. Fixes #766 maybe
A few cases of conflict were missing from the provisioning handler. They can't really be auto-fixed, so just output a warning and ignore the entry, that way the rest of the provisioning succeeds.
Before this, when testing in parallel (using xdist) each worker would create its own template database (per module, so 2) then would copy the database for each test. This is pretty inefficient as the init of a db is quite expensive in CPU, and when increasing the number of workers (as the test suite is rather IO bound) this would trigger a stampede as every worker would try to create a template at the start of the test suite, leading to extremely high loads and degraded host performances (e.g. 16 workers would cause a load of 20 on a 4 cores 8 thread machine, which makes its use difficult). Instead we can have a lockfile at a known location of the filesystem, the first worker to need a template for a module install locks it, creates the templates, then writes the template's name to the lockfile. Every other worker can then lock the lockfile and read the name out, using the db for duplication. Note: needs to use `os.open` because the modes of `open` apparently can't express "open at offset 0 for reading or create for writing", `r+` refuses to create the file, `w+` still truncates, and `a+` is undocumented and might not allow seeking back to the start on all systems so better avoid it. The implementation would be simplified by using `lockfile` but that's an additional dependency plus it's deprecated. It recommends `fasteners` but that seems to suck (not clear if storing stuff in the lockfile is supported, it opens the lockfile in append mode). Here the lockfiles are sufficient to do the entire thing. Conveniently, this turns out to improve *both* walltime CPU time compared to the original version, likely because while workers now have to wait on whoever is creating the template they're not competing for resources with it.
Allow sorting by a callback. Sort remains client-side.
- add support for authorship (not just approval) - make display counts directly - fix `state` filter: postgres can't do negative index lookups - add indexes for author and reviewed_by as we look them up - ensure we handle the entire source filtering via a single subquery Closes #778
Fix outstanding query to make a positive `state` filtering, instead of negative, matching 3b52b1aace8674259812a76b1566260937dbcacb. Also manually create a map of stagings (grouped by branch) sharing a single prefetch set. For odoo the mergebot home page has 12 branches in the odoo project and 8 in spreadsheet, 6 stagings each. This means 120 queries to retrieve all the heads (Odoo stagings have 5 heads and spreadsheet have 1, but that seems immaterial). By fixing `_compute_statuses` and creating a single prefetch set for all stagings of all branches we can fetch all the commits in a single query instead of 120.
Allow filtering stagings by state (success or failure), and provide a control to explicitly update the staging date limit. Should make it easier to drill through stagings when looking for specific information. Related to #751
In the backend, the intermediate jump through batches is really not convenient (even if we kinda have to jump through batches *anyway*). Fixes #751
Remove unused imports, unnecessary f-strings, dead code, fix less-than-ideal operators.
Currently the heads of a staging (both staging heads and merged heads) are just JSON data on the staging itself. Historically this was convenient as the heads were mostly of use to the staging process, and thus accessed directly through the staging essentially exclusively. However this makes finding stagings from merged commits e.g. for forensic research almost impossible, because querying based on the *values* of a JSON map is expensive, and indexing it is difficult. To make this use case more feasible, split the `heads` field into two join tables, one for the staging heads and one for the merged heads, this makes looking for stagings by commits much more efficient (although the queries may not be trivial). Also add two utility RPC methods, so that it's possible to query stagings reasonably easily and efficiently based on a set of commits (branch heads). related to #768
`auto_session_tracking` causes issues when not specified on the super old version of the client which is available on ubuntu. Also disable tracing as it seems less useful than hoped for, and I've not been using what's been collected so far.
`/runbot_merge/stagings` ======================== This endpoint is a reverse lookup from any number of commits to a (number of) staging(s): - it takes a list of commit hashes as either the `commits` or the `heads` keyword parameter - it then returns the stagings which have *all* these commits as respectively commits or heads, if providing all commits for a project the result should always be unique (if any) - `commits` are the merged commits, aka the stuff which ends up in the actual branches - `heads` are the staging heads, aka the commits at the tip of the `staging.$name` branches, those may be the same as the corresponding commit, or might be deduplicator commits which get discarded on success `/runbot_merge/stagings/:id` ============================ Returns a list of all PRs in the staging, grouped by batch (aka PRs which have the same label and must be merged together). For each PR, the `repository` name, `number`, and `name` in the form `$repository#$number` get returned. `/runbot_merge/stagings/:id1/:id2` ================================== Returns a list of all the *successfully merged* stagings between `id1` and `id2`, from oldest to most recent. Individual records have the form: - `staging` is the id of the staging - `prs` is the contents of the previous endpoint (a list of PRs grouped by batch) `id1` *must* be lower than `id2`. By default, this endpoint is inclusive on both ends, the `include_from` and / or `include_to` parameters can be passed with the `False` value to exclude the corresponding bound from the result. Related to #768
Mostly a temporary safety feature after the events of 07-31: it's still not clear whether that was a one-off issue or a change in policy (I was not able to reproduce locally even doing several set_refs a second) and the gh support is not super talkative, but it probably doesn't hurt to commit the workaround until #247 gets implemented. On 2023-07-31, around 08:30 UTC, `set_ref` started failing, a lot (although oddly enough not continuously), with the unhelpful message that > 422: Reference cannot be updated This basically broke all stagings, until a workaround was implemented by adding a 1s sleep before `set_ref` to ensure no more than 1 `set_ref` per second, which kinda sorta has been the github recommendation forever but had never been an issue before. Contributing to this suspicion is that in late 2022, the documentation of error 422 on `PATCH git/refs/{ref}` was updated to: > Validation failed, or the endpoint has been spammed. Still would be nice if GH was clear about it and sent a 429 instead. Technically the recommendation is: > If you're making a large number of POST, PATCH, PUT, or DELETE > requests for a single user or client ID, wait at least one second > between each request. So... actually implement that. On a per-worker basis as for the most part these are serial processes (e.g. crons), we can still get above the rate limit due to concurrent crons but it should be less likely. Also take `Retry-After` in account, can't hurt, though we're supposed to retry just the request rather than abort the entire thing. Maybe a future update can improve this handling. Would also be nice to take `X-RateLimit` in account, although that's supposed to apply to *all* requests so we'd need a second separate timestamp to track it. Technically that's probably also the case for `Retry-After`. And fixing #247 should cut down drastically on the API calls traffic as staging is a very API-intensive process, especially with the sanity checks we had to add, these days we might be at 4 calls per commit per PR, and up to 80 PRs/staging (5 repositories and 16 batches per staging), with 13 live branches (though realistically only 6-7 have significant traffic, and only 1~2 get close to filling their staging slots).
In 81ce4ea the delta for PRs being listed in the `/forwardport/outstanding` page was increased from 3 days to 7 (1 week), however the warning box in the home page still used the old cutoffs leading to An inconsistency between the two and an effective severe overcounting, as the reason why the cutoff was increased to a week is forward ports can take a while or the author / reviewer can be a touch busy at end of week, so 3~4 days are routine when a PR is merged on thursday or friday (and even worse if there's bank holidays in the mix).
Can't hurt to *have* them.
When I updated the status storage (including `previous_failure`) for some reason I didn't just migrate from the old to the new format, and added bridge functions instead. This is not really necessary (or useful), so convert all the legacy data and remove the conversion helpers. Relates to #802
- add formatting for a bunch of backend objects - add cross-links in order to use toplevel navigation between objects e.g. project -> branch -> staging -> PR with breadcrumbs instead of shitty dialog boxes Relates to #802
`_rec_name = 'sha'` means name_search and cross-model searches will work much better. Relates to #802
If the stagings are going to be created locally (via a git working copy rather than the github API), the mergebot part needs to have access to the cache, so move the cache over. Also move the maintenance cron. In an extermely minor way, this prefigures the (hopeful) eventual merging of the ~~planes~~ modules.
The current runbot infrastructure has 100+ machine that will each build all docker images. This is unpractical for multiple reasons: - impotant load on all machine when the build could be done once - possible differences for the same image depending on the moment the base was pulled on the host. An extreme example was the version of python (3.11 or 3.12) when building the noble docker image before it was stabilized. - increase the chance to have a docker build failure in case of network problem. (random) A centralized registry will help with that, and allow future devlopment to generate more docker images. All docker images will be exactly the same, the pull time is faster than build time, only one build per docker image, ...
The underscore was previously used in database name, but this is not a valid character for certificates (still need some investigation about the limitations) If it worked to access it it was causing issues for IOT andother. _ is forbidden on saas actually. The design_theme database was changed to design-theme, but this is not accessible through the nginx condig. This should solve the issue.
Use 2 jobs to fetch multiple origin at the same time.
The stats are only kept when the build finishes, this makes sence to avoid collecting stats of a manually killed build (or because of newer build found), but the stats mays be interresting when the build timeouts. Adding a manual collection of stats in this case.
A common error on runbot is to generate link containing a __init__.py file [/some/path/to/__init__.py](/some/path/to/__init__.py) This would be rendered as <a href="/some/path/to/<ins>init<ins>.py">/some/path/to/<ins>init<ins>.py</a> Breaking the link, and the display of the name By default markdown will not render links avoiding this issue, but it will remain for the content of the a, needing to manage some kind of escaping. The way to escape markdown is to add a \ before any special character This must be done upront before formating, adding the method markdown_escape Our implementation of markdown is not meant to meet the exact specification of markdown but better suit our needs. One of the requirements is to be able to use it to format message easily but adding dynamic countent comming from the outside. One of the error than can occur is also 'Some code `%s`' % code can also cause problem if code contains ` This issue could be solved using indented code block, but this would need to complexify the generated string, have a dedicated method to escape the code blocs, ... Since we have the controll on the input, we can easily sanitize all ynamic content to avoid such issues. For code block we introduce a way to escape backtick (\`). It is non standard but will be easier to use. Combine with that, the build._log method now allows to add args with the values to format the string (similar to logging) but will escape params by default. (cr.execute spirit) name = '__init__.py' url = 'path/to/__init__.py' code = '# comment `for` something' build._log('f', 'Some message [%s](%s) \n `%s`', name, url, code) name, url and code will be escaped
Sort trigger and group them by category, better selection of categories to display on stats page
When disabling tests on runbot by using the test_tags on a build error, the tests are disabled on every tested version. This can be annoying when a test only fails from one version up to another. With this commit, a min and max version can be specified on a build_error when the test_tags field is used. If the min and max are not used, the test will be disabled cross versions.
returns now a dict instead of a tuple for easier extension
The docker operation are called often and cannot be logged each time. If a docker operation is slow, we log it at the end but waiting for this output we have no idea what the process is doing. This pr proposed to go a lower level and get the stream from the docker operation to be able to log earlier if we started a potentially slow operation.
When a lot of Docker images are updated at the same time, all runbot hosts will try to pull them at the same moment. With this commit, only the images marked as `always_pull` will be pulled and if one of them takes too much time to be pulled, we let the host make another loop turn before continuing the images pull. Finally, if a host uses the Docker registry, it will pull the remote image when running a step, that way the image will be pulled automatically if needed.
Idea is to help runbot/QA team to monitor other peoples tasks easier. This way we will have a responsible person for each task that someone else has assigned. The goal of pr is two-fold: 1) We need to be able to better monitor the progress of tasks that are not our own. 2) It introduces another metric with which we can measure individual progress, and this can become main measuring metric for future non-technical employees whose main goal will be making sure that none of the tasks become rotten(not have any visible progress for months).
A recent task introduced a record loop using self inside. It doesn't respect an api multi classic loop but it isn't an issue since onchange are always called on a single record. Removing the loop on all onchange to clean it up.
This commit will prevent people acting like cowboys to break the whole system by duplicating a Dockerfile without disabling the always_pull field.
When an image is build for the first time and the build fails, an ImageNotfound Exception is raised when trying to push the image. Whith this commit, a warning is emmited instead of crashing in such a case.
As we try to assign ourselves as customer to build errors it is useful to add a new filter to find errors on which we are the customer more easily.
When a type error occurs when trying to format a message in a build log, the suspicious are are joined with the message. But as the args may be a tuple, an errors occurs when concatenating the message with the args during the join. With this commit, we ensure that the args are casted into a list.
This commit adds the utf-8 Content-Type to nginx response headers for txt files. That way, the logs files can be properly viewed in browsers.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
bt_gitbot