Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: use rabbit quorum queues in lieu of ha #294

Merged
merged 1 commit into from
Jun 4, 2024

Conversation

LukeRepko
Copy link
Contributor

@LukeRepko LukeRepko commented Jun 4, 2024

We define the use of quorum queues via kustomize as the default queue type for the named vhosts, but the oslo_messaging_rabbit config opt of rabbit_ha_queues: true was set, taking precedence. We actually do not want to use HA (mirrored) queues, as they are being deprecated, and will be removed in newer versions of RMQ (4.x being released EOY 2024). The use of HA queues in genestack up to this point was the result of sane but no longer ideal defaults set by openstack-helm that were carried forth.

This explicitly disables rabbit_ha_queues, and then enables rabbit_quorum_queue. Removing the related rabbit vhost is required for this change prior to re-deploying a given openstack service.

Example of re-deploying nova when making this change; note how we remove the queue, vhost, and user:

kubectl -n openstack delete queues.rabbitmq.com nova-queue
kubectl -n openstack delete vhosts.rabbitmq.com nova-vhost
kubectl -n openstack delete users.rabbitmq.com nova
helm --upgrade install nova ./nova

NOTE: Several helm upgrades may be required due to a race condition with the operator removing the vhost. Uninstalling first may be easier, but do so carefully.

Other changes:

  • add: rabbit_transient_quorum_queue which is newly availably in 2024.1. We will want to begin using this to make transient queues reliable
  • add: use_queue_manager which is newly available in 2024.1 We will want to begin using this when available to de-obfuscate named queues in rabbit
  • add: rabbit_interval_max to reconnect faster after a node outage
  • fix: send heartbeats more frequently; clients should mark a given node as down about 30s more quickly (default was 60s)
  • fix: set kombu_reconnect_delay lower to help avoid multiple code paths not being traversed when a RMQ node goes down

@LukeRepko LukeRepko marked this pull request as draft June 4, 2024 20:53
@LukeRepko LukeRepko marked this pull request as ready for review June 4, 2024 21:16
We define the use of quorum queues via kustomize as the default queue
type for the named vhosts, but the oslo_messaging_rabbit config opt of
`rabbit_ha_queues: true` was set, taking precedence. We actually do not
want to use HA queues, as they are being deprecated, and will be removed
in newer versions of RMQ (4.x being released EOY 2024). The use of HA
queues in genestack up to this point was the result of sane but no
longer ideal defaults set by openstack-helm that were carried forth.

This explicitly disables rabbit_ha_queues, and then enables
rabbit_quorum_queue. Removing the related rabbit vhost is required for
this change prior to re-deploying a given openstack service.

Example of re-deploying nova when making this change; note how we remove
the queue, vhost, and user:

```
kubectl -n openstack delete queues.rabbitmq.com nova-queue
kubectl -n openstack delete vhosts.rabbitmq.com nova-vhost
kubectl -n openstack delete users.rabbitmq.com nova
helm --upgrade install nova ./nova
```

**NOTE**: Several helm upgrades may be required due to a race condition
with the operator removing the vhost. Uninstalling first may be easier,
but do so carefully.

Other changes:

 - add: `rabbit_transient_quorum_queue` which is newly availably in
   2024.1. We will want to begin using this to make transient queues
   reliable
 - add: `use_queue_manager` which is newly available in 2024.1
   We will want to begin using this when available to de-obfuscate
   named queues in rabbit
 - add: `rabbit_interval_max` to reconnect faster after a node
   outage
 - fix: send heartbeats more frequently; clients should mark a
   given node as down about 30s more quickly (default was 60s)
 - fix: set `kombu_reconnect_delay` lower to help avoid multiple code
   paths not being traversed when a RMQ node goes down
@cloudnull cloudnull merged commit ec0ff9e into rackerlabs:main Jun 4, 2024
13 checks passed
@LukeRepko LukeRepko deleted the sjcdev branch September 5, 2024 14:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants