Audit QoS settings in Nav2; Create consistent internal profiles #4888

SteveMacenski · 2025-01-30T22:22:06Z

It would be good to do some auditing on QoS settings in Nav2 and make sure that they are sensible. In particular:

Subscriber depths should be as minimal as possible to keep the data fresh and only process what is live and needed. Older data probably doesn't need to be kept if the system isn't keeping up. However:
Publisher depths should be larger, such that they can keep a queue (when using async publication) to process to make sure messages aren't lost, its up to the subscriptions to decide they're not interested
Remove or reduce use of best effort QoS that has the unfortunate side effect of blasting data at full rate, even if the network is having trouble keeping up, without any ACKs. Really, this should be used sparingly when on the same CPU. best effort's CPU performance is pretty not-great and can make network issues worse due to its ignorance about what else is going on and its own communication status
Transient local for any static published data states

Consider also adding in support for the newer features like:

Deadline to notify when issues arise and handle them
Lifespan to disregard sufficiently old data
Liveliness to possibly replace bond in the lifecycle manager (but probably not?)

It might also be good to have launch files that launch and process a discovery server to reduce network traffic.

DDS configs like transport layer, sync vs async, local host only, buffer / fragment sizes, unicast, etc should be left to the user's setup

Considerations

I think we could have a set of Nav2 specific QoS policies for "publishers", "subscribers", and "latched" so that these are portable and consistent across the code base. During the audit, rather than fix each one, we can move each to use our default profiles, unless there is a compelling reason for some to be differentiated.

We could even wrap the create_XYZ() factories in a Nav2 version that also does things under the hood with respect to QoS override acceptance, deadline/liveliness callbacks (nav2_utils::LifecycleNode to handle), perhaps even lifecycle for subscription, etc

The text was updated successfully, but these errors were encountered:

ewak · 2025-01-30T22:58:58Z

I commend the idea.

When not using Best Effort. I think I of run into the problem of slow subscribers effectively live locking the publishers and other subscribers. I try to mitigate that somewhat, especially when networking is involved, say for example costmap visualization on wifi, using some features of zenoh-ros2dds-bridge (on humble). https://github.com/eclipse-zenoh/zenoh-plugin-ros2dds/blob/b662a95730f8daeaa71669406d0e00bef7898bd5/DEFAULT_CONFIG.json5#L95

Is it just me and my misguided superstition?

Do you have goto tools/techniques to discover/measure/pinpoint slow subscribers?

SteveMacenski · 2025-01-31T19:51:41Z

I think I of run into the problem of slow subscribers effectively live locking the publishers and other subscribers.

I believe that essentially is the behavior with using DDS' synchronous publishing (default) since the subscriber sends ACKs to the publisher, it does flow control on publication rate depending on packet loss and network bandwidth. There's some relationship going on there under the hood - which is good when you need to reliably guarantee delivery. I've recently been auditing the DDS implementations' documentation and tuning guides and this is the reason they recommend Sync publications for robotics / critical systems (where async where it is published in a separate thread is better for high rate streaming data/lower overhead; but comes at the downside of not being as reliable and as fancy tools for critical data guarantees).

So no, I don't think that's probably superstition & reliable transport over WiFi is probably bad for that (and likely other) reasons. This is a benefit of the Best Effort publisher where there is no ack, flow control (but instead sends at full rate without regard for subscriber processing state to keep up or networking load).

Do you have goto tools/techniques to discover/measure/pinpoint slow subscribers?

I don't have any, but I know of some. ROS 2 Tracing was built to put tracepoints across the software so you can have timing information while things are processing. I haven't used, but I know it is well loved. There are also DDS logs that come out, but those are a firehose and I've never personally successfully debugged a problem from them given how much is going on in Nav2.

I'd start with adding timer wrappers around the callbacks I'm suspicious of so that you can measure / log timing of callbacks to see what's taking longer than you expect. Its like tracing, I suppose, but personally to me easier to implement and manage for testing single areas at a time. A templated util function could be pretty easily made that itself owns the subscriber's callback via a provided lambda & it wraps a chrono start/end time to log the difference to a file / screen.

SteveMacenski added enhancement New feature or request help wanted Extra attention is needed labels Jan 30, 2025

SteveMacenski changed the title ~~Audit QoS settings in Nav2~~ Audit QoS settings in Nav2; Create consistent internal profiles Jan 30, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Audit QoS settings in Nav2; Create consistent internal profiles #4888

Audit QoS settings in Nav2; Create consistent internal profiles #4888

SteveMacenski commented Jan 30, 2025

ewak commented Jan 30, 2025

SteveMacenski commented Jan 31, 2025 •

edited

Loading

Audit QoS settings in Nav2; Create consistent internal profiles #4888

Audit QoS settings in Nav2; Create consistent internal profiles #4888

Comments

SteveMacenski commented Jan 30, 2025

Considerations

ewak commented Jan 30, 2025

SteveMacenski commented Jan 31, 2025 • edited Loading

SteveMacenski commented Jan 31, 2025 •

edited

Loading