Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Audit QoS settings in Nav2; Create consistent internal profiles #4888

Open
SteveMacenski opened this issue Jan 30, 2025 · 2 comments
Open

Audit QoS settings in Nav2; Create consistent internal profiles #4888

SteveMacenski opened this issue Jan 30, 2025 · 2 comments
Labels
enhancement New feature or request help wanted Extra attention is needed

Comments

@SteveMacenski
Copy link
Member

It would be good to do some auditing on QoS settings in Nav2 and make sure that they are sensible. In particular:

  • Subscriber depths should be as minimal as possible to keep the data fresh and only process what is live and needed. Older data probably doesn't need to be kept if the system isn't keeping up. However:
  • Publisher depths should be larger, such that they can keep a queue (when using async publication) to process to make sure messages aren't lost, its up to the subscriptions to decide they're not interested
  • Remove or reduce use of best effort QoS that has the unfortunate side effect of blasting data at full rate, even if the network is having trouble keeping up, without any ACKs. Really, this should be used sparingly when on the same CPU. best effort's CPU performance is pretty not-great and can make network issues worse due to its ignorance about what else is going on and its own communication status
  • Transient local for any static published data states

Consider also adding in support for the newer features like:

  • Deadline to notify when issues arise and handle them
  • Lifespan to disregard sufficiently old data
  • Liveliness to possibly replace bond in the lifecycle manager (but probably not?)

It might also be good to have launch files that launch and process a discovery server to reduce network traffic.

DDS configs like transport layer, sync vs async, local host only, buffer / fragment sizes, unicast, etc should be left to the user's setup


Considerations

I think we could have a set of Nav2 specific QoS policies for "publishers", "subscribers", and "latched" so that these are portable and consistent across the code base. During the audit, rather than fix each one, we can move each to use our default profiles, unless there is a compelling reason for some to be differentiated.

We could even wrap the create_XYZ() factories in a Nav2 version that also does things under the hood with respect to QoS override acceptance, deadline/liveliness callbacks (nav2_utils::LifecycleNode to handle), perhaps even lifecycle for subscription, etc

@SteveMacenski SteveMacenski added enhancement New feature or request help wanted Extra attention is needed labels Jan 30, 2025
@SteveMacenski SteveMacenski changed the title Audit QoS settings in Nav2 Audit QoS settings in Nav2; Create consistent internal profiles Jan 30, 2025
@ewak
Copy link
Contributor

ewak commented Jan 30, 2025

I commend the idea.

When not using Best Effort. I think I of run into the problem of slow subscribers effectively live locking the publishers and other subscribers. I try to mitigate that somewhat, especially when networking is involved, say for example costmap visualization on wifi, using some features of zenoh-ros2dds-bridge (on humble). https://github.com/eclipse-zenoh/zenoh-plugin-ros2dds/blob/b662a95730f8daeaa71669406d0e00bef7898bd5/DEFAULT_CONFIG.json5#L95

Is it just me and my misguided superstition?

Do you have goto tools/techniques to discover/measure/pinpoint slow subscribers?

@SteveMacenski
Copy link
Member Author

SteveMacenski commented Jan 31, 2025

I think I of run into the problem of slow subscribers effectively live locking the publishers and other subscribers.

I believe that essentially is the behavior with using DDS' synchronous publishing (default) since the subscriber sends ACKs to the publisher, it does flow control on publication rate depending on packet loss and network bandwidth. There's some relationship going on there under the hood - which is good when you need to reliably guarantee delivery. I've recently been auditing the DDS implementations' documentation and tuning guides and this is the reason they recommend Sync publications for robotics / critical systems (where async where it is published in a separate thread is better for high rate streaming data/lower overhead; but comes at the downside of not being as reliable and as fancy tools for critical data guarantees).

So no, I don't think that's probably superstition & reliable transport over WiFi is probably bad for that (and likely other) reasons. This is a benefit of the Best Effort publisher where there is no ack, flow control (but instead sends at full rate without regard for subscriber processing state to keep up or networking load).

Do you have goto tools/techniques to discover/measure/pinpoint slow subscribers?

I don't have any, but I know of some. ROS 2 Tracing was built to put tracepoints across the software so you can have timing information while things are processing. I haven't used, but I know it is well loved. There are also DDS logs that come out, but those are a firehose and I've never personally successfully debugged a problem from them given how much is going on in Nav2.

I'd start with adding timer wrappers around the callbacks I'm suspicious of so that you can measure / log timing of callbacks to see what's taking longer than you expect. Its like tracing, I suppose, but personally to me easier to implement and manage for testing single areas at a time. A templated util function could be pretty easily made that itself owns the subscriber's callback via a provided lambda & it wraps a chrono start/end time to log the difference to a file / screen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants