Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash: Removing MarkerArrays not threadsafe? #1469

Open
simonschmeisser opened this issue Feb 7, 2020 · 11 comments
Open

Crash: Removing MarkerArrays not threadsafe? #1469

simonschmeisser opened this issue Feb 7, 2020 · 11 comments

Comments

@simonschmeisser
Copy link
Contributor

I displayed a large (~3000) set of axes and when removing them from the display again (via the corresponding marker message) rviz crashed. There seems to be a race condition or similar problem between Thread 18 removing a material and Thread 1 displaying it

Please see the gist for a summarized backtrace
https://gist.github.com/simonschmeisser/12b1f4d6e0bcbd61516ec8646628c585

Your environment

  • OS Version: e.g. Ubuntu 18.04
  • ROS Distro: Melodic
  • RViz, Qt, OGRE, OpenGl version as printed by rviz:
    [ INFO] [1581070677.421391455]: rviz version 1.13.5
    [ INFO] [1581070677.421432363]: compiled against Qt version 5.12.4
    [ INFO] [1581070677.421446500]: compiled against OGRE version 1.9.0 (Ghadamon)
    
  • If source build, which git commit? 5805917
@simonschmeisser
Copy link
Contributor Author

I recorded a bag file but cannot reproduce in stand-alone rviz so far:
poses.bag.zip

@simonschmeisser
Copy link
Contributor Author

I have another similar crash in RobotStateDisplay this time. I guess this is due to our code having an AsyncSpinner running ...

@rhaschke do you have an idea for a general solution or should I go through all displays and use signal-slots to fix threading and move all rendering related stuff to the gui thread? Would be some PRs ...

@rhaschke
Copy link
Contributor

rhaschke commented Mar 3, 2020

rviz doesn't use a spinner, but calls spinOnce in VisualizationManager::onUpdate() - synchronously to the main GUI thread. Where do you use a spinner? Within a custom display / plugin?
If so, you should also use a custom message queue for this spinner. See point_cloud_common.cpp for an example:

, spinner_(1, &cbqueue_)

@simonschmeisser
Copy link
Contributor Author

We embed a rviz::RenderPanel into our main Qt based UI. Somewhere in there an AsyncSpinner is started. Now after reading your reply I first thought this is not the way rviz is meant to be used, but actually rviz::Display does provide separate NodeHandles for synchronous and asynchronous message handling ...

Now understanding the concept of message queues and NodeHandles better I think I found the bug in RobotStateDisplay at least:
In https://github.com/ros-planning/moveit/blob/e8a4072727abe4c704d009f05baa764bcf106b33/moveit_ros/visualization/robot_state_rviz_plugin/src/robot_state_display.cpp#L300
a new NodeHandle with the default message queue is used while it actually should use the synchronised update message queue from rviz::Display via Display::update_nh_ I'll see if that fixes my crash and open a PR

@rhaschke
Copy link
Contributor

rhaschke commented Mar 3, 2020

I think, rviz assumes to a large extent that ROS callbacks are executed synchronously with the main GUI thread. There are only a very few exceptions to this rule, namely displays derived from PointCloudCommon. This indeed, provides a separate callback queue and a spinner processing this queue. I think, you should call ros::spinOnce() from your main GUI thread frequently...

Regarding RobotStateDisplay: As the new NodeHandle doesn't have its own callback queue, it is just using the global one as well, which is processed from the GUI thread. I think, changing to update_nh_ will have no effect.

@simonschmeisser
Copy link
Contributor Author

ros::NodeHandle update_nh_;

at least according to the comment here it should have the desired effect of being run synchronous and indeed I do get different crashes now ...

@rhaschke
Copy link
Contributor

rhaschke commented Mar 3, 2020

Could you please verify that the two handles use different (or the same) queues?

@simonschmeisser
Copy link
Contributor Author

Never trust a comment/doc-string ...

return ros::getGlobalCallbackQueue();

So this should probably be changed to actually use a separate queue that really is processed synchronously? I'll open an rviz PR

I also experimented with locking the RobotState in RobotStateDisplay, this would even allow us to always handle RobotStateMessages asynchronously (by using threaded_nh_ and a std::mutex/std::lock_guard) which fixed my crashes so far. Will open a MoveIt PR as well ...

@rhaschke
Copy link
Contributor

rhaschke commented Mar 4, 2020

I think, the global callback queue is assumed to be processed by the main gui thread, while the threaded_nh_ uses a separate queue that needs to be processed by a separate spinner.
If you handle the global queue by a threaded spinner, you are not complying with this assumption.
If you have some events that you want to be handled in a thread, you should use the threaded_nh_ and ensure (using locks etc.) synchronization with the gui and rendering, which needs happen in the main thread.

@simonschmeisser
Copy link
Contributor Author

It could be triggered in Update/the rendering event (by a QTimer) and would then always be tied to the gui thread. I'll try that later and open a PR

@simonschmeisser
Copy link
Contributor Author

I opened a PR #1475

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants