Robotic Grasping with 3D Visual Observations #362

AndrejOrsula · 2021-06-09T16:40:36Z

AndrejOrsula
Jun 9, 2021

Hello everyone. First of all, thank you very much for your effort that made gym-ignition possible. Here is how I used it in my latest project for my Master's Thesis (GitHub repo).

Short Description

I investigated the applicability of DRL for vision-based robotic grasping of diverse objects. However, instead of the traditionally used RGB/RGB-D images (2D/2.5D), I tried employing octrees (3D) to learn an end-to-end policy with the aim to see if it brings any benefits. The goal of agent is to solve a very simple episodic task, which involves grasping of any object from the workspace through continuous actions in Cartesian space, i.e. translational gripper displacement, yaw rotation, and closing/opening of the gripper. For RL, I utilised model-free off-policy actor-critic algorithms from stable-baselines3 (TD3, SAC and TQC). All training was performed inside simulation but Sim2Real transfer on a real robot was also tested.

I will just mention some specific parts related to Ignition/gym-ignition that could be discussed further. For many of these, there might be a much better solution that I have not found and/or did not think of. Some of these also relate to the approach of fully using Python, which was selected to simplify integration with different modules due to the limited time I had for the project and my lack of prior experience with DL/RL. I believe it would be much easier to handle many of these issues with a lower-level language.

Sensors (RGB-D Camera)

Due to the current limitations of #249, I have decided to try a different approach of getting data from sensors. First, I attempted to parse the output of ign topic -e -t ... subprocess, but that was just too hacky. Instead, I use ROS 2 and convert all messages between Ignition Transport and ROS 2 via ros_ign_bridge. Although it causes some overhead, it is simple to implement and provides additional benefits such as use of RViz 2 and other useful tools like tf2. Besides the reduced determinism of simulation, the lack of ability to trigger image capture is by far the largest disadvantage. Therefore, the camera framerate needs to be set higher than the update rate of agent (e.g. 4x), while also making sure that a new observation is received after each update (extra steps that are potentially not necessary). Not a huge deal, but definitely very far from ideal.

Robot Controller and Motion Planning

For motion planning, I just used MoveIt 2 to generate joint trajectories based on the selected actions. To execute these trajectories, JointTrajectoryController system plugin is used. Once I have time to try out ign_ros2_control, I will probably transition to it in order to simplify the setup for arbitrary robot models and make it possible to employ the same interface also on real robots (once their ros2_control implementation is ready). Similar to sensors, ROS 2 with ros_ign_bridge is used to facilitate the communication.

More on ROS 2

I did not create any Runtime for ROS 2 because I was not exactly sure how that would look. For me, it was easier to just create ROS 2 nodes as individual submodules that are part of the Task object, e.g. one for camera subscriber and another for Moveit 2 requests.

Dataset

Google Scanned Objects collection from Fuel and a bunch of free PBR textures for ground plane are used. For all the different models, a single RandomObject class is used (not sure whether there is a better way). Because these models do not have inertial properties associated with them, they were estimated from their mesh and random mass. Their collision geometry also needs to be decimated, otherwise the stepping of simulation is unbearably slow.

Performance

Training is definitely a very time-consuming process. On my laptop (130 W), 500k time steps takes approximately three days to complete, albeit a large portion of it is the DL itself. I am barely approaching ~200% RTF when stepping the full environment (4 objects, rendering, random actions) with step size of 4ms. It is a bit better for a single object at ~350% RTF. That is with low-poly collision geometry, which is actually disabled for lower links of the robot. Such a slow training makes hyperparameter tuning especially painful (both manual and automatic). Parallelised environments would definitely help and make the environments more scalable!

Sim2Real

This part was relatively painless. I added a very simple runtime for it. It's a quick-and-dirty solution that allows evaluation of trained agents on a real robot (no training). Most of the aspects of RL are manual, including determining success and resetting the environment. It also allows manual stepping as a first step when testing the transfer to make sure nothing breaks. I cannot guarantee its safety though.

Other Thoughts

It could be beneficial to have the ability to progress the physics of simulation with an arbitrary number of steps instead of requiring to have a fixed stepsPerRun. I can see several use-cases for it such as tasks with actions that trigger action primitives (e.g. discrete pixel-wise action space that determines grasp pose) and all tasks defined as semi-MDPs. It is also useful for any task during reset of the environment that requires the agent to wait for the simulation to reach a certain condition before continuing (something that should not/cannot be manually set). Related to Moving code from Python to C++: high-level overview #304.
Manipulation of lights could be very useful, e.g. for domain randomization. I don't believe it is supported yet?
Run-time modifications of additional entity properties, including plugins for entities (this might be more related to Ignition than gym-ignition). For example, randomization of camera's aspect ratio without needing to remove and reinsert the entity would be nice.
A distinction between Randomizer and a second individual wrapper that populates the environment could also be useful? (just a formality though, no change in functionality)

The source code of this project is kinda messy right now. If I have time for it in the future, I might refactor it, split into separate submodules and possibly rewrite in a lower-level language (I do not normally use Python and I still kinda dislike it :D). Let me know if you think there is something of interest for gym-ignition that you would like to include from this project. I could then extract it and open a PR here, if desired.

diegoferigo · 2021-06-29T08:23:11Z

diegoferigo
Jun 29, 2021
Maintainer

@AndrejOrsula thanks a lot for the report and, first off, congrats for the great work! It's really rewarding seeing projects like yours working this well both in simulation and in the real world 🚀 Sorry for commenting this late, but this discussion for some reason went under my radar and I just found it.

I will just mention some specific parts related to Ignition/gym-ignition that could be discussed further. For many of these, there might be a much better solution that I have not found and/or did not think of. Some of these also relate to the approach of fully using Python, which was selected to simplify integration with different modules due to the limited time I had for the project and my lack of prior experience with DL/RL. I believe it would be much easier to handle many of these issues with a lower-level language.

This is the type of feedback we are looking for from our community. The use cases of RL, even limited in the robot learning domain, are quite wide, and these kind of posts are very helpful to us and to external contributors to focus the development. I'd encourage all downstream users to provide a similar feedback, it's much appreciated!

I use ROS 2 and convert all messages between Ignition Transport and ROS 2 via ros_ign_bridge. Although it causes some overhead, it is simple to implement and provides additional benefits such as use of RViz 2 and other useful tools like tf2. Besides the reduced determinism of simulation, the lack of ability to trigger image capture is by far the largest disadvantage. Therefore, the camera framerate needs to be set higher than the update rate of agent (e.g. 4x), while also making sure that a new observation is received after each update (extra steps that are potentially not necessary). Not a huge deal, but definitely very far from ideal.

This is definitely our current biggest limitation, we are well aware of it (#199, #249, #287). Unfortunately, the first attempt to overcome it, #249, was too hacky and performances were really really bad. In this moment, there is no clean solution beyond what you implemented, which is pretty smart and somehow similar to what @FirefoxMetzger prototyped in #287 (comment) and FirefoxMetzger/ropy.

Some fresh air on the problem could come from gazebosim/gz-sim#793, but it's too early to tell.

For motion planning, I just used MoveIt 2 to generate joint trajectories based on the selected actions. To execute these trajectories, JointTrajectoryController system plugin is used. Once I have time to try out ign_ros2_control, I will probably transition to it in order to simplify the setup for arbitrary robot models and make it possible to employ the same interface also on real robots (once their ros2_control implementation is ready). Similar to sensors, ROS 2 with ros_ign_bridge is used to facilitate the communication.

Using a middleware and in general IPC is a valid solution. ROS has nice resources ready to be used. In general, I tend to prefer solutions that do not involve any network transport due to reproducibility problems (and it's just easier doing all things in the same Python code). For motion control, using Ignition plugins and custom controllers (similar to ComputedTorqueFixedBase, that are both fully supported by ScenarIO, is another valid alternative even though it might require some extra effort wrt using ROS.

Google Scanned Objects collection from Fuel and a bunch of free PBR textures for ground plane are used. For all the different models, a single RandomObject class is used (not sure whether there is a better way). Because these models do not have inertial properties associated with them, they were estimated from their mesh and random mass. Their collision geometry also needs to be decimated, otherwise the stepping of simulation is unbearably slow.

I would have taken a very similar approach. I believe that this is the most straightforward path, and kudos for all the model processing that is often a pretty tedious task!

Training is definitely a very time-consuming process. On my laptop (130 W), 500k time steps takes approximately three days to complete, albeit a large portion of it is the DL itself. I am barely approaching ~200% RTF when stepping the full environment (4 objects, rendering, random actions) with step size of 4ms. It is a bit better for a single object at ~350% RTF. That is with low-poly collision geometry, which is actually disabled for lower links of the robot. Such a slow training makes hyperparameter tuning especially painful (both manual and automatic). Parallelised environments would definitely help and make the environments more scalable!

This is a shared curse, I relate very much, welcome to the club 😄 I just commented #363 which you might find interesting. Currently, when the contact system is enabled, it makes the simulation much slower. Contact rich scenario like yours, or also like mine for bipedal locomotion, are currently quite slow with DART. Consider that manipulation is a much lighter task, I really envy your RTF! Mine are around 30% 😅 Concurrent execution definitely mitigates the problem, but it's a workaround rather than a solution. We cannot do much, maybe new physics backends (gazebosim/gz-physics#153, leggedrobotics/raisimLib#40) might come helpful here.

This part was relatively painless. I added a very simple runtime for it. It's a quick-and-dirty solution that allows evaluation of trained agents on a real robot (no training). Most of the aspects of RL are manual, including determining success and resetting the environment. It also allows manual stepping as a first step when testing the transfer to make sure nothing breaks. I cannot guarantee its safety though.

I'm really happy that sim2real was smooth in your case, not every researcher in this domain is so lucky :) I'm not sure how you implemented the real-time execution, our approach not yet fully finalized nor tested is #94. In general, guaranteeing safety is still a pretty open research question in robot learning.

It could be beneficial to have the ability to progress the physics of simulation with an arbitrary number of steps instead of requiring to have a fixed stepsPerRun. I can see several use-cases for it such as tasks with actions that trigger action primitives (e.g. discrete pixel-wise action space that determines grasp pose) and all tasks defined as semi-MDPs. It is also useful for any task during reset of the environment that requires the agent to wait for the simulation to reach a certain condition before continuing (something that should not/cannot be manually set). Related to Moving code from Python to C++: high-level overview #304.

Allowing to specify the number of iterations is relatively simple feature to add. Right now, in order to simplify the understanding of the different rates (physics, controllers, simulator) we decided to keep the stepsPerRun fixed, but adding a new optional parameter to GazeboSimulator::run would be easy.

Manipulation of lights could be very useful, e.g. for domain randomization. I don't believe it is supported yet?

It is not yet supported, but there's some recent relate upstream activity gazebosim/gz-sim#515. Of course being able to do it from APIs would be better but using transport is a workaround that already works.

Run-time modifications of additional entity properties, including plugins for entities (this might be more related to Ignition than gym-ignition). For example, randomization of camera's aspect ratio without needing to remove and reinsert the entity would be nice.

Yes this is more related to upstream, I'm don't have enough knowledge of the sensors / rendering stack to comment. I'm not sure if these parameters could be changed dynamically, but if it's possible, a user commands approach similar to what done with the lights could be a possible implementation.

A distinction between Randomizer and a second individual wrapper that populates the environment could also be useful? (just a formality though, no change in functionality)

The randomizer is just a gym.Wrapper with another name. Environments developers are free to use, develop and stack all the wrappers they need. The randomization support is yet quite experimental and there's a lot of room for improvements. Though, being pure Python, is quite easy to hack and it could be done downstream without affecting this repo.

To conclude, and for the records, for those reading this comment, this month (June 2021) @AndrejOrsula will present his work in the community meeting (good luck!), and the presentation will complement his description above and the repos.

1 reply

diegoferigo Jul 2, 2021
Maintainer

Here's the link to the meeting: https://youtu.be/mo4ZRi0mmSQ. Unfortunately I couldn't attend the live presentation, very interesting session!!

FirefoxMetzger · 2021-06-29T15:19:58Z

FirefoxMetzger
Jun 29, 2021

Sensors (RGB-D Camera)

Although it causes some overhead

... actually, this overhead could be a lot larger than you'd initially expect.

I've noticed that ignition's communication layer is - at least for camera images (I assume RGBD is similar, but take this with some salt) - quite horrible. The (protobuf) message contains a raw, uncompressed image. The exact handling is machine-specific but involves at least one copy of the data. Certainly not ideal for raw image data even at low framerates like 30Hz. Put into context, the render of a ~10s simulation takes about 35s on my desktop (AMD Ryzen 7 5800X, RTC3070) of which 30s are spent on rendering and a bit less than 5s are spent on physics.

If you send that data through a bridge to ROS2 you simply double the message overhead. That is because, afaik, ROS2 also uses the zmq+protobuf combo. Aka, ignition makes a copy as the message is sent off via tcp://, and then the bridge makes another copy as it - again - sends the message off via tcp://.

This could be solved elegantly by ignition and/or ROS offering us to reconfigure communication to use zmq's inproc:// protocol instead if things are indeed single-process like here in gym-ignition. This would bring everything back to the 0 copy format we aim for. Perhaps it is already possible and I am simply unaware of the respective documentation.

Besides the reduced determinism of simulation, the lack of ability to trigger image capture is by far the largest disadvantage

@AndrejOrsula I may have missed a feature here, but is it possible to do "on-demand rendering" of sensors in ignition?

other useful tools like tf2

Random sidenote: ropy.transform offers tf2-like functionality in python. It started from my desire to express coordinate transformations and, in particular, projections in a style similar to graph computations in frameworks like tensorflow or torch. Now ropy has a module to do tf2-style coordinate transformations in N-dimensions. The cool thing is that - since it builds on top of numpy - you get full interoperability with the entire scientific python stack (e.g., dask for multi-threading or cluster use), and using things like tf.experimental.numpy you should even be able to put these transformations onto the GPU and/or construct something like an FK layer in a neural network that allows you to backprop through forward kinematics.

Performance

On my laptop (130 W), 500k time steps takes approximately three days

In my (somewhat limited) experience, this is either because the environment is slow, or because the chosen RL algorithm is not computationally efficient. It would be interesting to see some profiling runs of the full pipeline to get a sense of it.

My experience with A3C is that filling the replay buffer with "fresh" actions tends to be the most time-consuming part and that is either because the environment is slow by default or because the rollout loop is not efficiently implemented. While toying with Expert Iteration I found the same thing: the rollout being the bottleneck, even though in this case, I can rule out the environment because it was small and rather optimized (iirc it accounted for less than 10% of the runtime in my profiles).

Another factor is the choice of TF vs torch. Theoretically, it doesn't matter much, but in practice, I see a lot of people shooting themself in the foot with torch and then they artificially bottleneck themself by constantly synchronizing the GPU and the CPU. (Disclaimer: We are only one robotics group/lab and the majority of our division does research on supervised image processing, which biases my experience. That said, you can equally shoot yourself in the foot with this in supervised learning or RL.)

I am barely approaching ~200% RTF when stepping the full environment

@AndrejOrsula I'm a bit curious about this as well. Last I checked, increasing the RTF above 100% would speed up physics, but keep all the sensors at the original speed, which would cause synchronization issues between controllers, sensors, and the environment. Has this been a problem for you? (maybe the behavior has changed since)

1 reply

AndrejOrsula Jun 30, 2021
Author

Thank you for your response! :)

Put into context, the render of a ~10s simulation takes about 35s on my desktop (AMD Ryzen 7 5800X, RTC3070) of which 30s are spent on rendering and a bit less than 5s are spent on physics.

I totally agree, rendering itself can be more costly than physics also in my case. Luckily, I am using only 256x256 px RGB-D at 10Hz so it is not "terrible".

... actually, this overhead could be a lot larger than you'd initially expect.

I've noticed that ignition's communication layer is - at least for camera images (I assume RGBD is similar, but take this with some salt) - quite horrible. The (protobuf) message contains a raw, uncompressed image. The exact handling is machine-specific but involves at least one copy of the data.

If you send that data through a bridge to ROS2 you simply double the message overhead...

With respect to the communication, my current solution is definitely not great. Ideally, there would be no network transport at all. However, I feel like the overall system would then be quite a bit more complicated to implement w.r.t concurrency in Python.

Regarding ROS 2 bridge, I do it because it is very simple. Furthermore, the actual stepping of the simulation is for me bottle-necked by the single-threaded physics. Therefore, my system does not actually have 100% utilisation of all CPU cores. Since the bridge is on a separate process, it is not a huge deal for me performance-wise as it can just use one of the idle cores, but again - not ideal. (Obviously there can be a slow-down of the system due to generated heat and RAM needing to perform access + sequential read/write ops, but I just hope that's small in the grand scheme of things.) I cannot say much about latency it introduces and whether it has any effect on the learned policy - latency from not being able to trigger camera rendering is far worse in my opinion.

I may have missed a feature here, but is it possible to do "on-demand rendering" of sensors in ignition?

Sorry about the confusion. To my knowledge, there is currently NO way to render the scene on demand with Ignition sensors. I might be wrong though...

ropy.transform offers tf2-like functionality in python.

Thanks for sharing! I will definitely check it out if I will work on any more pure-Python projects in the future.

(about performance) ... this is either because the environment is slow, or because the chosen RL algorithm is not computationally efficient. It would be interesting to see some profiling runs of the full pipeline to get a sense of it.

Regarding RL, the actor-critic algorithms that I used (TD3/SAC/TQC) all employ actor network and two or more critic networks (and extra smaller-stuff like optimising entropy temperature). In that respect, I believe that the very popular PPO and friends (mostly on-policy) definitely have an upper hand when it comes to wall-time - however, more environment steps would be needed since there is no reuse of previous samples due to the lack of replay buffer. So in that aspect, off-policy methods immediately have advantage for robotics tasks that take a significant amount of time to perform/simulate. And if such environments require use of continuous action spaces, I do not believe there is a better alternative to the utilised actor-critic algorithms that I mentioned above (at least for model-free RL).

In my case, the feature extractor itself is the slowest part (3D CNN with extra look-up due to hierarchical octree structure). With batch size of 32 and 100 gradient steps, it can take up to a minute to perform a single update (after each episode).

Another factor is the choice of TF vs torch.

I agree, torch currently seems slower than it should be. I don't have much experience with DL, so I don't really know how to tune it in the correct way. Long time ago I saw a post saying that using a fewer/single CPU thread can improve the performance (due to the synchronisation that you mentioned), but I have not experienced any noticeable difference when I tried that. I would definitely be happy to learn more about other DL frameworks in the future. E.g. popularity of Jax with its functional approach seems to be increasing at a rapid rate, but I have not tried it yet.

Last I checked, increasing the RTF above 100% would speed up physics, but keep all the sensors at the original speed, which would cause synchronization issues between controllers, sensors, and the environment. Has this been a problem for you? (maybe the behavior has changed since)

I did a test just to be sure. With 10Hz camera, I get readings exactly after every 100ms of the simulation time. This is regardless of the RTF (and hence real time). Therefore, it is not a problem at all. I have only used Dome and tested Edifice a bit, so I cannot say if there was any issue with this in previous releases.

FirefoxMetzger · 2021-07-01T07:11:47Z

FirefoxMetzger
Jul 1, 2021

I do not believe there is a better alternative to the utilised actor-critic algorithms that I mentioned above (at least for model-free RL).

Sorry, my comment wasn't meant as a criticism of your choice of algorithm. It's moreso a general statement that most (all?) RL algorithms come with a quite high big O, low sample efficiency, and lots of blocks with sequential dependencies.

In my case, the feature extractor itself is the slowest part (3D CNN with extra look-up due to hierarchical octree structure). With batch size of 32 and 100 gradient steps, it can take up to a minute to perform a single update (after each episode).

@AndrejOrsula Did you write the octree code for the CPU and the rest of the network exists on the GPU? In this case, this could explain the high execution time.

Long time ago I saw a post saying that using a fewer/single CPU thread can improve the performance

This is indeed one of the ways to shoot yourself into the foot with torch; though it mainly applies in supervised learning, and not so much to RL unless you save the replay buffer to disk instead of keeping it in memory.

popularity of Jax with its functional approach seems to be increasing at a rapid rate, but I have not tried it yet.

Hm, I always thought JAX is part of TensorFlow. I will have to double-check that.

I get readings exactly after every 100ms of the simulation time.

That sounds really good. I guess I will try to bump my RTF/physics speed again and see how it performs. Thanks for the information.

1 reply

AndrejOrsula Jul 1, 2021
Author

... It's moreso a general statement that most (all?) RL algorithms come with a quite high big O, low sample efficiency, and lots of blocks with sequential dependencies.

I understood your intention :) and I agree with you. But I think it is also fair to say that actor-critic algorithms are pretty much the worst case scenario for reporting the metric of time_steps/days, which I did. It does not really give the full picture.

Did you write the octree code for the CPU and the rest of the network exists on the GPU? In this case, this could explain the high execution time.

It is all done on GPU (with exception for octree creation and pre-processing). I have not actually implemented that part (I do not know CUDA). I used https://github.com/microsoft/O-CNN to get torch modules that already implement the computations on GPU.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Robotic Grasping with 3D Visual Observations #362

{{title}}

Replies: 3 comments 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Robotic Grasping with 3D Visual Observations #362

AndrejOrsula Jun 9, 2021

Short Description

Sensors (RGB-D Camera)

Robot Controller and Motion Planning

More on ROS 2

Dataset

Performance

Sim2Real

Other Thoughts

Replies: 3 comments · 3 replies

diegoferigo Jun 29, 2021 Maintainer

diegoferigo Jul 2, 2021 Maintainer

FirefoxMetzger Jun 29, 2021

AndrejOrsula Jun 30, 2021 Author

FirefoxMetzger Jul 1, 2021

AndrejOrsula Jul 1, 2021 Author

AndrejOrsula
Jun 9, 2021

Replies: 3 comments 3 replies

diegoferigo
Jun 29, 2021
Maintainer

diegoferigo Jul 2, 2021
Maintainer

FirefoxMetzger
Jun 29, 2021

AndrejOrsula Jun 30, 2021
Author

FirefoxMetzger
Jul 1, 2021

AndrejOrsula Jul 1, 2021
Author