Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to add visual distractions? #685

Open
AlexandreBrown opened this issue Nov 6, 2024 · 7 comments
Open

How to add visual distractions? #685

AlexandreBrown opened this issue Nov 6, 2024 · 7 comments

Comments

@AlexandreBrown
Copy link

AlexandreBrown commented Nov 6, 2024

Hi,
I would like to change textures (randomly or via a png file) of the various objects in the scene (eg: before every new episode).
I managed to change the base_color but when I change the textures, nothing happens.
Any pointers is appreciated.
The objective is to change textures, camera FOV, lighting and if possible add new objects to evaluate methods for visual generalization.

base_color_texture = RenderTexture2D(
    "/home/user/Downloads/cliff_side_4k.blend/textures/cliff_side_diff_4k.jpg"
)
for actor_name in self.base_env.unwrapped.scene.actors.keys():
    for part in self.base_env.unwrapped.scene.actors[actor_name]._objs:
        for triangle in (
            part.find_component_by_type(sapien.render.RenderBodyComponent)
            .render_shapes[0]
            .parts
        ):
            # triangle.material.set_base_color([0.8, 0.1, 0.1, 1.0])
            triangle.material.set_base_color_texture(base_color_texture)

obs_dict, _ = self.base_env.reset()

PS: I do not know much about textures, I downloaded a sample file from https://polyhaven.com/a/cliff_side

@StoneT2000
Copy link
Member

Is this your own custom environment? What environment is this exactly? And are you planning to use the GPU sim + rendering?

@AlexandreBrown
Copy link
Author

AlexandreBrown commented Nov 7, 2024

Hi @StoneT2000 , I am using SimplerEnv and TorchRL.
The code is a TorchRL Env that wraps SimplerEnv environment to utilize it in TorchRL unified interface.
TorchRL env wrapper:

import torch
import numpy as np
from tensordict import TensorDict, TensorDictBase
from torchrl.envs import EnvBase
from torchrl.data import Composite, Unbounded, Bounded
from sapien.pysapien.render import RenderTexture2D
import sapien


class SimplerEnvWrapper(EnvBase):
    def __init__(self, base_env, **kwargs):
        super().__init__(**kwargs)
        self._device = torch.device(kwargs.get("device", "cpu"))
        self.base_env = base_env
        self.numpy_to_torch_dtype_dict = {
            bool: torch.bool,
            np.uint8: torch.uint8,
            np.int8: torch.int8,
            np.int16: torch.int16,
            np.int32: torch.int32,
            np.int64: torch.int64,
            np.float16: torch.float16,
            np.float32: torch.float32,
            np.float64: torch.float64,
        }
        self._make_specs()

    def _make_specs(self):
        raw_observation_spec = self.get_image_from_maniskill3_obs_dict(
            self.base_env, self.base_env.observation_space.spaces
        )
        height = raw_observation_spec.shape[-3]
        width = raw_observation_spec.shape[-2]
        self.channels = raw_observation_spec.shape[-1]
        shape = (height, width, self.channels)
        observation_spec = {
            "pixels": Bounded(
                low=torch.from_numpy(
                    raw_observation_spec.low[0, :, :, : self.channels]
                ).to(self._device),
                high=torch.from_numpy(
                    raw_observation_spec.high[0, :, :, : self.channels]
                ).to(self._device),
                shape=shape,
                dtype=torch.uint8,
                device=self._device,
            )
        }
        self.observation_spec = Composite(**observation_spec)

        action_space = self.base_env.action_space
        self.action_spec = Bounded(
            low=torch.from_numpy(action_space.low).to(self._device),
            high=torch.from_numpy(action_space.high).to(self._device),
            shape=action_space.shape,
            dtype=self.numpy_to_torch_dtype_dict[action_space.dtype.type],
            device=self._device,
        )

        self.reward_spec = Unbounded(
            shape=(1,), dtype=torch.float32, device=self._device
        )
        self.done_spec = Unbounded(shape=(1,), dtype=torch.bool, device=self._device)

    def get_image_from_maniskill3_obs_dict(self, env, obs, camera_name=None):
        if camera_name is None:
            if "google_robot" in env.unwrapped.robot_uids.uid:
                camera_name = "overhead_camera"
            elif "widowx" in env.unwrapped.robot_uids.uid:
                camera_name = "3rd_view_camera"
            else:
                raise NotImplementedError()
        img = obs["sensor_data"][camera_name]["rgb"]
        return img

    def _reset(self, tensordict: TensorDictBase = None):

        base_color_texture = RenderTexture2D(
            "/home/user/Downloads/cliff_side_4k.blend/textures/cliff_side_diff_4k.jpg"
        )
        for actor_name in self.base_env.unwrapped.scene.actors.keys():
            for part in self.base_env.unwrapped.scene.actors[actor_name]._objs:
                for triangle in (
                    part.find_component_by_type(sapien.render.RenderBodyComponent)
                    .render_shapes[0]
                    .parts
                ):
                    # triangle.material.set_base_color([0.8, 0.1, 0.1, 1.0])
                    triangle.material.set_base_color_texture(base_color_texture)

        obs_dict, _ = self.base_env.reset()

        rgb_obs = (
            self.get_image_from_maniskill3_obs_dict(self.base_env, obs_dict)[
                0, :, :, : self.channels
            ]
            .to(torch.uint8)
            .squeeze(0)
        )
        text_instruction = self.base_env.unwrapped.get_language_instruction()
        done = torch.tensor(False, dtype=torch.bool, device=self._device)
        terminated = torch.tensor(False, dtype=torch.bool, device=self._device)

        return TensorDict(
            {
                "pixels": rgb_obs,
                "text_instruction": text_instruction,
                "done": done,
                "terminated": terminated,
            },
            batch_size=[],
            device=self._device,
        )

    def _step(self, tensordict: TensorDictBase):
        action = tensordict["action"]
        obs_dict, reward, done, _, info = self.base_env.step(action)

        rgb_obs = (
            self.get_image_from_maniskill3_obs_dict(self.base_env, obs_dict)[
                0, :, :, : self.channels
            ]
            .to(torch.uint8)
            .squeeze(0)
        )
        text_instruction = self.base_env.unwrapped.get_language_instruction()

        return TensorDict(
            {
                "pixels": rgb_obs,
                "text_instruction": text_instruction,
                "reward": reward,
                "done": done,
            },
            batch_size=[],
            device=self._device,
        )

    def _set_seed(self, seed: int):
        self.base_env.seed(seed)

PS: I am not sure I am doing this right, should I apply the changes before the environment reset?
PS #2 : Is there specific file requirements for the texture file ? Do you have a test sample I can use as well? Or does any texture from publicly available texture websites work ?

Where base_env is obtained using Maniskill3 gym integration :

from mani_skill.envs.sapien_env import BaseEnv
...

env_name = cfg["env"]["name"]

sensor_configs = dict()
sensor_configs["shader_pack"] = "default"

base_env: BaseEnv = gym.make(
    env_name,
    max_episode_steps=max_episode_steps,
    obs_mode="rgb+segmentation",
    num_envs=1,
    sensor_configs=sensor_configs,
    render_mode="rgb_array",
    sim_backend=cfg["env"]["device"],
)

I am testing the following existing environments from maniskill3 (using SimplerEnv):

  • PutCarrotOnPlateInScene-v1
  • PutCarrotOnPlateInScene-v1
  • PutSpoonOnTableClothInScene-v1
  • StackGreenCubeOnYellowCubeBakedTexInScene-v1
    (Since currently only these are supported by SimplerEnv/Maniskill3 integration).

My goal is to leverage the flexibility of maniskill3/simplerenv and be able to :

  • Randomize colors
  • Randomize textures (eg: Assign random textures from the textures in the scene)
  • Assign specific textures (I want to come up with specific textures that are challenging and never seen during training for a visual RL generalization benchmark)
  • Change camera FOV & positions (again for visual generalization evaluation purposes)
  • Change lighting conditions
  • Load distracting objects or be able to load a scene with distractions like we could in Maniskill2 (eg: random extra objects in the scene)
  • Load rgb overlay / video overlay

The more I can achieve from this list, the better.
Note that I am not familiar with Maniskill3 so I did not try to create anything custom yet.

Ideally I would like to apply these randomization at the start of the episode.
I assume video overlay would require per step update (if we treat a video as a sequence of frames where at each step we update the overlayed frame).
I understand that GPU vectorization probably means these use cases are much harder, in which case I would prefer to go for the low hanging fruit first (eg: randomization that are only applied at the start of the episode, if that's easier).

Yes I plan on using the GPU to improve simulation performance (fps), I assume that sim_backend='cuda' is what needs to be done for this but please feel free to tell me more about it. GPU vectorization is a strong motivation for me to use maniskill3 with simplerenv (via their maniskill3 branch) instead of the existing maniskill2/simplerenv.

@StoneT2000
Copy link
Member

Thanks for the extensive notes, all of what you suggest are possible but it depends a little bit on what models you want to evaluate actually.

  • In particular do you plan to train a model and evaluate it? Or evaluate off the shelf models?
  • How realistic do you want the environment to look? Are you planning to try vision based sim2real or just do real2sim evaluation of a model trained on real world data?

There are two ways forward. The easiest option actually is to build a new table-top environment (take one of the templates or e.g. the pick cube environment) and add the parallelizations / randomizations you want for a custom environment. Only choose this option if you don't need to verify real2sim alignment and just simply want a controllable robot and objects.

Alternatively you can copy the code for the bridge dataset digital twins and modify the attributes in there to change the default RGB overlays, swap the overlay at each timestep when using video, modify the scene loader to add distractor objects etc.

https://github.com/haosulab/ManiSkill/blob/56dcd4cf1b1f04b7e7dfd82ec625c8428ce1f801/mani_skill/envs/tasks/digital_twins/bridge_dataset_eval (copy both).

Let me know which option you think is needed and I can suggest the relevant docs/code to do what you want.

@AlexandreBrown
Copy link
Author

AlexandreBrown commented Nov 11, 2024

Thanks a lot @StoneT2000 for the amazing reply!

do you plan to train a model and evaluate it? Or evaluate off the shelf models?

I plan on training and evaluating models (training from scatch).

How realistic do you want the environment to look? Are you planning to try vision based sim2real or just do real2sim evaluation of a model trained on real world data?

Are you planning to try vision based sim2real

Yes.

I want to train in simulation using an environment that is as realistic as possible (visually) but if this hinders training time I'm open to try to train using a hybrid approach where the environments are still realistic but maybe slightly less (eg: without ray-tracing) to boost collection speed during training and then the visual generalization benchmark can be more realistic and slower.
Basically I will need to train agents from scratch in simulation and then once trained, I will evaluate the approach using aggressive visual domain randomization (aggressive sim2real visual changes like random camera FOV, random objects colors, random textures, random lighting, random objects if it's feasible etc). The model will only depend on image observation (RGB pixels) and will be trained in an online RL fashion.
I am focused on an approach that shows generalization over visual distractions so the more visual distractions I can showcase the better.

The easiest option actually is to build a new table-top environment (take one of the templates or e.g. the pick cube environment) and add the parallelizations / randomizations you want for a custom environment.

This sounds interesting as I also want not just 1 environment but at least 2-3 that can show increasing level of difficulty (eg: easy to hard).
Is it easier to create an environment from scatch or to start from an existing one ? Context : I have very little experience in environment design. Where can I find a template and documentation for this ?
When you say "add the parallelizations" what do you mean exactly?

@AlexandreBrown AlexandreBrown changed the title How to change textures? How to add visual randomization? Nov 11, 2024
@AlexandreBrown AlexandreBrown changed the title How to add visual randomization? How to add visual distractions? Nov 11, 2024
@AlexandreBrown
Copy link
Author

AlexandreBrown commented Nov 18, 2024

@StoneT2000 After looking at the doc for Maniskill3, I'm tempted to use Maniskill3 directly instead of SimplerEnv. Would it be feasible to use Maniskill3 directly while also being able to add the visual distractions ?

Any help is appreciated!

@StoneT2000
Copy link
Member

StoneT2000 commented Nov 23, 2024

Yes using ManiSkill3 directly is likely better. SIMPLER's setup is based on maniskill2 and not well parallelized whereas maniskill3 is. Moreover SIMPLER is real2sim only, it is not designed for sim2real.
RE the realism aspect. In my opinion it's not clear whether photorealistic rendering with ray tracing actually helps. The simple/default rendering quality may have the same results. Ray tracing out of the box is insufficient and you need to do a lot of tuning around lighting (basically being a blender artist) and texture modeling to get actual photo realism. All those nice PR videos from various labs using sims with ray tracing like isaac or maniskill etc often heavily tune the videos in blender actually.

What kind of visual distractions were you thinking? Like spawning irrelevant objects in a table? (that is possible, just follow the custom tasks tutorial on loading objects and initializing them).

As for more parallel domain randomizations https://maniskill.readthedocs.io/en/latest/user_guide/tutorials/domain_randomization.html details some of them.

I will mark this issue as a request to add more docs (eg randomizing textures, robot controllers PD parameters and more).

@AlexandreBrown
Copy link
Author

AlexandreBrown commented Nov 23, 2024

Thanks for the detailed answer @StoneT2000 , I will use ManiSkill3 directly then and I will focus on the simple/default rendering quality which is already fairly good anyways.

What kind of visual distractions were you thinking?

I would like the following visual randomization :

  • Randomize/Set colors of the objects in the scene (eg: change a table to a specific or random RGB color, change the walls etc).
  • Randomize/Set textures (eg: Assign random textures from the textures in the scene or from files on disk).
    • I want to come up with specific textures that are challenging and never seen during training for a visual RL generalization benchmark.
  • Change camera FOV & positions (again for visual generalization evaluation purposes)
  • Change lighting conditions (randomize/set variables that affect the lighting)
  • Load distracting objects or be able to load a scene with distractions like we could in Maniskill2 (eg: Load extra objects in the scene)
    • Ideas: Load objects from same scene but as extra object (eg: extra cube), Load a random object from an other scene/env, Load a custom object from a file on disk.
  • Load rgb overlay, change texture of walls/tables/objects (to be determined) to use an image from a file on disk.
  • Load video overlay, change texture of walls/tables/objects (TBD) to use frames from a video file on disk, maybe to reduce overhead we can change the texture only every k steps to the next frame in the video? Then we can experiment with k = 1 vs k > 1.

I will mark this issue as a request to add more docs

Makes sense, these might already be possible and all is needed is some documentation improvement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants