Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Could you give me some clues on reproducing the CLIP feature alignment from DR.Robot? #3

Open
qrcat opened this issue Oct 26, 2024 · 4 comments

Comments

@qrcat
Copy link

qrcat commented Oct 26, 2024

I'm trying to reimplement the "Text to Robot Pose with CLIP" of paper but haven't achieved the same results.

I've been attempting to match the conditions outlined in the paper. When training the shadow hand model, I used:

python generate_robot_data.py --model_xml_dir mujoco_menagerie/shadow_hand --camera_distance_factor 0.4
python train.py --dataset_path data/shadow_hand --experiment_name shadow_hand --canonical_training_iterations 5000 --pose_conditioned_training_iterations 30_000

I also wrote a script for aligning CLIP features using the 🤗 openai/clip-vit-base-patch32 encoder.
The initial pose is come from get_canonical_pose at utils.mujoco_utils, and I use Adam to optimize it.
The loss function is dot product between language and image embeddings

loss = -torch.matmul(image_embedding, text_features.T.detach())

I've noticed some oddities. Firstly, the loss function starts off much lower than what was reported in the paper. On the webpage, I saw that the initial error value was around -24, but my reproduction yields a value below -30. I suspect this has something to do with the prompts. Secondly, it's difficult to optimize to the desired pose.
loss
render

Therefore, I would like to know more about the implementation details regarding this part, such as the optimizer settings or if any additional tricks were used, etc. Could you please share that with me?

I'm looking forward to receiving your reply.

@alpercanberk
Copy link
Collaborator

hi qrcat! thank you for your interest in our paper

  1. centering the hand in the image matters a lot, huggingface's clip automatically resizes images to square so also make sure you aren't passing a long rectangle
  2. we found the optimization process to be a bit sensitive to learning rates, so i suggest playing around with those as well
  3. yes, the prompt matters, i suggest starting out with what we included in the paper

@qrcat
Copy link
Author

qrcat commented Oct 27, 2024

thanks for your answer☺️

@jiangranlv
Copy link

Hi, could you please share the script for this experiment? Many thanks!!!

@jiangranlv
Copy link

jiangranlv commented Dec 4, 2024

Hi, could you please share the script for this experiment? Many thanks!!!

you can email me with [email protected]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants