Xiao Fu1 , Xintao Wang2 ✉, Xian Liu1, Jianhong Bai3, Runsen Xu1,
Pengfei Wan2, Di Zhang2, Dahua Lin1✉
1The Chinese University of Hong Kong 2Kling Team, Kuaishou Technology 3Zhejiang University
✉: Corresponding Authors
🔥 RoboMaster synthesizes realistic robotic manipulation video given an initial frame, a prompt, a user-defined object mask, and a collaborative trajectory describing the motion of both robotic arm and manipulated object in decomposed interaction phases. It supports diverse manipulation skills and can generalize to in-the-wild scenarios.
teaser.mp4
- Add inference codes with checkpoints.
- Add training codes.
- Add evaluation codes.
- Add Gradio demo to generate model inputs on in-the-wild images.
- Release full training data.
- Our environment setup is identical to CogVideoX. You can refer to their configuration to complete the environment setup.
conda create -n robomaster python=3.10 conda activate robomaster
- Download
ckptsfrom here and place it under the base rootRoboMaster. The checkpoints are organized as follows:├── ckpts ├── CogVideoX-Fun-V1.5-5b-InP (pretrained model base) ├── RoboMaster (post-trained transformer)
-
Robotic Manipulation on Diverse Out-of-Domain Objects.
python inference_inthewild.py \ --input_path demos/diverse_ood_objs \ --output_path samples/infer_diverse_ood_objs \ --transformer_path ckpts/RoboMaster \ --model_path ckpts/CogVideoX-Fun-V1.5-5b-InP
-
Robotic Manipulation with Diverse Skills
python inference_inthewild.py \ --input_path demos/diverse_skills \ --output_path samples/infer_diverse_skills \ --transformer_path ckpts/RoboMaster \ --model_path ckpts/CogVideoX-Fun-V1.5-5b-InP
-
Long Video Generation in Auto-Regressive Manner
python inference_inthewild.py \ --input_path demos/long_video \ --output_path samples/long_video \ --transformer_path ckpts/RoboMaster \ --model_path ckpts/CogVideoX-Fun-V1.5-5b-InP
- We fine-tune the base model on videos with a resolution of 640×480 and 37 frames using 8 GPUs. During preprocessing, videos with fewer than 16 frames are excluded.
cd scripts bash train_injector.sh
├── RoboMaster ├── eval_metrics ├── VBench ├── common_metrics_on_video_quality ├── eval_traj ├── results ├── bridge_eval_gt ├── bridge_eval_ours ├── bridge_eval_ours_tracking -
Download
eval_metrics.zipfrom here and extract it under the base root. -
Generating
bridge_eval_ours. (Note that the results may vary slightly across different computing machines, even with the same seed. We have prepared the reference files undereval_metrics/results)cd RoboMaster/ python inference_eval.py -
Generating
bridge_eval_ours_tracking: Install CoTracker3, and then estimate tracking points with grid size 30 onbridge_eval_ours.
- Evaluation of VBench metrics.
cd eval_metrics/VBench python evaluate.py \ --dimension aesthetic_quality imaging_quality temporal_flickering motion_smoothness subject_consistency background_consistency \ --videos_path ../results/bridge_eval_ours \ --mode=custom_input \ --output_path evaluation_results - Evaluation of FVD and FID metrics.
cd eval_metrics/common_metrics_on_video_quality python calculate.py -v1_f ../results/bridge_eval_ours -v2_f ../results/bridge_eval_gt python -m pytorch_fid eval_1 eval_2
- Estimation of TrajError metrics. (Note that we exclude some samples listed in
failed_track.txt, due to failed estimation by CoTracker3)cd eval_metrics/eval_traj python calculate_traj.py \ --input_path_1 ../results/bridge_eval_ours \ --input_path_2 ../results/bridge_eval_gt \ --tracking_path ../results/bridge_eval_ours_tracking \ --output_path evaluation_results - Check the visualization videos under
evaluation_results. We blend the trajectories of robotic arm and object throughout the entire video for better illustration.
If you find this work helpful, please consider citing:
@article{fu2025robomaster, title={Learning Video Generation for Robotic Manipulation with Collaborative Trajectory Control}, author={Fu, Xiao and Wang, Xintao and Liu, Xian and Bai, Jianhong and Xu, Runsen and Wan, Pengfei and Zhang, Di and Lin, Dahua}, journal={arXiv preprint arXiv:2506.01943}, year={2025} }