GitHub - KlingTeam/RoboMaster: [ARXIV’25] Learning Video Generation for Robotic Manipulation with Collaborative Trajectory Control

Learning Video Generation for Robotic Manipulation with Collaborative Trajectory Control

Xiao Fu¹, Xintao Wang^{2 ✉}, Xian Liu¹, Jianhong Bai³, Runsen Xu¹,
Pengfei Wan², Di Zhang², Dahua Lin^1✉
¹The Chinese University of Hong Kong ²Kling Team, Kuaishou Technology ³Zhejiang University
✉: Corresponding Authors

🌟 Introduction

🔥 RoboMaster synthesizes realistic robotic manipulation video given an initial frame, a prompt, a user-defined object mask, and a collaborative trajectory describing the motion of both robotic arm and manipulated object in decomposed interaction phases. It supports diverse manipulation skills and can generalize to in-the-wild scenarios.

teaser.mp4

📝 TODO List

Add inference codes with checkpoints.
Add training codes.
Add evaluation codes.
Add Gradio demo to generate model inputs on in-the-wild images.
Release full training data.

⚙️ Quick Start

1. Environment Setup

Our environment setup is identical to CogVideoX. You can refer to their configuration to complete the environment setup.
```
conda create -n robomaster python=3.10 conda activate robomaster
```

Download ckpts from here and place it under the base root RoboMaster. The checkpoints are organized as follows:

├── ckpts ├── CogVideoX-Fun-V1.5-5b-InP (pretrained model base) ├── RoboMaster (post-trained transformer)

2. Generate Website Demos

Robotic Manipulation on Diverse Out-of-Domain Objects.

python inference_inthewild.py \ --input_path demos/diverse_ood_objs \ --output_path samples/infer_diverse_ood_objs \ --transformer_path ckpts/RoboMaster \ --model_path ckpts/CogVideoX-Fun-V1.5-5b-InP

Robotic Manipulation with Diverse Skills

python inference_inthewild.py \ --input_path demos/diverse_skills \ --output_path samples/infer_diverse_skills \ --transformer_path ckpts/RoboMaster \ --model_path ckpts/CogVideoX-Fun-V1.5-5b-InP

Long Video Generation in Auto-Regressive Manner

python inference_inthewild.py \ --input_path demos/long_video \ --output_path samples/long_video \ --transformer_path ckpts/RoboMaster \ --model_path ckpts/CogVideoX-Fun-V1.5-5b-InP

3. Start Training

We fine-tune the base model on videos with a resolution of 640×480 and 37 frames using 8 GPUs. During preprocessing, videos with fewer than 16 frames are excluded.
```
cd scripts bash train_injector.sh
```

🚀 Benchmark Evaluation

├── RoboMaster ├── eval_metrics ├── VBench ├── common_metrics_on_video_quality ├── eval_traj ├── results ├── bridge_eval_gt ├── bridge_eval_ours ├── bridge_eval_ours_tracking

1. Prepare Evaluation Files & Inference on Benchmark

Download eval_metrics.zip from here and extract it under the base root.
Generating bridge_eval_ours. (Note that the results may vary slightly across different computing machines, even with the same seed. We have prepared the reference files under eval_metrics/results)
```
cd RoboMaster/ python inference_eval.py
```
Generating bridge_eval_ours_tracking: Install CoTracker3, and then estimate tracking points with grid size 30 on bridge_eval_ours.

2. Evaluation on Visual Quality

Evaluation of VBench metrics.

cd eval_metrics/VBench python evaluate.py \ --dimension aesthetic_quality imaging_quality temporal_flickering motion_smoothness subject_consistency background_consistency \ --videos_path ../results/bridge_eval_ours \ --mode=custom_input \ --output_path evaluation_results

Evaluation of FVD and FID metrics.

cd eval_metrics/common_metrics_on_video_quality python calculate.py -v1_f ../results/bridge_eval_ours -v2_f ../results/bridge_eval_gt python -m pytorch_fid eval_1 eval_2

3. Evaluation on Trajectory (Robotic Arm & Manipulated Object)

Estimation of TrajError metrics. (Note that we exclude some samples listed in failed_track.txt, due to failed estimation by CoTracker3)

cd eval_metrics/eval_traj python calculate_traj.py \ --input_path_1 ../results/bridge_eval_ours \ --input_path_2 ../results/bridge_eval_gt \ --tracking_path ../results/bridge_eval_ours_tracking \ --output_path evaluation_results

Check the visualization videos under evaluation_results. We blend the trajectories of robotic arm and object throughout the entire video for better illustration.

🔗 Citation

If you find this work helpful, please consider citing:

@article{fu2025robomaster, title={Learning Video Generation for Robotic Manipulation with Collaborative Trajectory Control}, author={Fu, Xiao and Wang, Xintao and Liu, Xian and Bai, Jianhong and Xu, Runsen and Wan, Pengfei and Zhang, Di and Lin, Dahua}, journal={arXiv preprint arXiv:2506.01943}, year={2025} }

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Learning Video Generation for Robotic Manipulation with Collaborative Trajectory Control

🌟 Introduction

📝 TODO List

⚙️ Quick Start

1. Environment Setup

2. Generate Website Demos

3. Start Training

🚀 Benchmark Evaluation

1. Prepare Evaluation Files & Inference on Benchmark

2. Evaluation on Visual Quality

3. Evaluation on Trajectory (Robotic Arm & Manipulated Object)

🔗 Citation

About

Uh oh!

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
demos		demos
img		img
robomaster		robomaster
robot		robot
scripts		scripts
README.md		README.md
inference_eval.py		inference_eval.py
inference_inthewild.py		inference_inthewild.py
utils.py		utils.py

KlingTeam/RoboMaster

Folders and files

Latest commit

History

Repository files navigation

Learning Video Generation for Robotic Manipulation with Collaborative Trajectory Control

🌟 Introduction

📝 TODO List

⚙️ Quick Start

1. Environment Setup

2. Generate Website Demos

3. Start Training

🚀 Benchmark Evaluation

1. Prepare Evaluation Files & Inference on Benchmark

2. Evaluation on Visual Quality

3. Evaluation on Trajectory (Robotic Arm & Manipulated Object)

🔗 Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages