RoboPearls

Editable Video Simulation for Robot Manipulation

ICCV 2025
Tao Tang, Likui Zhang, Youpeng Wen, Kaidong Zhang, Jia-Wang Bian, Xia Zhou, Tianyi Yan, Kun Zhan, Peng Jia, Hefeng Wu, Liang Lin, Xiaodan Liang

Method Overview

MY ALT TEXT

RoboPearls, an editable video simulation framework for robotic manipulation. RoboPearls reconstructs photo-realistic scenes with semantic features from demonstration videos. Then, with various simulation operators, RoboPearls leverages multiple LLM agents to process user commands into specific editing functions. Furthermore, RoboPearls utilizes a VLM to analyze learning issues and generate corresponding simulation demands to enhance robotic performance.

Abstract

The development of generalist robot manipulation policies has seen significant progress, driven by large-scale demonstration data across diverse environments. However, the high cost and inefficiency of collecting real-world demonstrations hinder the scalability of data acquisition. While existing simulation platforms enable controlled environments for robotic learning, the challenge of bridging the sim-to-real gap remains. To address these challenges, we propose RoboPearls, an editable video simulation framework for robotic manipulation. Built on 3D Gaussian Splatting (3DGS), RoboPearls enables the construction of photo-realistic, view-consistent simulations from demonstration videos, and supports a wide range of simulation operators, including various object manipulations, powered by advanced modules like Incremental Semantic Distillation (ISD) and 3D regularized NNFM Loss (3D-NNFM). Moreover, by incorporating large language models (LLMs), RoboPearls automates the simulation production process in a user-friendly manner through flexible command interpretation and execution. Furthermore, RoboPearls employs a vision-language model (VLM) to analyze robotic learning issues to close the simulation loop for performance enhancement. To demonstrate the effectiveness of RoboPearls, we conduct extensive experiments on multiple datasets and benchmarks, including RLBench, COLOSSEUM, Ego4D, and Open X-Embodiment, which demonstrate our satisfactory simulation performance.

Method

MY ALT TEXT

The RoboPearls Framework. (a) RoboPearls extends the Gaussian representation to reconstruct dynamic scenes with semantic features from demonstration videos. (b) RoboPearls includes and refines various simulation operators. (c) RoboPearls leverages multiple LLM agents to automate and streamline the simulation production process following user natural language commands.

Simulations


1. The photo-realistic simulations on in-the-wild datasets (Ego4d and open-X-Embodiment).

MY ALT TEXT

2. The various simulations on RLbench dataset.

MY ALT TEXT

Simulation operator frameworks


1. The incremental semantic distillation pipeline.

MY ALT TEXT

3. The texture modification pipeline.

MY ALT TEXT

2. The object removal pipeline.

MY ALT TEXT

4. The 3D asset management pipeline.

MY ALT TEXT

Manipulation Results


1. RLbench.

MY ALT TEXT

2. Real world.

MY ALT TEXT

Comparisons


1. Key designs: (a) direct deletion vs. inpainting and fine-tuning, (b) direct insertion vs. refinement with libcom and fine-tuning, (c) NNFM loss vs. 3D regularized NNFM loss, (d) RGB space vs. CIELAB color space, and (e) direct semantic distillation vs. incremental semantic distillation.

MY ALT TEXT

2. Spatial-temporal consistency.

MY ALT TEXT

Videos

1. Real-world manipulation tasks.

Task 1: Pick up the red block
Task 2: Place the yellow block between the blue and red blocks
Task 3: Put the yellow block to the blue subregion
Task 1: Pick up the red block
Task 2: Place the yellow block between the blue and red blocks
Task 3: Put the yellow block to the blue subregion
Baseline failure
RoboPearls

2. RLbench manipulation tasks.

Task 1: Stack Cups
Task 2: Close Box
Task 3: Basketball in Hoop
Task 4: Insert Peg
Task 5: Put in Cupboard
Task 6: Push Buttons
Baseline failure
RoboPearls

2. Various simulations.

Original Scene
Reconstruct
Insert & Physics
Remove
Texture
Color
Original Scene
Reconstruct
Insert
Remove
Texture
Color
Original Scene
Reconstruct
Insert
Remove
Texture
Color