ROSE: Object Removal in Videos Powered by Synthetic 3D Data

Generative models have made impressive progress in video editing and manipulation, but there’s still one very hard challenge: completely removing objects — not only the object itself, but also the side effects it creates such as shadows, reflections, illumination changes, translucency, and even mirror appearances.

The recent work ROSE (Remove Objects with Side Effects in Videos), published in August 2025 by researchers from Zhejiang University, KunByte AI, Peking University, and The University of Hong Kong, shows that this challenge can be addressed in an elegant and effective way — by training an advanced model entirely with synthetic data, generated in controlled 3D environments.

Paired video preparation pipeline using 3D data, which can be divided into: scene and
object sampling, multi-view generation with masks, valid view filtering and video data rendering. Source.

What is ROSE?

ROSE is a framework for object removal in videos, based on a Diffusion Transformer (DT) inpainting model.

The key innovation is that training does not rely on real paired videos (with and without objects) — which are nearly impossible to collect at scale.
Instead, the authors built an automatic synthetic pipeline to generate realistic 3D videos with full control over:

  • Object presence or absence.
  • Side effects like shadows, reflections, and translucency.
  • Lighting, camera angles, and trajectories.

This approach enabled the creation of a rich, paired, and highly varied dataset that simply cannot be obtained in the real world.

On top of that, ROSE introduces two core ideas:

  1. Difference mask supervision – the model explicitly predicts the areas affected by removal (e.g., disappearing shadows), focusing exactly where corrections are needed.
  2. Reference-based erasing – object removal is guided by synthetic references, ensuring temporal consistency across frames.

ROSE-Bench: testing side effects

To evaluate their method, the authors also created ROSE-Bench, a benchmark specifically designed to assess video object removal with side effects, covering five key categories:

  • Shadows
  • Reflections
  • Illumination changes
  • Translucency
  • Mirrors

In experiments, ROSE significantly outperformed prior methods in both quantitative and qualitative metrics, and even showed strong generalization to real-world videos, despite never being trained on them directly.

ROSE-Bench: testing side effects

The success of ROSE highlights an essential truth: in complex tasks, real data alone is not enough.

Only with 3D synthetic data was it possible to:

  • Create perfect video pairs with and without objects.
  • Capture rare but critical corner cases systematically.
  • Produce rich annotations instantly and at no manual cost.

This level of control over the scene makes synthetic data an indispensable tool for advanced computer vision.

And ROSE is not alone. Recently, Microsoft introduced DAViD, a multitask vision system trained 100% on synthetic data, with similarly outstanding results. Both works make it clear: synthetic data is no longer a supporting resource — it’s becoming the foundation of state-of-the-art AI models.

ROSE-Bench: testing side effects

At SynthVision, we embrace the same philosophy: using controlled 3D environments to create tailored datasets with photorealistic quality, perfect annotations, and full flexibility over lighting, materials, geometry, and sensors.

We help companies to:

  • Build datasets for tasks that can’t be solved with real data alone.
  • Reproduce effects that are hard or impossible to capture physically, such as reflections, shadows, or transparency.
  • Accelerate R&D cycles and deliver stronger proofs of concept.

ROSE is a public demonstration of what we see every day in practice: those who master synthetic data today are shaping the future of computer vision.


Want to see how synthetic data can transform your project? At SynthVision, we create ready-to-use datasets for real-world pipelines — with photorealistic quality, precise annotations, and fast delivery.