VERTIGO: Visual Preference Optimization for Cinematic Camera Trajectory Generation

Anonymous submission
>

Video Presentation

If the embedded player is unavailable, open the presentation on YouTube.

Abstract

Cinematic camera control relies on a tight feedback loop between director and cinematographer, where camera motion and framing are continuously reviewed and refined. Recent generative camera systems can produce diverse, text-conditioned trajectories, but they lack this ``director in the loop'' and have no explicit supervision of whether a shot is visually desirable. This results in in-distribution camera motion but poor framing, off-screen characters, and undesirable visual aesthetics.
In this paper, we introduce VERTIGO, the first framework for visual preference optimization of camera trajectory generators. Our framework leverages a real-time graphics engine (Unity) to render 2D visual previews from generated camera motion. A cinematically fine-tuned vision–language model then scores these previews using our proposed cyclic semantic similarity mechanism, which aligns renders with text prompts. This process provides the visual preference signals for Direct Preference Optimization (DPO) post-training.

Method Overview

Method Overview

We propose leveraging a real-time graphics engine (e.g., Unity) to generate previews (i.e., shot sequences rendered in the engine), which are then evaluated by a cinematically fine-tuned VLM using our cyclic semantic scoring mechanism that compares the semantic similarity between the original prompt and VLM-generated captions from rendered previews in embedding space. In this setup, the graphics engine functions as a 3D shooting stage, the camera generator plays the role of the cinematographer, and the VLM-based evaluator serves as the Director. We then employ Direct Preference Optimization (DPO) to post-train the camera trajectory generator on preference-ranked Unity previews, enabling the generator to better adhere to textual conditions, achieve more coherent framing, and produce trajectories with improved aesthetic look-and-feel.

Scoring Strategy

Preference Scoring Overview

We explore three preference scoring methods: (1) tag-consistency scoring; (2) direct scalar regression via RAFT-style fine-tuning on interpolated trajectories; (3) cyclic semantic scoring compares semantic similarity in latent space.

Experiment

Both quantitative and qualitative evaluations on Unity renders and diffusion-based Camera-to-Video pipelines show consistent gains in condition adherence, framing quality, and perceptual realism. Notably, VERTIGO reduces the character off-screen rate from 38% to nearly 0% while preserving the geometric fidelity of camera motion. User study participants further prefer VERTIGO over baselines across composition, consistency, prompt adherence, and aesthetic quality, confirming the perceptual benefits of our visual preference post-training.

To quantitatively compare camera trajectory and video quality, we evaluate geometric fidelity using CLaTr-based metrics and visual quality via VBench metrics on both Unity-rendered sequences and controllable video generation. VERTIGO achieves competitive geometric performance while substantially improving visual metrics and target framing stability.

The figure above shows qualitative comparison of camera generators,VERTIGO accurately adheres to spatial composition instructions and maintains framing, whereas GenDoP and DIRECTOR misplace subjects and occasionally lose them.

Qualitative Comparisons

We provide twelve representative comparisons across three prompts and four camera generators. Each clip is shown independently here for quick browsing, while the paired Unity and video-generation gallery remains below.

CCD
DIRECTOR
GenDoP
VERTIGO
CCD
DIRECTOR
GenDoP
VERTIGO
CCD
DIRECTOR
GenDoP
VERTIGO

Conclusion

We present VERTIGO, a visual preference optimization framework that post-trains camera-trajectory generators using computer-graphics–rendered previews as visual reward signals. To tackle the challenging problem of evaluating trajectories through their resulting visual content, we introduce a VLM-based cyclic semantic scoring mechanism for assessing cinematographic quality, and we systematically compare it against multiple alternative scoring designs through extensive ablations. Experiments show that our approach (1) dramatically reduces bad framings and improves prompt adherence while preserving high geometric fidelity, and (2) consistently enhances visual quality across both rendering-based and diffusion-based video generation pipelines. These improvements are corroborated by quantitative metrics and user studies. As the first framework to explicitly bridge geometric camera generation with the shot visual outcomes, VERTIGO pushes computational cinematography toward practical, director-in-the-loop visual generation and broader real-world creative applications.