@omarsar0
New research from Apple. Diffusion models dominate video generation. However, the current approach has fundamental limitations like multi-step sampling, no exact likelihood, and training and inference objectives that don't align. This new research introduces STARFlow-V, a novel normalizing flow-based causal video generator. It demonstrates that flow models can match diffusion quality while offering end-to-end training, exact likelihood estimation, and native multi-task support. The architecture uses a global-local two-level system. A deep autoregressive Transformer handles temporal reasoning in compressed latent space. Shallow flow blocks independently model within-frame structures. A learnable causal denoiser bridges training and inference through flow-score matching. What enables practical video generation? Video-aware Jacobi iteration. This technique allows parallel latent updates without breaking causality, making sampling efficient enough for real use. The model scales to 7B parameters, trained on 70M text-video pairs and 400M text-image pairs, generating 480p video at 16fps. The result is a single model that handles text-to-video, image-to-video, and video-to-video tasks, including inpainting, outpainting, and style transfer. The invertible structure enables this naturally without task-specific heads. Normalizing flows offer theoretical advantages that diffusion lacks. Exact likelihood estimation, true end-to-end training, unified multi-task capability.