@dair_ai
RT @omarsar0: New research from Google DeepMind. Really interesting paper on diffusion models. Training good latents for diffusion models is harder than it looks. The standard approach uses a KL penalty borrowed from VAEs, with no principled way to control how much information actually lives in the latent space. This new research introduces Unified Latents (UL), a framework that co-trains a diffusion prior on the latents. This provides a tight upper bound on latent bitrate and makes the reconstruction-generation tradeoff explicit and, most importantly, tunable. On ImageNet-512, UL achieves FID 1.4 while requiring fewer training FLOPs than Stable Diffusion latents. On Kinetics-600, it sets a new state-of-the-art FVD of 1.3 for video generation. The latent space is one of the most undertreated design decisions in diffusion-based generation. UL gives practitioners a principled handle on it, for both images and video. Paper: https://t.co/E1HCf9QzB4