@Tu7uruu
Just dropped on HF! HunyuanVideo-Foley from Tencent AI Lab an end-to-end Text-Video-to-Audio (TV2A) model that turns silent videos into lifelike soundscapes > 100k-hour curated TV2A dataset via automated pipeline > Modality-balanced MMDiT: dual-stream audio-video fusion + text cross-attention > REPA loss: aligns internal states with self-supervised audio features → higher fidelity & stability > DAC-VAE audio codec: 48kHz, continuous latents, strong reconstruction across speech/music/sfx > SOTA on Kling-Audio-Eval, VGGSound, and MovieGen-Audio-Bench (audio quality, semantic + temporal alignment)