@c__byrne
I think the popularity of systems like this exist for a reason. Video models need prompts that encode scene structure - camera motion, shot composition, character identity, scene layout - basically filmmaking constraints. There’s also just existing gap between how users prompt vs the vocab and semantics of the text encoders. Newer models/pipelines (and proprietary ones for a while already) like LTX-2 straight up ship with the auto prompt enhancement. That's implemented natively in ComfyUI btw. Most of the skill itself is just deterministic heuristics anyway (camera rules, style anchors, prompt length constraints). Big picture we probably move toward structured intermediates like scene graphs and overall more LLM prompt normalization/unpacking - generally decoupling prompt enhance from gen). Am I wrong?