@hanlin_hl
Glad to share our new preprint from my Meta @AIatMeta internship and @uncnlp collaboration: VEDiT: Latent Prediction Architecture for Procedural Video Representation Learning ๐ - A well designed DiT-based prediction model โa strong off-the-shelf frozen visual encoder โก๏ธ SoTA in procedural learning tasks without the need for pretraining the prediction model, nor requiring additional supervision from language or ASR. - Compared with image/video generative models that learn representations from pixel space, we predict visual representations entirely in the embedding space of publicly available vision encoders. See more details in paper ๐ https://t.co/ExSHWRiMRU Thread below ๐งต