@omarsar0
NEW Research from Apple. When you think about it, RAG systems are fundamentally broken. Retrieval and generation are optimized separately, retrieval selects documents based on surface-level similarity while generators produce answers without feedback about what information is actually needed. There is an architectural mismatch. Dense retrievers rank documents in embedding space while generators consume raw text. This creates inconsistent representation spaces that prevent end-to-end optimization, redundant text processing that causes context overflow, and duplicated encoding for both retrieval and generation. This new research introduces CLaRa, a unified framework that performs retrieval and generation over shared continuous document representations. They encode documents once into compact memory-token representations that serve both purposes. Instead of maintaining separate embeddings and raw text, documents are compressed into dense vectors that both the retriever and generator operate on directly. This enables something previously impossible: gradients flowing from the generator back to the retriever through a differentiable top-k selector using Straight-Through estimation. The retriever learns which documents truly enhance answer generation rather than relying on surface similarity. To make compression work, they introduce SCP, a pretraining framework that synthesizes QA pairs and paraphrases to teach the compressor which information is essential. Simple QA captures atomic facts, complex QA promotes relational reasoning, and paraphrases preserve semantics while altering surface form. Results: At 16x compression, CLaRa-Mistral-7B surpasses the text-based DRO-Mistral-7B on NQ (51.41 vs 51.01 F1) and 2Wiki (47.18 vs 43.65 F1) while processing far less context. At 4x compression, it exceeds uncompressed text baselines by 2.36% average on Mistral-7B. Most notably, CLaRa trained with only weak supervision from next-token prediction outperforms fully supervised retrievers with ground-truth relevance labels. On HotpotQA, it achieves 96.21% Recall@5, exceeding BGE-Reranker (85.93%) by over 10 points despite using no annotated relevance data. Well-trained soft compression can retain essential reasoning information while substantially reducing input length. The compressed representations filter out irrelevant content and focus the generator on reasoning-relevant context, leading to better generalization than raw text inputs. Great read for AI devs. (bookmark it) Paper: https://t.co/JtMukGVNwV Learn to build with RAG and AI Agents in my academy: https://t.co/JBU5beIoD0