@_philschmid
New Embedding Models for Code released by @awscloud! Embedding Models are at the heart of every RAG application. Without good embeddings, retrieving relevant context to answer your user prompts is impossible. ๐ Super exciting to see Amazon release CodeSage, a family of open code embedding models with an encoder architecture that supports a wide range of source code understanding tasks. ๐ค TL;DR; ๐ Comes in 3 sizes: 130M, 356M, 1.3B ๐ Pre-trained on @BigCodeProject the Stack (237 million code files) ๐ช๐บ Fine-tuned on 75 million bimodal (code and natural language) pairs ๐ Using hard negatives & hard positive improve MAP > 10% ๐ Using @BigCodeProject StarCoder Tokenizer โ๏ธ Licensed under Apache 2.0 ๐ฅ Outperforms @OpenAI and others on 0-shot Code Search ๐ Sota Performance on NL2Code (Natural Language to Code) ๐คย Available on @huggingface and supported in Sentence Transformers