@omarsar0
Introducing Colossal-LLaMA-2 Huge release by Colossal-AI. They present an open-source and commercial-free domain-specific LLM solution to build your own large-scale models at a much lower cost. Utilizes only approximately 0.0085 trillion tokens of data, investing 15 hours, and incurring training costs in the range of a few hundred dollars. This strategy led to a Chinese LLaMA-2 model outperforming competitors across multiple evaluation benchmarks. Lots of new improvements in this release, including: - vocabulary expansion and model initialization to extend to Chinese while preserving English language capabilities - complete data cleaning system and toolkit for selecting higher data quality used to train the models - a multi-stage, hierarchical continual pre-training scheme: 1) large-scale pertaining, 2) Chinese knowledge injection stage, 3) relevant knowledge replay stage; this approach ensures the model progresses equally in both Chinese and English abilities. - bucket training to ensure a balanced distribution of data Personally, the most interesting bit of this release is the focus and possibility of training lightweight domain-specific LLMs in a cost-effective way. This will unlock the ability to fine-tune these foundation models for all kinds of applications that meet specific business needs. Check out the blog here: https://t.co/D5U2dBjcIx ColossalAI repo: https://t.co/jatXbyQyby