@_philschmid
How does quantization impact the performance of LLMs? Only minimal! 🤯 @neuralmagic ran 500,000 different evaluations on @AIatMeta Llama using different quantization strategies. The impact is <1%, but the benefits are up to 2.4 faster inference and 3.5 model size reduction! 🔥 TL;DR; 💯 Quantized models achieve 99% accuracy recovery compared to full-precision 🚀 Up to 2.4x speedup and 3.5x model size reduction with quantization. 📊 Tested Llama 3.1 8B, 70B, and 405B models on OpenLLM Leaderboard, ArenaHard, HumanEval, and text similarity metrics. 🥇W8A8-FP8 dynamic yields the best results 🤗 Quantized models available on @huggingface.