@ActuallyIsaak
Introducing the MLX-Benchmark Suite!! https://t.co/sp4ZMIBxov The first comprehensive benchmark for evaluating LLMs on Apple's MLX framework. π― What is this? MLX Benchmark is a CLI tool and dataset that measures how well large language models understand, write, and debug code for Apple's MLX machine learning framework β covering everything from core array operations to LoRA fine-tuning with mlx-lm, mlx-vlm, and mlx-embeddings. π Dataset https://t.co/5b04a7PKAp - 520 questions across 6 task types: knowledge QA, multiple choice, true/false, fill-in-the-blank, code generation, and debugging - 11 categories spanning the full MLX ecosystem: mlx_core, mlx_nn, mlx_lm, mlx_lm_lora, mlx_vlm, mlx_embeddings, mlx_embeddings_lora, mlx_optimizers, coding, debugging, conceptual - 4 difficulty levels: easy β medium β hard β very-hard - 90+ subcategories covering everything from array_creation to lora_finetuning β¨ Features - π Multi-provider benchmarking β Ollama, Anthropic, OpenAI, Groq, OpenRouter - βοΈ LLM-as-judge evaluation β strict scoring with an independent judge model - π Fine-grained filtering β by type, difficulty, and category - π LaTeX export β --latex generates publication-ready booktabs tables - π PNG chart export β --plot generates grouped bar charts comparing models A detailed paper will be coming as well!!!