@ChrisLaubAI
BREAKING: Alibaba tested 18 AI coding agents on 100 real codebases, spanning 233 days each. they failed spectacularly. turns out passing tests once is easy. maintaining code for 8 months without breaking everything is where AI completely collapses. SWE-CI is the first benchmark that measures long-term code maintenance instead of one-shot bug fixes. each task tracks 71 consecutive commits of real evolution. 75% of models break previously working code during maintenance. only Claude Opus 4.5 and 4.6 stay above 50% zero-regression rate. every other model accumulates technical debt that compounds with every single iteration. here's the brutal part: - HumanEval and SWE-bench measure "does it work right now" - SWE-CI measures "does it still work after 8 months of changes" agents optimized for snapshot testing write brittle code that passes tests today but becomes completely unmaintainable tomorrow. they built EvoScore to weight later iterations heavier than early ones. agents that sacrifice code quality for quick wins get punished when the consequences compound. the AI coding narrative just got more honest. most models can write code. almost none can maintain it.