@dair_ai
New research on evaluating coding agents via continuous integration. Coding agents are moving beyond isolated bug fixes. If they're going to own CI pipelines, we need benchmarks that reflect the actual complexity of codebase maintenance. Most coding agent benchmarks today test whether an agent can fix a single issue. But real software engineering involves maintaining entire codebases over time. SWE-CI evaluates agent capabilities through continuous integration workflows: running test suites, catching regressions, and maintaining code quality across multiple changes. Paper: https://t.co/p8bOTJ9QPX Learn to build effective AI agents in our academy: https://t.co/LRnpZN7L4c