@dair_ai
NEW research: Multi-agents for automating reliability engineering. Cloud infrastructure fails constantly. Hundreds of machine failures. Thousands of disk failures. Software bugs. Misconfigurations. The scaling aspect is relentless and challenging. The current approach to handling these failures relies heavily on human Site Reliability Engineers. But what if AI agents could handle this autonomously? This new research introduces STRATUS, an LLM-based multi-agent system for autonomous reliability engineering. Multiple specialized agents handle failure detection, diagnosis, and mitigation without human intervention. The key architectural insight in this paper: organize agents through state machines. This enables system-level safety reasoning that single-agent approaches lack. Each agent specializes in one aspect of the reliability pipeline while the state machine coordinates their actions. What prevents agents from making things worse? The authors introduce Transactional No-Regression (TNR), a formal specification ensuring mitigation attempts never introduce regressions. Agents can explore solutions iteratively without compromising system stability. Results on AIOpsLab and ITBench benchmarks: STRATUS outperforms existing SRE agents by at least 1.5x on success rate metrics, with consistency across different underlying models. Autonomous reliability engineering isn't just about speed. It's about scale. Human SREs will always be bottlenecked by attention and availability. Multi-agent systems with formal safety guarantees can operate continuously across an infrastructure that no human team could monitor comprehensively. Paper: https://t.co/2BaN1mjaQw Learn to build effective AI Agents in our academy: https://t.co/zQXQt0PMbG