@dair_ai
Are LLMs any good for web API integrations? While we see a lot of fancy demos, the reality is that LLMs still largely struggle with web API integrations. The default assumption is that code models can handle API calls reliably. After all, they excel at general code completion tasks. But web APIs have unique challenges that break this assumption. This new research introduces WAPIIBench, a benchmark for evaluating LLM-generated web API invocation code across four real-world APIs: Asana, Google Calendar, Google Sheets, and Slack. None of the evaluated open-source models solved more than 40% of tasks. Even when given the correct endpoint, models still generated 6-31% illegal arguments. URLs get hallucinated 14-39% of the time. Why is this so hard? Web API invocations differ from regular function calls in critical ways. Operations are identified by HTTP method plus long URL strings, not simple function names. Multiple argument lists exist across body, header, and query locations. Parameters have complex nested data types. And API specifications are documented externally, limiting what models can memorize. The researchers propose a solution: constrained decoding. They automatically translate OpenAPI specifications into regex-based constraints that filter token predictions during generation. The constraints enforce compliance with API specs without requiring model modifications or prompt adjustments. Constrained decoding improves correctness by 90% on average for full completion and 135% for argument completion. Illegal URLs, methods, and arguments drop to zero. Models that previously generated zero executable code now achieve similar rates to other models. Great read for AI devs. Paper: https://t.co/OXFKJRMmJc Learn to build effective AI agents in our academy: https://t.co/zQXQt0PMbG