@omarsar0
Great read for AI devs. (bookmark it) LLM agents are slow. The bottleneck in complex agentic systems today is the planning part. Plan generation alone can take 25+ seconds for task requests. This compounds fast at scale. Real-world dataset analysis shows about 30% of requests received by LLM-driven agents are semantically identical or similar. This new paper introduces AgentReuse, a plan reuse mechanism that caches and retrieves previously generated plans based on semantic similarity. Two requests like "Book a ticket from Hefei to Beijing for tomorrow" and "Book a ticket from Changsha to Shanghai for Friday" differ in parameters but share an identical task structure. The plan steps are the same. Only the key values change. Using these insights, AgentReuse separates intent from parameters. It extracts key parameters (time, origin, destination), classifies intent, and then performs similarity matching on the parameter-stripped request. When a match exists, it injects new parameters into the cached plan and executes directly. On a real-world dataset of 2,664 task requests, AgentReuse achieves a 93% effective plan reuse rate. F1 score of 0.9718. Accuracy of 0.9459. Latency was reduced by 93.12% compared to no caching and 60.61% compared to GPTCache. The overhead is minimal. ~100MB additional VRAM, less than 1MB memory per request, and under 10ms processing latency per request. Plan generation that previously took 25-30 seconds becomes a cache lookup. Agents don't need to regenerate plans for structurally similar tasks. Semantic caching at the plan level, not the response level, unlocks massive latency reduction while preserving accuracy for dynamic, real-time information. I am sure this can inspire more general patterns that speed up coding agents. Remains to be seen, but it seems like a cool idea to apply in that domain. Paper: https://t.co/oIF1o44Zrl Learn to build effective AI agents in our academy: https://t.co/JBU5beIoD0