AI coding agents have been evaluated for their performance in fixing real-world bugs in the Kubernetes repository. The evaluation involved three different agent configurations: RAG Only, Hybrid (RAG + Local), and Local Only. Each agent was given an issue description and asked to produce a patch.
Overview
The agents were tested on a set of real, in-flight bug fixes from the Kubernetes repository, spanning various components such as kubelet, scheduler, networking, storage, and apps. The results showed that while AI agents can produce correct fixes, they often struggle to reason about the broader system and miss dependent changes across the system.
What it does
The RAG Only approach used retrieval-augmented generation (RAG) to find relevant code snippets, while the Hybrid approach combined RAG with local file access. The Local Only approach relied solely on local file access. The results showed that RAG is consistently the fastest approach, with an average wall-clock time of 1 minute 16 seconds.
Tradeoffs
The evaluation highlighted several tradeoffs between the approaches. The Hybrid approach was the most expensive in terms of token usage, due to the repeated round-trips between RAG queries and local file access. The RAG Only approach pulled in more new context via retrieval, while the Local Only approach made more exploratory calls.
The results also showed that agents tend to fix locally, not systemically, and struggle to reason about the broader system. They often miss dependent changes across the system and prefer adding new abstractions rather than reusing existing ones. Issue quality was found to dominate everything, with well-specified issues flattening the differences between approaches.
When to use it
The study suggests that while AI agents can be useful in fixing bugs, they should be used in conjunction with human validation and review. The results also highlight the importance of issue quality and the need for well-specified bug reports. Additionally, the study suggests that skills such as repo exploration strategies or architectural summarization could improve agent performance, but would require continuous maintenance and updates to remain effective.
In conclusion, AI agents can be a useful tool in fixing bugs, but their limitations and tradeoffs should be carefully considered. By understanding these limitations and using AI agents in conjunction with human validation and review, developers can improve the efficiency and effectiveness of their bug-fixing workflows.