The AIxiv column is a column where this site publishes academic and technical content. In the past few years, the AIxiv column of this site has received more than 2,000 reports, covering top laboratories from major universities and companies around the world, effectively promoting academic exchanges and dissemination. If you have excellent work that you want to share, please feel free to contribute or contact us for reporting. Submission email: liyazhou@jiqizhixin.com; zhaoyunfeng@jiqizhixin.com
The authors of this paper are all from the team of teacher Zhang Lingming at the University of Illinois at Urbana-Champaign (UIUC), including: Steven Xia, a fourth-year doctoral student, whose research direction is Automatic code repair based on AI large models; Deng Yinlin, a fourth-year doctoral student, whose research direction is code generation based on AI large models; Soren Dunn, a scientific research intern, is currently a third-year student at UIUC. Teacher Zhang Lingming is currently an associate professor in the Department of Computer Science at UIUC, mainly engaged in research related to software engineering, machine learning, and large code models.
For more detailed information, please see Teacher Zhang’s personal homepage: https://lingming.cs.illinois.edu/
Since Devin (the first fully automatic AI software engineer) proposed it, AI for software engineering The design of Agent has become the focus of research. More and more Agent-based AI automatic software engineers have been proposed, and have achieved good performance on the SWE-bench data set and automatically repaired many real GitHub issues.
However, a complex Agent system will bring additional overhead and uncertainty. Do we really need to use such a complex Agent to solve GitHub issues? Can an agent-free solution come close to their performance?
Starting from these two problems, the team of teacher Zhang Lingming from the University of Illinois at Urbana-Champaign (UIUC) proposed OpenAutoCoder-Agentless, a simple, efficient and completely open source Agent-less solution that can solve a real GitHub issue for only $0.34. Agentless has attracted more than 300 GitHub stars on GitHub in just a few days, and has made it into the top three of DAIR.AI’s weekly list of hottest ML papers.
Paper: AGENTLESS: Demystifying LLM-based Software Engineering Agents
Paper address: https://huggingface.co/papers/2407.01489
Open source code: https://github.com /OpenAutoCoder/Agentless
AWS Research Scientist Leo Boytsov said: "The Agentless framework outperformed all open source Agent solutions and almost reached the top level of SWE Bench Lite (27%). Moreover, it defeated it at a significantly lower cost. All open source solutions. The framework uses a hierarchical query approach (by asking LLM to find files, classes, functions, etc.) to determine patch locations, but does not allow LLM to make planning decisions. Agentless is an automated approach to software development problems that uses a simple two-phase approach to locate and fix bugs in your code base. In the locating phase, Agentless uses a hierarchical approach to gradually narrow down to suspicious files, classes/functions and specific editing locations. For fixes, it uses a simple diff format (referenced from the open source tool Aider) to generate multiple candidate patches, filtering and sorting them.
The researchers compared Agentless with existing AI software agents, including state-of-the-art open source and commercial/closed source projects. Surprisingly, Agentless can outperform all existing open source software agents at a lower cost! Agentless solved 27.33% of problems, the highest among open source solutions, and solved it for an average of $0.29 per problem and about $0.34 on average across all problems (both solvable and unsolved).
Not only that, Agentless has the potential to improve. Agentless can solve 41% of the problems when considering all generated patches, an upper bound that indicates significant room for improvement in the patch sorting and selection stages. Furthermore, Agentless is able to solve some unique problems that even the best commercial tool (Alibaba Lingma Agent) cannot solve, suggesting that it can be used as a complement to existing tools.
Analysis of SWE-bench Lite data set
The researchers also conducted manual inspection and detailed analysis of the SWE-bench Lite data set.
The study found that 4.3% of the problems in the SWE-bench Lite data set gave complete answers directly in the problem description, which is the correct fix patch. While the other 10% of the questions describe the exact steps to the correct solution. This suggests that some problems in SWE-bench Lite may be easier to solve.
In addition, the research team observed that 4.3% of the issues included user-proposed solutions or steps in the issue description, but these solutions were not consistent with the developers' actual patches. This further reveals a potential problem with this benchmark, as these misleading solutions could cause the AI tool to generate incorrect solutions simply by following the problem description.
In terms of problem description quality, researchers observed that although most tasks in SWE-bench Lite contain sufficient information, and many tasks also provide failure examples to reproduce errors, there are still 9.3% of problems Not enough information included. For example, you need to implement a new function or add an error message, but the specific function name or specific error message string is not given in the problem description. This means that even if the underlying functionality is implemented correctly, the test will fail if the function name or error message string does not match exactly.
Researchers at Princeton University and one of the authors of SWE-Bench, Ofir Press confirmed their findings: "Agentless performed a good manual analysis of SWE-bench Lite. They believe that the theoretical highest on Lite The score is probably 90.7%. I think the actual upper limit may be lower (around 80%). Some questions have insufficient information and others are too rigorously tested. ”
SWE-bench Lite-S: passed. Filtered strict problem subset
To address these problems, the researchers proposed a strict problem subset SWE-bench Lite-S (containing 252 questions). Specifically, issues that contained exact patches, misleading solutions, or did not provide sufficient information in the issue description were excluded from SWE-bench Lite (containing 300 issues). This removes unreasonable questions and standardizes the difficulty level of the benchmark. Compared to the original SWE-bench Lite, the filtered benchmark more accurately reflects the true capabilities of automated software development tools.
Conclusion
Although Agent-based software development is very promising, the authors believe that it is time for the technology and research community to stop and think about its key design and evaluation methods, rather than rushing to release more Agents. Researchers hope that Agentless can help reset the baseline and direction of Agents in future software engineering.
The above is the detailed content of Topping the list of open source AI software engineers, UIUC's agent-less solution easily solves SWE-bench real programming problems. For more information, please follow other related articles on the PHP Chinese website!