


Shanghai Jiao Tong University releases CodeApex, a large-model bilingual programming evaluation benchmark. Have machines really begun to challenge humans in writing code?
Building machines that can write their own code is a goal that pioneers in computer science and artificial intelligence have been pursuing. With the rapid development of GPT-type large models, such a goal is becoming closer than ever.
The emergence of large language models (Large Language Models) has attracted more and more attention from researchers to the programming capabilities of models. Under this situation, the APEX Laboratory of Shanghai Jiao Tong University launched CodeApex - a bilingual benchmark data set focused on assessing the programming understanding and code generation capabilities of LLMs.
To evaluate the programming understanding ability of large language models, CodeApex has designed three types of multiple-choice questions: conceptual understanding, common sense reasoning, and multi-hop reasoning. In addition, CodeApex also utilizes algorithmic questions and corresponding test cases to evaluate the code generation capabilities of LLMs. CodeApex evaluated a total of 14 large language models on coding tasks. Among them, GPT3.5-turbo shows the best programming ability, achieving approximately 50% and 56% accuracy on these two tasks respectively. It can be seen that large language models still have a lot of room for improvement in programming tasks. Building a machine that can write its own code is a very promising future.
- ## Website: https://apex.sjtu.edu.cn/codeapex/
- Code: https://github.com/APEXLAB/CodeApex.git
- Paper: https://apex.sjtu.edu.cn/codeapex/paper/
Introduction
Programming understanding and code generation are critical tasks in software engineering and play a key role in improving developer productivity, enhancing code quality, and automating the software development process. However, these tasks are still challenging for large models due to the complexity and semantic diversity of the code. Compared with ordinary natural language processing, using LLMs to generate code requires more emphasis on grammar, structure, detail processing and context understanding, and has extremely high requirements for the accuracy of the generated content. Traditional approaches include grammar rule-based models, template-based models, and rule-based models, which often rely on manually designed rules and heuristic algorithms that are limited in coverage and accuracy.
In recent years, with the emergence of large-scale pre-trained models such as CodeBERT and GPT3.5, researchers have begun to explore the application of these models in programming understanding and code generation tasks. These models integrate code generation tasks during training, allowing them to understand and generate code. However, a fair assessment of the progress of LLMs in code understanding and generation is difficult due to the lack of standard, publicly available, high-quality, and diverse benchmark datasets. Therefore, establishing a benchmark dataset that broadly covers code semantics and structure is crucial to promote research in programming understanding and code generation.
Existing code benchmark datasets have applicability and diversity issues when applied to LLMs. For example, some datasets are more suitable for evaluating Bert-type, bidirectional language modeling LLMs. However, existing multilingual code benchmark data sets (such as Human-Eval) contain relatively simple problems, lack diversity, and can only implement some basic functional codes.
In order to fill the above gaps, the APEX Data and Knowledge Management Laboratory of Shanghai Jiao Tong University built a new evaluation benchmark for large model code understanding and generation-CodeApex. As a groundbreaking bilingual (English, Chinese) benchmark dataset, CodeApex focuses on evaluating the programming understanding and code generation capabilities of LLMs.
The overall experimental scenario of CodeApex is shown in the picture above.
The first task, Programming Comprehension, includes 250 multiple-choice questions, divided into conceptual understanding, common sense reasoning and multi-hop reasoning. The questions used for testing are selected from the final exam questions of different courses (programming, data structures, algorithms) in colleges and universities, which greatly reduces the risk that the data is already in the LLMs training corpus. CodeApex tested the code understanding ability of LLMs in three scenarios: 0-shot, 2-shot, and 5-shot, and also tested the impact of Answer-Only and Chain-of-Thought modes on the ability of LLMs.
The second task code generation includes 476 C-based algorithm problems, covering common algorithm knowledge points, such as binary search, depth-first search, etc. CodeApex gives a description of the problem and a function prototype that implements the problem, and requires LLMs to complete the main part of the function. CodeApex also provides two scenarios: function-only and function-with-context. The difference between them is that the former only has a description of the target function, while the latter, in addition to the description of the target function, is also provided with the calling code and time of the target function. Space constraints, input and output description.
Experimental results show that different models perform differently in code-related tasks, and GPT3.5-turbo shows excellent competitiveness and obvious advantages. Furthermore, CodeApex compared the performance of LLMs in bilingual scenarios, revealing different results. Overall, there is still considerable room for improvement in the accuracy of LLMs in the CodeApex rankings, indicating that the potential of LLMs in code-related tasks has not yet been fully exploited.
Code Understanding
To fully integrate large language models into actual code production scenarios, programming understanding is essential. Programming understanding requires the ability to understand the code from all aspects, such as mastering the syntax, understanding the code execution flow, and understanding the execution algorithm.
CodeApex extracted 250 multiple-choice questions from college final exam questions as test data. These test data are divided into three categories: conceptual understanding, common sense reasoning, and multi-hop reasoning.
Test mode includes two categories: Answer-Only and Chain-of-Thought.
Experimental results and conclusions
The Chinese and English evaluation results of CodeApex on the code understanding task are as follows shown in the two tables. (The best performing model is shown in bold; the next best performing model is underlined.)
## The following conclusions can be drawn from it:
- Comparison of bilingual abilities. The Chinese version scored higher than the English version. There are two main reasons: (1) The source question descriptions come from the final exams of Chinese universities, so the test questions were originally presented in Chinese. Even if translated into English, they still contain some language habits unique to Chinese people. Therefore, when these biased English questions are input into LLMs, some noise may be introduced into the model's encoding results. (2) Most of the evaluated models are mainly trained on Chinese data, which leads to poor results.
- Comparison of abilities of different question types. Across these three problem categories, approximately half of the models performed best on conceptual understanding, suggesting that they likely contained knowledge of programming concepts while being trained. Most models score higher on commonsense reasoning compared to multi-hop reasoning, indicating that the power of LLMs decreases significantly with increasing inference steps.
- The role of CoT thinking chain model. The accuracy of most models in CoT mode is close to or lower than Answer-Only mode. There are two reasons for this phenomenon: (1) The evaluated model size does not reach the model size with CoT emergence capability. Previous research believed that the emergence of CoT requires LLMs to have at least 60B parameters. When the number of parameters is insufficient, the CoT setup may introduce additional noise and the response generated by LLMs is unstable. GPT3.5-turbo has reached the point of emergence of emergent capabilities and can achieve higher accuracy in CoT settings. (2) When answering conceptual understanding and common sense reasoning questions, multi-step reasoning is less necessary. Therefore, the CoT capabilities of LLMs cannot help with this type of problem. However, for multi-hop inference problems, some models (such as ChatGLM2, educhat, and GPT3.5-turbo) have significantly improved accuracy in CoT scenarios. (CodeApex excludes CodeT5 from the CoT setup due to its inability to generate responses via thought chains.)
Code Generation
Training large language models to generate accurate and executable code is a challenging task. CodeApex primarily evaluates the ability of LLMs to generate algorithms based on a given description and automatically evaluates the correctness of the generated code through unit tests.
CodeApex’s code generation tasks include 476 C-based algorithm problems, covering common algorithm knowledge points, such as binary search and graph algorithms. CodeApex gives a description of the problem and a function prototype that implements the problem, and requires LLMs to complete the main part of the function.
CodeApex provides two scenarios: Function-only and Function-with-context. The Function-only scenario only provides a description of the target function, while the Function-with-context scenario not only provides a description of the target function, but also provides the calling code, time and space constraints, and input and output description of the target function.
Experimental results and conclusions
Each language version uses two Prompt strategies (Function -Only and Function-with-Context). To align with human code testing scenarios, evaluation metrics include AC@1, AC@all and AC rate.
#The code generation task results of each model are shown in the following two tables. (Best performance: bold; second best performance: underline.)
The following conclusions can be drawn:
- GPT3.5-turbo performs better than the other 11 LLMs with an average score More than 50%.
- WizardCoder and StarCoder ranked second and third, highlighting significant improvements in code generation capabilities through code-based fine-tuning.
- In the code generation task, there is no obvious performance difference between the currently tested models on Chinese and English question types.
Additionally, CodeApex provides the proportion of compileable code in each scenario. After connecting the generated function to the main function, the compiled code is checked through test cases.
You can see:
- Most models are able to generate more than 50% of the Compile the code, which demonstrates LLMs' ability to understand function prototypes.
- Often, providing contextual information about a function can help LLMs generate compilable code.
Conclusion
CodeApex serves as a bilingual benchmark focusing on LLMs’ programming abilities, evaluating programming understanding and code generation of large language models. ability. In terms of programming understanding, CodeApex assessed the abilities of different models in three categories of multiple-choice questions. In terms of code generation, CodeApex uses the pass rate of test code cases to evaluate the model's capabilities. For these two tasks, CodeApex carefully designed Prompt strategies and compared them in different scenarios. CodeApex is experimentally evaluated on 14 LLMs, including general LLMs and specialized LLMs models based on code fine-tuning.
Currently, GPT3.5 has reached a relatively good level in terms of programming capabilities, achieving approximately 50% and 56% accuracy in programming understanding and code generation tasks respectively. CodeApex shows that the potential of large language models for programming tasks has not yet been fully exploited. We expect that leveraging large language models to generate code will revolutionize the field of software development in the near future. As natural language processing and machine learning advance, these models will become more powerful and adept at understanding and generating code snippets. Developers will find they have an unprecedented ally in their coding efforts, as they can rely on these models to automate tedious tasks, increase their productivity, and improve software quality.
In the future, CodeApex will release more tests (such as code correction) for testing the code capabilities of large language models. CodeApex’s test data will also continue to be updated, adding more diverse Code issues. At the same time, human experiments will also be added to the CodeApex list to compare the coding capabilities of large language models with human levels. CodeApex provides a benchmark and reference for the research on large language model programming capabilities, and will promote the development and prosperity of large language models in the code field.
Introduction to APEX Laboratory
Shanghai Jiao Tong University APEX Data and Knowledge Management Laboratory was established in 1996. Its founder is Tou Yu, the head teacher of the ACM class Professor Yong. The laboratory is committed to exploring artificial intelligence technology that effectively mines and manages data and summarizes knowledge. It has published more than 500 international academic papers and pursues practical applications in practical scenarios. Over the past 27 years, APEX Laboratory has become a global pioneer in many world technology waves. The laboratory began to study the core technology of the Semantic Web (now known as the Knowledge Graph) in 2000, and began to study personalized search engines and recommendations in 2003. System technology, began to study transfer learning theory and algorithm in 2006, began to explore deep learning technology in 2009 and developed neural network training library based on GPU. While producing fruitful scientific research and implementation results, APEX Lab has also developed a solid data science and machine learning research team, including Xue Guirong, Zhang Lei, Lin Chenxi, Liu Guangcan, Wang Haofen, Li Lei, Dai Wenyuan, Li Zhenhui, Chen Tianqi, Zhang Weinan, Yang Diyi and other outstanding alumni in the field of artificial intelligence.
The above is the detailed content of Shanghai Jiao Tong University releases CodeApex, a large-model bilingual programming evaluation benchmark. Have machines really begun to challenge humans in writing code?. For more information, please follow other related articles on the PHP Chinese website!

Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics



0.What does this article do? We propose DepthFM: a versatile and fast state-of-the-art generative monocular depth estimation model. In addition to traditional depth estimation tasks, DepthFM also demonstrates state-of-the-art capabilities in downstream tasks such as depth inpainting. DepthFM is efficient and can synthesize depth maps within a few inference steps. Let’s read about this work together ~ 1. Paper information title: DepthFM: FastMonocularDepthEstimationwithFlowMatching Author: MingGui, JohannesS.Fischer, UlrichPrestel, PingchuanMa, Dmytr

Imagine an artificial intelligence model that not only has the ability to surpass traditional computing, but also achieves more efficient performance at a lower cost. This is not science fiction, DeepSeek-V2[1], the world’s most powerful open source MoE model is here. DeepSeek-V2 is a powerful mixture of experts (MoE) language model with the characteristics of economical training and efficient inference. It consists of 236B parameters, 21B of which are used to activate each marker. Compared with DeepSeek67B, DeepSeek-V2 has stronger performance, while saving 42.5% of training costs, reducing KV cache by 93.3%, and increasing the maximum generation throughput to 5.76 times. DeepSeek is a company exploring general artificial intelligence

AI is indeed changing mathematics. Recently, Tao Zhexuan, who has been paying close attention to this issue, forwarded the latest issue of "Bulletin of the American Mathematical Society" (Bulletin of the American Mathematical Society). Focusing on the topic "Will machines change mathematics?", many mathematicians expressed their opinions. The whole process was full of sparks, hardcore and exciting. The author has a strong lineup, including Fields Medal winner Akshay Venkatesh, Chinese mathematician Zheng Lejun, NYU computer scientist Ernest Davis and many other well-known scholars in the industry. The world of AI has changed dramatically. You know, many of these articles were submitted a year ago.

Facing lag, slow mobile data connection on iPhone? Typically, the strength of cellular internet on your phone depends on several factors such as region, cellular network type, roaming type, etc. There are some things you can do to get a faster, more reliable cellular Internet connection. Fix 1 – Force Restart iPhone Sometimes, force restarting your device just resets a lot of things, including the cellular connection. Step 1 – Just press the volume up key once and release. Next, press the Volume Down key and release it again. Step 2 – The next part of the process is to hold the button on the right side. Let the iPhone finish restarting. Enable cellular data and check network speed. Check again Fix 2 – Change data mode While 5G offers better network speeds, it works better when the signal is weaker

Boston Dynamics Atlas officially enters the era of electric robots! Yesterday, the hydraulic Atlas just "tearfully" withdrew from the stage of history. Today, Boston Dynamics announced that the electric Atlas is on the job. It seems that in the field of commercial humanoid robots, Boston Dynamics is determined to compete with Tesla. After the new video was released, it had already been viewed by more than one million people in just ten hours. The old people leave and new roles appear. This is a historical necessity. There is no doubt that this year is the explosive year of humanoid robots. Netizens commented: The advancement of robots has made this year's opening ceremony look like a human, and the degree of freedom is far greater than that of humans. But is this really not a horror movie? At the beginning of the video, Atlas is lying calmly on the ground, seemingly on his back. What follows is jaw-dropping

Earlier this month, researchers from MIT and other institutions proposed a very promising alternative to MLP - KAN. KAN outperforms MLP in terms of accuracy and interpretability. And it can outperform MLP running with a larger number of parameters with a very small number of parameters. For example, the authors stated that they used KAN to reproduce DeepMind's results with a smaller network and a higher degree of automation. Specifically, DeepMind's MLP has about 300,000 parameters, while KAN only has about 200 parameters. KAN has a strong mathematical foundation like MLP. MLP is based on the universal approximation theorem, while KAN is based on the Kolmogorov-Arnold representation theorem. As shown in the figure below, KAN has

I cry to death. The world is madly building big models. The data on the Internet is not enough. It is not enough at all. The training model looks like "The Hunger Games", and AI researchers around the world are worrying about how to feed these data voracious eaters. This problem is particularly prominent in multi-modal tasks. At a time when nothing could be done, a start-up team from the Department of Renmin University of China used its own new model to become the first in China to make "model-generated data feed itself" a reality. Moreover, it is a two-pronged approach on the understanding side and the generation side. Both sides can generate high-quality, multi-modal new data and provide data feedback to the model itself. What is a model? Awaker 1.0, a large multi-modal model that just appeared on the Zhongguancun Forum. Who is the team? Sophon engine. Founded by Gao Yizhao, a doctoral student at Renmin University’s Hillhouse School of Artificial Intelligence.

The latest video of Tesla's robot Optimus is released, and it can already work in the factory. At normal speed, it sorts batteries (Tesla's 4680 batteries) like this: The official also released what it looks like at 20x speed - on a small "workstation", picking and picking and picking: This time it is released One of the highlights of the video is that Optimus completes this work in the factory, completely autonomously, without human intervention throughout the process. And from the perspective of Optimus, it can also pick up and place the crooked battery, focusing on automatic error correction: Regarding Optimus's hand, NVIDIA scientist Jim Fan gave a high evaluation: Optimus's hand is the world's five-fingered robot. One of the most dexterous. Its hands are not only tactile
