Table of Contents
Are we looking for AI gold in the wrong places?
Probability vs. Accuracy
Applying Reinforcement Learning to Software
Home Technology peripherals AI Are large language models wrong for coding?

Are large language models wrong for coding?

Jun 05, 2023 pm 12:34 PM
coding tool Artificial

Reinforcement learning models beat generative AI when the goal is accuracy, consistency, game mastery, or finding one correct answer.

Large language models, such as GPT-4, are impressive because they can generate high-quality, smooth and natural text that is extremely convincing. Sadly, so does the hype: Microsoft researchers breathlessly describe the Microsoft-funded OpenAI GPT-4 model as demonstrating "a spark of artificial general intelligence."

Of course, unless Microsoft is referring to a tendency to hallucinate, the generated error text must be wrong. GPT is not good at playing games such as chess and Go, it is not good at mathematics, and the code it writes may have errors and subtle loopholes.

This does not mean that large language models are all hype. We need some new angles to discuss generative artificial intelligence (GenAI) without over exaggerating its differences from other technologies.

As detailed in an IEEESpectrum article, some experts, such as OpenAI’s Ilya Sutskever, believe that adding reinforcement learning with human feedback can eliminate the LLM illusion. But others, like Meta's Yann LeCun and Geoff Hinton (recently retired from Google), think more fundamental flaws in large language models are at work. Both believe that large language models lack the non-linguistic knowledge that is crucial to understanding the underlying reality that language describes.

Diffblue CEO Mathew Lodge pointed out in an interview that there is a better solution. He said, "Small, fast, and cheap to run reinforcement learning models can easily defeat large language models with hundreds of billions of parameters in a variety of tasks, from playing games to writing code."

Are we looking for AI gold in the wrong places?

What Lodge is saying is that generative AI certainly has its uses, but perhaps we are trying to It forces the introduction of an area of ​​reinforcement learning for which it is not well suited. Take games for example.

Levy Rozman, a chess grandmaster, posted a video of himself playing against ChatGPT (chat-based artificial intelligence). The model made a series of ridiculous and illegal moves, including capturing its own pieces. The best open source chess software (Stockfish, which doesn't use neural networks at all) lets ChatGPT beat it in less than 10 moves because the large language model can't find legal moves. This proves that large language models fall far short of the claims of general artificial intelligence, and this is not an isolated example.

Due to its reinforcement learning algorithm, Google AlphaGo is the best performing Go artificial intelligence currently. Reinforcement learning works by generating different solutions to a problem, trying them, using the results to improve the next suggestion, and then repeating the process thousands of times to find the best result.

In the case of AlphaGo, the AI ​​tries different moves and predicts whether this is a good move and whether it is likely to win the game from this position. It uses feedback to "track" promising sequences of moves and generate other possible moves. The effect is a search for possible moves.

This process is called probabilistic search. Although there are many moves, you don't need to try them all, but you can be patient and search the areas where you might find the best moves. This works great for gaming. AlphaGo has defeated Go masters in the past. AlphaGo is not infallible, but it currently performs better than the best large-scale language models available today.

Probability vs. Accuracy

Proponents believe that even though there is evidence that large language models significantly lag behind other types of artificial intelligence, They also get progressively better. However, Lodge points out that we need to understand why they perform better at this kind of task if we are to accept this idea. The reason for the difficulty on this issue, he continued, is that no one can predict exactly how GPT-4 will respond to specific cues. This pattern is beyond human explanation. This, he believes, is “the reason why ‘just-in-time engineering’ doesn’t exist.” He stresses that it’s also a struggle for AI researchers to prove that “emergent properties” of large language models exist, let alone predict them.

It can be said that the best argument is induction. GPT-4 is better than GPT-3 on some language tasks because it is larger. Therefore, a larger model would be better.

Lodge’s view is that GPT-4 still needs to overcome the challenges faced by GPT-3, so there is a problem. One of them is math; while GPT-4 is better than GPT-3 at addition operations, it still has bottlenecks at multiplication and other mathematical operations.

Increasing the size of language models does not magically solve these problems, and according to OpenAI, larger models are not the solution. The reason comes down to the fundamental nature of large language models, as the OpenAI forum points out: “Large language models are probabilistic in nature and operate by generating possible outputs based on the patterns they observe in the training data. In Mathematics and physics problems, the likelihood of finding a single correct answer is slim."

In the artificial intelligence process, methods driven by reinforcement learning can produce results more accurately because it is a process of pursuing a goal. Reinforcement learning iteratively finds the best answer closest to the goal to achieve the desired goal. Lodge points out that large language model courses "are not designed to iterate or find goals. They are designed to give a 'good enough' answer one or a few times."

A "one-shot" answer is the first answer produced by the model, obtained by predicting a sequence of words in the prompt. "Few-shot learning" involves providing additional samples or cues to the model to assist it in generating better predictions.. Large language models often also add some randomness (that is, they are "randomized") to increase the likelihood of a better answer, so they will give different answers to the same question.

It’s not that the large language model world ignores reinforcement learning. GPT-4 combines "reinforcement learning with human feedback" (RLHF). A core model trained by a human operator favors certain answers, but this does not fundamentally change the answer the model generated in the first place. Lodge noted that a large language model might provide the following options to fill in the gaps in the sentence "Wayne Gretzky likes ice..."

1. Wayne Gretzky loves ice cream.

2. Wayne Gretzky loves ice hockey.

3. Wayne Gretzky loves ice fishing.

4. Wayne Gretzky loves skating.

5. Wayne Gretzky likes ice wine.

Human operators ranked the answers and might have concluded that the legendary Canadian hockey player preferred ice hockey and skating, despite the broad appeal of ice cream. Human rankings and more human-written responses are used to train the model. Note that GPT-4 does not pretend to know Wayne Gretzky's preferences accurately, only to do the best possible job when prompted.

Finally, large language models are not designed to be highly accurate or consistent. There is a trade-off between accuracy and deterministic behavior in exchange for generality. To Lodge, all this means is that reinforcement learning beats generative AI when it comes to applying AI at scale.

Applying Reinforcement Learning to Software

What about software development? As I write, GenAI already provides a solution for those using tools like GitHubCopilot or AmazonCodeWhisperer Opportunities are provided to increase developer productivity. This is not speculation - it has happened. These tools can predict what code is likely to appear next, based on the code before and after the insertion point in the integrated development environment.

In fact, as David Ramel of Visual Studio Magazine said, the latest version of Copilot already generates 61% of the Java code. For those concerned that this will reduce the work of software developers, remember that these tools require diligent human supervision to check completion and edit it so that the code compiles and runs correctly. Autocompletion has been a staple of IDEs since their earliest days, and Copilot and other code generators make it even more useful. Large-scale autonomous coding is different, in fact 61% of the Java code needs to be written.

However, reinforcement learning enables precise autonomous coding at scale, Lodge said. Of course, he has a vested interest in saying this: In 2019, his company Diffblue released Cover, a commercial unit test writing tool based on reinforcement learning. Cover writes complete unit test suites without human intervention, making it possible to automate complex, error-prone tasks at scale.

Is Lodge biased? Absolutely. He has many experiences justifying his belief that reinforcement learning outperforms GenAI in software development. Today, Diffblue uses reinforcement learning to search the space of all possible test methods, automatically write test code for each method, and select the best test among the tests written. Reinforcement learning reward functions are based on a variety of criteria, including test coverage and aesthetics, one of which includes conforming to human-written coding style. The tool creates tests for each method in an average of one second.

Lodge believes that if the goal is to automatically write 10,000 unit tests for a program that no one understands, then reinforcement learning is the only real solution. "Large language models cannot compete; humans have no way to effectively supervise them and correct their code at this scale, and making models larger and more complex does not solve this problem."

Conclusion: The most powerful thing about large language models is that they are general-purpose language processors. They can perform language tasks for which they have not been explicitly trained. This means they can do a great job at content generation (copywriting) and many other things. Lodge emphasized: "But this does not make large language models a substitute for artificial intelligence models, which are often based on reinforcement learning, which are more accurate, more consistent, and can be used at scale."

The above is the detailed content of Are large language models wrong for coding?. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)
2 weeks ago By 尊渡假赌尊渡假赌尊渡假赌
Repo: How To Revive Teammates
1 months ago By 尊渡假赌尊渡假赌尊渡假赌
Hello Kitty Island Adventure: How To Get Giant Seeds
1 months ago By 尊渡假赌尊渡假赌尊渡假赌

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

How to send a POST request containing JSON data using PHP's cURL library? How to send a POST request containing JSON data using PHP's cURL library? Apr 01, 2025 pm 03:12 PM

Sending JSON data using PHP's cURL library In PHP development, it is often necessary to interact with external APIs. One of the common ways is to use cURL library to send POST�...

How to efficiently integrate Node.js or Python services under LAMP architecture? How to efficiently integrate Node.js or Python services under LAMP architecture? Apr 01, 2025 pm 02:48 PM

Many website developers face the problem of integrating Node.js or Python services under the LAMP architecture: the existing LAMP (Linux Apache MySQL PHP) architecture website needs...

How to configure apscheduler timing task as a service on macOS? How to configure apscheduler timing task as a service on macOS? Apr 01, 2025 pm 06:09 PM

Configure the apscheduler timing task as a service on macOS platform, if you want to configure the apscheduler timing task as a service, similar to ngin...

Can Python parameter annotations use strings? Can Python parameter annotations use strings? Apr 01, 2025 pm 08:39 PM

Alternative usage of Python parameter annotations In Python programming, parameter annotations are a very useful function that can help developers better understand and use functions...

In LangChain, how do I use AgentExecutor to replace the disabled initialize_agent function? In LangChain, how do I use AgentExecutor to replace the disabled initialize_agent function? Apr 01, 2025 pm 04:18 PM

How to replace the disabled initialize_agent function in LangChain? In the LangChain library, initialize_agent...

How to ensure high availability of MongoDB on Debian How to ensure high availability of MongoDB on Debian Apr 02, 2025 am 07:21 AM

This article describes how to build a highly available MongoDB database on a Debian system. We will explore multiple ways to ensure data security and services continue to operate. Key strategy: ReplicaSet: ReplicaSet: Use replicasets to achieve data redundancy and automatic failover. When a master node fails, the replica set will automatically elect a new master node to ensure the continuous availability of the service. Data backup and recovery: Regularly use the mongodump command to backup the database and formulate effective recovery strategies to deal with the risk of data loss. Monitoring and Alarms: Deploy monitoring tools (such as Prometheus, Grafana) to monitor the running status of MongoDB in real time, and

Can the Python interpreter be deleted in Linux system? Can the Python interpreter be deleted in Linux system? Apr 02, 2025 am 07:00 AM

Regarding the problem of removing the Python interpreter that comes with Linux systems, many Linux distributions will preinstall the Python interpreter when installed, and it does not use the package manager...

PostgreSQL monitoring method under Debian PostgreSQL monitoring method under Debian Apr 02, 2025 am 07:27 AM

This article introduces a variety of methods and tools to monitor PostgreSQL databases under the Debian system, helping you to fully grasp database performance monitoring. 1. Use PostgreSQL to build-in monitoring view PostgreSQL itself provides multiple views for monitoring database activities: pg_stat_activity: displays database activities in real time, including connections, queries, transactions and other information. pg_stat_replication: Monitors replication status, especially suitable for stream replication clusters. pg_stat_database: Provides database statistics, such as database size, transaction commit/rollback times and other key indicators. 2. Use log analysis tool pgBadg

See all articles