


Study shows large language models have problems with logical reasoning
Translator|Li Rui
Reviewer|Sun Shujuan
Before chatbots with perceptual capabilities became a hot topic, large language models (LLM) had attracted more attention of excitement and worry. In recent years, large language models (LLMs), deep learning models trained on large amounts of text, have performed well on several benchmarks used to measure language understanding capabilities.
Large language models such as GPT-3 and LaMDA manage to maintain coherence across longer texts. They seem knowledgeable about different topics and remain consistent throughout lengthy conversations. Large language models (LLMs) have become so convincing that some people associate them with personality and higher forms of intelligence.
But can large language models (LLMs) perform logical reasoning like humans? Transformers, a deep learning architecture used in large language models (LLMs), does not learn to simulate reasoning capabilities, according to a research paper published by UCLA scientists. Instead, computers have found clever ways to learn the statistical properties inherent in reasoning problems.
The researchers tested the currently popular Transformer architecture BERT in a limited problem space. Their results show that BERT can accurately respond to inference problems on examples within a distribution in the training space, but cannot generalize to examples in other distributions based on the same problem space.
And these tests highlight some of the shortcomings of deep neural networks and the benchmarks used to evaluate them.
1. How to measure logical reasoning in artificial intelligence?
There are several benchmarks for artificial intelligence systems targeting natural language processing and understanding problems, such as GLUE, SuperGLUE, SNLI, and SqUAD. As Transformer has grown larger and been trained on larger datasets, Transformer has been able to incrementally improve on these benchmarks.
It’s worth noting that the performance of AI systems on these benchmarks is often compared to human intelligence. Human performance on these benchmarks is closely related to common sense and logical reasoning abilities. But it's unclear whether large language models improve because they gain logical reasoning capabilities or because they are exposed to large amounts of text.
To test this, UCLA researchers developed SimpleLogic, a class of logical reasoning questions based on propositional logic. To ensure that the language model's reasoning capabilities were rigorously tested, the researchers eliminated language differences by using template language structures. SimpleLogic problems consist of a set of facts, rules, queries, and labels. Facts are predicates known to be "true". Rules are conditions, defined as terms. A query is a question that a machine learning model must respond to. The label is the answer to the query, that is, "true" or "false". SimpleLogic questions are compiled into continuous text strings containing the signals and delimiters expected by the language model during training and inference.
Questions asked in SimpleLogic format One of the characteristics of SimpleLogic is that its questions are self-contained and require no prior knowledge. This is especially important because, as many scientists say, when humans speak, they ignore shared knowledge. This is why language models often fall into a trap when asked questions about basic world knowledge that everyone knows. In contrast, SimpleLogic provides developers with everything they need to solve their problems. Therefore, any developer looking at a problem posed by the SimpleLogic format should be able to infer its rules and be able to handle new examples regardless of their background knowledge.
2. Statistical features and logical inference
The researchers proved that the problem space in SimpleLogic can be represented by an inference function. The researchers further showed that BERT is powerful enough to solve all problems in SimpleLogic, and they can manually adjust the parameters of the machine learning model to represent the inference function.
However, when they trained BERT on the SimpleLogic example dataset, the model was unable to learn the inference function on its own. Machine learning models manage to achieve near-perfect accuracy on a data distribution. But it does not generalize to other distributions within the same problem space. This is the case even though the training dataset covers the entire problem space and all distributions come from the same inference function.
The capacity of the BERT Transformer model is sufficient to represent SimpleLogic’s inference capabilities
(Note: This is different from the out-of-distribution generalization challenge, which applies to open space Problem. When a model cannot generalize to OOD data, its performance will drop significantly when processing data that is not within its training set distribution.)
The researchers wrote: "After further investigation, we provide an explanation for this paradox: a model that achieves high accuracy only on distributed test examples has not learned to reason. In fact, the model has learned to reason on logical reasoning problems use statistical features to make predictions, rather than simulating correct inference functions."
This finding highlights an important challenge in using deep learning for language tasks. Neural networks are very good at discovering and fitting statistical features. In some applications this can be very useful. For example, in sentiment analysis, there are strong correlations between certain words and sentiment categories.
However, for logical reasoning tasks, even if statistical features exist, the model should try to find and learn the underlying reasoning functions.
The researchers wrote: “When we attempt to train neural models end-to-end to solve natural language processing (NLP) tasks that involve both logical reasoning and prior knowledge and present language differences, it should Be careful." They emphasized that the challenges posed by SimpleLogic become even more severe in the real world, where the large amounts of information required for large language models (LLMs) are simply not contained in the data.
The researchers observed that when they removed a statistical feature from the training data set, the performance of the language model improved on other distributions of the same problem space. The problem, however, is that discovering and removing multiple statistical features is easier said than done. As the researchers point out in their paper, "Such statistical features can be numerous and extremely complex, making them difficult to remove from training data."
3. Inference in Deep Learning
Unfortunately, as the size of language models becomes larger, the logical reasoning problem does not disappear. It's just hidden in huge architectures and very large training corpora. Large language models (LLM) can describe facts and stitch sentences together very well, but in terms of logical reasoning, they still use statistical features for reasoning, which is not a solid foundation. Moreover, there is no indication that by adding layers, parameters, and attention heads to Transformers, the logical reasoning gap will be closed.
This paper is consistent with other work showing the limitations of neural networks in learning logical rules, such as the Game of Life or abstract reasoning from visual data. The paper highlights one of the main challenges facing current language models. As UCLA researchers point out, “On the one hand, when a model is trained to learn a task from data, it always tends to learn statistical patterns that are inherently present in the inference examples; However, on the other hand, logical rules never rely on statistical patterns to make inferences. Since it is difficult to construct a logical inference data set that does not contain statistical features, learning inferences from data is difficult."
Original link: https://bdtechtalks.com/2022/06/27/large-language-models-logical-reasoning/
The above is the detailed content of Study shows large language models have problems with logical reasoning. For more information, please follow other related articles on the PHP Chinese website!

Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics



This site reported on June 27 that Jianying is a video editing software developed by FaceMeng Technology, a subsidiary of ByteDance. It relies on the Douyin platform and basically produces short video content for users of the platform. It is compatible with iOS, Android, and Windows. , MacOS and other operating systems. Jianying officially announced the upgrade of its membership system and launched a new SVIP, which includes a variety of AI black technologies, such as intelligent translation, intelligent highlighting, intelligent packaging, digital human synthesis, etc. In terms of price, the monthly fee for clipping SVIP is 79 yuan, the annual fee is 599 yuan (note on this site: equivalent to 49.9 yuan per month), the continuous monthly subscription is 59 yuan per month, and the continuous annual subscription is 499 yuan per year (equivalent to 41.6 yuan per month) . In addition, the cut official also stated that in order to improve the user experience, those who have subscribed to the original VIP

On July 29, at the roll-off ceremony of AITO Wenjie's 400,000th new car, Yu Chengdong, Huawei's Managing Director, Chairman of Terminal BG, and Chairman of Smart Car Solutions BU, attended and delivered a speech and announced that Wenjie series models will be launched this year In August, Huawei Qiankun ADS 3.0 version was launched, and it is planned to successively push upgrades from August to September. The Xiangjie S9, which will be released on August 6, will debut Huawei’s ADS3.0 intelligent driving system. With the assistance of lidar, Huawei Qiankun ADS3.0 version will greatly improve its intelligent driving capabilities, have end-to-end integrated capabilities, and adopt a new end-to-end architecture of GOD (general obstacle identification)/PDP (predictive decision-making and control) , providing the NCA function of smart driving from parking space to parking space, and upgrading CAS3.0

Improve developer productivity, efficiency, and accuracy by incorporating retrieval-enhanced generation and semantic memory into AI coding assistants. Translated from EnhancingAICodingAssistantswithContextUsingRAGandSEM-RAG, author JanakiramMSV. While basic AI programming assistants are naturally helpful, they often fail to provide the most relevant and correct code suggestions because they rely on a general understanding of the software language and the most common patterns of writing software. The code generated by these coding assistants is suitable for solving the problems they are responsible for solving, but often does not conform to the coding standards, conventions and styles of the individual teams. This often results in suggestions that need to be modified or refined in order for the code to be accepted into the application

Large Language Models (LLMs) are trained on huge text databases, where they acquire large amounts of real-world knowledge. This knowledge is embedded into their parameters and can then be used when needed. The knowledge of these models is "reified" at the end of training. At the end of pre-training, the model actually stops learning. Align or fine-tune the model to learn how to leverage this knowledge and respond more naturally to user questions. But sometimes model knowledge is not enough, and although the model can access external content through RAG, it is considered beneficial to adapt the model to new domains through fine-tuning. This fine-tuning is performed using input from human annotators or other LLM creations, where the model encounters additional real-world knowledge and integrates it

To learn more about AIGC, please visit: 51CTOAI.x Community https://www.51cto.com/aigc/Translator|Jingyan Reviewer|Chonglou is different from the traditional question bank that can be seen everywhere on the Internet. These questions It requires thinking outside the box. Large Language Models (LLMs) are increasingly important in the fields of data science, generative artificial intelligence (GenAI), and artificial intelligence. These complex algorithms enhance human skills and drive efficiency and innovation in many industries, becoming the key for companies to remain competitive. LLM has a wide range of applications. It can be used in fields such as natural language processing, text generation, speech recognition and recommendation systems. By learning from large amounts of data, LLM is able to generate text

Editor |ScienceAI Question Answering (QA) data set plays a vital role in promoting natural language processing (NLP) research. High-quality QA data sets can not only be used to fine-tune models, but also effectively evaluate the capabilities of large language models (LLM), especially the ability to understand and reason about scientific knowledge. Although there are currently many scientific QA data sets covering medicine, chemistry, biology and other fields, these data sets still have some shortcomings. First, the data form is relatively simple, most of which are multiple-choice questions. They are easy to evaluate, but limit the model's answer selection range and cannot fully test the model's ability to answer scientific questions. In contrast, open-ended Q&A

Machine learning is an important branch of artificial intelligence that gives computers the ability to learn from data and improve their capabilities without being explicitly programmed. Machine learning has a wide range of applications in various fields, from image recognition and natural language processing to recommendation systems and fraud detection, and it is changing the way we live. There are many different methods and theories in the field of machine learning, among which the five most influential methods are called the "Five Schools of Machine Learning". The five major schools are the symbolic school, the connectionist school, the evolutionary school, the Bayesian school and the analogy school. 1. Symbolism, also known as symbolism, emphasizes the use of symbols for logical reasoning and expression of knowledge. This school of thought believes that learning is a process of reverse deduction, through existing

Editor | KX In the field of drug research and development, accurately and effectively predicting the binding affinity of proteins and ligands is crucial for drug screening and optimization. However, current studies do not take into account the important role of molecular surface information in protein-ligand interactions. Based on this, researchers from Xiamen University proposed a novel multi-modal feature extraction (MFE) framework, which for the first time combines information on protein surface, 3D structure and sequence, and uses a cross-attention mechanism to compare different modalities. feature alignment. Experimental results demonstrate that this method achieves state-of-the-art performance in predicting protein-ligand binding affinities. Furthermore, ablation studies demonstrate the effectiveness and necessity of protein surface information and multimodal feature alignment within this framework. Related research begins with "S
