A New Apple Study Shows AI Reasoning Has Critical Flaws
It’s no surprise that AI doesn’t always get things right. Occasionally, it even hallucinates. However, a recent study by Apple researchers has shown even more significant flaws within the mathematical models used by AI for formal reasoning.
✕ Remove AdsAs part of the study, Apple scientists asked an AI Large Language Model (LLM) a question, multiple times, in slightly varying ways, and were astounded when they found the LLM offered unexpected variations in the answers. These variations were most prominent when numbers were involved.
Apple's Study Suggests Big Problems With AI's Reliability

The research, published by arxiv.org, concluded there was “significant performance variability across different instantiations of the same question, challenging the reliability of current GSM8K results that rely on single point accuracy metrics.” GSM8K is a dataset which includes over 8000 diverse grade-school math questions and answers.
✕ Remove AdsApple researchers identified the variance in this performance could be as much as 10%. And even slight variations in prompts can cause colossal problems with the reliability of the LLM’s answers.
In other words, you might want to fact-check your answers anytime you use something like ChatGPT. That's because, while it may sometimes look like AI is using logic to give you answers to your inquiries, logic isn’t what’s being used.
AI, instead, relies on pattern recognition to provide responses to prompts. However, the Apple study shows how changing even a few unimportant words can alter that pattern recognition.
One example of the critical variance presented came about through a problem regarding collecting kiwis over several days. Apple researchers conducted a control experiment, then added some inconsequential information about kiwi size.
✕ Remove AdsBoth Meta and OpenAI Models Showed Issues

Meta’s Llama, and OpenAI’s o1, then altered their answers to the problem from the control despite kiwi size data having no tangible influence on the problem’s outcome. OpenAI’s GPT-4o also had issues with its performance when introducing tiny variations in the data given to the LLM.
Since LLMs are becoming more prominent in our culture, this news raises a tremendous concern about whether we can trust AI to provide accurate answers to our inquiries. Especially for issues like financial advice. It also reinforces the need to accurately verify the information you receive when using large language models.
That means you'll want to do some critical thinking and due diligence instead of blindly relying on AI. Then again, if you're someone who uses AI regularly, you probably already knew that.
✕ Remove AdsThe above is the detailed content of A New Apple Study Shows AI Reasoning Has Critical Flaws. For more information, please follow other related articles on the PHP Chinese website!

Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics



Vibe coding is reshaping the world of software development by letting us create applications using natural language instead of endless lines of code. Inspired by visionaries like Andrej Karpathy, this innovative approach lets dev

February 2025 has been yet another game-changing month for generative AI, bringing us some of the most anticipated model upgrades and groundbreaking new features. From xAI’s Grok 3 and Anthropic’s Claude 3.7 Sonnet, to OpenAI’s G

YOLO (You Only Look Once) has been a leading real-time object detection framework, with each iteration improving upon the previous versions. The latest version YOLO v12 introduces advancements that significantly enhance accuracy

The article reviews top AI art generators, discussing their features, suitability for creative projects, and value. It highlights Midjourney as the best value for professionals and recommends DALL-E 2 for high-quality, customizable art.

ChatGPT 4 is currently available and widely used, demonstrating significant improvements in understanding context and generating coherent responses compared to its predecessors like ChatGPT 3.5. Future developments may include more personalized interactions and real-time data processing capabilities, further enhancing its potential for various applications.

The article discusses AI models surpassing ChatGPT, like LaMDA, LLaMA, and Grok, highlighting their advantages in accuracy, understanding, and industry impact.(159 characters)

Mistral OCR: Revolutionizing Retrieval-Augmented Generation with Multimodal Document Understanding Retrieval-Augmented Generation (RAG) systems have significantly advanced AI capabilities, enabling access to vast data stores for more informed respons

The article discusses top AI writing assistants like Grammarly, Jasper, Copy.ai, Writesonic, and Rytr, focusing on their unique features for content creation. It argues that Jasper excels in SEO optimization, while AI tools help maintain tone consist
