Can't the NLP model read human language? Microsoft AdaTest makes fault-finding five times more efficient-AI-php.cn

Table of Contents

Microsoft develops AdaTest method to test NLP models

Discover vulnerabilities with test loops

Use debugging loop to fix bugs

Home

Technology peripherals

Can't the NLP model read human language? Microsoft AdaTest makes fault-finding five times more efficient

WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWB

Apr 09, 2023 pm 04:11 PM

ai Microsoft Research

The natural language processing (NLP) model cannot read human speech and interprets text as the opposite meaning, which is a chronic problem in the industry. Now Microsoft says it has developed a solution to this problem.

Microsoft develops AdaTest method to test NLP models

It can be used as a large-scale model across various application foundations, or the progress of platform models has greatly improved the natural processing of AI. language ability. But natural language processing (NLP) models are still far from perfect, and flaws can sometimes be exposed in embarrassing ways.

For example, there is a top-notch commercial model that translates "I do not recommend this dish" in Portuguese into "I highly recommend this dish" in English.

These failures continue in part because finding and fixing bugs in NLP models is so difficult that serious bugs affect nearly all major open source and commercial NLP models. There are currently two methods for finding and fixing NLP model errors: either user-driven or automatic.

The user-driven approach is flexible and can test any aspect of NLP model behavior. But this method relies on humans' extremely variable imagination and ability to identify errors, and is extremely labor-intensive, so that in practice only a small amount of input data can be used for testing.

On the other hand, automatic methods are fast and therefore can handle a large portion of the input data. However, lacking human control, they can only test whether a model is right or wrong in very limited circumstances, such as when the model processes input wording that changes slightly and its predictions become inconsistent.

Cant the NLP model read human language? Microsoft AdaTest makes fault-finding five times more efficient

Microsoft researchers believe that modern large language models (LLMs) like GPT-3 provide an opportunity for the industry to try to combine user-driven methods and The advantages of automated approaches combine to let the user define what the model under test should do, while leveraging the generation capabilities of modern large-scale language models to generate large-scale tests in specific categories of model behavior.

Microsoft researchers call this kind of human-machine integration path "adaptive testing and debug", abbreviated as AdaTest. With AdaTest, a large language model is given the heavy burden of generating a large number of tests for errors in the model under test.

Human intervention guides the generation of language models by selecting effective tests and organizing them into semantically related topics. This kind of human guidance greatly improves the generation performance of the language model and guides it to the target domain.

Because these tests are actually a form of labeled data, they can not only identify errors in NLP models, but can also be used to fix NLP model errors in an iterative debugging cycle similar to traditional software development.

AdaTest provides significant efficiency improvements for professional users, while being simple enough for ordinary people without a programming background to use effectively. This means that both professional users and ordinary users can better understand and control the behavior in a series of scenarios of the NLP model, which not only makes the AI system perform better, but also makes the AI system respond to user needs more effectively.

Discover vulnerabilities with test loops

The AdaTest mode consists of an internal test loop and an external debugging loop. The former is used to find errors and the latter is used to fix errors.

Although this task seems simple, even SOTA models on the market often make mistakes. For example, some SOTA models will classify the double negative sentence "I don't think I have had a better time in my life" as emotionally negative, or the sentence "I am a minority" will be classified as emotionally negative. Emotionally negative.

Both of these situations are mistakes that have actually occurred in business models on the market. To prove that AdaTest can find and fix bugs, Microsoft's research team demonstrated how to test and fix text fairness lapses in NLP models.

The text fairness error of the NLP model, that is, the neutral description of a specific attribute group in a piece of text, may lead to errors in the text sentiment analysis function of the NLP model and mistakenly reduce the emotional weight of the text. That is, the model may treat descriptions of certain groups more negatively.

Cant the NLP model read human language? Microsoft AdaTest makes fault-finding five times more efficient

In the test loop, Microsoft researchers started with a set of text unit tests for various identities and marked this set of tests as "sensitive." These initial examples did not reveal any errors in the model.

However, the AdaTest method uses GPT-3 to generate a large number of suggestive tests with similar corpus to highlight the hidden bugs in the test object model.

Although hundreds of tests are generated, the intervening personnel only need to review the first few tests that are erroneous or close to being erroneous. Human intervention then ignores those test results that are not actually wrong and adds other valid test results to the current topic, and occasionally organizes them into other subtopics. These manually filtered test results will be included in the next In the language model prompt of round input, the processing results of the next set of input data are pushed to the intersection between user concerns and model errors.

Repeating this internal testing cycle allows the NLP model to start with no errors and slowly expose more and more obvious errors and bugs. Therefore, even if users cannot find faults in the model themselves, they can start with a small set of passing tests and then quickly iterate with the NLP model to produce a large batch of tests that reveal errors in the model under test.

Cant the NLP model read human language? Microsoft AdaTest makes fault-finding five times more efficient

Internal test loop example If the tester does not use the topic of text sentiment analysis, but focuses on a different topic, such as processing negative sentences and double negative sentences, the tester will Different faults are found.

For example, a simple statement like "I have never been happier than I am now" can be correctly classified by the business model as positive. However, using the AdaTest method, you can quickly find that complex statements like "I don't think I have ever seen a better city" will be incorrectly marked as negative by the NLP model.

Once a tester sees these errors, they will be obvious and egregious, but they are difficult to detect directly by humans because they only occur in very specific wordings. Microsoft's research team conducted a user survey to quantitatively evaluate whether AdaTest enables professional and non-professional users to better write tests and find errors in NLP models. The researchers asked professional users to test topic-specific features in two models: a commercial text sentiment classifier and GPT-2 for next-word autocompletion.

This function is used for applications such as predicting the next word in an email being entered. For each topic and model, participants were randomly assigned to use CheckList (which stands for SOTA for User-Driven Testing) or AdaTest. The researchers observed a fivefold improvement in AdaTest across different models and professional participants.

The researcher’s testing requirement for non-professional users is to test the content control of toxic corpus in the NLP model. Participants have to find non-toxic content in the corpus that is judged as toxic by the model, that is, content that they personally feel is suitable. Participants can use an improved version of Dynabench crowdsourcing interface for model testing, or they can use AdaTest. The result is that AdaTest provides up to 10x improvement.

Cant the NLP model read human language? Microsoft AdaTest makes fault-finding five times more efficient

Test renderings of test participants with different viewpoints

Use debugging loop to fix bugs

Once enough errors are found , the tester of the model will perform an external debugging loop (as shown below), fix the errors found in the test loop, and then retest the model. In this process, the "retest" part of the debug loop (i.e., running the test loop again) is crucial because once the tests are used to fix the model, they are no longer test data, but training data. The process of fixing bugs often overcompensates, introducing shortcuts or bugs in the first few rounds of the debugging cycle that can only be discovered with a set of tests adapted to the new "fixed" model.

Testing cycle process on an open source RoBERTa-Large emotion model. The researchers started with testing on the “/sensitive/immigration” topic in Figure 2, which the RoBERTa model incorrectly labeled as negative. The model is fine-tuned during these tests (mixed with the original training data to maintain task performance), and the result is a new model that no longer fails. However, when re-running the test loop, it was discovered that almost all immigration statements were now marked as "neutral" even though they were truly negative based on the application and test scenario.

Using these new tests to fine-tune again, the result is that the model correctly fixes the original error without adding the "every immigration statement is neutral" shortcut. Of course, this does not guarantee that another shortcut does not exist in the model, but according to the researcher's experience, after several debugging cycles, the number of unexpected errors introduced when fixing the original error is greatly reduced.

Testers do not need to identify every possible error in detail in advance, AdaTest will adaptively display and fix errors introduced in the next round of testing and debugging.

Thus, the debugging loop pushes the boundaries of the current bug testing specification until a satisfactory model is produced. In fact, AdaTest can be seen as the application of the test-fix-retest cycle in software engineering in NLP.

Cant the NLP model read human language? Microsoft AdaTest makes fault-finding five times more efficient

Shortcuts added during iterations of the debug loop were discovered and fixed by future iterations To evaluate the effectiveness of the debug loop, we used the Quora question dataset on RoBERTa -Large was fine-tuned to detect whether two questions are duplicates and also fine-tuned using the Stanford Sentiment Treebank (SST) dataset for positive/neutral/negative sentiment analysis.

The results found that the baseline model failed to successfully identify 22 of the 53 QQP topics and 11 of the 39 emotional topics. Afterwards, the researcher created data to repair the theme. Extract 50 examples from the data on this topic and run a debugging loop with AdaTest. On the QQP data set, an average of 41.6 tests are performed, and on the sentiment data set, an average of 55.8 tests are performed.

The results show that in the vast majority of cases, AdaTest repairs the questions used for training and some unseen reserved questions without destroying any questions, while the original CheckList data often introduces new errors, thus Destroy other test questions. The researchers also evaluated AdaTest's effectiveness in a standard development environment. After three months of development, CheckList testing and ad hoc data augmentation based on GPT-3, the F1 score is 0.66 (out of 1.00) on unseen data collected in the wild.

The same team using AdaTest achieved an F1 score of 0.77 on the same unseen data set after running the debug loop themselves for four hours. These scores were later replicated on a second, unseen data set, demonstrating that AdaTest can perform bug fixes and achieve better results in areas where traditional methods fail.

People provide problem specifications that language models lack, while language models provide high-quality testing at a greater scale and scope, and connect model testing and debugging to effectively fix errors and enable model development A step closer to the iterative nature of traditional software development.

The cooperation between humans and AI represents a future direction for the development of machine learning. It is hoped that this collaboration will continue to improve as the capabilities of large-scale language models continue to grow.

The above is the detailed content of Can't the NLP model read human language? Microsoft AdaTest makes fault-finding five times more efficient. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)

1 months ago By 尊渡假赌尊渡假赌尊渡假赌

R.E.P.O. Best Graphic Settings

1 months ago By 尊渡假赌尊渡假赌尊渡假赌

Assassin's Creed Shadows: Seashell Riddle Solution

3 weeks ago By DDD

What's New in Windows 11 KB5054979 & How to Fix Update Issues

2 weeks ago By DDD

Will R.E.P.O. Have Crossplay?

1 months ago By 尊渡假赌尊渡假赌尊渡假赌

Hot Tools

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Hot Topics

Where is the login entrance for gmail email?

7549

CakePHP Tutorial

1382

What is the format of the account name of steam

win11 activation key permanent

nyt connections hints and answers

Related knowledge

Debian mail server firewall configuration tips Apr 13, 2025 am 11:42 AM

Configuring a Debian mail server's firewall is an important step in ensuring server security. The following are several commonly used firewall configuration methods, including the use of iptables and firewalld. Use iptables to configure firewall to install iptables (if not already installed): sudoapt-getupdatesudoapt-getinstalliptablesView current iptables rules: sudoiptables-L configuration

Centos shutdown command line Apr 14, 2025 pm 09:12 PM

The CentOS shutdown command is shutdown, and the syntax is shutdown [Options] Time [Information]. Options include: -h Stop the system immediately; -P Turn off the power after shutdown; -r restart; -t Waiting time. Times can be specified as immediate (now), minutes ( minutes), or a specific time (hh:mm). Added information can be displayed in system messages.

Sony confirms the possibility of using special GPUs on PS5 Pro to develop AI with AMD Apr 13, 2025 pm 11:45 PM

Mark Cerny, chief architect of SonyInteractiveEntertainment (SIE, Sony Interactive Entertainment), has released more hardware details of next-generation host PlayStation5Pro (PS5Pro), including a performance upgraded AMDRDNA2.x architecture GPU, and a machine learning/artificial intelligence program code-named "Amethylst" with AMD. The focus of PS5Pro performance improvement is still on three pillars, including a more powerful GPU, advanced ray tracing and AI-powered PSSR super-resolution function. GPU adopts a customized AMDRDNA2 architecture, which Sony named RDNA2.x, and it has some RDNA3 architecture.

What are the backup methods for GitLab on CentOS Apr 14, 2025 pm 05:33 PM

Backup and Recovery Policy of GitLab under CentOS System In order to ensure data security and recoverability, GitLab on CentOS provides a variety of backup methods. This article will introduce several common backup methods, configuration parameters and recovery processes in detail to help you establish a complete GitLab backup and recovery strategy. 1. Manual backup Use the gitlab-rakegitlab:backup:create command to execute manual backup. This command backs up key information such as GitLab repository, database, users, user groups, keys, and permissions. The default backup file is stored in the /var/opt/gitlab/backups directory. You can modify /etc/gitlab

How to check CentOS HDFS configuration Apr 14, 2025 pm 07:21 PM

Complete Guide to Checking HDFS Configuration in CentOS Systems This article will guide you how to effectively check the configuration and running status of HDFS on CentOS systems. The following steps will help you fully understand the setup and operation of HDFS. Verify Hadoop environment variable: First, make sure the Hadoop environment variable is set correctly. In the terminal, execute the following command to verify that Hadoop is installed and configured correctly: hadoopversion Check HDFS configuration file: The core configuration file of HDFS is located in the /etc/hadoop/conf/ directory, where core-site.xml and hdfs-site.xml are crucial. use

What are the methods of tuning performance of Zookeeper on CentOS Apr 14, 2025 pm 03:18 PM

Zookeeper performance tuning on CentOS can start from multiple aspects, including hardware configuration, operating system optimization, configuration parameter adjustment, monitoring and maintenance, etc. Here are some specific tuning methods: SSD is recommended for hardware configuration: Since Zookeeper's data is written to disk, it is highly recommended to use SSD to improve I/O performance. Enough memory: Allocate enough memory resources to Zookeeper to avoid frequent disk read and write. Multi-core CPU: Use multi-core CPU to ensure that Zookeeper can process it in parallel.

How to train PyTorch model on CentOS Apr 14, 2025 pm 03:03 PM

Efficient training of PyTorch models on CentOS systems requires steps, and this article will provide detailed guides. 1. Environment preparation: Python and dependency installation: CentOS system usually preinstalls Python, but the version may be older. It is recommended to use yum or dnf to install Python 3 and upgrade pip: sudoyumupdatepython3 (or sudodnfupdatepython3), pip3install--upgradepip. CUDA and cuDNN (GPU acceleration): If you use NVIDIAGPU, you need to install CUDATool

How is the GPU support for PyTorch on CentOS Apr 14, 2025 pm 06:48 PM

Enable PyTorch GPU acceleration on CentOS system requires the installation of CUDA, cuDNN and GPU versions of PyTorch. The following steps will guide you through the process: CUDA and cuDNN installation determine CUDA version compatibility: Use the nvidia-smi command to view the CUDA version supported by your NVIDIA graphics card. For example, your MX450 graphics card may support CUDA11.1 or higher. Download and install CUDAToolkit: Visit the official website of NVIDIACUDAToolkit and download and install the corresponding version according to the highest CUDA version supported by your graphics card. Install cuDNN library:

See all articles