Table of Contents
1. Data design for AI
2. Improve data: filter, clean, label, enhance
Data screening
Data Cleaning
Data annotation
Data Augmentation
3. Data used to evaluate and monitor AI models
4. The future of data
Home Technology peripherals AI New research by Stanford Li Feifei's team found that designing, improving, and evaluating data are the keys to achieving trustworthy artificial intelligence

New research by Stanford Li Feifei's team found that designing, improving, and evaluating data are the keys to achieving trustworthy artificial intelligence

Apr 24, 2023 am 10:04 AM
ai Stanford University Li Feifei

Under the current trend of the development of AI models shifting from model-centric to data-centric, the quality of data has become particularly important.

In previous AI development processes, data sets were usually fixed, and development efforts focused on iterating the model architecture or training process to improve baseline performance. Now, with data iteration taking center stage, we need more systematic ways to evaluate, filter, clean, and annotate the data used to train and test AI models.

Recently, Weixin Liang, Li Feifei and others from the Department of Computer Science at Stanford University jointly published an article titled "Advances, challenges and opportunities in creating data for trustworthy AI" in "Nature-Machine Intelligence" , discussed the key factors and methods to ensure data quality in all aspects of the entire AI data process.

斯坦福李飞飞团队新研究登 Nature 子刊:实现可信 AI,数据的设计、完善、评估是关键

Paper address: https://www.nature.com/articles/s42256-022-00516-1.epdf?sharing_token=VPzI-KWAm8tLG_BiXJnV9tRgN0jAjWel9jnR3ZoTv0MRS1pu9dXg73FQ0NTrwhu7Hi_VBEr6peszIAFc 6XO1tdlvV1lLJQtOvUFnSXpvW6_nu0Knc_dRekx6lyZNc6PcM1nslocIcut_qNW9OUg1IsbCfuL058R4MsYFqyzlb2E=

The main steps in the AI ​​data process include: data design (data collection and recording), data improvement (data filtering, cleaning, annotation, enhancement), and data strategy for evaluating and monitoring AI models, each step of which All will affect the credibility of the final AI model.

斯坦福李飞飞团队新研究登 Nature 子刊:实现可信 AI,数据的设计、完善、评估是关键

Figure 1: Roadmap for the development of a data-centric approach from data design to evaluation.

1. Data design for AI

After identifying an artificial intelligence application, the first step in developing an AI model is to design the data (i.e., identify and record the source of the data).

Design should be an iterative process—using experimental data to develop an initial AI model, then collecting additional data to patch the model’s limitations. Key criteria for design are ensuring that the data is suitable for the task and covers enough scope to represent the different users and scenarios the model may encounter.

And the data sets currently used to develop AI often have limited coverage or are biased. In medical AI, for example, the collection of patient data used to develop algorithms is disproportionately distributed geographically, which can limit the applicability of AI models to different populations.

One way to improve data coverage is to involve the broader community in the creation of the data. This is exemplified by the Common Voice project, the largest public dataset currently available, which contains 11,192 hours of speech transcriptions in 76 languages ​​from more than 166,000 participants.

And when representative data is difficult to obtain, synthetic data can be used to fill the coverage gaps. For example, the collection of real faces often involves privacy issues and sampling bias, while synthetic faces created by deep generative models are now used to mitigate data imbalance and bias. In healthcare, synthetic medical records can be shared to facilitate knowledge discovery without disclosing actual patient information. In robotics, real-world challenges are the ultimate test bed, and high-fidelity simulation environments can be used to enable agents to learn faster and safer on complex and long-term tasks.

But there are also some problems with synthetic data. There is always a gap between synthetic and real-world data, so performance degradation often occurs when an AI model trained on synthetic data is transferred to the real world. Synthetic data can also exacerbate data disparities if simulators are not designed with minority groups in mind. The performance of AI models is highly dependent on the context of their training and evaluation data, so it is important to document the context of data design in standardized and transparent reporting. .

Now, researchers have created a variety of "data nutrition labels" to capture metadata about the data design and annotation process. Useful metadata includes statistics on the sex, gender, race, and geographic location of participants in the dataset, which can help discover if there are underrepresented subgroups that are not being covered. Data provenance is also a type of metadata that tracks the source and time of data as well as the processes and methods that produced the data.

Metadata can be stored in a dedicated data design document, which is important for observing the life cycle and socio-technical context of the data. Documents can be uploaded to stable and centralized data repositories such as Zenodo.

2. Improve data: filter, clean, label, enhance

After the initial data set is collected, we need to further improve the data to provide more effective data for the development of AI. This is a key difference between model-centric approaches to AI and data-centric approaches, as shown in Figure 2a. Model-centric research is usually based on given data and focuses on improving the model architecture or optimizing this data. Data-centric research, on the other hand, focuses on scalable methods to systematically improve data through processes such as data cleaning, filtering, annotation, and enhancement, and can use a one-stop model development platform.

斯坦福李飞飞团队新研究登 Nature 子刊:实现可信 AI,数据的设计、完善、评估是关键

Figure 2a: Comparison of model-centric and data-centric approaches to AI. MNIST, COCO, and ImageNet are commonly used datasets in AI research.

Data screening

If the data set is very noisy, we must carefully screen the data before training, which can significantly improve the reliability and generalization of the model. The airplane image in Figure 2a is the noisy data point that should be removed from the bird data set.

In Figure 2b, four state-of-the-art models trained on previously used large dermatology data all perform poorly due to bias in the training data, especially on dark skin images. Not good, and Model 1 trained on smaller, high-quality data is relatively more reliable on both dark and light skin tones.

斯坦福李飞飞团队新研究登 Nature 子刊:实现可信 AI,数据的设计、完善、评估是关键

Figure 2b: Dermatology diagnostic test performance on light skin and dark skin images.

Figure 2c shows that ResNet, DenseNet, and VGG, three popular deep learning architectures for image classification, all perform poorly when trained on noisy image datasets. After data Shapley value filtering, poor quality data is deleted, and the ResNet model trained on a cleaner data subset performs significantly better.

斯坦福李飞飞团队新研究登 Nature 子刊:实现可信 AI,数据的设计、完善、评估是关键

#Figure 2c: Comparison of object recognition test performance of different models before and after data filtering. Numbers in parentheses represent the number of training data points remaining after filtering out noisy data, with results aggregated over five random seeds, and the shaded area represents the 95% confidence interval.

This is what data evaluation is all about, it aims to quantify the importance of different data and filter out data that may harm model performance due to poor quality or bias.

Data Cleaning

In this article, the author introduces two data evaluation methods to help clean the data:

One method is to measure when different data are deleted during the training process The change in AI model performance can be obtained by using the Shapley value or impact approximation of the data, as shown in Figure 3a below. This approach enables efficient computational evaluation of large AI models.

斯坦福李飞飞团队新研究登 Nature 子刊:实现可信 AI,数据的设计、完善、评估是关键

Figure 3a: Data evaluation. The Shapley value of the data measures the change in performance of a model trained on different subsets of the data when a specific point is removed from training (the faded five-pointed star crossed out in the figure), thereby quantifying each data point (the five-pointed star symbol) value. Colors represent category labels.

Another approach is to predict uncertainty to detect poor quality data points. Human annotations of data points can systematically deviate from AI model predictions, and confidence learning algorithms can detect these deviations, with more than 3% of test data found to be mislabeled on common benchmarks like ImageNet. Filtering out these errors can greatly improve model performance.

Data annotation

Data annotation is also a major source of data bias. Although AI models can tolerate a certain level of random label noise, biased errors produce biased models. Currently, we mainly rely on manual annotation, which is very expensive. For example, the cost of annotating a single LIDAR scan may exceed $30. Because it is three-dimensional data, the annotator needs to draw a three-dimensional bounding box, which is more demanding than general annotation tasks.

Therefore, the author believes that we need to carefully calibrate annotation tools on crowdsourcing platforms such as MTurk to provide consistent annotation rules. In the medical environment, it is also important to consider that annotators may require specialized knowledge or may have sensitive data that cannot be crowdsourced.

One way to reduce the cost of annotation is data programming. In data programming, AI developers no longer need to manually label data points, but instead write programmatic labeling functions to automatically label training sets. As shown in Figure 3b, after automatically generating multiple potentially noisy labels for each input using a user-defined label function, we can design additional algorithms to aggregate multiple label features to reduce noise.

斯坦福李飞飞团队新研究登 Nature 子刊:实现可信 AI,数据的设计、完善、评估是关键

Figure 3b: Data programming.

Another "human-in-the-loop" approach to reducing labeling costs is to prioritize the most valuable data so that we can label it through active learning. Active learning draws ideas from optimal experimental design. In active learning, the algorithm selects the most informative points from a set of unlabeled data points, such as points with high information gain or points on which the model has uncertainty. Click, and then perform manual annotation. The benefit of this approach is that the amount of data required is much smaller than that required for standard supervised learning.

Data Augmentation

Finally, when the existing data is still very limited, data augmentation is an effective method to expand the data set and improve the reliability of the model.

Computer vision data can be enhanced with image rotations, flips, and other digital transformations, and text data can be enhanced with transformations to automated writing styles. There is also the recent Mixup, a more complex augmentation technique that creates new training data by interpolating pairs of training samples, as shown in Figure 3c.

In addition to manual data enhancement, the current AI automated data enhancement process is also a popular solution. Additionally, when unlabeled data is available, label augmentation can also be achieved by using an initial model to make predictions (these predictions are called pseudo-labels), and then training a larger model on the combined data with real and high-confidence pseudo-labels. Model.

斯坦福李飞飞团队新研究登 Nature 子刊:实现可信 AI,数据的设计、完善、评估是关键

Figure 3c: Mixup augments a dataset by creating synthetic data that interpolates existing data. Blue points represent existing data points in the training set, and red points represent synthetic data points created by interpolating two existing data points.

3. Data used to evaluate and monitor AI models

After the model is trained, the goal of AI evaluation is the generalizability and credibility of the model.

To achieve this goal, we should carefully design the evaluation data to find the real-world settings of the model, and the evaluation data also need to be sufficiently different from the model's training data.

For example, in medical research, AI models are usually trained based on data from a small number of hospitals. When such a model is deployed in a new hospital, its accuracy will decrease due to differences in data collection and processing. In order to evaluate the generalization of the model, it is necessary to collect evaluation data from different hospitals and different data processing pipelines. In other applications, evaluation data should be collected from different sources, preferably labeled as training data by different annotators. At the same time, high-quality human labels remain the most important evaluation.

An important role of AI evaluation is to determine whether the AI ​​model uses spurious correlations as "shortcuts" in training data that cannot be well conceptualized. For example, in medical imaging, the way the data is processed (such as cropping or image compression) can create spurious correlations (i.e., shortcuts) that are picked up by the model. These shortcuts may be helpful on the surface, but can fail catastrophically when the model is deployed in a slightly different environment.

Systematic data ablation is a good way to examine potential model "shortcuts". In data ablation, AI models are trained and tested on ablated inputs of spuriously correlated surface signals.

斯坦福李飞飞团队新研究登 Nature 子刊:实现可信 AI,数据的设计、完善、评估是关键

Figure 4: Data ablation

An example of using data ablation to detect model shortcuts is a study on common natural language inference datasets. , an AI model trained on only the first half of a text input achieved high accuracy in inferring the logical relationship between the first and second halves of the text, compared to the level of human inference and random guessing on the same input. almost. This suggests that AI models exploit spurious correlations as a shortcut to accomplish this task. The research team found that specific linguistic phenomena are exploited by AI models, such as negation in text being highly correlated with tags.

Data ablation is widely used in various fields. For example, in the medical field, biologically relevant parts of an image can be masked out as a way to assess whether the AI ​​is learning from false background or an artifact of the image quality.

AI evaluation is often limited to comparing overall performance metrics across an entire test data set. But even if an AI model works well at the overall data level, it may still show systematic errors on specific subgroups of the data, and characterizing clusters of these errors can provide a better understanding of the model's limitations.

When metadata is available, fine-grained assessment methods should, to the extent possible, slice assessment data by the sex, gender, race, and geographic location of the participants in the dataset—for example, “Asian Older Male” or “Native American women”—and quantify the model’s performance on each data subgroup. Multi-accuracy auditing is an algorithm that automatically searches for subgroups of data where AI models perform poorly. Here, auditing algorithms are trained to use metadata to predict and cluster the original model’s errors, then provide explainable answers to questions such as what mistakes the AI ​​model made and why.

When metadata is not available, methods such as Domino automatically identify data clusters where evaluation models are prone to errors and use text generation to create natural language explanations of these model errors.

4. The future of data

Currently most AI research projects develop datasets only once, but real-world AI users often need to continuously update datasets and models. Continuous data development will bring the following challenges:

First, both data and AI tasks can change over time: for example, maybe a new vehicle model appears on the road (i.e., domain shift) , or maybe the AI ​​developer wants to recognize a new object category (for example, a school bus type that is different from a regular bus), which would change the classification of the label. It would be wasteful to throw away millions of hours of old tag data, so updates are imperative. Additionally, training and evaluation metrics should be carefully designed to weigh new data and use appropriate data for each subtask.

Second, in order to continuously obtain and use data, users will need to automate most data-centric AI processes. This automation involves using algorithms to choose which data to send to the annotator and how to use it to retrain the model, and only alerting the model developer if something goes wrong in the process (for example, if accuracy metrics drop). As part of the "MLOps (Machine Learning Operations)" trend, industry companies are beginning to use tools to automate the machine learning life cycle.

The above is the detailed content of New research by Stanford Li Feifei's team found that designing, improving, and evaluating data are the keys to achieving trustworthy artificial intelligence. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)
1 months ago By 尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. Best Graphic Settings
1 months ago By 尊渡假赌尊渡假赌尊渡假赌
Will R.E.P.O. Have Crossplay?
1 months ago By 尊渡假赌尊渡假赌尊渡假赌

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Centos shutdown command line Centos shutdown command line Apr 14, 2025 pm 09:12 PM

The CentOS shutdown command is shutdown, and the syntax is shutdown [Options] Time [Information]. Options include: -h Stop the system immediately; -P Turn off the power after shutdown; -r restart; -t Waiting time. Times can be specified as immediate (now), minutes ( minutes), or a specific time (hh:mm). Added information can be displayed in system messages.

How to check CentOS HDFS configuration How to check CentOS HDFS configuration Apr 14, 2025 pm 07:21 PM

Complete Guide to Checking HDFS Configuration in CentOS Systems This article will guide you how to effectively check the configuration and running status of HDFS on CentOS systems. The following steps will help you fully understand the setup and operation of HDFS. Verify Hadoop environment variable: First, make sure the Hadoop environment variable is set correctly. In the terminal, execute the following command to verify that Hadoop is installed and configured correctly: hadoopversion Check HDFS configuration file: The core configuration file of HDFS is located in the /etc/hadoop/conf/ directory, where core-site.xml and hdfs-site.xml are crucial. use

What are the backup methods for GitLab on CentOS What are the backup methods for GitLab on CentOS Apr 14, 2025 pm 05:33 PM

Backup and Recovery Policy of GitLab under CentOS System In order to ensure data security and recoverability, GitLab on CentOS provides a variety of backup methods. This article will introduce several common backup methods, configuration parameters and recovery processes in detail to help you establish a complete GitLab backup and recovery strategy. 1. Manual backup Use the gitlab-rakegitlab:backup:create command to execute manual backup. This command backs up key information such as GitLab repository, database, users, user groups, keys, and permissions. The default backup file is stored in the /var/opt/gitlab/backups directory. You can modify /etc/gitlab

How is the GPU support for PyTorch on CentOS How is the GPU support for PyTorch on CentOS Apr 14, 2025 pm 06:48 PM

Enable PyTorch GPU acceleration on CentOS system requires the installation of CUDA, cuDNN and GPU versions of PyTorch. The following steps will guide you through the process: CUDA and cuDNN installation determine CUDA version compatibility: Use the nvidia-smi command to view the CUDA version supported by your NVIDIA graphics card. For example, your MX450 graphics card may support CUDA11.1 or higher. Download and install CUDAToolkit: Visit the official website of NVIDIACUDAToolkit and download and install the corresponding version according to the highest CUDA version supported by your graphics card. Install cuDNN library:

Detailed explanation of docker principle Detailed explanation of docker principle Apr 14, 2025 pm 11:57 PM

Docker uses Linux kernel features to provide an efficient and isolated application running environment. Its working principle is as follows: 1. The mirror is used as a read-only template, which contains everything you need to run the application; 2. The Union File System (UnionFS) stacks multiple file systems, only storing the differences, saving space and speeding up; 3. The daemon manages the mirrors and containers, and the client uses them for interaction; 4. Namespaces and cgroups implement container isolation and resource limitations; 5. Multiple network modes support container interconnection. Only by understanding these core concepts can you better utilize Docker.

Centos install mysql Centos install mysql Apr 14, 2025 pm 08:09 PM

Installing MySQL on CentOS involves the following steps: Adding the appropriate MySQL yum source. Execute the yum install mysql-server command to install the MySQL server. Use the mysql_secure_installation command to make security settings, such as setting the root user password. Customize the MySQL configuration file as needed. Tune MySQL parameters and optimize databases for performance.

How to view GitLab logs under CentOS How to view GitLab logs under CentOS Apr 14, 2025 pm 06:18 PM

A complete guide to viewing GitLab logs under CentOS system This article will guide you how to view various GitLab logs in CentOS system, including main logs, exception logs, and other related logs. Please note that the log file path may vary depending on the GitLab version and installation method. If the following path does not exist, please check the GitLab installation directory and configuration files. 1. View the main GitLab log Use the following command to view the main log file of the GitLabRails application: Command: sudocat/var/log/gitlab/gitlab-rails/production.log This command will display product

How to operate distributed training of PyTorch on CentOS How to operate distributed training of PyTorch on CentOS Apr 14, 2025 pm 06:36 PM

PyTorch distributed training on CentOS system requires the following steps: PyTorch installation: The premise is that Python and pip are installed in CentOS system. Depending on your CUDA version, get the appropriate installation command from the PyTorch official website. For CPU-only training, you can use the following command: pipinstalltorchtorchvisiontorchaudio If you need GPU support, make sure that the corresponding version of CUDA and cuDNN are installed and use the corresponding PyTorch version for installation. Distributed environment configuration: Distributed training usually requires multiple machines or single-machine multiple GPUs. Place

See all articles