Home > Technology peripherals > AI > body text

Simple and universal: 3 times lossless training acceleration of visual basic network, Tsinghua EfficientTrain++ selected for TPAMI 2024

WBOY
Release: 2024-06-10 14:54:28
Original
913 people have browsed it

The author of this discussion paper, Wang Yulin, is a 2019 direct doctoral student in the Department of Automation, Tsinghua University. He studied under Academician Wu Cheng and Associate Professor Huang Gao. His main research directions are efficient deep learning, computer vision, etc. He has published discussion papers as the first author in journals and conferences such as TPAMI, NeurIPS, ICLR, ICCV, CVPR, ECCV, etc. He has received Baidu Scholarship, Microsoft Scholar, CCF-CV Academic Emerging Award, ByteDance Scholarship and other honors. Personal homepage: wyl.cool.

This article mainly introduces an article that has just been accepted by IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI): EfficientTrain++: Generalized Curriculum Learning for Efficient Visual Backbone Training.

简单通用:视觉基础网络3倍无损训练加速,清华EfficientTrain++入选TPAMI 2024

  • ##Paper link: https://arxiv.org/pdf/2405.08768
  • The code and pre-trained model have been open source: https://github.com/LeapLabTHU/EfficientTrain
  • ##Conference version paper (ICCV 2023) :
  • https://arxiv.org/pdf/2211.09703
In recent years, “scaling” has been one of the protagonists of computer vision research. With the increase in model size and training data scale, the advancement of learning algorithms, and the widespread application of regularization and data enhancement technologies, visual basic networks obtained through large-scale training (such as Vision Transformer and MAE trained on ImageNet1K/22K , DINOv2, etc.) has achieved stunning performance in many important visual tasks such as visual recognition, target detection, and semantic segmentation.

However, "scaling" often brings prohibitive

high model training overhead, which significantly hinders the further development and industrial application of basic vision models.

To solve this problem, the research team of Tsinghua University proposed a generalized curriculum learning algorithm:

EfficientTrain++. The core idea is to promote the traditional course learning paradigm of "screening and using data from easy to difficult, and gradually training the model" to "not filtering data dimensions, always using all training data, but gradually revealing each feature during the training process" Characteristics or patterns (pattern) from easy to difficult of each data sample."

EfficientTrain++ has several important highlights:

  1. Plug and play implementation of visual basic network 1.5−3.0× Lossless training acceleration. Neither upstream nor downstream model performance is lost. The measured speed is consistent with the theoretical results.
  2. Commonly applicable to
  3. different training data sizes (such as ImageNet-1K/22K, the effect of 22K is even more obvious). Commonly used for supervised learning and self-supervised learning (such as MAE). Common to different training costs (e.g. corresponding to 0-300 or more epochs).
  4. Commonly used in
  5. ViT, ConvNet and other network structures (More than twenty models of different sizes and types have been tested in this article, and they are consistent and effective).
  6. For smaller models, in addition to training acceleration, it can also significantly improve performance (for example, without the help of additional information and without additional training overhead, we obtained on ImageNet-1K
  7. 81.3% DeiT-S, comparable to the original Swin-Tiny).
  8. Developed
  9. specialized practical efficiency optimization technology## for two challenging common practical situations #: 1) The CPU/hard disk is not powerful enough, and the data preprocessing efficiency cannot keep up with the GPU; 2) Large-scale parallel training, such as using 64 or more GPUs to train large models on ImageNet-22K.
  10. Next, let’s take a look at the details of the study.

one. Research motivation

In recent years, the vigorous development of large-scale foundation models has promoted the progress of artificial intelligence and deep learning. In the field of computer vision, representative works such as Vision Transformer (ViT), CLIP, SAM, and DINOv2 have proven that scaling up the size of neural networks and the scale of training data can significantly expand important visual tasks such as cognition, detection, and segmentation. performance boundaries.

However, large basic models often have high training overhead. Figure 1 gives two typical examples. Taking 8 NVIDIA V100 or higher-performance GPUs as an example, it would take years or even decades to complete just one training session for GPT-3 and ViT-G. Such high training costs are a huge expense that is difficult to afford for both academia and industry. Often only a few high-level institutions consume large amounts of resources to advance the progress of deep learning. Therefore, an urgent problem to be solved is: how to effectively improve the training efficiency of large-scale deep learning models?

简单通用:视觉基础网络3倍无损训练加速,清华EfficientTrain++入选TPAMI 2024Figure 1 Example: High training overhead of large deep learning basic models

For computer vision models, a classic The idea is curriculum learning, as shown in Figure 2, which imitates the progressive and highly structured learning process of humans. During the model training process, we start with the "simplest" training data and gradually introduce it from easy to difficult. The data.

简单通用:视觉基础网络3倍无损训练加速,清华EfficientTrain++入选TPAMI 2024Figure 2 Classic Curriculum Learning Paradigm (Picture Source: "A Survey on Curriculum Learning", TPAMI'22)

However ,Although the motivation is relatively natural, course learning ,has not been widely applied as a general method for training ,visual basic models. The main reason is that there are ,two key bottlenecks, as shown in Figure 3. First, designing an effective training curriculum (curriculum) is not easy. Distinguishing between "simple" and "difficult" samples often requires the help of additional pre-training models, designing more complex AutoML algorithms, introducing reinforcement learning, etc., and has poor versatility. Second, the modeling of course learning itself is somewhat unreasonable. Visual data in natural distribution often has a high degree of diversity. An example is given below in Figure 3 (parrot pictures randomly selected from ImageNet). The model training data contains a large number of parrots with different movements, parrots at different distances from the camera, Parrots from different perspectives and backgrounds, as well as the diverse interactions between parrots and people or objects, etc., it is actually a relatively rough method to distinguish such diverse data only by single-dimensional indicators of "simple" and "difficult" and far-fetched modeling methods.

简单通用:视觉基础网络3倍无损训练加速,清华EfficientTrain++入选TPAMI 2024Figure 3 Two key bottlenecks that hinder large-scale application of course learning in training visual basic models

2. Method Introduction

Inspired by the above challenges, this paper proposes a generalized curriculum learning paradigm. The core idea is to make "filtering and use easy The traditional course learning paradigm of "obtaining difficult data and gradually training the model" has been extended to "does not filter the data dimensions and always uses all training data, but gradually reveals the reasons for each data sample during the training process. Difficult features or patterns", thus effectively avoiding the limitations and sub-optimal designs caused by the data screening paradigm, as shown in Figure 4.

简单通用:视觉基础网络3倍无损训练加速,清华EfficientTrain++入选TPAMI 2024Figure 4 Traditional curriculum learning (sample dimension) vs. generalized curriculum learning (feature dimension)

The main reasons for this paradigm Based on an interesting phenomenon: In the training process of a natural visual model, although the model can always obtain all the information contained in the data at any time, the model will always naturally learn to recognize some simpler information contained in the data first. The discriminant features (pattern), and then gradually learn to identify more difficult discriminant features on this basis. Moreover, this rule is relatively universal, "relatively simple" discriminant features can be found more easily in both the frequency domain and the spatial domain. This paper designed a series of interesting experiments to demonstrate the above findings, as described below.

From a frequency domain perspective, "low-frequency features" are "relatively simple" for the model. In Figure 5, the author of this article trained a DeiT-S model using standard ImageNet-1K training data, and used low-pass filters with different bandwidths to filter the verification set, retaining only the low-frequency components of the verification image, and reports on this basis. The accuracy of DeiT-S on the low-pass filtered verification data during the training process. The curve of the obtained accuracy relative to the training process is shown on the right side of Figure 5.

We can see an interesting phenomenon: in the early stages of training, using only low-pass filtered validation data will not significantly reduce the accuracy, and the curve is consistent with the normal validation set accuracy. The separation point gradually moves to the right as the filter bandwidth increases. This phenomenon shows that although the model always has access to the low- and high-frequency parts of the training data, its learning process naturally starts by focusing only on low-frequency information, and the ability to identify higher-frequency features is gradually acquired later in the training (this phenomenon For more evidence, please refer to the original text).

简单通用:视觉基础网络3倍无损训练加速,清华EfficientTrain++入选TPAMI 2024Figure 5 From a frequency domain perspective, the model naturally tends to learn to identify low-frequency features first

This finding leads to an interesting Question: Can we design a training curriculum that starts with low-frequency information that only provides visual input to the model, and then gradually introduces high-frequency information?

Figure 6 investigates the idea of ​​performing low-pass filtering on the training data only during an early training phase of a specific length, leaving the rest of the training process unchanged. It can be observed from the results that although the final performance improvement is limited, it is interesting to note that the final accuracy of the model can be preserved to a large extent even if only low-frequency components are provided to the model for a considerable period of early training phase, which It also coincides with the observation in Figure 5 that "the model mainly focuses on learning to identify low-frequency features in the early stages of training".

This discovery inspired the author of this article to think about training efficiency: Since the model only needs low-frequency components in the data in the early stages of training, and the information contained in the low-frequency components is smaller than the original data, then it can Can the model efficiently learn from only low-frequency components at less computational cost than processing the original input?

简单通用:视觉基础网络3倍无损训练加速,清华EfficientTrain++入选TPAMI 2024Figure 6 Providing only low-frequency components to the model for a long period of early training does not significantly affect the final performance

In fact, this idea is completely feasible. As shown on the left side of Figure 7, the author of this article introduces a cropping operation in the Fourier spectrum of the image to crop out the low-frequency part and map it back to the pixel space. This low-frequency cropping operation accurately preserves all low-frequency information while reducing the size of the image input, so the computational cost of the model learning from the input can be exponentially reduced.

If you use this low-frequency cropping operation to process the model input in the early stages of training, you can significantly save the overall training cost, but because the information necessary for model learning is maximally retained, The final model with almost no performance loss can still be obtained, and the experimental results are shown in the lower right corner of Figure 7.

简单通用:视觉基础网络3倍无损训练加速,清华EfficientTrain++入选TPAMI 2024Figure 7 Low-frequency cropping: Make the model efficiently learn only from low-frequency information

In addition to frequency domain operations, from the perspective of spatial domain transformation, "relatively simple" features for the model can also be found. For example, natural image information contained in raw visual input that has not undergone strong data enhancement or distortion processing is often "simpler" for the model and easier for the model to learn because it is derived from real-world distributions. , and the additional information, invariance, etc. introduced by preprocessing techniques such as data enhancement are often difficult for the model to learn (a typical example is given on the left side of Figure 8).

In fact, existing research has also observed that data augmentation mainly plays a role in the later stages of training (such as "Improving Auto-Augment via Augmentation-Wise Weight Sharing", NeurIPS' 20).

在这一维度上,为实现广义课程学习的范式,可以简单地通过改变数据增强的强度方便地实现在训练早期阶段仅向模型提供训练数据中较容易学习的自然图像信息。图 8 右侧使用 RandAugment 作为代表性示例来验证了这个思路,RandAugment 包含了一系列常见的空域数据增强变换(例如随机旋转、更改锐度、仿射变换、更改曝光度等)。

可以观察到,从较弱的数据增强开始训练模型可以有效提高模型最终表现,同时这一技术与低频裁切兼容。

简单通用:视觉基础网络3倍无损训练加速,清华EfficientTrain++入选TPAMI 2024图 8 从空域的角度寻找模型 “较容易学习” 的特征:一个数据增强的视角

到此处为止,本文提出了广义课程学习的核心框架与假设,并通过揭示频域、空域的两个关键现象证明了广义课程学习的合理性和有效性。在此基础上,本文进一步完成了一系列系统性工作,在下面列出。由于篇幅所限,关于更多研究细节,可参考原论文。

  1. 融合频域、空域的两个核心发现,提出和改进了专门设计的优化算法,建立了一个统一、整合的 EfficientTrain++ 广义课程学习方案。
  2. 探讨了低频裁切操作在实际硬件上高效实现的具体方法,并从理论和实验两个角度比较了两种提取低频信息的可行方法:低频裁切和图像降采样,的区别和联系。
  3. 对两种有挑战性的常见实际情形开发了专门的实际效率优化技术:1)CPU / 硬盘不够强力,数据预处理效率跟不上 GPU;2)大规模并行训练,例如在 ImageNet-22K 上使用 64 或以上的 GPUs 训练大型模型。

本文最终得到的 EfficientTrain++ 广义课程学习方案如图 9 所示。EfficientTrain++ 以模型训练总计算开销的消耗百分比为依据,动态调整频域低频裁切的带宽和空域数据增强的强度。

值得注意的是,作为一种即插即用的方法,EfficientTrain++ 无需进一步的超参数调整或搜索即可直接应用于多种视觉基础网络和多样化的模型训练场景,效果比较稳定、显著。

简单通用:视觉基础网络3倍无损训练加速,清华EfficientTrain++入选TPAMI 2024图 9 统一、整合的广义课程学习方案:EfficientTrain++

三.实验结果

作为一种即插即用的方法,EfficientTrain++ 在 ImageNet-1K 上,在基本不损失或提升性能的条件下,将多种视觉基础网络的实际训练开销降低了 1.5 倍左右。

简单通用:视觉基础网络3倍无损训练加速,清华EfficientTrain++入选TPAMI 2024图 10 ImageNet-1K 实验结果:EfficientTrain++ 在多种视觉基础网络上的表现

EfficientTrain++ 的增益通用于不同的训练开销预算,严格相同表现的情况下,DeiT/Swin 在 ImageNet-1K 上的训加速比在 2-3 倍左右。

简单通用:视觉基础网络3倍无损训练加速,清华EfficientTrain++入选TPAMI 2024图 11 ImageNet-1K 实验结果:EfficientTrain++ 在不同训练开销预算下的表现

EfficientTrain++ 在 ImageNet-22k 上可以取得 2-3 倍的性能无损预训练加速。

简单通用:视觉基础网络3倍无损训练加速,清华EfficientTrain++入选TPAMI 2024图 12 ImageNet-22K 实验结果:EfficientTrain++ 在更大规模训练数据上的表现

对于较小的模型,EfficientTrain++ 可以实现显著的性能上界提升。

简单通用:视觉基础网络3倍无损训练加速,清华EfficientTrain++入选TPAMI 2024图13 ImageNet-1K 实验结果:EfficientTrain++ 可以显着提升较小模型的性能上界

EfficientTrain++ 对于自监督学习算法(如MAE)同样有效。

简单通用:视觉基础网络3倍无损训练加速,清华EfficientTrain++入选TPAMI 2024图14 EfficientTrain++ 可以应用于自监督学习(如MAE)

EfficientTrain++ 训得的模型在目标检测、实例分割、语义分割等下游任务上同样不损失性能。

简单通用:视觉基础网络3倍无损训练加速,清华EfficientTrain++入选TPAMI 2024图 15 COCO 目标检测、COCO 实例分割、ADE20K 语义分割实验结果

The above is the detailed content of Simple and universal: 3 times lossless training acceleration of visual basic network, Tsinghua EfficientTrain++ selected for TPAMI 2024. For more information, please follow other related articles on the PHP Chinese website!

Related labels:
source:51cto.com
Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Popular Tutorials
More>
Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template
About us Disclaimer Sitemap
php.cn:Public welfare online PHP training,Help PHP learners grow quickly!