#Computer Vision Research Institute Column
Column of Computer Vision InstituteThis article mainly introduces an article that has just been accepted by IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI): EfficientTrain++: Generalized Curriculum Learning for Efficient Visual Backbone Training.
In recent years, “scaling” has been one of the protagonists of computer vision research. With the increase in model size and training data scale, the advancement of learning algorithms, and the widespread application of regularization and data enhancement technologies, visual basic networks obtained through large-scale training (such as Vision Transformer and MAE trained on ImageNet1K/22K , DINOv2, etc.) has achieved stunning performance in many important visual tasks such as visual recognition, target detection, and semantic segmentation.
However, "scaling" often brings prohibitive high model training overhead, which significantly hinders the further development and industrial application of basic vision models.
To solve this problem, the research team of Tsinghua University proposed a generalized curriculum learning algorithm: EfficientTrain++. The core idea is to promote the traditional course learning paradigm of "screening and using data from easy to difficult, and gradually training the model" to "not filtering data dimensions, always using all training data, but gradually revealing each feature during the training process" Characteristics or patterns (pattern) from easy to difficult of each data sample."
EfficientTrain++ has several important highlights:
Next, let’s take a look at the details of the study.
In recent years, the vigorous development of large-scale foundation models has greatly promoted the progress of artificial intelligence and deep learning. In the field of computer vision, representative works such as Vision Transformer (ViT), CLIP, SAM, and DINOv2 have proven that scaling up the size of neural networks and training data can significantly expand important visual tasks such as recognition, detection, and segmentation. performance boundaries.
However, large basic models often have high training overhead, and Figure 1 gives two typical examples. Taking 8 NVIDIA V100 or higher-performance GPUs as an example, it would take years or even decades to complete just one training session for GPT-3 and ViT-G. Such high training costs are a huge expense that is difficult to afford for both academia and industry. Often only a few top institutions can advance the progress of deep learning by consuming large amounts of resources. Therefore, an urgent question to be solved is: how to effectively improve the training efficiency of large-scale deep learning models?
Figure 1 Example: High training overhead of large deep learning basic models
For computer vision models, a classic The idea is curriculum learning, as shown in Figure 2, which imitates humans’ progressive and highly structured learning process. During the model training process, starting from the “simplest” training data, gradually introducing steps from easy to difficult The data.
Figure 2 Classic Curriculum Learning Paradigm (Picture Source: "A Survey on Curriculum Learning", TPAMI'22)
However ,Despite the natural motivation, course learning has not been ,used on a large scale as a general method for training ,visual basic models. The main reason is that there are ,two key bottlenecks, as shown in Figure 3. First, designing an effective training curriculum (curriculum) is not easy. Distinguishing between "simple" and "difficult" samples often requires the help of additional pre-training models, designing more complex AutoML algorithms, introducing reinforcement learning, etc., and has poor versatility. Second, the modeling of course learning itself is somewhat unreasonable. Visual data in natural distribution often has a high degree of diversity. An example is given below in Figure 3 (parrot pictures randomly selected from ImageNet). The model training data contains a large number of parrots with different movements, parrots at different distances from the camera, Parrots from different perspectives and backgrounds, as well as the diverse interactions between parrots and people or objects, etc., it is actually a relatively rough method to distinguish such diverse data only by single-dimensional indicators of "simple" and "difficult" and far-fetched modeling methods.
Figure 3 Two key bottlenecks that hinder large-scale application of course learning in training visual basic models
Inspired by the above challenges, this paper proposes a generalized curriculum learning paradigm. The core idea is to make "filtering and use easy The traditional course learning paradigm of "obtaining difficult data and gradually training the model" has been extended to "does not filter the data dimensions and always uses all training data, but gradually reveals the reasons for each data sample during the training process. Difficult features or patterns", thus effectively avoiding the limitations and sub-optimal designs caused by the data screening paradigm, as shown in Figure 4.
Figure 4 Traditional course learning (sample dimension) vs. generalized course learning (feature dimension)
The proposal of this paradigm is mainly based on an interesting phenomenon: In the training process of a natural visual model, although the model can always obtain all the information contained in the data at any time, the model will always naturally learn to recognize the information in the data first. Contains some relatively simple discriminant features (patterns), and then gradually learns to identify more difficult discriminant features on this basis. Moreover, this rule is relatively universal, "relatively simple" discriminant features can be found more easily in both the frequency domain and the spatial domain. This paper designed a series of interesting experiments to demonstrate the above findings, as described below.
From a frequency domain perspective, "low-frequency features" are "relatively simple" for the model . In Figure 5, the author of this article trained a DeiT-S model using standard ImageNet-1K training data, and used low-pass filters with different bandwidths to filter the verification set, retaining only the low-frequency components of the verification image, and reports on this basis. The accuracy of DeiT-S on the low-pass filtered verification data during the training process. The curve of the obtained accuracy relative to the training process is shown on the right side of Figure 5.
We can see an interesting phenomenon: in the early stages of training, using only low-pass filtered validation data will not significantly reduce the accuracy, and the separation point between the curve and the normal validation set accuracy increases with filtering It gradually moves to the right as the bandwidth of the processor increases. This phenomenon shows that although the model always has access to the low- and high-frequency parts of the training data, its learning process naturally starts by focusing only on low-frequency information, and the ability to identify higher-frequency features is gradually acquired later in the training (this phenomenon For more evidence, please refer to the original text).
Figure 5 From a frequency domain perspective, the model naturally tends to learn to identify low-frequency features first
This finding leads to an interesting Question: Can we design a training curriculum that starts with low-frequency information that only provides visual input to the model, and then gradually introduces high-frequency information?
Figure 6 investigates the idea of performing low-pass filtering on the training data only during an early training phase of a specific length, leaving the rest of the training process unchanged. It can be observed from the results that although the final performance improvement is limited, it is interesting to note that the final accuracy of the model can be preserved to a large extent even if only low-frequency components are provided to the model for a considerable period of early training phase, which It also coincides with the observation in Figure 5 that "the model mainly focuses on learning to identify low-frequency features in the early stages of training".
This discovery inspired the author of this article to think about training efficiency: Since the model only needs low-frequency components in the data in the early stages of training, and the information contained in the low-frequency components is smaller than the original data, then it can Can the model efficiently learn from only low-frequency components at less computational cost than processing the original input?
Figure 6 Providing only low-frequency components to the model for a long period of early training does not significantly affect the final performance
In fact, this idea is completely feasible. As shown on the left side of Figure 7, the author of this article introduces a cropping operation in the Fourier spectrum of the image to crop out the low-frequency part and map it back to the pixel space. This low-frequency cropping operation accurately preserves all low-frequency information while reducing the size of the image input, so the computational cost of the model learning from the input can be exponentially reduced.
If you use this low-frequency cropping operation to process the model input in the early stages of training, you can significantly save the overall training cost, but because the information necessary for model learning is maximally retained, The final model with almost no performance loss can still be obtained, and the experimental results are shown in the lower right corner of Figure 7.
Figure 7 Low-frequency cropping: Make the model efficiently learn only from low-frequency information
In addition to frequency domain operations, from the perspective of spatial domain transformation, "relatively simple" features for the model can also be found. For example, natural image information contained in raw visual input that has not undergone strong data enhancement or distortion processing is often "simpler" for the model and easier for the model to learn because it is derived from real-world distributions. , and the additional information, invariance, etc. introduced by preprocessing techniques such as data enhancement are often difficult for the model to learn (a typical example is given on the left side of Figure 8).
In fact, existing research has also observed that data augmentation mainly plays a role in the later stages of training (such as "Improving Auto-Augment via Augmentation-Wise Weight Sharing", NeurIPS' 20).
In this dimension, in order to realize the paradigm of generalized course learning, it can be easily achieved by simply changing the intensity of data augmentation. In the early stage of training, only the larger part of the training data is provided to the model. Easy-to-learn natural image information. The right side of Figure 8 uses RandAugment as a representative example to verify this idea. RandAugment contains a series of common spatial data enhancement transformations (such as random rotation, changing sharpness, affine transformation, changing exposure, etc.).
It can be observed that training the model starting from weaker data augmentation can effectively improve the final performance of the model, and this technique is compatible with low-frequency cropping.
Figure 8 Looking for the "easier to learn" features of the model from the perspective of airspace: a data enhancement perspective
This is it So far, this article has proposed the core framework and assumptions of generalized curriculum learning, and proved the rationality and effectiveness of generalized curriculum learning by revealing two key phenomena in the frequency domain and spatial domain. On this basis, this paper further completes a series of systematic work, which are listed below. Due to space limitations, please refer to the original paper for more research details.
The EfficientTrain++ generalized course learning plan finally obtained in this article is shown in Figure 9. EfficientTrain++ dynamically adjusts the bandwidth of frequency domain low-frequency cropping and the intensity of spatial domain data enhancement based on the consumption percentage of the total computing overhead of model training.
Notably, as a plug-and-play approach, EfficientTrain++ can be directly applied to a variety of vision-based networks and diverse In model training scenarios, the effect is relatively stable and significant.
Figure 9 Unified and integrated broad curriculum learning plan: EfficientTrain++
As a plug-and-play method, EfficientTrain++ combines the actual performance of various visual basic networks on ImageNet-1K without basically losing or improving performance. The training overhead is reduced by about 1.5 times.
Figure 10 ImageNet-1K experimental results: EfficientTrain++ performance on various visual basic networks
## The gain of #EfficientTrain++ is applicable to different training cost budgets. Under strictly the same performance, the training acceleration ratio of DeiT/Swin on ImageNet-1K is about 2-3 times.
##Figure 11 ImageNet-1K experimental results: Performance of EfficientTrain++ under different training overhead budgets
EfficientTrain++ can achieve 2-3 times performance lossless pre-training acceleration on ImageNet-22k.
Figure 12 ImageNet-22K experimental results: Performance of EfficientTrain++ on larger-scale training data
For smaller models , EfficientTrain++ can achieve significant performance upper bound improvements.
Figure 13 ImageNet-1K experimental results: EfficientTrain++ can significantly improve the performance upper bound of smaller models
EfficientTrain++ for self-supervision Learning algorithms such as MAE are equally effective.
Figure 14 EfficientTrain++ can be applied to self-supervised learning (such as MAE) The model trained by EfficientTrain++ also does not lose performance on downstream tasks such as target detection, instance segmentation, and semantic segmentation. Figure 15 COCO target detection, COCO instance segmentation, and ADE20K semantic segmentation experimental results
The above is the detailed content of Simple and universal: the visual basic network accelerates lossless training by up to 3 times, Tsinghua EfficientTrain++ was selected for TPAMI 2024. For more information, please follow other related articles on the PHP Chinese website!