Have you ever had trouble with image retrieval?
Either it is difficult to accurately find the required image among the massive images, or it is unsatisfactory in text-based retrieval. Regarding this problem, researchers from Microsoft Research Asia and Microsoft Cloud Computing and Artificial Intelligence Division conducted in-depth research on lightweight visual models and proposed a series of design and compression methods for visual pre-training models to realize the visual Transformer. Lightweight deployment requirements.
Currently, this method and model have been successfully applied to Microsoft’s Bing search engine, achieving accurate and fast reasoning and retrieval of tens of billions of images. This article will provide an in-depth explanation of the development, key technologies, applications and potential of lightweight visual pre-training models, as well as future opportunities and challenges. I hope everyone can better understand the field of lightweight visual pre-training and jointly promote the development of related technologies.
Recently, Transformer-based visual pre-training models have achieved superior performance on many computer vision tasks and have received widespread attention. However, visual Transformer pre-training models usually have large parameters and high complexity, which restricts their deployment and use in practical applications, especially in resource-constrained devices or scenarios with high real-time requirements. Therefore, the research on “lightweighting” of large visual pre-training models has become a new hot topic in academia and industry.
In this regard, researchers from Microsoft Research Asia and Microsoft Cloud Computing and Artificial Intelligence Division conducted in-depth exploration on the structural design and training inference of large visual models. They also conducted The lightweight, real-time and cloud deployment of large models have also been innovatively applied. This article will start from the development of lightweight visual pre-training models, explore the key technologies in model lightweight research, and the application and potential of lightweight visual Transformer models in actual products, and finally look forward to the future development opportunities and prospects of lightweight visual models. challenge.
In recent years, deep learning has been used in ImageNet image classification tasks The progress is mainly due to the substantial expansion of the visual model capacity. As shown in Figure 1, in just a few years, the capacity of visual pre-training models has expanded more than 300 times, from the ResNet-101 model with 44.5 million parameters to the V-MoE model with 15 billion parameters. These large-scale visual pre-training models Training models have made great strides in tasks such as image understanding and visual content generation.
Figure 1: Change trend chart of visual pre-training model parameters
Whether it is Microsoft The 3 billion parameter Swin-V2 model is still the 1.8 billion parameter ViT-G/14 model released by Google. The large visual model has demonstrated superior performance in many tasks, especially its powerful small sample (few-shot) and even The generalization ability of zero-shot is very critical to achieving general intelligence.
However, in many practical scenarios, due to limitations of storage and computing resources, large models are difficult to deploy directly or cannot meet real-time needs. Therefore, research on lightweight visual pre-training models has become increasingly important and has strong practical application value. Although there is currently some work exploring lightweight models, most of these methods are designed for specific tasks and specific structures. The versatility of the model is not considered during the design and training process, and there is generalization across data domains and tasks. limitation.
In order to achieve lightweight visual pre-training models, Microsoft researchers discovered two key technologies Questions: 1) How to design a more versatile lightweight model structure? 2) Subject to the limited capacity of lightweight visual pre-training models, how to design efficient pre-training methods so that small models can learn effective information from large-scale data? Faced with these problems, researchers have achieved some initial results through unremitting research and exploration.
Since the core of improving the versatility of lightweight pre-training models lies in how to strengthen the learning ability of the model under the condition of limited resources (amount of parameters, delay, etc.), so that it can be more capable It is good to learn general features in large-scale data. Therefore, researchers have conducted in-depth exploration from the following three perspectives:
Lightweight and low-latency modules are an important part of the lightweight model. In convolutional neural networks, representative lightweight modules include MobileNet's Inverted Residual Block and ShuffleNet's channel random crossover unit (Shuffle Unit). In the visual Transformer structure, since the calculation of attention between image blocks does not well consider the relative position encoding information, the researchers designed a plug-and-play lightweight two-dimensional image relative position encoding method iRPE [1]. It can improve the performance of the model without modifying any training hyperparameters. In addition, to address the problem of visual Transformer parameter redundancy, researchers designed the Weight Multiplexing module [2]. As shown in Figure 2, this method reduces the redundancy of model parameters through multi-layer weight reuse, and introduces unshared linear transformations to increase parameter diversity.
Figure 2: Weight multiplexing module in Transformer
Neural Architecture Search can automatically find a more lightweight and better-performing model structure from the model design space [3]. In convolutional neural networks, representative works include NASNet and EfficientNet. In the visual Transformer structure search, researchers have successively proposed AutoFormer [4] and S3 [5] for multiple dimensions such as channel width, network depth, and number of heads in the visual model, realizing dynamic scalable training and scalability of the visual model. Structure search. Under the same model accuracy, the new model obtained through search has a smaller number of parameters and calculations. It is worth noting that in S3, researchers used E-T Error [5] and weight sharing supernet to guide and improve the search space. While obtaining a more efficient model structure, they also analyzed the evolution process of the search space, as shown in Figure 3 shown. At the same time, the process of model structure search provides effective design experience and reference for the design of lightweight models.
Figure 3: Lightweight model search space evolution process
This series of research results is not only Many papers have been published at top academic conferences on computer vision (CVPR, ICCV, ECCV, NeurIPS, etc.) [1-6], and through cooperation with Microsoft Bing, lightweight pre-training models have been successfully applied to image search products. , improving the ability to understand image and video content in actual business.
Application of lightweight visual pre-training model
Lightweight visual pre-training models have many practical uses, especially in scenarios with high real-time requirements or resource constraints, such as: real-time rendering and enhancement of cloud videos, end-to-end image testing, and video content understanding. Lightweight visual models have shown broad application prospects in smart retail, advanced manufacturing and other fields, and will play an important role in emerging industries such as the Metaverse and autonomous driving in the future. Taking image content search in Microsoft's Bing product as an example, the following will show you the practical application and deployment of lightweight visual models.
At present, content-based image search is relatively mature in understanding the category attributes of images, but there are still great challenges in understanding the content of complex scenes. Pictures of complex scenes usually have characteristics such as large depth of field, cluttered backgrounds, many characters, and complex object relationships, which significantly increase the difficulty of content understanding, thus placing higher requirements on the robustness and generalization of pre-training models.
For example, the search quality of anime pictures cannot be effectively improved for a long time. The main challenges include: painting lines and colors are more exaggerated than real scene pictures, including More action and scenes, and the style content varies greatly between comics. Figures 5 to 7 respectively show three different cartoon characters and behaviors of "Slam Dunk", "Pikachu" and "Captain". Their comic styles and contents are very different. How to effectively understand the content of comic pictures puts forward higher requirements for visual pre-training models.
Figure 5: In the Microsoft Bing search engine, the understanding of the slam dunk master’s actions includes: dunking, dribbling, stealing, shooting, etc.
Figure 6: In Microsoft Bing search engine, understanding of Pikachu’s behavior such as eating apples, eating watermelon, eating ice cream, etc.
Figure 7: Close-up of the young football player’s shooting action in Microsoft’s Bing search engine
above The lightweight visual general model and fast pre-training distillation algorithm mentioned have been successfully used in Microsoft's Bing search engine. With the help of the visual language multi-modal pre-training model provided by Microsoft Research Asia, Microsoft's Bing image search function enhances the understanding of comic content and can return image content that better matches user needs.
At the same time, the huge index library of Microsoft Bing search engine has very high requirements for retrieval efficiency. The rapid pre-training distillation method provided by Microsoft Research Asia effectively migrates the indexing capabilities of the pre-trained large model to a lightweight model, improving the recognition accuracy of the existing model by 14% and greatly optimizing the calculation of the model. Efficiency, achieving fast reasoning on tens of billions of images.
Model lightweighting is the core of the future application of artificial intelligence. As vision technology, algorithms, computing power, and data continue to improve, the complexity of models has increased dramatically, and the energy consumption of neural network calculations has become increasingly expensive. The lightweight visual model's high computational efficiency and low deployment and application costs can play a huge advantage in more actual products in the future. In addition, localized lightweight pre-trained visual models can better protect user data and privacy while supporting more services. User's data will no longer need to leave the device, allowing remote upgrades of functions such as model services.
Of course, researchers are also aware of the challenges faced by lightweight pre-trained visual models: on the one hand, in terms of model structure design, how to achieve the optimal learning ability of the model under the constraints of the number of model parameters and inference delay, It has always been a matter of close concern in academia and industry. Although many effective model structures have been accumulated and great progress has been made in fields such as Universal Approximation Theorem (UAT) and Neural Network Structure Search (NAS), the existing lightweight pre-trained visual models and visual large-scale There are still gaps between models that need to be further optimized and improved. On the other hand, in terms of training methods, academia and industry have proposed a variety of training methods such as self-supervision, image classification, and multi-modality for large visual models, which have significantly improved the general capabilities of the model. How to design a more effective training method for lightweight models with limited capacity requires further research and exploration. Researchers at Microsoft Research Asia will continue to promote the scientific research progress of lightweight pre-trained visual models, and welcome more technology colleagues to communicate and explore related technologies in this field.
The above is the detailed content of How to perform image search efficiently and accurately? Take a look at the lightweight vision pre-trained model. For more information, please follow other related articles on the PHP Chinese website!