How to perform image search efficiently and accurately? Take a look at the lightweight vision pre-trained model-AI-php.cn

Table of Contents

Large visual models emerge in endlessly, but lightweight pre-trained models are of little interest

Research on key technologies of lightweight visual models

3. Visual large model compression Knowledge Transfer Another problem with lightweight pre-trained models is that due to the limited capacity of the model, it is difficult to directly learn the rich information and knowledge contained in large-scale data. In order to solve this problem, researchers have proposed a fast pre-training distillation scheme to transfer the knowledge of large models to lightweight small models [6]. As shown in Figure 4, unlike traditional single-stage knowledge distillation, fast pre-training distillation is divided into two stages: 1) compress and save the data augmentation information and prediction information used in the large model training process; 2) load and restore After the prediction information and data of the large model are augmented, the large model is used as a teacher to guide the learning and training of lightweight student models through pre-training distillation. Different from pruning and quantization, this method uses the weight reuse mentioned above [2] based on weight sharing. By introducing lightweight weight transformation and distillation, it successfully compresses the large visual pre-training model and obtains universal A more robust lightweight model. This method can compress the original large model dozens of times without sacrificing performance.

Future opportunities and challenges

Home

Technology peripherals

How to perform image search efficiently and accurately? Take a look at the lightweight vision pre-trained model

WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWB

Apr 08, 2023 pm 04:41 PM

picture Model train

Have you ever had trouble with image retrieval?

Either it is difficult to accurately find the required image among the massive images, or it is unsatisfactory in text-based retrieval. Regarding this problem, researchers from Microsoft Research Asia and Microsoft Cloud Computing and Artificial Intelligence Division conducted in-depth research on lightweight visual models and proposed a series of design and compression methods for visual pre-training models to realize the visual Transformer. Lightweight deployment requirements.

Currently, this method and model have been successfully applied to Microsoft’s Bing search engine, achieving accurate and fast reasoning and retrieval of tens of billions of images. This article will provide an in-depth explanation of the development, key technologies, applications and potential of lightweight visual pre-training models, as well as future opportunities and challenges. I hope everyone can better understand the field of lightweight visual pre-training and jointly promote the development of related technologies.

Recently, Transformer-based visual pre-training models have achieved superior performance on many computer vision tasks and have received widespread attention. However, visual Transformer pre-training models usually have large parameters and high complexity, which restricts their deployment and use in practical applications, especially in resource-constrained devices or scenarios with high real-time requirements. Therefore, the research on “lightweighting” of large visual pre-training models has become a new hot topic in academia and industry.

In this regard, researchers from Microsoft Research Asia and Microsoft Cloud Computing and Artificial Intelligence Division conducted in-depth exploration on the structural design and training inference of large visual models. They also conducted The lightweight, real-time and cloud deployment of large models have also been innovatively applied. This article will start from the development of lightweight visual pre-training models, explore the key technologies in model lightweight research, and the application and potential of lightweight visual Transformer models in actual products, and finally look forward to the future development opportunities and prospects of lightweight visual models. challenge.

Large visual models emerge in endlessly, but lightweight pre-trained models are of little interest

In recent years, deep learning has been used in ImageNet image classification tasks The progress is mainly due to the substantial expansion of the visual model capacity. As shown in Figure 1, in just a few years, the capacity of visual pre-training models has expanded more than 300 times, from the ResNet-101 model with 44.5 million parameters to the V-MoE model with 15 billion parameters. These large-scale visual pre-training models Training models have made great strides in tasks such as image understanding and visual content generation.

How to perform image search efficiently and accurately? Take a look at the lightweight vision pre-trained model

Figure 1: Change trend chart of visual pre-training model parameters

Whether it is Microsoft The 3 billion parameter Swin-V2 model is still the 1.8 billion parameter ViT-G/14 model released by Google. The large visual model has demonstrated superior performance in many tasks, especially its powerful small sample (few-shot) and even The generalization ability of zero-shot is very critical to achieving general intelligence.

However, in many practical scenarios, due to limitations of storage and computing resources, large models are difficult to deploy directly or cannot meet real-time needs. Therefore, research on lightweight visual pre-training models has become increasingly important and has strong practical application value. Although there is currently some work exploring lightweight models, most of these methods are designed for specific tasks and specific structures. The versatility of the model is not considered during the design and training process, and there is generalization across data domains and tasks. limitation.

Research on key technologies of lightweight visual models

In order to achieve lightweight visual pre-training models, Microsoft researchers discovered two key technologies Questions: 1) How to design a more versatile lightweight model structure? 2) Subject to the limited capacity of lightweight visual pre-training models, how to design efficient pre-training methods so that small models can learn effective information from large-scale data? Faced with these problems, researchers have achieved some initial results through unremitting research and exploration.

Since the core of improving the versatility of lightweight pre-training models lies in how to strengthen the learning ability of the model under the condition of limited resources (amount of parameters, delay, etc.), so that it can be more capable It is good to learn general features in large-scale data. Therefore, researchers have conducted in-depth exploration from the following three perspectives:

1. Lightweight module design

Lightweight and low-latency modules are an important part of the lightweight model. In convolutional neural networks, representative lightweight modules include MobileNet's Inverted Residual Block and ShuffleNet's channel random crossover unit (Shuffle Unit). In the visual Transformer structure, since the calculation of attention between image blocks does not well consider the relative position encoding information, the researchers designed a plug-and-play lightweight two-dimensional image relative position encoding method iRPE [1]. It can improve the performance of the model without modifying any training hyperparameters. In addition, to address the problem of visual Transformer parameter redundancy, researchers designed the Weight Multiplexing module [2]. As shown in Figure 2, this method reduces the redundancy of model parameters through multi-layer weight reuse, and introduces unshared linear transformations to increase parameter diversity.

How to perform image search efficiently and accurately? Take a look at the lightweight vision pre-trained model

Figure 2: Weight multiplexing module in Transformer

2. Lightweight model Search

Neural Architecture Search can automatically find a more lightweight and better-performing model structure from the model design space [3]. In convolutional neural networks, representative works include NASNet and EfficientNet. In the visual Transformer structure search, researchers have successively proposed AutoFormer [4] and S3 [5] for multiple dimensions such as channel width, network depth, and number of heads in the visual model, realizing dynamic scalable training and scalability of the visual model. Structure search. Under the same model accuracy, the new model obtained through search has a smaller number of parameters and calculations. It is worth noting that in S3, researchers used E-T Error [5] and weight sharing supernet to guide and improve the search space. While obtaining a more efficient model structure, they also analyzed the evolution process of the search space, as shown in Figure 3 shown. At the same time, the process of model structure search provides effective design experience and reference for the design of lightweight models.

How to perform image search efficiently and accurately? Take a look at the lightweight vision pre-trained model

Figure 3: Lightweight model search space evolution process

Figure 4: Rapid pre-training knowledge distillation

How to perform image search efficiently and accurately? Take a look at the lightweight vision pre-trained model

This series of research results is not only Many papers have been published at top academic conferences on computer vision (CVPR, ICCV, ECCV, NeurIPS, etc.) [1-6], and through cooperation with Microsoft Bing, lightweight pre-training models have been successfully applied to image search products. , improving the ability to understand image and video content in actual business.

Application of lightweight visual pre-training model

Lightweight visual pre-training models have many practical uses, especially in scenarios with high real-time requirements or resource constraints, such as: real-time rendering and enhancement of cloud videos, end-to-end image testing, and video content understanding. Lightweight visual models have shown broad application prospects in smart retail, advanced manufacturing and other fields, and will play an important role in emerging industries such as the Metaverse and autonomous driving in the future. Taking image content search in Microsoft's Bing product as an example, the following will show you the practical application and deployment of lightweight visual models.

At present, content-based image search is relatively mature in understanding the category attributes of images, but there are still great challenges in understanding the content of complex scenes. Pictures of complex scenes usually have characteristics such as large depth of field, cluttered backgrounds, many characters, and complex object relationships, which significantly increase the difficulty of content understanding, thus placing higher requirements on the robustness and generalization of pre-training models.

For example, the search quality of anime pictures cannot be effectively improved for a long time. The main challenges include: painting lines and colors are more exaggerated than real scene pictures, including More action and scenes, and the style content varies greatly between comics. Figures 5 to 7 respectively show three different cartoon characters and behaviors of "Slam Dunk", "Pikachu" and "Captain". Their comic styles and contents are very different. How to effectively understand the content of comic pictures puts forward higher requirements for visual pre-training models.

How to perform image search efficiently and accurately? Take a look at the lightweight vision pre-trained model

Figure 5: In the Microsoft Bing search engine, the understanding of the slam dunk master’s actions includes: dunking, dribbling, stealing, shooting, etc.

How to perform image search efficiently and accurately? Take a look at the lightweight vision pre-trained model

Figure 6: In Microsoft Bing search engine, understanding of Pikachu’s behavior such as eating apples, eating watermelon, eating ice cream, etc.

How to perform image search efficiently and accurately? Take a look at the lightweight vision pre-trained model

Figure 7: Close-up of the young football player’s shooting action in Microsoft’s Bing search engine

above The lightweight visual general model and fast pre-training distillation algorithm mentioned have been successfully used in Microsoft's Bing search engine. With the help of the visual language multi-modal pre-training model provided by Microsoft Research Asia, Microsoft's Bing image search function enhances the understanding of comic content and can return image content that better matches user needs.

At the same time, the huge index library of Microsoft Bing search engine has very high requirements for retrieval efficiency. The rapid pre-training distillation method provided by Microsoft Research Asia effectively migrates the indexing capabilities of the pre-trained large model to a lightweight model, improving the recognition accuracy of the existing model by 14% and greatly optimizing the calculation of the model. Efficiency, achieving fast reasoning on tens of billions of images.

Future opportunities and challenges

Model lightweighting is the core of the future application of artificial intelligence. As vision technology, algorithms, computing power, and data continue to improve, the complexity of models has increased dramatically, and the energy consumption of neural network calculations has become increasingly expensive. The lightweight visual model's high computational efficiency and low deployment and application costs can play a huge advantage in more actual products in the future. In addition, localized lightweight pre-trained visual models can better protect user data and privacy while supporting more services. User's data will no longer need to leave the device, allowing remote upgrades of functions such as model services.

Of course, researchers are also aware of the challenges faced by lightweight pre-trained visual models: on the one hand, in terms of model structure design, how to achieve the optimal learning ability of the model under the constraints of the number of model parameters and inference delay, It has always been a matter of close concern in academia and industry. Although many effective model structures have been accumulated and great progress has been made in fields such as Universal Approximation Theorem (UAT) and Neural Network Structure Search (NAS), the existing lightweight pre-trained visual models and visual large-scale There are still gaps between models that need to be further optimized and improved. On the other hand, in terms of training methods, academia and industry have proposed a variety of training methods such as self-supervision, image classification, and multi-modality for large visual models, which have significantly improved the general capabilities of the model. How to design a more effective training method for lightweight models with limited capacity requires further research and exploration. Researchers at Microsoft Research Asia will continue to promote the scientific research progress of lightweight pre-trained visual models, and welcome more technology colleagues to communicate and explore related technologies in this field.

The above is the detailed content of How to perform image search efficiently and accurately? Take a look at the lightweight vision pre-trained model. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

How to fix KB5055523 fails to install in Windows 11?

3 weeks ago By DDD

How to fix KB5055518 fails to install in Windows 10?

3 weeks ago By DDD

Strength Levels for Every Enemy & Monster in R.E.P.O.

3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

Roblox: Dead Rails - How To Tame Wolves

3 weeks ago By DDD

Roblox: Grow A Garden - Complete Mutation Guide

2 weeks ago By DDD

Hot Tools

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Hot Topics

Java Tutorial

1655

CakePHP Tutorial

1413

Laravel Tutorial

1306

PHP Tutorial

1252

C# Tutorial

1226

Related knowledge

The world's most powerful open source MoE model is here, with Chinese capabilities comparable to GPT-4, and the price is only nearly one percent of GPT-4-Turbo May 07, 2024 pm 04:13 PM

Imagine an artificial intelligence model that not only has the ability to surpass traditional computing, but also achieves more efficient performance at a lower cost. This is not science fiction, DeepSeek-V2[1], the world’s most powerful open source MoE model is here. DeepSeek-V2 is a powerful mixture of experts (MoE) language model with the characteristics of economical training and efficient inference. It consists of 236B parameters, 21B of which are used to activate each marker. Compared with DeepSeek67B, DeepSeek-V2 has stronger performance, while saving 42.5% of training costs, reducing KV cache by 93.3%, and increasing the maximum generation throughput to 5.76 times. DeepSeek is a company exploring general artificial intelligence

AI subverts mathematical research! Fields Medal winner and Chinese-American mathematician led 11 top-ranked papers | Liked by Terence Tao Apr 09, 2024 am 11:52 AM

AI is indeed changing mathematics. Recently, Tao Zhexuan, who has been paying close attention to this issue, forwarded the latest issue of "Bulletin of the American Mathematical Society" (Bulletin of the American Mathematical Society). Focusing on the topic "Will machines change mathematics?", many mathematicians expressed their opinions. The whole process was full of sparks, hardcore and exciting. The author has a strong lineup, including Fields Medal winner Akshay Venkatesh, Chinese mathematician Zheng Lejun, NYU computer scientist Ernest Davis and many other well-known scholars in the industry. The world of AI has changed dramatically. You know, many of these articles were submitted a year ago.

KAN, which replaces MLP, has been extended to convolution by open source projects Jun 01, 2024 pm 10:03 PM

Earlier this month, researchers from MIT and other institutions proposed a very promising alternative to MLP - KAN. KAN outperforms MLP in terms of accuracy and interpretability. And it can outperform MLP running with a larger number of parameters with a very small number of parameters. For example, the authors stated that they used KAN to reproduce DeepMind's results with a smaller network and a higher degree of automation. Specifically, DeepMind's MLP has about 300,000 parameters, while KAN only has about 200 parameters. KAN has a strong mathematical foundation like MLP. MLP is based on the universal approximation theorem, while KAN is based on the Kolmogorov-Arnold representation theorem. As shown in the figure below, KAN has

Hello, electric Atlas! Boston Dynamics robot comes back to life, 180-degree weird moves scare Musk Apr 18, 2024 pm 07:58 PM

Boston Dynamics Atlas officially enters the era of electric robots! Yesterday, the hydraulic Atlas just "tearfully" withdrew from the stage of history. Today, Boston Dynamics announced that the electric Atlas is on the job. It seems that in the field of commercial humanoid robots, Boston Dynamics is determined to compete with Tesla. After the new video was released, it had already been viewed by more than one million people in just ten hours. The old people leave and new roles appear. This is a historical necessity. There is no doubt that this year is the explosive year of humanoid robots. Netizens commented: The advancement of robots has made this year's opening ceremony look like a human, and the degree of freedom is far greater than that of humans. But is this really not a horror movie? At the beginning of the video, Atlas is lying calmly on the ground, seemingly on his back. What follows is jaw-dropping

Tesla robots work in factories, Musk: The degree of freedom of hands will reach 22 this year! May 06, 2024 pm 04:13 PM

The latest video of Tesla's robot Optimus is released, and it can already work in the factory. At normal speed, it sorts batteries (Tesla's 4680 batteries) like this: The official also released what it looks like at 20x speed - on a small "workstation", picking and picking and picking: This time it is released One of the highlights of the video is that Optimus completes this work in the factory, completely autonomously, without human intervention throughout the process. And from the perspective of Optimus, it can also pick up and place the crooked battery, focusing on automatic error correction: Regarding Optimus's hand, NVIDIA scientist Jim Fan gave a high evaluation: Optimus's hand is the world's five-fingered robot. One of the most dexterous. Its hands are not only tactile

$The latest from Oxford University! Mickey: 2D image matching in 3D SOTA! (CVPR\'24)$ The latest from Oxford University! Mickey: 2D image matching in 3D SOTA! (CVPR\'24) Apr 23, 2024 pm 01:20 PM

Project link written in front: https://nianticlabs.github.io/mickey/ Given two pictures, the camera pose between them can be estimated by establishing the correspondence between the pictures. Typically, these correspondences are 2D to 2D, and our estimated poses are scale-indeterminate. Some applications, such as instant augmented reality anytime, anywhere, require pose estimation of scale metrics, so they rely on external depth estimators to recover scale. This paper proposes MicKey, a keypoint matching process capable of predicting metric correspondences in 3D camera space. By learning 3D coordinate matching across images, we are able to infer metric relative

Kuaishou version of Sora 'Ke Ling' is open for testing: generates over 120s video, understands physics better, and can accurately model complex movements Jun 11, 2024 am 09:51 AM

What? Is Zootopia brought into reality by domestic AI? Exposed together with the video is a new large-scale domestic video generation model called "Keling". Sora uses a similar technical route and combines a number of self-developed technological innovations to produce videos that not only have large and reasonable movements, but also simulate the characteristics of the physical world and have strong conceptual combination capabilities and imagination. According to the data, Keling supports the generation of ultra-long videos of up to 2 minutes at 30fps, with resolutions up to 1080p, and supports multiple aspect ratios. Another important point is that Keling is not a demo or video result demonstration released by the laboratory, but a product-level application launched by Kuaishou, a leading player in the short video field. Moreover, the main focus is to be pragmatic, not to write blank checks, and to go online as soon as it is released. The large model of Ke Ling is already available in Kuaiying.

FisheyeDetNet: the first target detection algorithm based on fisheye camera Apr 26, 2024 am 11:37 AM

Target detection is a relatively mature problem in autonomous driving systems, among which pedestrian detection is one of the earliest algorithms to be deployed. Very comprehensive research has been carried out in most papers. However, distance perception using fisheye cameras for surround view is relatively less studied. Due to large radial distortion, standard bounding box representation is difficult to implement in fisheye cameras. To alleviate the above description, we explore extended bounding box, ellipse, and general polygon designs into polar/angular representations and define an instance segmentation mIOU metric to analyze these representations. The proposed model fisheyeDetNet with polygonal shape outperforms other models and simultaneously achieves 49.5% mAP on the Valeo fisheye camera dataset for autonomous driving

See all articles

How to perform image search efficiently and accurately? Take a look at the lightweight vision pre-trained model

Large visual models emerge in endlessly, but lightweight pre-trained models are of little interest

Research on key technologies of lightweight visual models

1. Lightweight module design

2. Lightweight model Search​

Future opportunities and challenges

Hot AI Tools

Undresser.AI Undress

AI Clothes Remover

Undress AI Tool

Clothoff.io

Video Face Swap

Hot Article

Hot Tools

Notepad++7.3.1

SublimeText3 Chinese version

Zend Studio 13.0.1

Dreamweaver CS6

SublimeText3 Mac version

Hot Topics

2. Lightweight model Search