1.3ms takes 1.3ms! Tsinghua's latest open source mobile neural network architecture RepViT-AI-php.cn

Table of Contents

Method

Alignment of training recipes

Optimization of Block Design

Optimization of macro-architectural elements

Deeper Downsampling Layers

Adjustment of micro-design

Network Architecture

Experiment

Image Classification

Home

Technology peripherals

1.3ms takes 1.3ms! Tsinghua's latest open source mobile neural network architecture RepViT

王林

Mar 11, 2024 pm 12:07 PM

iphone Architecture Neural Networks Open source overflow

1.3ms耗时！清华最新开源移动端神经网络架构 RepViT

Paper address: https://arxiv.org/abs/2307.09283

Code address: https:/ /github.com/THU-MIG/RepViT

1.3ms耗时！清华最新开源移动端神经网络架构 RepViT

RepViT performs well in the mobile ViT architecture and shows significant advantages. Next, we explore the contributions of this study.

It is mentioned in the article that lightweight ViTs generally perform better than lightweight CNNs on visual tasks, mainly due to their multi-head self-attention module (MSHA) allows the model to learn global representation. However, the architectural differences between lightweight ViTs and lightweight CNNs have not been fully studied.
In this study, the authors gradually improve the mobile-friendliness of standard lightweight CNNs (especially MobileNetV3) by integrating effective architectural choices of lightweight ViTs. This makes Derived the birth of a new pure lightweight CNN family, namely RepViT. It is worth noting that although RepViT has a MetaFormer structure, it is entirely composed of convolutions.
Experimental results It is shown that RepViT surpasses existing state-of-the-art lightweight ViTs and shows better performance and efficiency than existing state-of-the-art lightweight ViTs on various visual tasks, including ImageNet classification, Object detection and instance segmentation on COCO-2017, and semantic segmentation on ADE20k. In particular, on ImageNet, RepViT achieved the best performance on iPhone 12 With a latency of nearly 1ms and a Top-1 accuracy of over 80%, this is the first breakthrough for a lightweight model.

Okay, what everyone should be concerned about next is "How How can a model designed with such low latency but high accuracy come out?

Method

1.3ms耗时！清华最新开源移动端神经网络架构 RepViT

## again

ConvNeXt , the authors based on the ResNet50 architecture through rigorous theoretical and experimental analysis, finally designed a very excellent pure convolutional neural network architecture comparable to Swin-Transformer. Similarly, RepViT also mainly performs targeted transformations by gradually integrating the architectural design of lightweight ViTs into standard lightweight CNN, namely MobileNetV3-L ( Magic modification). In this process, the authors considered design elements at different levels of granularity and achieved the optimization goal through a series of steps.

1.3ms耗时！清华最新开源移动端神经网络架构 RepViT

Alignment of training recipes

In the paper, a new metric is introduced to measure latency on mobile devices and ensures that the training strategy is consistent with the currently popular lightweight Level ViTs remain consistent. The purpose of this initiative is to ensure the consistency of model training, which involves two key concepts of delay measurement and training strategy adjustment.

Latency Measurement Index

In order to more accurately measure the performance of the model on real mobile devices, the author chose to directly measure the actual delay of the model on the device. as a baseline measure. This metric differs from previous studies, which mainly optimize the model's inference speed through metrics such as FLOPs or model size, which do not always reflect the actual latency in mobile applications well.

Alignment of training strategy

Here, the training strategy of MobileNetV3-L is adjusted to align with other lightweight ViTs models. This includes using the AdamW optimizer [a must-have optimizer for ViTs models], 5 epochs of warm-up training, and 300 epochs of training using cosine annealing learning rate scheduling. Although this adjustment results in a slight decrease in model accuracy, fairness is guaranteed.

Optimization of Block Design

Next, the authors explored the optimal block design based on consistent training settings. Block design is an important component in CNN architecture, and optimizing block design can help improve the performance of the network.

1.3ms耗时！清华最新开源移动端神经网络架构 RepViT

Separate Token mixer and channel mixer

This is mainly for MobileNetV3-L The block structure was improved to separate token mixers and channel mixers. The original MobileNetV3 block structure consists of a 1x1 dilated convolution, followed by a depthwise convolution and a 1x1 projection layer, and then connects the input and output via residual connections. On this basis, RepViT advances the depth convolution so that the channel mixer and token mixer can be separated. To improve performance, structural reparameterization is also introduced to introduce multi-branch topology for deep filters during training. Finally, the authors succeeded in separating the token mixer and the channel mixer in the MobileNetV3 block and named such block the RepViT block.

Reduce the dilation ratio and increase the width

In the channel mixer, the original dilation ratio is 4, which means that the hidden dimension of the MLP block is four times the input dimension times, which consumes a large amount of computing resources and has a great impact on the inference time. To alleviate this problem, we can reduce the dilation ratio to 2, thereby reducing parameter redundancy and latency, bringing the latency of MobileNetV3-L down to 0.65ms. Subsequently, by increasing the width of the network, i.e. increasing the number of channels at each stage, the Top-1 accuracy increased to 73.5%, while the latency only increased to 0.89ms!

Optimization of macro-architectural elements

In this step, this article further optimizes the performance of MobileNetV3-L on mobile devices, mainly starting from the macro-architectural elements, including stem, downsampling layer, classifier as well as overall stage proportions. By optimizing these macro-architectural elements, the performance of the model can be significantly improved.

Shallow network using convolutional extractor

1.3ms耗时！清华最新开源移动端神经网络架构 RepViT Picture

ViTs typically use a "patchify" operation that splits the input image into non-overlapping patches as the stem. However, this approach has problems with training optimization and sensitivity to training recipes. Therefore, the authors adopted early convolution instead, an approach that has been adopted by many lightweight ViTs. In contrast, MobileNetV3-L uses a more complex stem for 4x downsampling. In this way, although the initial number of filters is increased to 24, the total latency is reduced to 0.86ms, while the top-1 accuracy increases to 73.9%.

Deeper Downsampling Layers

1.3ms耗时！清华最新开源移动端神经网络架构 RepViT

In ViTs, spatial downsampling is usually implemented through a separate patch merging layer. So here we can adopt a separate and deeper downsampling layer to increase the network depth and reduce the information loss due to resolution reduction. Specifically, the authors first used a 1x1 convolution to adjust the channel dimension, and then connected the input and output of two 1x1 convolutions through residuals to form a feedforward network. Additionally, they added a RepViT block in front to further deepen the downsampling layer, a step that improved the top-1 accuracy to 75.4% with a latency of 0.96ms.

Simpler classifier

1.3ms耗时！清华最新开源移动端神经网络架构 RepViT

In lightweight ViTs, the classifier is usually followed by a global average pooling layer consists of a linear layer. In contrast, MobileNetV3-L uses a more complex classifier. Because the final stage now has more channels, the authors replaced it with a simple classifier, a global average pooling layer and a linear layer. This step reduced the latency to 0.77ms while being top-1 accurate. The rate is 74.8%.

Overall stage proportion

The stage proportion represents the proportion of the number of blocks in different stages, thus indicating the distribution of calculations in each stage. The paper chooses a more optimal stage ratio of 1:1:7:1, and then increases the network depth to 2:2:14:2, thereby achieving a deeper layout. This step increases top-1 accuracy to 76.9% with a latency of 1.02 ms.

Adjustment of micro-design

Next, RepViT adjusts the lightweight CNN through layer-by-layer micro design, which includes selecting the appropriate convolution kernel size and optimizing squeeze-excitation (Squeeze- and-excitation, referred to as SE) layer location. Both methods significantly improve model performance.

Selection of convolution kernel size

It is well known that the performance and latency of CNNs are usually affected by the size of the convolution kernel. For example, to model long-range context dependencies like MHSA, ConvNeXt uses large convolutional kernels, resulting in significant performance improvements. However, large convolution kernels are not mobile-friendly due to its computational complexity and memory access cost. MobileNetV3-L mainly uses 3x3 convolutions, and 5x5 convolutions are used in some blocks. The authors replaced them with 3x3 convolutions, which resulted in latency being reduced to 1.00ms while maintaining a top-1 accuracy of 76.9%.

Position of SE layer

One advantage of self-attention modules over convolutions is the ability to adjust weights based on the input, which is called a data-driven property. As a channel attention module, the SE layer can make up for the limitations of convolution in the lack of data-driven properties, thereby leading to better performance. MobileNetV3-L adds SE layers in some blocks, mainly focusing on the last two stages. However, the lower-resolution stage gains smaller accuracy gains from the global average pooling operation provided by SE than the higher-resolution stage. The authors designed a strategy to use the SE layer in a cross-block manner at all stages to maximize the accuracy improvement with the smallest delay increment. This step improved the top-1 accuracy to 77.4% while delaying reduced to 0.87ms. [In fact, Baidu has already done experiments and comparisons on this point and reached this conclusion a long time ago. The SE layer is more effective when placed close to the deep layer]

Network Architecture

Finally, by integrating the above improvement strategies, we obtained the overall architecture of the model RepViT, which has multiple variants, such as RepViT-M1/ M2/M3. Likewise, the different variants are mainly distinguished by the number of channels and blocks per stage.

1.3ms耗时！清华最新开源移动端神经网络架构 RepViT

Experiment

Image Classification

1.3ms耗时！清华最新开源移动端神经网络架构 RepViT

##Detection and Segmentation

1.3ms耗时！清华最新开源移动端神经网络架构 RepViT

Summary

This paper re-examines the efficient design of lightweight CNNs by introducing the architectural choice of lightweight ViT. This led to the emergence of RepViT, a new family of lightweight CNNs designed for resource-constrained mobile devices. RepViT outperforms existing state-of-the-art lightweight ViTs and CNNs on various vision tasks, showing superior performance and latency. This highlights the potential of purely lightweight CNNs for mobile devices.

The above is the detailed content of 1.3ms takes 1.3ms! Tsinghua's latest open source mobile neural network architecture RepViT. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

Assassin's Creed Shadows: Seashell Riddle Solution

3 weeks ago By DDD

What's New in Windows 11 KB5054979 & How to Fix Update Issues

2 weeks ago By DDD

Where to find the Crane Control Keycard in Atomfall

3 weeks ago By DDD

Saving in R.E.P.O. Explained (And Save Files)

1 months ago By 尊渡假赌尊渡假赌尊渡假赌

Assassin's Creed Shadows - How To Find The Blacksmith And Unlock Weapon And Armour Customisation

4 weeks ago By DDD

Hot Tools

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Hot Topics

Where is the login entrance for gmail email?

7569

CakePHP Tutorial

1386

What is the format of the account name of steam

win11 activation key permanent

nyt connections hints and answers

107

Related knowledge

Anbi app official download v2.96.2 latest version installation Anbi official Android version Mar 04, 2025 pm 01:06 PM

Binance App official installation steps: Android needs to visit the official website to find the download link, choose the Android version to download and install; iOS search for "Binance" on the App Store. All should pay attention to the agreement through official channels.

How to solve the problem of 'Undefined array key 'sign'' error when calling Alipay EasySDK using PHP? Mar 31, 2025 pm 11:51 PM

Problem Description When calling Alipay EasySDK using PHP, after filling in the parameters according to the official code, an error message was reported during operation: "Undefined...

Download link of Ouyi iOS version installation package Feb 21, 2025 pm 07:42 PM

Ouyi is a world-leading cryptocurrency exchange with its official iOS app that provides users with a convenient and secure digital asset management experience. Users can download the Ouyi iOS version installation package for free through the download link provided in this article, and enjoy the following main functions: Convenient trading platform: Users can easily buy and sell hundreds of cryptocurrencies on the Ouyi iOS app, including Bitcoin and Ethereum. and Dogecoin. Safe and reliable storage: Ouyi adopts advanced security technology to provide users with safe and reliable digital asset storage. 2FA, biometric authentication and other security measures ensure that user assets are not infringed. Real-time market data: Ouyi iOS app provides real-time market data and charts, allowing users to grasp encryption at any time

The latest price of Bitcoin in 2018-2024 USD Feb 15, 2025 pm 07:12 PM

Real-time Bitcoin USD Price Factors that affect Bitcoin price Indicators for predicting future Bitcoin prices Here are some key information about the price of Bitcoin in 2018-2024:

How to install and register an app for buying virtual coins? Feb 21, 2025 pm 06:00 PM

Abstract: This article aims to guide users on how to install and register a virtual currency trading application on Apple devices. Apple has strict regulations on virtual currency applications, so users need to take special steps to complete the installation process. This article will elaborate on the steps required, including downloading the application, creating an account, and verifying your identity. Following this article's guide, users can easily set up a virtual currency trading app on their Apple devices and start trading.

Is H5 page production a front-end development? Apr 05, 2025 pm 11:42 PM

Yes, H5 page production is an important implementation method for front-end development, involving core technologies such as HTML, CSS and JavaScript. Developers build dynamic and powerful H5 pages by cleverly combining these technologies, such as using the <canvas> tag to draw graphics or using JavaScript to control interaction behavior.

How to open XML files with iPhone Apr 02, 2025 pm 11:00 PM

There is no built-in XML viewer on iPhone, and you can use third-party applications to open XML files, such as XML Viewer, JSON Viewer. Method: 1. Download and install the XML viewer in the App Store; 2. Find the XML file on the iPhone; 3. Press and hold the XML file to select "Share"; 4. Select the installed XML viewer app; 5. The XML file will open in the app. Note: 1. Make sure the XML viewer is compatible with the iPhone iOS version; 2. Be careful about case sensitivity when entering file paths; 3. Be careful with XML documents containing external entities

How to customize the resize symbol through CSS and make it uniform with the background color? Apr 05, 2025 pm 02:30 PM

The method of customizing resize symbols in CSS is unified with background colors. In daily development, we often encounter situations where we need to customize user interface details, such as adjusting...

See all articles