How much potential do fixed-parameter models have? Hong Kong Chinese, Shanghai AI Lab and others proposed an efficient video understanding framework EVL-AI-php.cn

Table of Contents

Method

Experiment

Summary

Home

Technology peripherals

How much potential do fixed-parameter models have? Hong Kong Chinese, Shanghai AI Lab and others proposed an efficient video understanding framework EVL

WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWB

Apr 12, 2023 pm 08:58 PM

frame Model

Visual basic models have achieved remarkable development in the past two years. On the one hand, pre-training based on large-scale Internet data has preset a large number of semantic concepts for the model, thus having good generalization performance; but on the other hand, in order to make full use of the model size brought by large-scale data sets Growth makes related models face inefficiency problems when migrating to downstream tasks, especially for video understanding models that need to process multiple frames.

How much potential do fixed-parameter models have? Hong Kong Chinese, Shanghai AI Lab and others proposed an efficient video understanding framework EVL

Paper link: https://arxiv.org/abs/2208.03550
Code link: https://github.com/OpenGVLab/efficient-video-recognition

Based on the above two characteristics , researchers from the Chinese University of Hong Kong, Shanghai Artificial Intelligence Laboratory and other institutions proposed an efficient video understanding transfer learning framework EVL. By fixing the weight of the backbone basic model, it saves training calculations and memory consumption; at the same time, by utilizing multi-level, Fine-grained intermediate features maintain the flexibility of traditional end-to-end fine-tuning as much as possible.

Figure 1 below shows the results of the EVL method on the video understanding dataset Kinetics-400. Experiments show that this method saves training overhead while still fully exploring the potential of the basic visual model in video understanding tasks.

How much potential do fixed-parameter models have? Hong Kong Chinese, Shanghai AI Lab and others proposed an efficient video understanding framework EVL

Figure 1: Kinetics-400 recognition accuracy comparison, the horizontal axis is the amount of inference calculation, and the vertical axis is the accuracy.

Method

The overall schematic diagram of the algorithm is shown in Figure 2(a). For a video sample, we take T frames and input them into an image recognition network (taking CLIP as an example) and extract features. Compared with traditional methods, we extract multi-layer, unpooled features from the last few layers of the image recognition network to obtain richer and finer-grained image information; and the parameter weights of the image recognition network are always consistent in video learning. Stay fixed. Subsequently, the multi-layer feature maps are sequentially input into a Transformer decoder for video-level information aggregation. The multi-layer decoded [CLS] features are used to generate the final classification prediction.

As shown in Figure 2(b), due to the disorder when the Transformer decoder aggregates features, we added additional temporal information modeling modules to the network to better Extract location-related fine-grained timing information. Specifically, we add three additional types of position-related timing information: the first is the temporal position embeddings (Position Embeddings), the second is the temporal dimension depth-separable convolution (Depthwise Convolution), and the third is the attention between adjacent frames force information. For inter-frame attention information, we extract the Query and Key features of the corresponding layer from the image recognition network, and calculate the attention map between adjacent frames (different from the image recognition network, the attention map is composed of the Query from the same frame and Key features are obtained). The resulting attention map can explicitly reflect the position changes of objects between adjacent frames. After linear projection, the attention map obtains a vector group that reflects the object's displacement characteristics, and is integrated into the image features in the form of element-by-element addition.

How much potential do fixed-parameter models have? Hong Kong Chinese, Shanghai AI Lab and others proposed an efficient video understanding framework EVL

Figure 2: EVL algorithm structure diagram. (a) Overall structure, (b) Sequential information modeling module.

How much potential do fixed-parameter models have? Hong Kong Chinese, Shanghai AI Lab and others proposed an efficient video understanding framework EVL

##Figure 3: Inter-frame attention features mathematical expression.

Experiment

In Figure 1 and Table 1, we quoted some important methods in previous video understanding. Despite focusing on reducing training overhead, our method still outperforms existing methods in terms of accuracy (with the same amount of computation).

In Table 2 we show the reduction in training overhead brought by the fixed backbone network. In terms of memory, on the V100 16GB GPU, the fixed backbone network can enable a single-card batch size to reach a maximum of 64, while end-to-end training can only reach 8; in terms of time, the fixed backbone network can save 3 to 4 times the training time.

In Table 3 we show the improvement of recognition performance by fine-grained feature maps. The multi-layer unpooled features allow us to maintain a considerable degree of flexibility when fixing the backbone network weights. Using unpooled features brings the most significant improvement (about 3%), followed by using multi-layer decoders and mid-layer features, which also bring about 1% performance improvement each.

Finally we show the effect of the fine-grained timing information module in Table 4. Although fine-grained timing information has a limited impact on the performance of Kinetics-400, they are very important for the performance of Something-Something-v2: the three fine-grained timing information modules bring a total of about 0.5% and about 14% performance improvement.

How much potential do fixed-parameter models have? Hong Kong Chinese, Shanghai AI Lab and others proposed an efficient video understanding framework EVL

Table 1: Comparison results with existing methods on Kinetics-400

How much potential do fixed-parameter models have? Hong Kong Chinese, Shanghai AI Lab and others proposed an efficient video understanding framework EVL

Table 2: Training overhead reduction caused by fixed backbone network weights

How much potential do fixed-parameter models have? Hong Kong Chinese, Shanghai AI Lab and others proposed an efficient video understanding framework EVL

Table 3: The impact of fine-grained feature maps on accuracy

How much potential do fixed-parameter models have? Hong Kong Chinese, Shanghai AI Lab and others proposed an efficient video understanding framework EVL

Table 4: The effect of fine-grained time series information modeling on different data sets

Summary

This article proposes the EVL video understanding learning framework, which for the first time demonstrates the great potential of a fixed image backbone network in video understanding problems, and also makes high-performance video understanding more friendly to research groups with limited computing resources. We also believe that as the quality and scale of visual basic models improve, our method can provide a reference for subsequent research on lightweight transfer learning algorithms.

The above is the detailed content of How much potential do fixed-parameter models have? Hong Kong Chinese, Shanghai AI Lab and others proposed an efficient video understanding framework EVL. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Assassin's Creed Shadows: Seashell Riddle Solution

4 weeks ago By DDD

What's New in Windows 11 KB5054979 & How to Fix Update Issues

3 weeks ago By DDD

Where to find the Crane Control Keycard in Atomfall

4 weeks ago By DDD

Roblox: Dead Rails - How To Complete Every Challenge

1 months ago By DDD

How to fix KB5055523 fails to install in Windows 11?

2 weeks ago By DDD

Hot Tools

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Hot Topics

Where is the login entrance for gmail email?

7733

Java Tutorial

1643

CakePHP Tutorial

1397

Laravel Tutorial

1290

PHP Tutorial

1233

Related knowledge

No OpenAI data required, join the list of large code models! UIUC releases StarCoder-15B-Instruct Jun 13, 2024 pm 01:59 PM

At the forefront of software technology, UIUC Zhang Lingming's group, together with researchers from the BigCode organization, recently announced the StarCoder2-15B-Instruct large code model. This innovative achievement achieved a significant breakthrough in code generation tasks, successfully surpassing CodeLlama-70B-Instruct and reaching the top of the code generation performance list. The unique feature of StarCoder2-15B-Instruct is its pure self-alignment strategy. The entire training process is open, transparent, and completely autonomous and controllable. The model generates thousands of instructions via StarCoder2-15B in response to fine-tuning the StarCoder-15B base model without relying on expensive manual annotation.

Yolov10: Detailed explanation, deployment and application all in one place! Jun 07, 2024 pm 12:05 PM

1. Introduction Over the past few years, YOLOs have become the dominant paradigm in the field of real-time object detection due to its effective balance between computational cost and detection performance. Researchers have explored YOLO's architectural design, optimization goals, data expansion strategies, etc., and have made significant progress. At the same time, relying on non-maximum suppression (NMS) for post-processing hinders end-to-end deployment of YOLO and adversely affects inference latency. In YOLOs, the design of various components lacks comprehensive and thorough inspection, resulting in significant computational redundancy and limiting the capabilities of the model. It offers suboptimal efficiency, and relatively large potential for performance improvement. In this work, the goal is to further improve the performance efficiency boundary of YOLO from both post-processing and model architecture. to this end

How to evaluate the cost-effectiveness of commercial support for Java frameworks Jun 05, 2024 pm 05:25 PM

Evaluating the cost/performance of commercial support for a Java framework involves the following steps: Determine the required level of assurance and service level agreement (SLA) guarantees. The experience and expertise of the research support team. Consider additional services such as upgrades, troubleshooting, and performance optimization. Weigh business support costs against risk mitigation and increased efficiency.

Tsinghua University took over and YOLOv10 came out: the performance was greatly improved and it was on the GitHub hot list Jun 06, 2024 pm 12:20 PM

The benchmark YOLO series of target detection systems has once again received a major upgrade. Since the release of YOLOv9 in February this year, the baton of the YOLO (YouOnlyLookOnce) series has been passed to the hands of researchers at Tsinghua University. Last weekend, the news of the launch of YOLOv10 attracted the attention of the AI community. It is considered a breakthrough framework in the field of computer vision and is known for its real-time end-to-end object detection capabilities, continuing the legacy of the YOLO series by providing a powerful solution that combines efficiency and accuracy. Paper address: https://arxiv.org/pdf/2405.14458 Project address: https://github.com/THU-MIG/yo

Google Gemini 1.5 technical report: Easily prove Mathematical Olympiad questions, the Flash version is 5 times faster than GPT-4 Turbo Jun 13, 2024 pm 01:52 PM

In February this year, Google launched the multi-modal large model Gemini 1.5, which greatly improved performance and speed through engineering and infrastructure optimization, MoE architecture and other strategies. With longer context, stronger reasoning capabilities, and better handling of cross-modal content. This Friday, Google DeepMind officially released the technical report of Gemini 1.5, which covers the Flash version and other recent upgrades. The document is 153 pages long. Technical report link: https://storage.googleapis.com/deepmind-media/gemini/gemini_v1_5_report.pdf In this report, Google introduces Gemini1

Review! Comprehensively summarize the important role of basic models in promoting autonomous driving Jun 11, 2024 pm 05:29 PM

Written above & the author’s personal understanding: Recently, with the development and breakthroughs of deep learning technology, large-scale foundation models (Foundation Models) have achieved significant results in the fields of natural language processing and computer vision. The application of basic models in autonomous driving also has great development prospects, which can improve the understanding and reasoning of scenarios. Through pre-training on rich language and visual data, the basic model can understand and interpret various elements in autonomous driving scenarios and perform reasoning, providing language and action commands for driving decision-making and planning. The base model can be data augmented with an understanding of the driving scenario to provide those rare feasible features in long-tail distributions that are unlikely to be encountered during routine driving and data collection.

How do the lightweight options of PHP frameworks affect application performance? Jun 06, 2024 am 10:53 AM

The lightweight PHP framework improves application performance through small size and low resource consumption. Its features include: small size, fast startup, low memory usage, improved response speed and throughput, and reduced resource consumption. Practical case: SlimFramework creates REST API, only 500KB, high responsiveness and high throughput

How does the learning curve of PHP frameworks compare to other language frameworks? Jun 06, 2024 pm 12:41 PM

The learning curve of a PHP framework depends on language proficiency, framework complexity, documentation quality, and community support. The learning curve of PHP frameworks is higher when compared to Python frameworks and lower when compared to Ruby frameworks. Compared to Java frameworks, PHP frameworks have a moderate learning curve but a shorter time to get started.

See all articles