Table of Contents
Method
Experiment
Summary
Home Technology peripherals AI How much potential do fixed-parameter models have? Hong Kong Chinese, Shanghai AI Lab and others proposed an efficient video understanding framework EVL

How much potential do fixed-parameter models have? Hong Kong Chinese, Shanghai AI Lab and others proposed an efficient video understanding framework EVL

Apr 12, 2023 pm 08:58 PM
frame Model

Visual basic models have achieved remarkable development in the past two years. On the one hand, pre-training based on large-scale Internet data has preset a large number of semantic concepts for the model, thus having good generalization performance; but on the other hand, in order to make full use of the model size brought by large-scale data sets Growth makes related models face inefficiency problems when migrating to downstream tasks, especially for video understanding models that need to process multiple frames.

How much potential do fixed-parameter models have? Hong Kong Chinese, Shanghai AI Lab and others proposed an efficient video understanding framework EVL

  • Paper link: https://arxiv.org/abs/2208.03550
  • Code link: https://github.com/OpenGVLab/efficient-video-recognition

Based on the above two characteristics , researchers from the Chinese University of Hong Kong, Shanghai Artificial Intelligence Laboratory and other institutions proposed an efficient video understanding transfer learning framework EVL. By fixing the weight of the backbone basic model, it saves training calculations and memory consumption; at the same time, by utilizing multi-level, Fine-grained intermediate features maintain the flexibility of traditional end-to-end fine-tuning as much as possible.

Figure 1 below shows the results of the EVL method on the video understanding dataset Kinetics-400. Experiments show that this method saves training overhead while still fully exploring the potential of the basic visual model in video understanding tasks.

How much potential do fixed-parameter models have? Hong Kong Chinese, Shanghai AI Lab and others proposed an efficient video understanding framework EVL

Figure 1: Kinetics-400 recognition accuracy comparison, the horizontal axis is the amount of inference calculation, and the vertical axis is the accuracy.

Method

The overall schematic diagram of the algorithm is shown in Figure 2(a). For a video sample, we take T frames and input them into an image recognition network (taking CLIP as an example) and extract features. Compared with traditional methods, we extract multi-layer, unpooled features from the last few layers of the image recognition network to obtain richer and finer-grained image information; and the parameter weights of the image recognition network are always consistent in video learning. Stay fixed. Subsequently, the multi-layer feature maps are sequentially input into a Transformer decoder for video-level information aggregation. The multi-layer decoded [CLS] features are used to generate the final classification prediction.

As shown in Figure 2(b), due to the disorder when the Transformer decoder aggregates features, we added additional temporal information modeling modules to the network to better Extract location-related fine-grained timing information. Specifically, we add three additional types of position-related timing information: the first is the temporal position embeddings (Position Embeddings), the second is the temporal dimension depth-separable convolution (Depthwise Convolution), and the third is the attention between adjacent frames force information. For inter-frame attention information, we extract the Query and Key features of the corresponding layer from the image recognition network, and calculate the attention map between adjacent frames (different from the image recognition network, the attention map is composed of the Query from the same frame and Key features are obtained). The resulting attention map can explicitly reflect the position changes of objects between adjacent frames. After linear projection, the attention map obtains a vector group that reflects the object's displacement characteristics, and is integrated into the image features in the form of element-by-element addition.

How much potential do fixed-parameter models have? Hong Kong Chinese, Shanghai AI Lab and others proposed an efficient video understanding framework EVL

Figure 2: EVL algorithm structure diagram. (a) Overall structure, (b) Sequential information modeling module.

How much potential do fixed-parameter models have? Hong Kong Chinese, Shanghai AI Lab and others proposed an efficient video understanding framework EVL

How much potential do fixed-parameter models have? Hong Kong Chinese, Shanghai AI Lab and others proposed an efficient video understanding framework EVL

##Figure 3: Inter-frame attention features mathematical expression.

Experiment

In Figure 1 and Table 1, we quoted some important methods in previous video understanding. Despite focusing on reducing training overhead, our method still outperforms existing methods in terms of accuracy (with the same amount of computation).

In Table 2 we show the reduction in training overhead brought by the fixed backbone network. In terms of memory, on the V100 16GB GPU, the fixed backbone network can enable a single-card batch size to reach a maximum of 64, while end-to-end training can only reach 8; in terms of time, the fixed backbone network can save 3 to 4 times the training time.

In Table 3 we show the improvement of recognition performance by fine-grained feature maps. The multi-layer unpooled features allow us to maintain a considerable degree of flexibility when fixing the backbone network weights. Using unpooled features brings the most significant improvement (about 3%), followed by using multi-layer decoders and mid-layer features, which also bring about 1% performance improvement each.

Finally we show the effect of the fine-grained timing information module in Table 4. Although fine-grained timing information has a limited impact on the performance of Kinetics-400, they are very important for the performance of Something-Something-v2: the three fine-grained timing information modules bring a total of about 0.5% and about 14% performance improvement.

How much potential do fixed-parameter models have? Hong Kong Chinese, Shanghai AI Lab and others proposed an efficient video understanding framework EVL

Table 1: Comparison results with existing methods on Kinetics-400

How much potential do fixed-parameter models have? Hong Kong Chinese, Shanghai AI Lab and others proposed an efficient video understanding framework EVL

Table 2: Training overhead reduction caused by fixed backbone network weights

How much potential do fixed-parameter models have? Hong Kong Chinese, Shanghai AI Lab and others proposed an efficient video understanding framework EVL

Table 3: The impact of fine-grained feature maps on accuracy

How much potential do fixed-parameter models have? Hong Kong Chinese, Shanghai AI Lab and others proposed an efficient video understanding framework EVL

Table 4: The effect of fine-grained time series information modeling on different data sets

Summary

This article proposes the EVL video understanding learning framework, which for the first time demonstrates the great potential of a fixed image backbone network in video understanding problems, and also makes high-performance video understanding more friendly to research groups with limited computing resources. We also believe that as the quality and scale of visual basic models improve, our method can provide a reference for subsequent research on lightweight transfer learning algorithms.

The above is the detailed content of How much potential do fixed-parameter models have? Hong Kong Chinese, Shanghai AI Lab and others proposed an efficient video understanding framework EVL. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)
3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. Best Graphic Settings
3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. How to Fix Audio if You Can't Hear Anyone
3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌
WWE 2K25: How To Unlock Everything In MyRise
3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

No OpenAI data required, join the list of large code models! UIUC releases StarCoder-15B-Instruct No OpenAI data required, join the list of large code models! UIUC releases StarCoder-15B-Instruct Jun 13, 2024 pm 01:59 PM

At the forefront of software technology, UIUC Zhang Lingming's group, together with researchers from the BigCode organization, recently announced the StarCoder2-15B-Instruct large code model. This innovative achievement achieved a significant breakthrough in code generation tasks, successfully surpassing CodeLlama-70B-Instruct and reaching the top of the code generation performance list. The unique feature of StarCoder2-15B-Instruct is its pure self-alignment strategy. The entire training process is open, transparent, and completely autonomous and controllable. The model generates thousands of instructions via StarCoder2-15B in response to fine-tuning the StarCoder-15B base model without relying on expensive manual annotation.

Yolov10: Detailed explanation, deployment and application all in one place! Yolov10: Detailed explanation, deployment and application all in one place! Jun 07, 2024 pm 12:05 PM

1. Introduction Over the past few years, YOLOs have become the dominant paradigm in the field of real-time object detection due to its effective balance between computational cost and detection performance. Researchers have explored YOLO's architectural design, optimization goals, data expansion strategies, etc., and have made significant progress. At the same time, relying on non-maximum suppression (NMS) for post-processing hinders end-to-end deployment of YOLO and adversely affects inference latency. In YOLOs, the design of various components lacks comprehensive and thorough inspection, resulting in significant computational redundancy and limiting the capabilities of the model. It offers suboptimal efficiency, and relatively large potential for performance improvement. In this work, the goal is to further improve the performance efficiency boundary of YOLO from both post-processing and model architecture. to this end

Tsinghua University took over and YOLOv10 came out: the performance was greatly improved and it was on the GitHub hot list Tsinghua University took over and YOLOv10 came out: the performance was greatly improved and it was on the GitHub hot list Jun 06, 2024 pm 12:20 PM

The benchmark YOLO series of target detection systems has once again received a major upgrade. Since the release of YOLOv9 in February this year, the baton of the YOLO (YouOnlyLookOnce) series has been passed to the hands of researchers at Tsinghua University. Last weekend, the news of the launch of YOLOv10 attracted the attention of the AI ​​community. It is considered a breakthrough framework in the field of computer vision and is known for its real-time end-to-end object detection capabilities, continuing the legacy of the YOLO series by providing a powerful solution that combines efficiency and accuracy. Paper address: https://arxiv.org/pdf/2405.14458 Project address: https://github.com/THU-MIG/yo

Google Gemini 1.5 technical report: Easily prove Mathematical Olympiad questions, the Flash version is 5 times faster than GPT-4 Turbo Google Gemini 1.5 technical report: Easily prove Mathematical Olympiad questions, the Flash version is 5 times faster than GPT-4 Turbo Jun 13, 2024 pm 01:52 PM

In February this year, Google launched the multi-modal large model Gemini 1.5, which greatly improved performance and speed through engineering and infrastructure optimization, MoE architecture and other strategies. With longer context, stronger reasoning capabilities, and better handling of cross-modal content. This Friday, Google DeepMind officially released the technical report of Gemini 1.5, which covers the Flash version and other recent upgrades. The document is 153 pages long. Technical report link: https://storage.googleapis.com/deepmind-media/gemini/gemini_v1_5_report.pdf In this report, Google introduces Gemini1

How to evaluate the cost-effectiveness of commercial support for Java frameworks How to evaluate the cost-effectiveness of commercial support for Java frameworks Jun 05, 2024 pm 05:25 PM

Evaluating the cost/performance of commercial support for a Java framework involves the following steps: Determine the required level of assurance and service level agreement (SLA) guarantees. The experience and expertise of the research support team. Consider additional services such as upgrades, troubleshooting, and performance optimization. Weigh business support costs against risk mitigation and increased efficiency.

Review! Comprehensively summarize the important role of basic models in promoting autonomous driving Review! Comprehensively summarize the important role of basic models in promoting autonomous driving Jun 11, 2024 pm 05:29 PM

Written above & the author’s personal understanding: Recently, with the development and breakthroughs of deep learning technology, large-scale foundation models (Foundation Models) have achieved significant results in the fields of natural language processing and computer vision. The application of basic models in autonomous driving also has great development prospects, which can improve the understanding and reasoning of scenarios. Through pre-training on rich language and visual data, the basic model can understand and interpret various elements in autonomous driving scenarios and perform reasoning, providing language and action commands for driving decision-making and planning. The base model can be data augmented with an understanding of the driving scenario to provide those rare feasible features in long-tail distributions that are unlikely to be encountered during routine driving and data collection.

How does the learning curve of PHP frameworks compare to other language frameworks? How does the learning curve of PHP frameworks compare to other language frameworks? Jun 06, 2024 pm 12:41 PM

The learning curve of a PHP framework depends on language proficiency, framework complexity, documentation quality, and community support. The learning curve of PHP frameworks is higher when compared to Python frameworks and lower when compared to Ruby frameworks. Compared to Java frameworks, PHP frameworks have a moderate learning curve but a shorter time to get started.

Do different data sets have different scaling laws? And you can predict it with a compression algorithm Do different data sets have different scaling laws? And you can predict it with a compression algorithm Jun 07, 2024 pm 05:51 PM

Generally speaking, the more calculations it takes to train a neural network, the better its performance. When scaling up a calculation, a decision must be made: increase the number of model parameters or increase the size of the data set—two factors that must be weighed within a fixed computational budget. The advantage of increasing the number of model parameters is that it can improve the complexity and expression ability of the model, thereby better fitting the training data. However, too many parameters can lead to overfitting, making the model perform poorly on unseen data. On the other hand, expanding the data set size can improve the generalization ability of the model and reduce overfitting problems. Let us tell you: As long as you allocate parameters and data appropriately, you can maximize performance within a fixed computing budget. Many previous studies have explored Scalingl of neural language models.

See all articles