Table of Contents
Background introduction
Research Motivation
Method Introduction
MAE Overview" >MAE Overview
Characteristics of video data" >Characteristics of video data
Temporal redundancy
Temporal correlation" > Temporal correlation
VideoMAE Method Introduction" >VideoMAE Method Introduction
Timing downsampling" >Timing downsampling
Spatial and temporal block embedding" >Spatial and temporal block embedding
Pipeline masking strategy with extremely high masking ratio" >Pipeline masking strategy with extremely high masking ratio
Space-time joint self-attention mechanism" >Space-time joint self-attention mechanism
VideoMAE implementation details
Mask Strategies" >Mask Strategies
Reconstruction target" >Reconstruction target
Pre-training strategy" >Pre-training strategy
Pre-training data set" >Pre-training data set
Pre-training rounds" >Pre-training rounds
Important features of VideoMAE
VideoMAE is a data-efficient learner" >VideoMAE is a data-efficient learner
Main results
Impact on the community
Summary
Home Technology peripherals AI VideoMAE: a new paradigm of simple and efficient video self-supervised pre-training

VideoMAE: a new paradigm of simple and efficient video self-supervised pre-training

Apr 12, 2023 pm 03:28 PM
AI

VideoMAE: a new paradigm of simple and efficient video self-supervised pre-training

Paper link: https://arxiv.org/abs/2203.12602

The code and pre-training weights have been open sourced on Github: https://github.com/MCG-NJU/VideoMAE

Directory1. Background introduction
2. Research motivation
3. Method introduction
4. VideoMAE implementation details
5. Ablation experiment
6. Important features of VideoMAE
7. Main results
8. Impact of the community
9. Summary

Background introduction

Video Self-supervised Learning: Does not use label information, self-supervision by design The agent task is to learn spatiotemporal representation information from video data. Existing video self-supervised pre-training algorithms are mainly divided into two categories: (1) Self-supervised methods based on contrastive learning, such as CoCLR, CVRL, etc. (2) Self-supervised methods based on time-series related agent tasks, such as DPC, SpeedNet, Pace, etc.

Action Recognition: Classify a given trimmed video (Trimmed Video) and identify the actions of the characters in this video. The current mainstream methods are 2D-based (TSN, TSM, TDN, etc.), 3D-based (I3D, SlowFast, etc.) and Transformer-based (TimeSformer, ViViT, MViT, VideoSwin, etc.). As a basic task in the video field, action recognition is often used as the backbone network (Backbone) for various downstream tasks in the video field (such as temporal behavior detection, spatiotemporal action detection) to extract spatiotemporal features at the entire video level or at the video clip level.

Action Detection: This task not only requires action classification of the video, identifying the actions of the characters in the video, but also using a bounding box within the space. box) marks the spatial position of the character. Action detection has a wide range of application scenarios in movie video analysis, sports video analysis and other scenarios.

Research Motivation

Since the visual self-attention model (Vision Transformer) was proposed at the end of 2020, Transformer has been widely used in the field of computer vision and helped improve a Performance on a series of computer vision tasks.

However, Vision Transformer needs to utilize large-scale labeled data sets for training. Initially, the original ViT (Vanilla Vision Transformer) achieved good performance through supervised pre-training using hundreds of millions of labeled images. Current Video Transformers are usually based on Vision Transformer models trained on image data (such as TimeSformer, ViViT, etc.) and rely on pre-trained models of large-scale image data (such as ImageNet-1K, ImageNet-21K, JFT-300M, etc.). TimeSformer and ViViT have both tried to train the Video Transformer model from scratch in the video data set, but they have not been able to achieve satisfactory results. Therefore, how to effectively train Video Transformer, especially the original ViT (Vanilla Vision Transformer), directly on video datasets without using any other pre-trained models or additional image data is still an urgent problem to be solved . It should be noted that existing video datasets are relatively small compared to image datasets. For example, the widely used Kinectics-400 data set only has more than 200,000 training samples. The number of samples is about 1/50 of the ImageNet-21K data set and 1/1500 of the JFT-300M data set. There is a gap of several orders of magnitude. At the same time, compared with training an image model, the computational overhead of training a video model is also much higher. This further increases the difficulty of training Video Transformer on video datasets.

Recently, the self-supervised training paradigm of "masking-and-reconstruction" has achieved success in natural language processing (BERT) and image understanding (BEiT, MAE) success. Therefore, we try to use this self-supervised paradigm to train Video Transformer on video data sets, and propose a video self-supervised pre-training algorithm VideoMAE (Video MAE) based on the proxy task of masking-and-reconstruction. Masked Autoencoder). The ViT model pre-trained by VideoMAE can achieve significantly better results than other methods on larger video data sets such as Kinetics-400 and Something-Something V2, as well as relatively small video data sets such as UCF101 and HMDB51.

Method Introduction

MAE Overview

MAE adopts an asymmetric encoder-decoder architecture for self-supervised prediction of masking and reconstruction. training tasks. A 224x224 resolution input image is first divided into non-overlapping visual pixel blocks (tokens) of size 16 × 16. Each pixel block (token) is converted into a high-dimensional feature through a block embedding (token embedding) operation. MAE uses a higher mask ratio (75%) to randomly mask out some pixel blocks (tokens). After the masking operation, the remaining pixel blocks are sent to the encoder for feature extraction. Immediately afterwards, the feature blocks extracted by the encoder are spliced ​​with another part of preset learnable pixel blocks (learnable tokens) to form features that are as large as the original input image size. Finally, a lightweight decoder is used to reconstruct the original image based on these features (during the actual experiment, the reconstruction target is a normalized pixel block (normalized token).

Characteristics of video data

Compared with image data, video data contains more frames and has richer motion information. This section will first analyze Let’s take a look at the characteristics of video data.

VideoMAE: a new paradigm of simple and efficient video self-supervised pre-training

Examples of different masking strategies for video data

Temporal redundancy

Video data contains dense image frames, and the semantic information contained in these image frames changes very slowly over time. It can be seen that the dense and continuous color image frames in the video is highly redundant, as shown in the figure. This redundancy may cause two problems in the process of implementing MAE. First, if the dense frame rate of the original video (such as 30 FPS) is used for pre-training, the training efficiency will be very low. Because this setting will make the network pay more attention to static appearance features in the data or some locally changing motion features. Secondly, the temporal redundancy in the video will greatly dilute the motion features in the video. Therefore, this situation makes the task of reconstructing the masked pixel patches relatively simple under normal masking rates (e.g., 50% to 75%). These problems will affect the pre-trained performance of Backbone as an encoder. Extract motion features in the process.

Temporal correlation

Video can be regarded as generated by the evolution of static pictures over time, so There are also semantic correspondences between video frames. If the masking strategy is not designed specifically, this temporal correlation may increase the risk of "information leakage" during the reconstruction process. Specifically, as shown in the figure, if Using a global random mask or a randomly masked image frame, the network can exploit the temporal correlation in the video to perform block reconstruction by "copying and pasting" unoccluded pixel blocks at the corresponding temporal positions in adjacent frames. In this case It can also complete the agent task to a certain extent, but it may cause VideoMAE to only learn lower semantic temporal correspondence features instead of high-level abstract semantic information, such as the spatiotemporal reasoning ability of video content. In order to alleviate this situation , a new masking strategy needs to be designed to make the reconstruction task more challenging, so that the network can better learn the spatiotemporal feature representation in the video.

VideoMAE Method Introduction

VideoMAE: a new paradigm of simple and efficient video self-supervised pre-training

The overall framework of VideoMAE

In order to solve the video pre-training process in the previous article To address the problems that may be encountered when using masking-and-reconstruction tasks, we introduce some new designs in VideoMAE.

Timing downsampling

According to the previous analysis of the timing redundancy existing in dense consecutive frames in the video, we chose to use timing intervals in VideoMAE sampling strategy to conduct more efficient video self-supervised pre-training. Specifically, a video segment consisting of $t$ consecutive frames is first randomly sampled from the original video. The video clips are then compressed into frames using temporally spaced sampling, with each frame containing VideoMAE: a new paradigm of simple and efficient video self-supervised pre-training pixels. In the specific experimental settings, the sampling intervals on the Kinetics-400 and Something-Something V2 datasets are set to 4 and 2 respectively.

Spatial and temporal block embedding

Before input into the encoder, for the sampled video clips, pixel blocks are processed in the form of spatiotemporal joint Embed. Specifically, visual pixels of size VideoMAE: a new paradigm of simple and efficient video self-supervised pre-training in a video clip of size VideoMAE: a new paradigm of simple and efficient video self-supervised pre-training are considered as one visual pixel block. Therefore, VideoMAE: a new paradigm of simple and efficient video self-supervised pre-training visual pixel blocks can be obtained from the sampled video clips after passing through the space-time block embedding (cube embedding) layer. In this process, the channel dimensions of the visual pixel block will also be mapped to. This design can reduce the spatiotemporal dimension size of the input data, and also helps alleviate the spatiotemporal redundancy of video data to a certain extent.

Pipeline masking strategy with extremely high masking ratio

In order to solve the problem caused by the timing redundancy and To solve the "information leakage" problem caused by temporal correlation, this method chooses to adopt a pipeline masking strategy in the process of self-supervised pre-training. The pipeline masking strategy can naturally extend the masking method of a single frame color image to the entire video sequence, that is, visual pixel blocks at the same spatial location in different frames will be masked. Specifically, the pipeline masking policy can be expressed as . Different times t share the same value. Using this masking strategy, tokens at the same spatial location will always be masked. So for some visual pixel patches (e.g., the pixel patch containing the finger in line 4 of the example image of different masking strategies), the network will not be able to find its corresponding part in other frames. This design helps mitigate the risk of "information leakage" during the reconstruction process, allowing VideoMAE to reconstruct the masked token by extracting high-level semantic information from the original video clip.

Compared with image data, video data has stronger redundancy, and the information density of video data is much lower than that of images. This feature allows VideoMAE to use extremely high mask rates (e.g. 90% to 95%) for pre-training. It is worth noting that the default mask rate of MAE is 75%. Experimental results show that using an extremely high mask rate can not only speed up pre-training (only 5% to 10% of visual pixel blocks are input to the encoder), but also improve the representation ability of the model and the effect in downstream tasks .

Space-time joint self-attention mechanism

As mentioned in the previous article, VideoMAE uses an extremely high masking rate and only retains a very small number of tokens as input to the encoder. . In order to better extract the spatiotemporal features of this part of the unoccluded token, VideoMAE chose to use the original ViT as the Backbone, and at the same time used spatiotemporal joint self-attention in the attention layer (that is, the model structure of the original ViT was not changed). Therefore all unoccluded tokens can interact with each other in the self-attention layer. The VideoMAE: a new paradigm of simple and efficient video self-supervised pre-training-level computational complexity of the spatio-temporal joint self-attention mechanism is the computational bottleneck of the network. In the previous article, an extremely high mask ratio strategy was used for VideoMAE, which only unmasked tokens (such as 10%) is input into the encoder. This design can effectively alleviate the VideoMAE: a new paradigm of simple and efficient video self-supervised pre-training level computational complexity problem to a certain extent.

VideoMAE implementation details

VideoMAE: a new paradigm of simple and efficient video self-supervised pre-training

##Specific design details of the VideoMAE framework

The above figure shows the specific architectural design of the encoder and decoder used by VideoMAE (taking ViT-B as an example). We evaluate VideoMAE on four downstream video action recognition datasets and one action detection dataset. These datasets focus on different aspects of motion information in videos. Kinetics-400 is a large-scale YouTube video dataset containing approximately 300,000 cropped video clips covering 400 different action categories. The Kinetics-400 dataset mainly contains activities in daily life, and some categories are highly correlated with interactive objects or scene information. The videos in the Something-Something V2 dataset mainly contain different objects performing the same actions, so action recognition in this dataset focuses more on motion attributes rather than object or scene information. The training set contains approximately 170,000 video clips, and the validation set contains approximately 25,000 video clips. UCF101 and HMDB51 are two relatively small video action recognition datasets. The training set of UCF101 contains approximately 9500 videos, and the training set of HMDB51 contains approximately 3500 videos. During the experiment, we first used VideoMAE to perform self-supervised pre-training on the network on the training set, then performed supervised fine-tuning of the encoder (ViT) on the training set, and finally evaluated the performance of the model on the validation set. . For the action detection data set AVA, we will first load the model trained on the Kinetics-400 data set and perform supervised fine-tuning of the encoder (ViT).

Ablation experiment

This section conducts ablation experiments on VideoMAE on Something-Something V2 and Kinetics-400 data sets. The ablation experiment uses the original ViT model with 16 frames of input by default. At the same time, when evaluating after fine-tuning, 2 video clips and 3 crops were selected for testing on Something-Something V2, and 5 video clips and 3 crops were selected for testing on Kinetics-400.

VideoMAE: a new paradigm of simple and efficient video self-supervised pre-training

Decoder design

The lightweight decoder is VideA key component in oMAE. Experimental results using decoders of different depths are shown in Table (a). Unlike MAE, deeper decoders in VideoMAE can achieve better performance, while shallower decoders can effectively reduce GPU memory usage. The number of decoder layers is set to 4 by default. Following the empirical design of MAE, the channel width of the decoder in VideoMAE is set to half of the channel width of the encoder (for example, when ViT-B is used as the encoder, the channel width of the decoder is set to 384).

Mask Strategies

Compare different masking strategies to the pipelined masking strategy using a masking ratio of 75%. As shown in Table (b), the performance of global random masking and random masking image frames is worse than the pipeline masking strategy. This may be due to the fact that the pipeline masking strategy can alleviate the timing redundancy and timing correlation in the video data to a certain extent. If the mask ratio is increased to 90%, VideoMAE's performance on Something-Something can be further improved from 68.0% to 69.6%. The design of the masking strategy and masking ratio in VideoMAE can make masking plus reconstruction a more challenging proxy task, forcing the model to learn higher-level spatiotemporal features.

Reconstruction target

The reconstruction target in VideoMAE is compared here, and the results are in Table (c). First, VideoMAE's performance in downstream tasks will be significantly reduced if only the center frame in the video clip is used as the reconstruction target. At the same time, VideoMAE is also very sensitive to the sampling interval. If you choose to reconstruct a more densely packed video clip, the results will be significantly lower than the default temporally downsampled video clip. Finally, we also tried to reconstruct the denser frames in the video clips from the temporally downsampled video clips, but this setting would require decoding more frames, making the training slower and not very effective.

Pre-training strategy

The pre-training strategy in VideoMAE is compared here, and the results are shown in Table (d). Similar to the experimental conclusions of previous methods (TimeSformer, ViViT), training ViT from scratch on Something-Something V2, a data set that is more sensitive to motion information, cannot achieve satisfactory results. If the pre-trained ViT model on the large-scale image data set (ImageNet-21K) is used as initialization, better accuracy can be obtained, which can be improved from 32.6% to 61.8%. Using models pretrained on ImageNet-21K and Kinetics-400 further improved the accuracy to 65.2%. ViT, which is pre-trained from the video data set itself using VideoMAE, can ultimately achieve the best performance of 69.6% without using any additional data. Similar conclusions were reached on Kinetics-400.

Pre-training data set

The pre-training data set in VideoMAE is compared here, and the results are shown in Table (e). First, according to the settings of MAE, ViT is self-supervised pre-trained on ImageNet-1K for 1600 epochs. The 2D block embedding layer is then inflated into a 3D spatiotemporal block embedding layer using strategies in I3D, and the model is fine-tuned on the video dataset. This training paradigm can outperform models trained supervised from scratch. Next, the performance of the MAE pretrained model was compared with the ViT model pretrained by VideoMAE on Kinetics-400. It can be found that VideoMAE can achieve better performance than MAE. However, both pre-trained models failed to achieve better performance than VideoMAE, which was only self-supervised pre-trained on the Something-Something V2 dataset. It can be analyzed that the domain difference between the pre-training data set and the target data set may be an important issue.

Pre-training rounds

VideoMAE: a new paradigm of simple and efficient video self-supervised pre-training

Total rounds of pre-training in VideoMAE Impact

In ablation experiments, the total number of epochs for VideoMAE pre-training is set to 800 by default. We attempt to conduct an in-depth exploration of the pre-training epochs on the Kinetics-400 and Something-Something V2 datasets. According to the results in the figure, using longer pre-training epochs leads to consistent gains on both datasets.

Important features of VideoMAE

VideoMAE is a data-efficient learner

VideoMAE: a new paradigm of simple and efficient video self-supervised pre-training

Performance comparison between VideoMAE and MoCov3 on different downstream video action recognition data sets

Many previous works have conducted extensive research on video self-supervised pre-training, but these methods mainly use convolutional neural networks as Backbone, and there are few methods to study ViT-based training mechanisms. Therefore, to verify the effectiveness of ViT-based VideoMAE for video self-supervised pre-training, we compared two ViT-based training methods: (1) supervised training of the model from scratch, (2) using a contrastive learning method (MoCo v3) Perform self-supervised pre-training. According to the experimental results, it can be found that VideoMAE is significantly better than the other two training methods. For example, on the Kinetics-400 dataset, which has the largest data size, VideoMAE is approximately 10% more accurate than training from scratch and approximately 6% higher than the results of MoCo v3 pre-training. The excellent performance of VideoMAE shows that the self-supervised paradigm of masking-and-reconstruction provides an efficient pre-training mechanism for ViT. At the same time, it is worth noting that as the training set becomes smaller, the performance gap between VideoMAE and the other two training methods becomes larger and larger. It is worth noting that even though the HMDB51 dataset only contains about 3500 video clips, the pre-trained model based on VideoMAE can still achieve very satisfactory accuracy. This new result demonstrates that VideoMAE is a data-efficient learner. This is different from contrastive learning, which requires a large amount of data for pre-training. VideoMAE's data-efficient features are particularly important in scenarios where video data is limited.

##Efficiency analysis of VideoMAE and MoCov3 on Something-SomethingV2 datasetVideoMAE: a new paradigm of simple and efficient video self-supervised pre-training

We also further compared the computational efficiency of pre-training using VideoMAE and pre-training using MoCo v3. Due to the extremely challenging proxy task of masking and reconstruction, the network can only observe 10% of the input data (90% of tokens are masked) in each iteration, so VideoMAE requires more training rounds. A very high proportion of tokens are obscured. This design greatly saves the calculation consumption and time of pre-training. VideoMAE pre-training for 800 rounds only takes 19.5 hours, while MoCo v3 pre-training for 300 rounds takes 61.7 hours.

Extremely high masking rate

##The impact of masking ratio in VideoMAE

VideoMAE: a new paradigm of simple and efficient video self-supervised pre-training

The extremely high mask rate is one of the core designs in VideoMAE. We conduct an in-depth exploration of this design on the Kinetics-400 and Something-Something V2 datasets. According to the results in the figure, when the mask ratio is very high, even 95%, the network still shows excellent performance on these two important datasets for downstream video action recognition tasks. This phenomenon is hugely different from BERT in natural language processing and MAE for images. The existence of temporal redundancy and temporal correlation in video data allows VideoMAE to operate with extremely high mask ratios compared to image data and natural language.

We also visualize reconstructed examples of pre-trained VideoMAE. It can be found from the figure that VideoMAE can produce satisfactory reconstruction results even at extremely high mask rates. This means that VideoMAE can learn and extract spatiotemporal features in videos.

VideoMAE: a new paradigm of simple and efficient video self-supervised pre-training

Generalization and transfer capabilities: quality and quantity of data

VideoMAE and MoCov3 Performance comparison of feature transfer capabilities on smaller datasets

VideoMAE: a new paradigm of simple and efficient video self-supervised pre-training

In order to further study the features learned by VideoMAE, this section evaluates the generalization and transfer capabilities of pre-trained VideoMAE. The table above shows the transfer effect of VideoMAE pre-trained on the Kinetics-400 data set to Something-Something V2, UCF101 and HMDB51 data sets. At the same time, the table also shows the migration ability of the model pre-trained using MoCo v3. According to the results in the table, the transfer and generalization capabilities of the model pre-trained using VideoMAE are better than the model pre-trained based on MoCo v3. This shows that VideoMAE can learn more transferable feature representations. VideoMAE pretrained on the Kinetics-400 dataset performs better than VideoMAE pretrained directly on the UCF101 and HMDB51 datasets. But the model pre-trained on the Kinetics-400 dataset transfers poorly on the Something-Something V2 dataset.

VideoMAE: a new paradigm of simple and efficient video self-supervised pre-training

To further explore the reasons for this inconsistency, we conducted experiments on reducing the number of pre-training videos on the Something-Something V2 dataset . The exploration process includes two experiments: (1) using the same number of training rounds (epoch) for pre-training, (2) using the same number of iterations (iteration) for pre-training. From the results in the figure, we can find that when reducing the number of pre-training samples, using more training iterations can also improve the performance of the model. Even if only 42,000 pre-trained videos are used, VideoMAE trained directly on the Something-Something V2 dataset can still achieve better accuracy (68.7) than the pre-trained Kinetics-400 dataset using 240,000 video data. % vs. 68.5%). This finding means that domain differences are another important factor to pay attention to during video self-supervised pre-training. When there are domain differences between the pre-training dataset and the target dataset, the quality of the pre-training data is more important than the quantity of the data. At the same time, this finding also indirectly verifies that VideoMAE is a data-efficient learner for video self-supervised pre-training.

Main results

VideoMAE: a new paradigm of simple and efficient video self-supervised pre-training

Something-Something V2 data set experimental results

VideoMAE: a new paradigm of simple and efficient video self-supervised pre-training

Kinetics-400 data set experimental results

VideoMAE: a new paradigm of simple and efficient video self-supervised pre-training

UCF101 and HMDB51 data set experimental results

VideoMAE’s Top-1 accuracy on Something-Something V2 and Kinetics-400 without using any additional data Reached 75.4% and 87.4% respectively. It should be noted that the current state-of-the-art methods on the Something-Something V2 dataset strongly rely on models pre-trained on external datasets for initialization. In contrast, VideoMAE is able to significantly outperform the previous best method by about 5% in accuracy without utilizing any external data. VideoMAE can also achieve excellent performance on the Kinetics-400 data set. In the case of limited video data (for example, the UCF101 dataset only contains less than 10,000 training videos, and the HMDB51 only contains about 3500 training videos), VideoMAE does not need to utilize any additional image and video data, and can also be used in these Far surpassing previous best methods on small-scale video datasets.

VideoMAE: a new paradigm of simple and efficient video self-supervised pre-training

AVA v2.2 data set experimental results

In addition to the traditional action classification task, we further verified the representation ability of the VideoMAE model on the more sophisticated understanding task of video action detection. We selected the AVA v2.2 data set for experiments. In the experiment, the pre-trained model on the Kinetics-400 data set will first be loaded, and then ViT will be fine-tuned in a supervised manner. It can be found from the table that the ViT model pre-trained by VideoMAE can achieve very good results on the AVA v2.2 data set. If the self-supervised pre-trained ViT model is further supervised fine-tuned on Kinetics-400, it can perform better on the action detection task (3 mAP-6mAP improvement). This also shows that the performance of the VideoMAE self-supervised pre-trained model can be further improved by performing supervised fine-tuning on the upstream data set and then migrating to downstream tasks.

Impact on the community

We open sourced the VideoMAE model and code in April this year and received continued attention and recognition from the community.

VideoMAE: a new paradigm of simple and efficient video self-supervised pre-training

According to the Paper with Code list, VideoMAE has occupied the top spot on the Something-Something V2[1] and AVA 2.2[2] lists respectively. Half a year (from the end of March 2022 to the present). Without utilizing any external data, VideoMAE’s results on Kinetics-400[3], UCF101[4], and HMDB51[5] datasets are also the best so far.

VideoMAE: a new paradigm of simple and efficient video self-supervised pre-training

​https://huggingface.co/docs/transformers/main/en/model_doc/videomae​

A few months ago, VideoMAE’s model was included in Hugging Face’s Transformers official repository. It was the first video understanding model included in the repository! To a certain extent, it also reflects the community’s recognition of our work! We hope that our work can provide a simple and efficient baseline method for Transformer-based video pre-training, and also inspire subsequent Transformer-based video understanding methods.

​https://github.com/open-mmlab/mmaction2/tree/dev-1.x/configs/recognition/videomae​

The current video understanding warehouse MMAction2 also supports inference of the VideoMAE model.

VideoMAE: a new paradigm of simple and efficient video self-supervised pre-trainingVideoMAE: a new paradigm of simple and efficient video self-supervised pre-training

At the just-concluded ECCV 2022 2nd International Ego4D Workshop, VideoMAE has quickly become a powerful tool to help everyone play the game. Shanghai Artificial Intelligence Laboratory won championships in multiple sub-tracks of this Ego4D Challenge. Among them, VideoMAE serves as an important backbone, providing powerful video features for their solution. It is worth noting that from the first picture above, it can be found that the effect of VideoMAE (ViT-L) pre-trained only on Kinetics-400 can already surpass that of the IG-65M video data set (approximately Kinetics-400 sample data 300 times) on pre-trained ir-CSN-152. This also further verifies the powerful representation ability of the VideoMAE pre-trained model.

Summary

The main contributions of our work include the following three aspects:

• We were the first to propose VideoMAE, a video self-supervised pre-training framework based on ViT masking and reconstruction. Even with self-supervised pre-training on smaller-scale video data sets, VideoMAE can still achieve excellent performance. In order to solve the "information leakage" problem caused by temporal redundancy and temporal correlation, we propose tube masking with an extremely high masking rate. ). Experiments show that this design is the key to VideoMAE's ultimate ability to achieve SOTA effects. At the same time, due to the asymmetric encoder-decoder architecture of VideoMAE, the computational consumption of the pre-training process is greatly reduced, which greatly saves the time of the pre-training process.

• VideoMAE has successfully extended the experience in NLP and image fields to the field of video understanding in a natural but valuable way, verifying that simple proxy tasks based on masking and reconstruction can provide video self-supervised pre-training. A simple yet very effective solution. The performance of the ViT model after self-supervised pre-training using VideoMAE is significantly better than train from scratch or contrastive learning methods on downstream tasks in the field of video understanding (such as action recognition, action detection).

• There are two interesting findings during the experiment, which may have been overlooked by previous research work in NLP and image understanding: (1) VideoMAE is a data-efficient learner . Even on the video data set HMDB51, which only has about 3,000 videos, VideoMAE is able to complete self-supervised pre-training and can achieve results far exceeding those of other methods on downstream classification tasks. (2) For video self-supervised pre-training, when there is an obvious domain gap between the pre-training data set and the downstream task data set, the quality of the video data may be more important than the quantity.

The above is the detailed content of VideoMAE: a new paradigm of simple and efficient video self-supervised pre-training. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)
3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. Best Graphic Settings
3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. How to Fix Audio if You Can't Hear Anyone
3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌
WWE 2K25: How To Unlock Everything In MyRise
4 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Bytedance Cutting launches SVIP super membership: 499 yuan for continuous annual subscription, providing a variety of AI functions Bytedance Cutting launches SVIP super membership: 499 yuan for continuous annual subscription, providing a variety of AI functions Jun 28, 2024 am 03:51 AM

This site reported on June 27 that Jianying is a video editing software developed by FaceMeng Technology, a subsidiary of ByteDance. It relies on the Douyin platform and basically produces short video content for users of the platform. It is compatible with iOS, Android, and Windows. , MacOS and other operating systems. Jianying officially announced the upgrade of its membership system and launched a new SVIP, which includes a variety of AI black technologies, such as intelligent translation, intelligent highlighting, intelligent packaging, digital human synthesis, etc. In terms of price, the monthly fee for clipping SVIP is 79 yuan, the annual fee is 599 yuan (note on this site: equivalent to 49.9 yuan per month), the continuous monthly subscription is 59 yuan per month, and the continuous annual subscription is 499 yuan per year (equivalent to 41.6 yuan per month) . In addition, the cut official also stated that in order to improve the user experience, those who have subscribed to the original VIP

Context-augmented AI coding assistant using Rag and Sem-Rag Context-augmented AI coding assistant using Rag and Sem-Rag Jun 10, 2024 am 11:08 AM

Improve developer productivity, efficiency, and accuracy by incorporating retrieval-enhanced generation and semantic memory into AI coding assistants. Translated from EnhancingAICodingAssistantswithContextUsingRAGandSEM-RAG, author JanakiramMSV. While basic AI programming assistants are naturally helpful, they often fail to provide the most relevant and correct code suggestions because they rely on a general understanding of the software language and the most common patterns of writing software. The code generated by these coding assistants is suitable for solving the problems they are responsible for solving, but often does not conform to the coding standards, conventions and styles of the individual teams. This often results in suggestions that need to be modified or refined in order for the code to be accepted into the application

Can fine-tuning really allow LLM to learn new things: introducing new knowledge may make the model produce more hallucinations Can fine-tuning really allow LLM to learn new things: introducing new knowledge may make the model produce more hallucinations Jun 11, 2024 pm 03:57 PM

Large Language Models (LLMs) are trained on huge text databases, where they acquire large amounts of real-world knowledge. This knowledge is embedded into their parameters and can then be used when needed. The knowledge of these models is "reified" at the end of training. At the end of pre-training, the model actually stops learning. Align or fine-tune the model to learn how to leverage this knowledge and respond more naturally to user questions. But sometimes model knowledge is not enough, and although the model can access external content through RAG, it is considered beneficial to adapt the model to new domains through fine-tuning. This fine-tuning is performed using input from human annotators or other LLM creations, where the model encounters additional real-world knowledge and integrates it

Seven Cool GenAI & LLM Technical Interview Questions Seven Cool GenAI & LLM Technical Interview Questions Jun 07, 2024 am 10:06 AM

To learn more about AIGC, please visit: 51CTOAI.x Community https://www.51cto.com/aigc/Translator|Jingyan Reviewer|Chonglou is different from the traditional question bank that can be seen everywhere on the Internet. These questions It requires thinking outside the box. Large Language Models (LLMs) are increasingly important in the fields of data science, generative artificial intelligence (GenAI), and artificial intelligence. These complex algorithms enhance human skills and drive efficiency and innovation in many industries, becoming the key for companies to remain competitive. LLM has a wide range of applications. It can be used in fields such as natural language processing, text generation, speech recognition and recommendation systems. By learning from large amounts of data, LLM is able to generate text

To provide a new scientific and complex question answering benchmark and evaluation system for large models, UNSW, Argonne, University of Chicago and other institutions jointly launched the SciQAG framework To provide a new scientific and complex question answering benchmark and evaluation system for large models, UNSW, Argonne, University of Chicago and other institutions jointly launched the SciQAG framework Jul 25, 2024 am 06:42 AM

Editor |ScienceAI Question Answering (QA) data set plays a vital role in promoting natural language processing (NLP) research. High-quality QA data sets can not only be used to fine-tune models, but also effectively evaluate the capabilities of large language models (LLM), especially the ability to understand and reason about scientific knowledge. Although there are currently many scientific QA data sets covering medicine, chemistry, biology and other fields, these data sets still have some shortcomings. First, the data form is relatively simple, most of which are multiple-choice questions. They are easy to evaluate, but limit the model's answer selection range and cannot fully test the model's ability to answer scientific questions. In contrast, open-ended Q&A

Five schools of machine learning you don't know about Five schools of machine learning you don't know about Jun 05, 2024 pm 08:51 PM

Machine learning is an important branch of artificial intelligence that gives computers the ability to learn from data and improve their capabilities without being explicitly programmed. Machine learning has a wide range of applications in various fields, from image recognition and natural language processing to recommendation systems and fraud detection, and it is changing the way we live. There are many different methods and theories in the field of machine learning, among which the five most influential methods are called the "Five Schools of Machine Learning". The five major schools are the symbolic school, the connectionist school, the evolutionary school, the Bayesian school and the analogy school. 1. Symbolism, also known as symbolism, emphasizes the use of symbols for logical reasoning and expression of knowledge. This school of thought believes that learning is a process of reverse deduction, through existing

SOTA performance, Xiamen multi-modal protein-ligand affinity prediction AI method, combines molecular surface information for the first time SOTA performance, Xiamen multi-modal protein-ligand affinity prediction AI method, combines molecular surface information for the first time Jul 17, 2024 pm 06:37 PM

Editor | KX In the field of drug research and development, accurately and effectively predicting the binding affinity of proteins and ligands is crucial for drug screening and optimization. However, current studies do not take into account the important role of molecular surface information in protein-ligand interactions. Based on this, researchers from Xiamen University proposed a novel multi-modal feature extraction (MFE) framework, which for the first time combines information on protein surface, 3D structure and sequence, and uses a cross-attention mechanism to compare different modalities. feature alignment. Experimental results demonstrate that this method achieves state-of-the-art performance in predicting protein-ligand binding affinities. Furthermore, ablation studies demonstrate the effectiveness and necessity of protein surface information and multimodal feature alignment within this framework. Related research begins with "S

Laying out markets such as AI, GlobalFoundries acquires Tagore Technology's gallium nitride technology and related teams Laying out markets such as AI, GlobalFoundries acquires Tagore Technology's gallium nitride technology and related teams Jul 15, 2024 pm 12:21 PM

According to news from this website on July 5, GlobalFoundries issued a press release on July 1 this year, announcing the acquisition of Tagore Technology’s power gallium nitride (GaN) technology and intellectual property portfolio, hoping to expand its market share in automobiles and the Internet of Things. and artificial intelligence data center application areas to explore higher efficiency and better performance. As technologies such as generative AI continue to develop in the digital world, gallium nitride (GaN) has become a key solution for sustainable and efficient power management, especially in data centers. This website quoted the official announcement that during this acquisition, Tagore Technology’s engineering team will join GLOBALFOUNDRIES to further develop gallium nitride technology. G

See all articles