Currently, Video Pose Transformer (VPT) has achieved the most leading performance in the field of video-based three-dimensional human pose estimation. In recent years, the computational workload of these VPTs has become increasingly large, and these huge computational workloads have also limited further development in this field. It is very unfriendly to researchers with insufficient computing resources. For example, training a 243-frame VPT model usually takes several days, seriously slowing down the progress of research and becoming a major pain point in the field that needs to be solved urgently.
So, how to effectively improve the efficiency of VPT with almost no loss of accuracy?
The team from Peking University proposed an efficient three-dimensional human pose estimation framework HoT based on the hourglass Tokenizer to solve the high computational cost of the existing Video Pose Transformer (VPT) A question of demand. The framework can be plug-and-play and seamlessly integrated into models such as MHFormer, MixSTE, and MotionBERT, reducing the model's calculations by nearly 40% without losing accuracy. The code has been open sourced.
##Research motivation
Therefore, in order to achieve efficient VPT, this article believes that two factors need to be considered first:
Based on the above three considerations, the author proposes an efficient three-dimensional human pose estimation framework based on the hourglass structure, ⏳ Hourglass Tokenizer (HoT). In general, this method has two major highlights: #HoT is the first Transformer-based plug-and-play framework for efficient 3D human pose estimation. As shown in the figure below, traditional VPT adopts a "rectangular" paradigm, that is, maintaining the full length of Pose Token in all layers of the model, which brings high computational costs and feature redundancy. Different from traditional VPT, HoT first prunes to remove redundant tokens, and then restores the entire sequence of tokens (looking like an "hourglass"), so that only a small amount of tokens are retained in the middle layer of the Transformer, thus effectively improving the model s efficiency. HoT also demonstrates extremely high versatility. Not only can it be seamlessly integrated into conventional VPT models, whether it is VPT based on seq2seq or seq2frame, it can also be adapted to various Token pruning and recovery strategies. HoT reveals that maintaining full-length pose sequences is redundant, and using Pose Tokens of a small number of representative frames can achieve both high efficiency and high performance. Compared with the traditional VPT model, HoT not only significantly improves processing efficiency, but also achieves highly competitive or even better results. For example, it can reduce MotionBERT's FLOPs by nearly 50% without sacrificing performance, while reducing MixSTE's FLOPs by nearly 40% with only a slight performance drop of 0.2%. The overall framework of HoT proposed is shown in the figure below. In order to perform Token pruning and recovery more effectively, this article proposes two modules: Token Pruning Cluster (TPC) and Token Recovering Attention (TRA). Among them, the TPC module dynamically selects a small number of representative tokens with high semantic diversity while mitigating the redundancy of video frames. The TRA module recovers detailed spatiotemporal information based on selected tokens, thereby extending the network output to the original full-length temporal resolution for fast inference. Token pruning and clustering module This article believes that it is a difficult problem to select a small number of Pose Tokens with rich information for accurate three-dimensional human posture estimation. In order to solve this problem, this article believes that the key is to select those representative tokens with high semantic diversity, because such tokens can retain necessary information while reducing video redundancy. Based on this concept, this article proposes a Token Pruning Cluster (TPC) module that is simple, effective and requires no additional parameters. The core of this module is to identify and remove those tokens that contribute little semantically, and focus on those tokens that can provide key information for the final three-dimensional human pose estimation. By using a clustering algorithm, TPC dynamically selects cluster centers as representative tokens, thereby utilizing the characteristics of cluster centers to retain the rich semantics of the original data. The structure of TPC is shown in the figure below. It first pools the input Pose Token in the spatial dimension, and then uses the feature similarity of the pooled Token to process the input Token. Cluster and select the cluster center as the representative token. Token Restoration Attention Module The TPC module effectively reduces the number of Pose Tokens. However, the decrease in time resolution caused by the pruning operation limits VPT for fast seq2seq inference. Therefore, Token needs to be restored. At the same time, considering efficiency factors, the recovery module should be designed to be lightweight to minimize the impact on the overall model computational cost. In order to solve the above challenges, this article designs a lightweight Token Recovering Attention (TRA) module, which can recover detailed spatiotemporal information based on the selected Token. . In this way, the low temporal resolution caused by the pruning operation is effectively extended to the temporal resolution of the original complete sequence, allowing the network to estimate the three-dimensional human pose sequence of all frames at once, thereby achieving fast seq2seq reasoning. The structure of the TRA module is shown in the figure below. It uses the representative Token in the last layer of Transformer and the learnable Token initialized to zero, through a simple cross-attention mechanism. Restore the complete Token sequence. ##Apply to existing VPT In discussing how to apply all Before applying the proposed method to existing VPT, this paper first summarizes the existing VPT architecture. As shown in the figure below, the VPT architecture mainly consists of three components: a pose embedding module for encoding the spatial and temporal information of the pose sequence, a multi-layer Transformer for learning global spatiotemporal representation, and a regression head module for regression output 3D Human posture results. According to the number of output frames, the existing VPT can be divided into two inference processes: seq2frame and seq2seq. In the seq2seq pipeline, the output is all frames of the input video, so the original full-length timing resolution needs to be restored. As shown in the HoT framework diagram, both TPC and TRA modules are embedded in VPT. In the seq2frame process, the output is the 3D pose of the center frame of the video. Therefore, under this process, the TRA module is unnecessary and only the TPC module is integrated in the VPT. Its framework is shown in the figure below. ##Ablation experiment
This article also compares different Token pruning strategies, including attention score pruning, uniform sampling, and selecting the top k tokens with larger As for the motion pruning strategy of motion token, it can be seen that the proposed TPC has achieved the best performance.
This article also compares different Token recovery strategies, including nearest neighbor interpolation and linear interpolation. It can be seen that the proposed TRA achieves the best performance .
Comparison with SOTA method As shown in the table below, this method significantly reduces the calculation amount of the SOTA VPT model while maintaining the original accuracy. These results not only verify the effectiveness and high efficiency of this method, but also reveal that there are computational redundancies in existing VPT models, and these redundancies contribute little to the final estimation performance, and may even lead to performance degradation. In addition, this method can eliminate these unnecessary calculations while achieving highly competitive or even better performance. The author also provides demo operation (https://github.com /NationalGAILab/HoT), integrating YOLOv3 human detector, HRNet 2D pose detector, HoT w. MixSTE 2D to 3D pose enhancer. Just download the pre-trained model provided by the author, input a short video containing people, and you can directly output a demo of 3D human pose estimation with one line of code. Results obtained by running the sample video: This article proposes Hourglass Tokenizer (HoT), a plug-and-play Token pruning, to solve the problem of high computational cost of existing Video Pose Transforme (VPT). and recovery framework for efficient Transformer-based 3D human pose estimation from videos. The study found that maintaining full-length pose sequences in VPT is unnecessary and that using a small number of representative frames of Pose Tokens can achieve both high accuracy and efficiency. A large number of experiments have verified the high compatibility and wide applicability of this method. It can be easily integrated into various common VPT models, whether it is VPT based on seq2seq or seq2frame, and can effectively adapt to a variety of token pruning and recovery strategies, demonstrating its great potential. The authors expect HoT to drive the development of stronger and faster VPTs. Model Method
Experimental results
Code operation
python demo/vis.py --video sample_video.mp4
Summary
The above is the detailed content of To make the video pose Transformer fast, Peking University proposes an efficient 3D human pose estimation framework HoT. For more information, please follow other related articles on the PHP Chinese website!