Table of Contents
Positioning video clips in one sentence
Benchmark data of researchers on the TSG task Experimental evaluation was conducted on Charades-STA and ActivityNet Captions. The comparison with the baseline method is shown in Table 1.
Home Technology peripherals AI Tsinghua University's new method successfully locates precise video clips! SOTA has been surpassed and open sourced

Tsinghua University's new method successfully locates precise video clips! SOTA has been surpassed and open sourced

Jan 09, 2024 pm 03:26 PM
video ai

With just one sentence of description, you can locate the corresponding clip in a large video!

For example, describing "a person drinking water while going down the stairs", by matching the video image and the footsteps, the new method can immediately find out the corresponding start and end timestamps:

Tsinghua Universitys new method successfully locates precise video clips! SOTA has been surpassed and open sourced

Even "laughing", which is semantically difficult to understand, can be accurately located:

Tsinghua Universitys new method successfully locates precise video clips! SOTA has been surpassed and open sourced

The method is called Adaptive dual branch Promotion Network (ADPN), proposed by the Tsinghua University research team.

Specifically, ADPN is used to complete a visual-linguistic cross-modal task called video clip positioning (Temporal Sentence Grounding, TSG) , that is, based on the query text, from the video Locate relevant segments.

ADPN is characterized by its ability to efficiently utilize the consistency and complementarity of visual and audio modalities in the video to enhance video clip positioning performance.

Compared with other TSG work PMI-LOC and UMT that use audio, the ADPN method has achieved more significant performance improvements from the audio mode, and has won new SOTA in multiple tests.

This work has been accepted by ACM Multimedia 2023 and is completely open source.

Tsinghua Universitys new method successfully locates precise video clips! SOTA has been surpassed and open sourced

Let’s take a look at what ADPN is~

Positioning video clips in one sentence

Positioning of video clips(Temporal Sentence Grounding (TSG) is an important visual-linguistic cross-modal task.

Its purpose is to find the start and end timestamps of segments that semantically match it in an unedited video based on natural language queries. It requires the method to have strong temporal cross-modal reasoning capabilities.

However, most existing TSG methods only consider visual information in the video, such as RGB, optical flows, depth(depth) etc., while ignoring the audio information naturally accompanying the video.

Audio information often contains rich semantics and is consistent and complementary with visual information. As shown in the figure below, these properties will help the TSG task.

Tsinghua Universitys new method successfully locates precise video clips! SOTA has been surpassed and open sourced
△Figure 1

(a) Consistency: The video image and footsteps consistently match "walking down the stairs" in the query Semantics; (b) Complementarity: It is difficult to identify specific behaviors in video images to locate the semantics of "laughing" in the query, but the appearance of laughter provides a strong complementary positioning clue.

Therefore, researchers have deeply studied the audio-enhanced video clip positioning task

(Audio-enhanced Temporal Sentence Grounding, ATSG) , aiming to better combine visual and audio Capture positioning clues in the modal, however, the introduction of audio modal also brings the following challenges:

    The consistency and complementarity of audio and visual modalities are associated with the query text, so capturing Audio-visual consistency and complementarity require modeling of the three-modal interaction of text-visual-audio.
  • There are significant modal differences between audio and vision. The information density and noise intensity of the two are different, which will affect the performance of audio-visual learning.
In order to solve the above challenges, researchers proposed a novel ATSG method "

Adaptive Dual-branch Prompted Network"(Adaptive Dual-branch Prompted Network, ADPN ).

Through a dual-branch model structure design, this method can adaptively model the consistency and complementarity between audio and vision, and further eliminate the noise using a denoising optimization strategy based on course learning. The interference of audio modal noise reveals the importance of audio signals for video retrieval.

The overall structure of ADPN is shown in the figure below:

Tsinghua Universitys new method successfully locates precise video clips! SOTA has been surpassed and open sourced
△Figure 2: Overall schematic diagram of Adaptive Dual Branch Promotion Network (ADPN)

It mainly includes three designs:

1. Dual-branch network structure design

Considering that the noise of audio is more obvious, and for TSG tasks, audio is usually There is more redundant information, so the learning process of audio and visual modalities needs to be given different importance. Therefore, this article involves a dual-branch network structure that uses audio and vision for multi-modal learning while also dealing with visual information. Be strengthened.

Specifically, referring to Figure 2(a), ADPN simultaneously trains a branch (visual branch) that uses only visual information and a branch that uses both visual information and audio information (joint branch ).

The two branches have similar structures, in which the joint branch adds a text-guided clue mining unit

(TGCM) Modeling text-visual-audio modal interaction. During the training process, the two branches update parameters simultaneously, and the inference phase uses the result of the joint branch as the model prediction result.

2. Text-Guided Clues Mining Unit(Text-Guided Clues Miner, TGCM)

Considering the consistency of audio and visual modalities Sex and complementarity are conditioned on a given text query, so the researchers designed a TGCM unit to model the interaction between the three modalities of text-visual-audio.

Referring to Figure 2(b), TGCM is divided into two steps: "extraction" and "propagation".

First, text is used as the query condition, and the associated information is extracted and integrated from the visual and audio modalities; then the visual and audio modalities are used as the query condition, and the integrated information is spread through attention to the respective modes of vision and audio, and finally feature fusion through FFN.

3. Course learning optimization strategy

The researchers observed that the audio contained noise, which would affect the effect of multi-modal learning, so they used the intensity of the noise as For reference of sample difficulty, Curriculum Learning (CL)

is introduced to denoise the optimization process, refer to Figure 2(c). They evaluate the difficulty of the sample based on the difference in the predicted output of the two branches. They believe that a sample that is too difficult has a high probability that its audio contains too much noise and is not suitable for the TSG task, so they evaluate the difficulty of the sample based on the score. The loss function terms of the training process are reweighted to discard bad gradients caused by noise in the audio.

(Please refer to the original text for the rest of the model structure and training details.)

Multiple tests of new SOTA

Benchmark data of researchers on the TSG task Experimental evaluation was conducted on Charades-STA and ActivityNet Captions. The comparison with the baseline method is shown in Table 1.

The ADPN method can achieve SOTA performance; in particular, compared to other TSG work PMI-LOC and UMT that utilize audio, the ADPN method obtains more significant performance improvements from the audio mode, indicating that the ADPN method utilizes audio Modality promotes the superiority of TSG.

Tsinghua Universitys new method successfully locates precise video clips! SOTA has been surpassed and open sourced△Table 1: Experimental results on Charades-STA and ActivityNet Captions

The researchers further demonstrated the effectiveness of different design units in ADPN through ablation experiments sex, as shown in Table 2.

Tsinghua Universitys new method successfully locates precise video clips! SOTA has been surpassed and open sourced△Table 2: Ablation experiment on Charades-STA

The researchers selected the prediction results of some samples to visualize, and drew the TGCM " Extract the "text to vision" (T→V) and "text to audio" (T→A) attention weight distribution in the "step," as shown in Figure 3.

It can be observed that the introduction of audio modality improves the prediction results. From the case of "Person laughs at it", we can see that the attention weight distribution of T→A is closer to the Ground Truth, which corrects the misguided guidance of the model prediction by the weight distribution of T→V.

Tsinghua Universitys new method successfully locates precise video clips! SOTA has been surpassed and open sourced△Figure 3: Case display

In general, the researchers in this article proposed a novel adaptive dual-branch promotion network

(ADPN)

to solve the problem of audio enhanced video clip positioning (ATSG) . They designed a dual-branch model structure to jointly train the visual branch and the audiovisual joint branch to resolve the information difference between audio and visual modalities.

They also proposed a text-guided clue mining unit

(TGCM)

that uses text semantics as a guide to model text-audio-visual interaction. Finally, the researchers designed a course learning-based optimization strategy to further eliminate audio noise, evaluate sample difficulty as a measure of noise intensity in a self-aware manner, and adaptively adjust the optimization process.

They first conducted an in-depth study of the characteristics of audio in ATSG to better improve the performance improvement effect of audio modalities.

In the future, they hope to build a more appropriate evaluation benchmark for ATSG to encourage more in-depth research in this area.

Paper link: https://dl.acm.org/doi/pdf/10.1145/3581783.3612504

Warehouse link: https://github.com/hlchen23 /ADPN-MM

The above is the detailed content of Tsinghua University's new method successfully locates precise video clips! SOTA has been surpassed and open sourced. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)
2 weeks ago By 尊渡假赌尊渡假赌尊渡假赌
Repo: How To Revive Teammates
1 months ago By 尊渡假赌尊渡假赌尊渡假赌
Hello Kitty Island Adventure: How To Get Giant Seeds
4 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

How to dynamically create an object through a string and call its methods in Python? How to dynamically create an object through a string and call its methods in Python? Apr 01, 2025 pm 11:18 PM

In Python, how to dynamically create an object through a string and call its methods? This is a common programming requirement, especially if it needs to be configured or run...

How to use Go or Rust to call Python scripts to achieve true parallel execution? How to use Go or Rust to call Python scripts to achieve true parallel execution? Apr 01, 2025 pm 11:39 PM

How to use Go or Rust to call Python scripts to achieve true parallel execution? Recently I've been using Python...

How to solve the problem of missing dynamic loading content when obtaining web page data? How to solve the problem of missing dynamic loading content when obtaining web page data? Apr 01, 2025 pm 11:24 PM

Problems and solutions encountered when using the requests library to crawl web page data. When using the requests library to obtain web page data, you sometimes encounter the...

How to recover Debian mail server How to recover Debian mail server Apr 02, 2025 am 07:33 AM

Detailed Steps for Restoring Debian Mail Server This article will guide you on how to restore Debian Mail Server. Before you begin, it is important to remember the importance of data backup. Recovery Steps: Backup Data: Be sure to back up all important email data and configuration files before performing any recovery operations. This will ensure that you have a fallback version when problems occur during the recovery process. Check log files: Check mail server log files (such as /var/log/mail.log) for errors or exceptions. Log files often provide valuable clues about the cause of the problem. Stop service: Stop the mail service to prevent further data corruption. Use the following command: su

How to operate Zookeeper performance tuning on Debian How to operate Zookeeper performance tuning on Debian Apr 02, 2025 am 07:42 AM

This article describes how to optimize ZooKeeper performance on Debian systems. We will provide advice on hardware, operating system, ZooKeeper configuration and monitoring. 1. Optimize storage media upgrade at the system level: Replacing traditional mechanical hard drives with SSD solid-state drives will significantly improve I/O performance and reduce access latency. Disable swap partitioning: By adjusting kernel parameters, reduce dependence on swap partitions and avoid performance losses caused by frequent memory and disk swaps. Improve file descriptor upper limit: Increase the number of file descriptors allowed to be opened at the same time by the system to avoid resource limitations affecting the processing efficiency of ZooKeeper. 2. ZooKeeper configuration optimization zoo.cfg file configuration

How to do Oracle security settings on Debian How to do Oracle security settings on Debian Apr 02, 2025 am 07:48 AM

To strengthen the security of Oracle database on the Debian system, it requires many aspects to start. The following steps provide a framework for secure configuration: 1. Oracle database installation and initial configuration system preparation: Ensure that the Debian system has been updated to the latest version, the network configuration is correct, and all required software packages are installed. It is recommended to refer to official documents or reliable third-party resources for installation. Users and Groups: Create a dedicated Oracle user group (such as oinstall, dba, backupdba) and set appropriate permissions for it. 2. Security restrictions set resource restrictions: Edit /etc/security/limits.d/30-oracle.conf

In the ChatGPT era, how can the technical Q&A community respond to challenges? In the ChatGPT era, how can the technical Q&A community respond to challenges? Apr 01, 2025 pm 11:51 PM

The technical Q&A community in the ChatGPT era: SegmentFault’s response strategy StackOverflow...

See all articles