Add fast and slow eyes to the video model, Apple's new training-free method surpasses everything SOTA in seconds-AI-php.cn

Home

Technology peripherals

Add fast and slow eyes to the video model, Apple's new training-free method surpasses everything SOTA in seconds

WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWB

Aug 11, 2024 pm 04:02 PM

project

Since the release of Sora, the field of AI video generation has become more "busy". In the past few months, we have witnessed Jimeng, Runway Gen-3, Luma AI, and Kuaishou Keling taking turns to explode.

Unlike the past models that can be identified as AI-generated at a glance, this batch of large video models may be the "best" we have ever seen.

However, behind the amazing performance of the video large language model (LLM) is a huge and finely annotated video data set, which requires a very high cost. Recently, a number of innovative methods have emerged in the research field that do not require additional training: using trained image large language models to directly process video tasks, thus bypassing the "expensive" training process.

In addition, most existing video LLMs suffer from two major disadvantages: (1) they can only handle video input with a limited number of frames, which makes it difficult for the model to capture the subtle spatial and temporal content in the video; (2) they It lacks temporal modeling design, but simply inputs video features into LLM, completely relying on LLM's ability to model motion.

In response to the above problems, Apple researchers proposed SlowFast-LLaVA (SF-LLaVA for short). This model is based on the LLaVA-NeXT architecture developed by the Byte team. It requires no additional fine-tuning and can be used out of the box. Inspired by the successful two-stream network in the field of action recognition, the research team designed a novel SlowFast input mechanism for video LLM.

Simply put, SF-LLaVA will understand the details and motion in the video through two different observation speeds (Slow and Fast).

Slow path: extract features at low frame rates while retaining as much spatial detail as possible (e.g. retain 24×24 tokens every 8 frames)
Fast path: run at high frame rates, but Use a larger spatial pooling step size to reduce the resolution of the video to simulate a larger temporal context and focus more on understanding the coherence of actions

This is equivalent to the model having two "eyes": one Just look slowly and pay attention to the details; the other one is to look quickly and pay attention to the movements. This solves the pain points of most existing video LLMs and can capture both detailed spatial semantics and longer temporal context.

Add fast and slow eyes to the video model, Apples new training-free method surpasses everything SOTA in seconds

Paper link: https://arxiv.org/pdf/2407.15841

Experimental results show that SF-LLaVA surpasses existing training-free methods by significant advantages in all benchmark tests. Compared with carefully fine-tuned SFT models, SF-LLaVA achieves the same performance or even better.

Add fast and slow eyes to the video model, Apples new training-free method surpasses everything SOTA in seconds

Model architecture

As shown in the figure below, SF-LLaVA follows the standard training-free video LLM process. It takes a video V and a question Q as input and outputs the corresponding answer A.

Add fast and slow eyes to the video model, Apples new training-free method surpasses everything SOTA in seconds

對於輸入，要從每個視訊任意大小和長度中均勻取樣 N 幀，I = {I_1, I_2, ..., I_N}，不需要對選取的視訊幀進行特別的組合或排列。以幀為單位視獨立提取頻特徵為 F_v ∈ R^N×H×W，其中 H 和 W 分別為幀特徵的高度和寬度。

下一步需要在慢速和快速兩個路徑中進一步處理 F_v，並將它們結合起來作為有效的視頻表示。慢速路徑從 F_v 中均勻取樣 Add fast and slow eyes to the video model, Apples new training-free method surpasses everything SOTA in seconds

Add fast and slow eyes to the video model, Apples new training-free method surpasses everything SOTA in seconds

的幀特徵，其中

。

先前有研究發現，在空間維度上適當池化可以提高影片產生的效率和穩健性。因此，研究團隊在 F_v 上應用步長為 σ_h×σ_w 的池化過程，得到最終特徵： Add fast and slow eyes to the video model, Apples new training-free method surpasses everything SOTA in seconds

，其中

，

。慢速路徑的整個過程如公式 2 所示。

Add fast and slow eyes to the video model, Apples new training-free method surpasses everything SOTA in seconds

快速路徑保留 F_v 中的所有幀特徵，以盡可能多地捕捉視訊的長程時間上下文。具體來說，研究團隊使用空間池化步長 Add fast and slow eyes to the video model, Apples new training-free method surpasses everything SOTA in seconds

對 F_v 進行激進的下取樣，得到最終特徵 Add fast and slow eyes to the video model, Apples new training-free method surpasses everything SOTA in seconds

。研究團隊設定

、

，讓快速路徑能專注於模擬時間脈絡和運動線索。慢速路徑的整個過程如公式 3 所示。

Add fast and slow eyes to the video model, Apples new training-free method surpasses everything SOTA in seconds

最後，得到聚合的視訊特徵： Add fast and slow eyes to the video model, Apples new training-free method surpasses everything SOTA in seconds

，其中 flat 和 [, ] 分別表示展平和連接操作。如表達式所示， Add fast and slow eyes to the video model, Apples new training-free method surpasses everything SOTA in seconds

不需要任何特殊的 token 來分隔慢速和快速路徑。 SF-LLaVA 總共使用 Add fast and slow eyes to the video model, Apples new training-free method surpasses everything SOTA in seconds

個影片 token。影片的視覺特徵 Add fast and slow eyes to the video model, Apples new training-free method surpasses everything SOTA in seconds

將和文字訊息（例如使用者提出的問題）將被組合在一起，作為輸入資料送入大型語言模型（LLM）進行處理。

SlowFast 流程如公式 4 所示。

Add fast and slow eyes to the video model, Apples new training-free method surpasses everything SOTA in seconds

實驗結果

研究團隊對 SF-LLaVA 進行了全面的性能評估，將其與當前 SOTA 免訓練模型（如 IG-VLM 和 LLoVi）在多個視訊問答任務中進行了對比。此外，他們還將其與經過視訊資料集監督微調（SFT）的視訊 LLM，例如 VideoLLaVA 和 PLLaVA 進行了比較。

開放式視訊問答

如下表所示，在開放式視訊問答任務中，SF-LLaVA 在所有基準測試中都比現有的免訓練方法表現得更好。具體來說，分別搭載7B 和34B 參數規模的LLM 時，SF-LLaVA 分別在MSRVTT-QA 上比IGVLM 高出2.1% 和5.0%，在TGIF-QA 上高出5.7% 和1.5%，在ActivityNet -QA 上高出2.0% 和0.8%。

即使與經過微調的SFT 方法相比，SF-LLaVA 在大多數基準測試中也展現了可比的性能，只有在ActivityNet-QA 這一基准上，PLLaVA 和LLaVA-NeXT-VideoDPO 略勝一籌。

Add fast and slow eyes to the video model, Apples new training-free method surpasses everything SOTA in seconds

多項選擇視訊問答

從下表中可見，在所有基準測試中，SF-LLaVA 在多項選擇視訊問答的表現都優於其他免費訓練方法。在要求複雜長時序推理的 EgoSchema 資料集中，SF-LLaVA7B 和 34B 的版本相較 IG-VLM 模型的得分分別高出 11.4% 和 2.2%。

雖然 VideoTree 在基準測試中領先，因為它是基於 GPT-4 的專有模型，因而性能遠高於開源 LLM。與 SFT 方法相比，SF-LLaVA 34B 模型在 EgoSchema 上也取得了更好的結果，這證實了 SlowFast 設計處理長影片的強大能力。

Text Generation

Add fast and slow eyes to the video model, Apples new training-free method surpasses everything SOTA in seconds

文生影片

如表 3 所示，對於文字產生影片的任務，SF-LLaVA 也顯示出了一些優勢。 SF-LLaVA-34B 在整體表現上超越了所有免訓練的基準。儘管在細節取向方面，SF-LLaVA 略遜於 LLaVA-NeXT-Image。基於 SlowFast 設計，SF-LLaVA 可以用更少的視覺 token 覆蓋更長的時間上下文，因此在時間理解任務中表現得格外出色。

此外，在文生影片的表現上，SF-LLaVA-34B 也優於大多數 SFT 方法。

Add fast and slow eyes to the video model, Apples new training-free method surpasses everything SOTA in seconds

更多細節，請參考原論文。

The above is the detailed content of Add fast and slow eyes to the video model, Apple's new training-free method surpasses everything SOTA in seconds. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

How to fix KB5055523 fails to install in Windows 11?

4 weeks ago By DDD

How to fix KB5055518 fails to install in Windows 10?

4 weeks ago By DDD

Roblox: Grow A Garden - Complete Mutation Guide

2 weeks ago By DDD

Roblox: Bubble Gum Simulator Infinity - How To Get And Use Royal Keys

3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

How to fix KB5055612 fails to install in Windows 10?

3 weeks ago By DDD

Hot Tools

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Hot Topics

Java Tutorial

1664

CakePHP Tutorial

1421

Laravel Tutorial

1315

PHP Tutorial

1266

C# Tutorial

1239

Related knowledge

The author of ControlNet has another hit! The whole process of generating a painting from a picture, earning 1.4k stars in two days Jul 17, 2024 am 01:56 AM

It is also a Tusheng video, but PaintsUndo has taken a different route. ControlNet author LvminZhang started to live again! This time I aim at the field of painting. The new project PaintsUndo has received 1.4kstar (still rising crazily) not long after it was launched. Project address: https://github.com/lllyasviel/Paints-UNDO Through this project, the user inputs a static image, and PaintsUndo can automatically help you generate a video of the entire painting process, from line draft to finished product. follow. During the drawing process, the line changes are amazing. The final video result is very similar to the original image: Let’s take a look at a complete drawing.

Topping the list of open source AI software engineers, UIUC's agent-less solution easily solves SWE-bench real programming problems Jul 17, 2024 pm 10:02 PM

The AIxiv column is a column where this site publishes academic and technical content. In the past few years, the AIxiv column of this site has received more than 2,000 reports, covering top laboratories from major universities and companies around the world, effectively promoting academic exchanges and dissemination. If you have excellent work that you want to share, please feel free to contribute or contact us for reporting. Submission email: liyazhou@jiqizhixin.com; zhaoyunfeng@jiqizhixin.com The authors of this paper are all from the team of teacher Zhang Lingming at the University of Illinois at Urbana-Champaign (UIUC), including: Steven Code repair; Deng Yinlin, fourth-year doctoral student, researcher

From RLHF to DPO to TDPO, large model alignment algorithms are already 'token-level' Jun 24, 2024 pm 03:04 PM

The AIxiv column is a column where this site publishes academic and technical content. In the past few years, the AIxiv column of this site has received more than 2,000 reports, covering top laboratories from major universities and companies around the world, effectively promoting academic exchanges and dissemination. If you have excellent work that you want to share, please feel free to contribute or contact us for reporting. Submission email: liyazhou@jiqizhixin.com; zhaoyunfeng@jiqizhixin.com In the development process of artificial intelligence, the control and guidance of large language models (LLM) has always been one of the core challenges, aiming to ensure that these models are both powerful and safe serve human society. Early efforts focused on reinforcement learning methods through human feedback (RL

arXiv papers can be posted as 'barrage', Stanford alphaXiv discussion platform is online, LeCun likes it Aug 01, 2024 pm 05:18 PM

cheers! What is it like when a paper discussion is down to words? Recently, students at Stanford University created alphaXiv, an open discussion forum for arXiv papers that allows questions and comments to be posted directly on any arXiv paper. Website link: https://alphaxiv.org/ In fact, there is no need to visit this website specifically. Just change arXiv in any URL to alphaXiv to directly open the corresponding paper on the alphaXiv forum: you can accurately locate the paragraphs in the paper, Sentence: In the discussion area on the right, users can post questions to ask the author about the ideas and details of the paper. For example, they can also comment on the content of the paper, such as: "Given to

Posthumous work of the OpenAI Super Alignment Team: Two large models play a game, and the output becomes more understandable Jul 19, 2024 am 01:29 AM

If the answer given by the AI model is incomprehensible at all, would you dare to use it? As machine learning systems are used in more important areas, it becomes increasingly important to demonstrate why we can trust their output, and when not to trust them. One possible way to gain trust in the output of a complex system is to require the system to produce an interpretation of its output that is readable to a human or another trusted system, that is, fully understandable to the point that any possible errors can be found. For example, to build trust in the judicial system, we require courts to provide clear and readable written opinions that explain and support their decisions. For large language models, we can also adopt a similar approach. However, when taking this approach, ensure that the language model generates

A significant breakthrough in the Riemann Hypothesis! Tao Zhexuan strongly recommends new papers from MIT and Oxford, and the 37-year-old Fields Medal winner participated Aug 05, 2024 pm 03:32 PM

Recently, the Riemann Hypothesis, known as one of the seven major problems of the millennium, has achieved a new breakthrough. The Riemann Hypothesis is a very important unsolved problem in mathematics, related to the precise properties of the distribution of prime numbers (primes are those numbers that are only divisible by 1 and themselves, and they play a fundamental role in number theory). In today's mathematical literature, there are more than a thousand mathematical propositions based on the establishment of the Riemann Hypothesis (or its generalized form). In other words, once the Riemann Hypothesis and its generalized form are proven, these more than a thousand propositions will be established as theorems, which will have a profound impact on the field of mathematics; and if the Riemann Hypothesis is proven wrong, then among these propositions part of it will also lose its effectiveness. New breakthrough comes from MIT mathematics professor Larry Guth and Oxford University

The first Mamba-based MLLM is here! Model weights, training code, etc. have all been open source Jul 17, 2024 am 02:46 AM

The AIxiv column is a column where this site publishes academic and technical content. In the past few years, the AIxiv column of this site has received more than 2,000 reports, covering top laboratories from major universities and companies around the world, effectively promoting academic exchanges and dissemination. If you have excellent work that you want to share, please feel free to contribute or contact us for reporting. Submission email: liyazhou@jiqizhixin.com; zhaoyunfeng@jiqizhixin.com. Introduction In recent years, the application of multimodal large language models (MLLM) in various fields has achieved remarkable success. However, as the basic model for many downstream tasks, current MLLM consists of the well-known Transformer network, which

LLM is really not good for time series prediction. It doesn't even use its reasoning ability. Jul 15, 2024 pm 03:59 PM

Can language models really be used for time series prediction? According to Betteridge's Law of Headlines (any news headline ending with a question mark can be answered with "no"), the answer should be no. The fact seems to be true: such a powerful LLM cannot handle time series data well. Time series, that is, time series, as the name suggests, refers to a set of data point sequences arranged in the order of time. Time series analysis is critical in many areas, including disease spread prediction, retail analytics, healthcare, and finance. In the field of time series analysis, many researchers have recently been studying how to use large language models (LLM) to classify, predict, and detect anomalies in time series. These papers assume that language models that are good at handling sequential dependencies in text can also generalize to time series.

See all articles