Table of Contents
Example display
Currently, audio and video understanding is still very complex and there is no mature solution yet Although Video-LLaMA has shown impressive capabilities, the authors also mentioned that it has some limitations.
Home Technology peripherals AI Adding comprehensive audio-visual capabilities to large language models, DAMO Academy opens source Video-LLaMA

Adding comprehensive audio-visual capabilities to large language models, DAMO Academy opens source Video-LLaMA

Jun 09, 2023 pm 09:28 PM
language Model

Video plays an increasingly important role in today’s social media and Internet culture. Douyin, Kuaishou, Bilibili, etc. have become popular platforms for hundreds of millions of users. Users share their life moments, creative works, interesting moments and other content around videos to interact and communicate with others.

Recently, large language models have demonstrated impressive capabilities. Can we equip large models with “eyes” and “ears” so that they can understand videos and interact with users?

Starting from this problem, researchers from DAMO Academy proposed Video-LLaMA, a large model with comprehensive audio-visual capabilities. Video-LLaMA can perceive and understand video and audio signals in videos, and can understand user input instructions to complete a series of complex tasks based on audio and video, such as audio/video description, writing, question and answer, etc. Currently, papers, codes, and interactive demos are all open. In addition, on the Video-LLaMA project homepage, the research team also provides a Chinese version of the model to make the experience of Chinese users smoother.

Adding comprehensive audio-visual capabilities to large language models, DAMO Academy opens source Video-LLaMA

  • ## Paper link: https://arxiv.org/abs/2306.02858
  • Code address: https://github.com/DAMO-NLP-SG/Video-LLaMA


Video-LLaMA adopts modular design principles to combine the visual and Audio modality information is mapped into the input space of a large language model to achieve the ability to follow cross-modal instructions. Unlike previous large model research (MiNIGPT4, LLaVA) that focused on static image understanding, Video-LLaMA faces two challenges in video understanding: capturing dynamic scene changes in vision and integrating audio-visual signals.

To capture dynamic scene changes in videos, Video-LLaMA introduces a pluggable visual language branch. This branch first uses the pre-trained image encoder in BLIP-2 to obtain the individual features of each frame of image, and then combines it with the corresponding frame position embedding. All image features are sent to Video Q-Former, and Video Q-Former will Aggregate frame-level image representations and generate fixed-length synthetic video representations. Finally, a linear layer is used to align the video representation to the embedding space of the large language model.

Adding comprehensive audio-visual capabilities to large language models, DAMO Academy opens source Video-LLaMA

As for the sound signals in the video, Video-LLaMA uses the audio-language branch for processing. First, multiple two-second audio clips are uniformly sampled from the original video and each clip is converted into a 128-dimensional mel spectrogram. Then, the powerful ImageBind is used as the audio encoder to extract the features of each sound clip individually. After adding learnable positional embeddings, Audio Q-Former aggregates segment features as a whole and generates fixed-length audio features. Similar to the visual language branch, a linear layer is finally used to align the audio representation to the embedding space of the large language model.

In order to reduce training costs, Video-LLaMA freezes the pre-trained image/audio encoder and only updates the following parameters in the visual and audio branches: Video/Audio Q-Former , position coding layer and linear layer (shown in Figure 1).

In order to learn the alignment relationship between vision and text, the authors first pre-trained the vision branch using the large-scale video-text dataset (WebVid-2M) and image-text dataset (CC-595K). Afterwards, the authors used image command data sets from MiniGPT-4, LLaVA and video command data sets from Video-Chat to fine-tune to achieve better cross-modal command following capabilities.

As for the learning of audio-text alignment relationships, due to the lack of large-scale high-quality audio-text data, the authors adopted a workaround strategy to achieve this goal. First, the goal of the learnable parameters in the audio-linguistic branch can be understood as aligning the output of the audio encoder with the embedding space of the LLM. The audio encoder ImageBind has a very strong multi-modal alignment capability, which can align the embeddings of different modalities into a common space. Therefore, the authors use visual-text data to train the audio-language branch, aligning the common embedding space of ImageBind to the text embedding space of LLM, thereby achieving audio modality to LLM text embedding space alignment. In this clever way, Video-LLaMA is able to demonstrate the ability to understand audio during inference, even though it has never been trained on audio data.

Example display

The author shows some examples of Video-LLaMA video/audio/image-based dialogue.

(1) The following two examples demonstrate the comprehensive audio-visual perception capabilities of Video-LLaMA. The conversations in the examples revolve around audio videos. In Example 2, only the performer is shown on the screen, but the sound is the cheers and applause of the audience. If the model can only receive visual signals, it will not be able to infer the positive response of the audience. There is no sound of musical instruments in the audio. But there is a saxophone in the picture. If the model can only receive auditory signals, it will not know that the player played the saxophone.

Adding comprehensive audio-visual capabilities to large language models, DAMO Academy opens source Video-LLaMA

##(2) Video-LLaMA also has strong perceptual understanding ability for static images, and can complete picture description, question and answer Wait for the task.

Adding comprehensive audio-visual capabilities to large language models, DAMO Academy opens source Video-LLaMA

## (3) Surprisingly, Video-LLaMA can successfully identify famous landmarks and people, and can Do common sense Q&A. For example, VIdeo-LLaMA below successfully identified the White House and introduced the situation of the White House. Another example is inputting a still photo of Long Ma and Jon Snow (characters in the classic film and television series "Game of Thrones"). VIdeo-LLaMA can not only successfully identify them, but also tell them about their relationship that is constantly being edited and messed up.

Adding comprehensive audio-visual capabilities to large language models, DAMO Academy opens source Video-LLaMA

Adding comprehensive audio-visual capabilities to large language models, DAMO Academy opens source Video-LLaMA

## (4) Dynamics for videos Events, Video-llama can also capture well, such as the movement of catcalls and the direction of a boat.

Adding comprehensive audio-visual capabilities to large language models, DAMO Academy opens source Video-LLaMASummary

Currently, audio and video understanding is still very complex and there is no mature solution yet Although Video-LLaMA has shown impressive capabilities, the authors also mentioned that it has some limitations.

(1) Limited perceptual ability: Video-LLaMA’s visual and auditory abilities are still relatively rudimentary, and it is still difficult to identify complex visual and sound information. Part of the reason is that the quality and size of the data sets are not good enough. This research group is working hard to build a high-quality audio-video-text alignment dataset to improve the perceptual capabilities of the model.

(2) Difficulty processing long videos: Long videos (such as movies and TV shows) contain a large amount of information, which requires high reasoning capabilities and computing resources for the model.

(3) The inherent hallucination problem of language models still exists in Video-LLaMA.

In general, Video-LLaMA, as a large model with comprehensive audio-visual capabilities, has achieved impressive results in the field of audio and video understanding. As researchers continue to work hard, the above challenges will be overcome one by one, making the audio and video understanding model have broad practical value.


The above is the detailed content of Adding comprehensive audio-visual capabilities to large language models, DAMO Academy opens source Video-LLaMA. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)
3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. Best Graphic Settings
3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. How to Fix Audio if You Can't Hear Anyone
3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

The world's most powerful open source MoE model is here, with Chinese capabilities comparable to GPT-4, and the price is only nearly one percent of GPT-4-Turbo The world's most powerful open source MoE model is here, with Chinese capabilities comparable to GPT-4, and the price is only nearly one percent of GPT-4-Turbo May 07, 2024 pm 04:13 PM

Imagine an artificial intelligence model that not only has the ability to surpass traditional computing, but also achieves more efficient performance at a lower cost. This is not science fiction, DeepSeek-V2[1], the world’s most powerful open source MoE model is here. DeepSeek-V2 is a powerful mixture of experts (MoE) language model with the characteristics of economical training and efficient inference. It consists of 236B parameters, 21B of which are used to activate each marker. Compared with DeepSeek67B, DeepSeek-V2 has stronger performance, while saving 42.5% of training costs, reducing KV cache by 93.3%, and increasing the maximum generation throughput to 5.76 times. DeepSeek is a company exploring general artificial intelligence

KAN, which replaces MLP, has been extended to convolution by open source projects KAN, which replaces MLP, has been extended to convolution by open source projects Jun 01, 2024 pm 10:03 PM

Earlier this month, researchers from MIT and other institutions proposed a very promising alternative to MLP - KAN. KAN outperforms MLP in terms of accuracy and interpretability. And it can outperform MLP running with a larger number of parameters with a very small number of parameters. For example, the authors stated that they used KAN to reproduce DeepMind's results with a smaller network and a higher degree of automation. Specifically, DeepMind's MLP has about 300,000 parameters, while KAN only has about 200 parameters. KAN has a strong mathematical foundation like MLP. MLP is based on the universal approximation theorem, while KAN is based on the Kolmogorov-Arnold representation theorem. As shown in the figure below, KAN has

Hello, electric Atlas! Boston Dynamics robot comes back to life, 180-degree weird moves scare Musk Hello, electric Atlas! Boston Dynamics robot comes back to life, 180-degree weird moves scare Musk Apr 18, 2024 pm 07:58 PM

Boston Dynamics Atlas officially enters the era of electric robots! Yesterday, the hydraulic Atlas just "tearfully" withdrew from the stage of history. Today, Boston Dynamics announced that the electric Atlas is on the job. It seems that in the field of commercial humanoid robots, Boston Dynamics is determined to compete with Tesla. After the new video was released, it had already been viewed by more than one million people in just ten hours. The old people leave and new roles appear. This is a historical necessity. There is no doubt that this year is the explosive year of humanoid robots. Netizens commented: The advancement of robots has made this year's opening ceremony look like a human, and the degree of freedom is far greater than that of humans. But is this really not a horror movie? At the beginning of the video, Atlas is lying calmly on the ground, seemingly on his back. What follows is jaw-dropping

AI subverts mathematical research! Fields Medal winner and Chinese-American mathematician led 11 top-ranked papers | Liked by Terence Tao AI subverts mathematical research! Fields Medal winner and Chinese-American mathematician led 11 top-ranked papers | Liked by Terence Tao Apr 09, 2024 am 11:52 AM

AI is indeed changing mathematics. Recently, Tao Zhexuan, who has been paying close attention to this issue, forwarded the latest issue of "Bulletin of the American Mathematical Society" (Bulletin of the American Mathematical Society). Focusing on the topic "Will machines change mathematics?", many mathematicians expressed their opinions. The whole process was full of sparks, hardcore and exciting. The author has a strong lineup, including Fields Medal winner Akshay Venkatesh, Chinese mathematician Zheng Lejun, NYU computer scientist Ernest Davis and many other well-known scholars in the industry. The world of AI has changed dramatically. You know, many of these articles were submitted a year ago.

Google is ecstatic: JAX performance surpasses Pytorch and TensorFlow! It may become the fastest choice for GPU inference training Google is ecstatic: JAX performance surpasses Pytorch and TensorFlow! It may become the fastest choice for GPU inference training Apr 01, 2024 pm 07:46 PM

The performance of JAX, promoted by Google, has surpassed that of Pytorch and TensorFlow in recent benchmark tests, ranking first in 7 indicators. And the test was not done on the TPU with the best JAX performance. Although among developers, Pytorch is still more popular than Tensorflow. But in the future, perhaps more large models will be trained and run based on the JAX platform. Models Recently, the Keras team benchmarked three backends (TensorFlow, JAX, PyTorch) with the native PyTorch implementation and Keras2 with TensorFlow. First, they select a set of mainstream

Tesla robots work in factories, Musk: The degree of freedom of hands will reach 22 this year! Tesla robots work in factories, Musk: The degree of freedom of hands will reach 22 this year! May 06, 2024 pm 04:13 PM

The latest video of Tesla's robot Optimus is released, and it can already work in the factory. At normal speed, it sorts batteries (Tesla's 4680 batteries) like this: The official also released what it looks like at 20x speed - on a small "workstation", picking and picking and picking: This time it is released One of the highlights of the video is that Optimus completes this work in the factory, completely autonomously, without human intervention throughout the process. And from the perspective of Optimus, it can also pick up and place the crooked battery, focusing on automatic error correction: Regarding Optimus's hand, NVIDIA scientist Jim Fan gave a high evaluation: Optimus's hand is the world's five-fingered robot. One of the most dexterous. Its hands are not only tactile

FisheyeDetNet: the first target detection algorithm based on fisheye camera FisheyeDetNet: the first target detection algorithm based on fisheye camera Apr 26, 2024 am 11:37 AM

Target detection is a relatively mature problem in autonomous driving systems, among which pedestrian detection is one of the earliest algorithms to be deployed. Very comprehensive research has been carried out in most papers. However, distance perception using fisheye cameras for surround view is relatively less studied. Due to large radial distortion, standard bounding box representation is difficult to implement in fisheye cameras. To alleviate the above description, we explore extended bounding box, ellipse, and general polygon designs into polar/angular representations and define an instance segmentation mIOU metric to analyze these representations. The proposed model fisheyeDetNet with polygonal shape outperforms other models and simultaneously achieves 49.5% mAP on the Valeo fisheye camera dataset for autonomous driving

Single card running Llama 70B is faster than dual card, Microsoft forced FP6 into A100 | Open source Single card running Llama 70B is faster than dual card, Microsoft forced FP6 into A100 | Open source Apr 29, 2024 pm 04:55 PM

FP8 and lower floating point quantification precision are no longer the "patent" of H100! Lao Huang wanted everyone to use INT8/INT4, and the Microsoft DeepSpeed ​​team started running FP6 on A100 without official support from NVIDIA. Test results show that the new method TC-FPx's FP6 quantization on A100 is close to or occasionally faster than INT4, and has higher accuracy than the latter. On top of this, there is also end-to-end large model support, which has been open sourced and integrated into deep learning inference frameworks such as DeepSpeed. This result also has an immediate effect on accelerating large models - under this framework, using a single card to run Llama, the throughput is 2.65 times higher than that of dual cards. one

See all articles