Low-quality multi-modal data fusion, multiple institutions jointly published a review paper-AI-php.cn

Home

Technology peripherals

Low-quality multi-modal data fusion, multiple institutions jointly published a review paper

WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWB

May 08, 2024 pm 07:40 PM

git theory Multimodal fusion

Low-quality multi-modal data fusion, multiple institutions jointly published a review paper

The AIxiv column is a column where this site publishes academic and technical content. In the past few years, the AIxiv column of this site has received more than 2,000 reports, covering top laboratories from major universities and companies around the world, effectively promoting academic exchanges and dissemination. If you have excellent work that you want to share, please feel free to contribute or contact us for reporting. Submission email: liyazhou@jiqizhixin.com; zhaoyunfeng@jiqizhixin.com

##Multimodal fusion is multimodal intelligence One of the basic tasks in .

The motivation of multi-modal fusion is to jointly utilize effective information from different modalities to improve the accuracy and stability of downstream tasks. Traditional multi-modal fusion methods often rely on high-quality data and are difficult to adapt to the complex and low-quality multi-modal data in real applications.

Low-quality multimodal jointly released by Tianjin University, Renmin University of China, Singapore Agency for Science, Technology and Research, Sichuan University, Xi'an University of Electronic Science and Technology and Harbin Institute of Technology (Shenzhen) Data fusion review "Multimodal Fusion on Low-quality Data: A Comprehensive Survey" introduces the fusion challenges of multimodal data from a unified perspective, and focuses on the existing fusion methods of low-quality multimodal data and potential development directions in this field. Ordered.

arXiv link:

http://arxiv.org/abs/2404.18947

awesome-list link:

https://github.com/QingyangZhang/awesome-low-quality-multimodal-learning

Traditional multimodal fusion model

#Humans perceive the world by fusing information from multiple modalities.

Humans have the ability to process these low-quality multi-modal data signals and perceive the environment even when the signals of some modalities are unreliable.

Although multimodal learning has made great progress, multimodal machine learning models still lack the ability to effectively fuse low-quality multimodal data in the real world. In practical experience, the performance of traditional multi-modal fusion models will decline significantly in the following scenarios:

(1)

Noisy multi-modal data : Some features of some modes are disturbed by noise and lose their original information. In the real world, unknown environmental factors, sensor failures, and signal loss during transmission may introduce noise interference, thereby damaging the reliability of the multi-modal fusion model.

(2)

Missing multimodal data: Due to various practical factors, some modalities of the actual collected multimodal data samples There may be something missing. For example, in the medical field, the multimodal data composed of patients' various physiological examination results may be seriously missing, and some patients may have never had a certain examination.

(3)

Imbalanced multi-modal data: Due to the inconsistency in the heterogeneous encoding properties and information quality differences between modalities, This leads to the emergence of imbalanced learning problems between modalities. During the multi-modal fusion process, the model may rely too much on certain modalities and ignore the potentially effective information contained in other modalities.

(4)

Dynamic low-quality multi-modal data: Due to the complexity and change of the application environment, different samples, different time and space, the modal quality It has dynamic changing characteristics. The occurrence of low-quality modal data is often difficult to predict in advance, which brings challenges to multi-modal fusion.

In order to fully characterize the nature and processing methods of low-quality multi-modal data, this article summarizes the current machine learning methods in the field of low-quality multi-modal fusion. The development process of this field is systematically reviewed, and issues that require further research are further prospected.

Low-quality multi-modal data fusion, multiple institutions jointly published a review paper

^{Figure 1. Low -quality multi -modal data classification schematic diagram, yellow and blue represent two modes, the deeper the color represents the higher the quality}

#Denoising method in multi-modal fusion

Problem definition:

Noise is one of the most common causes of multimodal data quality degradation.

This article mainly focuses on two types of noise:

(1)Mode-related multi-modal noise

. This type of noise may be caused by factors such as sensor errors (such as instrument errors in medical diagnosis) and environmental factors (such as rain and fog in autonomous driving). The noise is limited to certain feature levels within a specific mode.

(2) Cross-modal noise at the semantic level. This type of noise is caused by the misalignment of high-level semantics between modalities, and is more difficult to process than multi-modal noise at the feature layer. Fortunately, due to the complementarity between multi-modal data modes and the redundancy of information, combining information from multiple modalities for denoising has proven to be an effective strategy in the multi-modal fusion process. .

Method classification:

Feature-level multi-mode State denoising methods are highly dependent on the specific modalities involved in the actual task.

This article mainly takes the multi-modal image fusion task as an example to illustrate. In multi-modal image fusion, the mainstream denoising methods include weighted fusion and joint variation.

Weighted fusion method

Considering that feature noise is random and real data obeys a specific distribution, the influence of noise is eliminated through weighted summation;

Joint variation method

is an expansion of traditional single-modal image variation denoising, which can transform the denoising process into an optimization problem. solution process, and utilizes complementary information from multiple modalities to improve the denoising effect. Semantic-level cross-modal noise results from weakly aligned or misaligned multimodal sample pairs.

For example, in the multi-modal target detection task of joint RGB and thermal images, due to differences in sensors, although the same target is present in both modalities appears, but its precise position and attitude may be slightly different in different modalities (weak alignment), which brings challenges to accurately estimate position information.

In the content understanding task of social media, the semantic information contained in the image and text modalities of a sample (such as a Weibo) may be very different or even completely different. Irrelevant (completely misaligned), which further brings greater challenges to multi-modal fusion. Ways to deal with cross-modal semantic noise include rule filtering, model filtering, noise-robust model regularization and other methods.

Future Outlook:

Although the processing of data noise has long been used in classic machine learning The task has been extensively studied, but in multi-modal scenarios, how to jointly utilize the complementarity and consistency between modalities to weaken the impact of noise is still an urgent research problem to be solved.

In addition, unlike traditional feature-level denoising, how to solve semantic-level noise during the pre-training and inference process of multi-modal large models is interesting and extremely Challenging questions.

Low-quality multi-modal data fusion, multiple institutions jointly published a review paper

###

^{Table 1. Classification of multi-modal fusion methods for noise}

Missing many Modal data fusion method

Problem definition:

In real scenarios The collected multimodal data is often incomplete. Due to various factors such as damage to the storage device and unreliable data transmission process, multimodal data often inevitably loses part of the modal information.

For example: in the recommendation system, the user's browsing history and credit rating constitute multi-modal data. However, due to permissions and privacy issues, it is often impossible to fully collect To build a multi-modal learning system based on user information from all modalities.

In medical diagnosis, due to limited equipment in some hospitals and high cost of specific examinations, the multi-modal diagnostic data of different patients are often highly incomplete. .

Method classification:

According to "Whether it is necessary to explicitly correct missing multi-mode Based on the classification principle of "completing modal data", missing multi-modal data fusion methods can be divided into:

(1) Multi-modal fusion method based on completion

Completion-based multi-modal fusion methods include model-independent completion methods: for example, completion methods that directly fill missing modes with 0 values or the mean value of residual modes ;

Completion methods based on graphs or kernels: This type of method does not directly learn how to complete the original multi-modal data, but constructs a graph or kernel for each modality. Kernel, and then learn the similarity or correlation information between sample pairs, and then complete the missing data;

Complete directly at the original feature level: some methods Use generative models, such as Generative Adversarial Network GAN and its variants, to directly complete the missing features.

(2) Multi-modal fusion method without completion.

Different from completion-based methods, completion-free methods focus on how to use the useful information contained in the non-missing modalities to fuse the best possible representation. This type of method often adds constraints to the unified representation expected to be learned, so that this representation can reflect the complete information of the observable modal data, so as to bypass the completion process for multi-modal fusion. Low-quality multi-modal data fusion, multiple institutions jointly published a review paper

##Future Outlook:

Although many methods have been proposed at home and abroad to solve many incomplete problems in classic machine learning tasks such as clustering and classification. modal data fusion problem, but there are still some deeper challenges.

For example: Quality assessment of completion data in missing modal completion schemes is often overlooked.

In addition, the strategy of using a priori missing data location information to shield the missing modality itself is difficult to make up for the information gap and information imbalance caused by the missing modality.

^{Table 2. Classification of fusion methods for missing multi-modal data}

Balance Multi-modal fusion method

Problem definition:

In multi-modal fusion In modal learning, joint training is usually used to integrate data from different modalities to improve the overall performance and generalization performance of the model. However, this type of widely adopted joint training paradigm that uses a unified learning objective ignores the heterogeneity of data from different modalities.

On the one hand,

The heterogeneity of different modes in terms of data sources and forms

makes them have different characteristics in terms of convergence speed, etc. This makes it difficult for all modalities to be processed and learned well at the same time, which brings difficulties to multi-modal joint learning;

On the other hand, this difference also reflects On the quality of

unimodal data

. Although all modalities describe the same concept, they vary in the amount of information related to the target event or target object. Deep neural networks based on the maximum likelihood learning objective have greedy learning characteristics, resulting in multi-modal models that often rely on high-quality modalities with high discriminative information and are easier to learn, while insufficiently modeling other modal information.

In order to address these challenges and improve the learning quality of multi-modal models, related research on

balanced multi-modal learning

has received widespread attention recently.

Method classification:

According to different balance angles, related methods can be divided into For

methods based on characteristic differences

and methods based on quality differences.

(1) The widely used multi-modal joint training framework often

ignores the inherent differences in learning properties of single-modal data

, which may have a negative impact on Negatively affects model performance. The method based on characteristic differences starts from the differences in learning characteristics of each modality and tries to solve this problem in terms of learning goals, optimization, and architecture.

(2) Recent research further found that multi-modal models often

heavily rely on certain high-quality information modalities

while ignoring others modalities, resulting in insufficient learning of all modalities. Methods based on quality differences start from this perspective and try to solve this problem and promote the balanced utilization of different modalities in multi-modal models from the perspectives of learning objectives, optimization methods, model architecture and data enhancement.

^{Table 3. Classification of balanced multi-modal data fusion methods}

Future outlook:

The balanced multi-modal learning method mainly targets the differences in learning characteristics or data quality between different modalities caused by the heterogeneity of multi-modal data. These methods propose solutions from different perspectives such as learning objectives, optimization methods, model architecture, and data enhancement.

Balanced multimodal learning is currently a booming field, and there are many theoretical and application directions that have not been fully explored. For example, current methods are mainly limited to typical multi-modal tasks, which are mostly discriminative tasks and a few generative tasks.

In addition, multi-modal large models also need to combine modal data with different qualities. There is also this objective imbalance problem. Accordingly, it is expected that in Extend existing research or design new solutions in multimodal large-model scenarios.

Dynamic multi-modal fusion method

Problem definition:

Dynamic multimodal data refers to the fact that the quality of the modality changes dynamically with different input samples and scenarios. For example, in autonomous driving scenarios, the system obtains road surface and target information through RGB and infrared sensors. Under good lighting conditions, the RGB camera can better support the decision-making of the intelligent system because it can capture the rich texture and color information of the target;

# However, at night when there is insufficient light, the perception information provided by the infrared sensor is more reliable. How to enable the model to automatically sense changes in the quality of different modalities, so as to perform accurate and stable fusion, is the core task of the dynamic multi-modal fusion method.

# Method classification:

Dynamic multi-modal fusion methods can be roughly divided into three categories:

(1) Heuristic dynamic fusion method:

#The heuristic dynamic fusion method relies on the algorithm designer’s understanding of the multi-modal model application scenarios, generally through This is achieved by introducing a

dynamic fusion mechanism

For example, in the multi-modal target detection task of RGB/thermal signal collaboration, researchers heuristically designed an illumination perception module to dynamically evaluate the illumination of the input image situation, and dynamically adjust the fusion weight of RGB and thermal modes based on the light intensity to adapt to the environment. When the brightness is high, the RGB mode is mainly relied on for decision-making, and vice versa, the thermal mode is mainly relied on for decision-making.

(2) Dynamic fusion method based on attention mechanism:

Dynamic fusion method based on attention mechanism Mainly focus on

presentation layer fusion

. The attention mechanism itself has dynamic characteristics, so it can be naturally used in multi-modal dynamic fusion tasks.

Various mechanisms such as Self-attention, Spatial attention, Channel attention and Transformer are widely used in the construction of multi-modal fusion models. Such methods automatically learn how to perform dynamic fusion, driven by task goals. The fusion based on the attention mechanism can adapt to dynamic low-quality multi-modal data to a certain extent in the absence of explicit or heuristic guidance.

(3) Dynamic fusion method of uncertainty perception:

Dynamic fusion method of uncertainty perception Often have

clearer and explainable fusion mechanisms

. Different from complex fusion modes based on attention mechanisms, uncertainty-aware dynamic fusion methods rely on uncertainty estimates of modalities (such as evidence, energy, entropy, etc.) to adapt to low-quality multi-modal data.

#Specifically, uncertainty perception can be used to characterize the quality changes of each modality of the input data. When the quality of a certain modality of the input sample becomes low, the uncertainty of the model's decision-making based on that modality becomes higher, providing clear guidance for subsequent fusion mechanism design. In addition, compared to heuristics and attention mechanisms, uncertainty-aware dynamic fusion methods can provide good theoretical guarantees.

Future Outlook:

Although in traditional multi-modal fusion tasks, The superiority of uncertainty-aware dynamic fusion methods has been proven experimentally and theoretically. However, in SOTA's multi-modal models (not limited to fusion models, such as CLIP/BLIP, etc.), the idea of dynamics also has Greater potential for exploration and application.

In addition, dynamic fusion mechanisms with theoretical guarantees are often limited to the decision-making level. How to make them work at the representation level is also worth thinking about and exploring.

The above is the detailed content of Low-quality multi-modal data fusion, multiple institutions jointly published a review paper. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

Repo: How To Revive Teammates

1 months ago By 尊渡假赌尊渡假赌尊渡假赌

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)

2 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

Hello Kitty Island Adventure: How To Get Giant Seeds

1 months ago By 尊渡假赌尊渡假赌尊渡假赌

How Long Does It Take To Beat Split Fiction?

4 weeks ago By DDD

R.E.P.O. Save File Location: Where Is It & How to Protect It?

4 weeks ago By DDD

Hot Tools

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Hot Topics

Where is the login entrance for gmail email?

7364

Java Tutorial

1628

CakePHP Tutorial

1353

Laravel Tutorial

1265

PHP Tutorial

1214

Related knowledge

How to install deepseek Feb 19, 2025 pm 05:48 PM

There are many ways to install DeepSeek, including: compile from source (for experienced developers) using precompiled packages (for Windows users) using Docker containers (for most convenient, no need to worry about compatibility) No matter which method you choose, Please read the official documents carefully and prepare them fully to avoid unnecessary trouble.

Summary of FAQs for DeepSeek usage Feb 19, 2025 pm 03:45 PM

DeepSeekAI Tool User Guide and FAQ DeepSeek is a powerful AI intelligent tool. This article will answer some common usage questions to help you get started quickly. FAQ: The difference between different access methods: There is no difference in function between web version, App version and API calls, and App is just a wrapper for web version. The local deployment uses a distillation model, which is slightly inferior to the full version of DeepSeek-R1, but the 32-bit model theoretically has 90% full version capability. What is a tavern? SillyTavern is a front-end interface that requires calling the AI model through API or Ollama. What is breaking limit

What are the AI tools? Nov 29, 2024 am 11:11 AM

AI tools include: Doubao, ChatGPT, Gemini, BlenderBot, etc.

What are the Grayscale Encryption Trust Funds? Common Grayscale Encryption Trust Funds Inventory Mar 05, 2025 pm 12:33 PM

Grayscale Investment: The channel for institutional investors to enter the cryptocurrency market. Grayscale Investment Company provides digital currency investment services to institutions and investors. It allows investors to indirectly participate in cryptocurrency investment through the form of trust funds. The company has launched several crypto trusts, which has attracted widespread market attention, but the impact of these funds on token prices varies significantly. This article will introduce in detail some of Grayscale's major crypto trust funds. Grayscale Major Crypto Trust Funds Available at a glance Grayscale Investment (founded by DigitalCurrencyGroup in 2013) manages a variety of crypto asset trust funds, providing institutional investors and high-net-worth individuals with compliant investment channels. Its main funds include: Zcash (ZEC), SOL,

Delphi Digital: How to change the new AI economy by parsing the new ElizaOS v2 architecture? Mar 04, 2025 pm 07:00 PM

ElizaOSv2: Empowering AI and leading the new economy of Web3. AI is evolving from auxiliary tools to independent entities. ElizaOSv2 plays a key role in it, which gives AI the ability to manage funds and operate Web3 businesses. This article will dive into the key innovations of ElizaOSv2 and how it shapes an AI-driven future economy. AI Automation: Going to independently operate ElizaOS was originally an AI framework focusing on Web3 automation. v1 version allows AI to interact with smart contracts and blockchain data, while v2 version achieves significant performance improvements. Instead of just executing simple instructions, AI can independently manage workflows, operate business and develop financial strategies. Architecture upgrade: Enhanced A

As top market makers enter the crypto market, what impact will Castle Securities have on the industry? Mar 04, 2025 pm 08:03 PM

The entry of top market maker Castle Securities into Bitcoin market maker is a symbol of the maturity of the Bitcoin market and a key step for traditional financial forces to compete for future asset pricing power. At the same time, for retail investors, it may mean the gradual weakening of their voice. On February 25, according to Bloomberg, Citadel Securities is seeking to become a liquidity provider for cryptocurrencies. The company aims to join the list of market makers on various exchanges, including exchanges operated by CoinbaseGlobal, BinanceHoldings and Crypto.com, people familiar with the matter said. Once approved by the exchange, the company initially planned to set up a market maker team outside the United States. This move is not only a sign

Significantly surpassing SFT, the secret behind o1/DeepSeek-R1 can also be used in multimodal large models Mar 12, 2025 pm 01:03 PM

Researchers from Shanghai Jiaotong University, Shanghai AILab and the Chinese University of Hong Kong have launched the Visual-RFT (Visual Enhancement Fine Tuning) open source project, which requires only a small amount of data to significantly improve the performance of visual language big model (LVLM). Visual-RFT cleverly combines DeepSeek-R1's rule-based reinforcement learning approach with OpenAI's reinforcement fine-tuning (RFT) paradigm, successfully extending this approach from the text field to the visual field. By designing corresponding rule rewards for tasks such as visual subcategorization and object detection, Visual-RFT overcomes the limitations of the DeepSeek-R1 method being limited to text, mathematical reasoning and other fields, providing a new way for LVLM training. Vis

Bitwise: Businesses Buy Bitcoin A Neglected Big Trend Mar 05, 2025 pm 02:42 PM

Weekly Observation: Businesses Hoarding Bitcoin – A Brewing Change I often point out some overlooked market trends in weekly memos. MicroStrategy's move is a stark example. Many people may say, "MicroStrategy and MichaelSaylor are already well-known, what are you going to pay attention to?" This is true, but many investors regard it as a special case and ignore the deeper market forces behind it. This view is one-sided. In-depth research on the adoption of Bitcoin as a reserve asset in recent months shows that this is not an isolated case, but a major trend that is emerging. I predict that in the next 12-18 months, hundreds of companies will follow suit and buy large quantities of Bitcoin

See all articles