Home Technology peripherals AI CVPR 2024 | A general image fusion model based on MoE, adding 2.8% parameters to complete multiple tasks

CVPR 2024 | A general image fusion model based on MoE, adding 2.8% parameters to complete multiple tasks

Apr 24, 2024 pm 02:28 PM
git project image fusion

CVPR 2024 | 基于MoE的通用图像融合模型,添加2.8%参数完成多项任务

The AIxiv column is a column where this site publishes academic and technical content. In the past few years, the AIxiv column of this site has received more than 2,000 reports, covering top laboratories from major universities and companies around the world, effectively promoting academic exchanges and dissemination. If you have excellent work that you want to share, please feel free to contribute or contact us for reporting. Submission email: liyazhou@jiqizhixin.com; zhaoyunfeng@jiqizhixin.com.

CVPR 2024 | 基于MoE的通用图像融合模型,添加2.8%参数完成多项任务

  • ## Paper link: https://arxiv.org/abs/2403.12494
  • Code link: https://github.com/YangSun22/TC-MoA
  • Paper title: Task-Customized Mixture of Adapters for General Image Fusion

CVPR 2024 | 基于MoE的通用图像融合模型,添加2.8%参数完成多项任务

# 图 Figure 1 Different fusion tasks The dominant strength of the fusion results

##Research background and motivation

The purpose of image fusion is to integrate the complementary information of multi-source images captured by different sensors in the same scene into a single image . This method is usually used to extract important information from images and improve visual quality.

Currently, general image fusion mainly includes multi-modal, multi-exposure, multi-focus image fusion, etc. Fusion tasks exhibit different fusion mechanisms. Multi-exposure image fusion (MEF) focuses on converting image sequences with multiple exposure levels into a high-quality full-exposure image. Each source image provides its own lighting and structural information to the fused image. Visible infrared image fusion (VIF) is a type of multi-modal image fusion (MMF) that aims to fuse complementary information from infrared and visible modalities to produce robust and information-rich fused images. Infrared images provide more intensity information, while visible images provide more texture and gradient information. The purpose of multi-focus image fusion (MFF) is to generate a fully focused image from a series of partially focused images. Each clear region of a multi-focus fused image usually only needs to be learned from one source image. Therefore, it can be observed that the MEF and VIF tasks are relatively equal fusions of multiple sources, while MFF is a task with more extreme multi-source status, often showing polarized selection for a certain area of ​​the image.

With the rapid development of deep learning technology, great progress has been made in the field of image fusion in recent years, but most of the existing methods only focus on a single image fusion scenario, usually Adopting a specific strategy for a single task, such as a complex network or task-specific loss function designed for a certain task, makes it impossible to directly apply it to other tasks. Considering that the essence of different fusion tasks is the same, that is, integrating important information from multiple source images, some recently proposed methods try to use a unified model to handle multiple fusion tasks and build a universal image fusion. However, these methods either suffer from task-dominant bias or sacrifice individuality for multi-task commonality, resulting in suboptimal performance. This motivates us to explore a more compatible fusion paradigm that can be adaptively and dynamically compatible with different fusion scenarios.

To deal with this challenge, inspired by the powerful feature representation capabilities of the pre-trained base model, we introduce the base model as a frozen encoder to extract multiple Complementary features of the source image. Different from most existing methods, we draw on the idea of ​​Mixed Experts (MoE) and treat each expert as an efficient fine-tuned adapter to perform adaptive visual feature cue fusion based on the base model. Task-specific routing networks tailor a mix of these adapters to generate task-specific fusion cues for different sources, forming a new Task-Customized Hybrid Adapter (TC-MoA) architecture. Additionally, we design mutual information regularization to constrain the fusion cues, thus ensuring complementarity to different sources. Notably, fusion cues had significant task bias and modality dominance strength differences. As shown in Figure 1, MFF cues have larger color differences than VIF and MEF, indicating that the feature selection is more bipolar in the intensity bias of the dominant mode. Our model effectively perceives the fusion strength bias between different fusion tasks in a single model and is therefore compatible with a wider range of fusion tasks.

Extensive experiments have verified our superiority in general image fusion, including multi-modal, multi-exposure and multi-focus fusion. More importantly, our TC-MoA shows creative controllability and generalization even to unknown fusion tasks, fully demonstrating our potential in a wider range of fusion scenarios.

Main Contributions

  • We proposed a unified Universal Image Fusion Model, which provides a new task-tailored hybrid adapter (TC-MoA) for adaptive multi-source image fusion (which benefits from dynamically aggregating the effective information of the respective modalities).
  • We propose a mutual information regularization method for adapters, which enables our model to more accurately identify the dominant intensity of different source images.
  • To the best of our knowledge, we propose a MoE-based flexible adapter for the first time. By adding only 2.8% of the learnable parameters, our model can handle many fusion tasks. Extensive experiments demonstrate the advantages of our competing methods while showing significant controllability and generalization.

Core method

As shown in Figure 2, Given a pair of source images CVPR 2024 | 基于MoE的通用图像融合模型,添加2.8%参数完成多项任务, the network integrates complementary information from different sources to obtain the fused image CVPR 2024 | 基于MoE的通用图像融合模型,添加2.8%参数完成多项任务. We input the source image into the ViT network and obtain the Token of the source image through the patch encoding layer. ViT consists of an encoder for feature extraction and a decoder for image reconstruction, both of which are composed of Transformer blocks.

In the encoder and decoder, a TC-MoA is inserted for every CVPR 2024 | 基于MoE的通用图像融合模型,添加2.8%参数完成多项任务 Transformer block. The network progressively modulates the outcome of fusion through these TC-MoAs. Each TC-MoA consists of a task-specific router bankCVPR 2024 | 基于MoE的通用图像融合模型,添加2.8%参数完成多项任务, a task-shared adapter bankCVPR 2024 | 基于MoE的通用图像融合模型,添加2.8%参数完成多项任务, and a hint fusion layer F. TC-MoA consists of two main stages: cue generation and cue-driven fusion. For ease of expression, we take VIF as an example, assume that the input comes from the VIF data set, and use G to represent CVPR 2024 | 基于MoE的通用图像融合模型,添加2.8%参数完成多项任务.

CVPR 2024 | 基于MoE的通用图像融合模型,添加2.8%参数完成多项任务

## 图 2 TC-MOA's overall architecture

Prompt generation. First, multi-source features are obtained for subsequent processing. The network structure before the jth TC-MoA is defined as , and the extracted cue generation features are defined as CVPR 2024 | 基于MoE的通用图像融合模型,添加2.8%参数完成多项任务. We concatenate CVPR 2024 | 基于MoE的通用图像融合模型,添加2.8%参数完成多项任务 as the feature representation of multi-source Token pairs. This allows tokens from different sources to exchange information within the subsequent network. However, directly calculating high-dimensional concatenated features will bring a large number of unnecessary parameters. Therefore, we use CVPR 2024 | A general image fusion model based on MoE, adding 2.8% parameters to complete multiple tasks to perform feature dimensionality reduction and obtain the processed multi-source feature CVPR 2024 | 基于MoE的通用图像融合模型,添加2.8%参数完成多项任务, as follows: CVPR 2024 | 基于MoE的通用图像融合模型,添加2.8%参数完成多项任务

CVPR 2024 | 基于MoE的通用图像融合模型,添加2.8%参数完成多项任务

Then, according to the task to which Φ belongs, we start from Select a task-specific router in the router bank to customize the routing scheme, i.e., which adapter in the adapter bank each pair of source tokens should enter.

CVPR 2024 | 基于MoE的通用图像融合模型,添加2.8%参数完成多项任务

Finally, we perform a weighted sum of the adapter's outputs to obtain the fusion hint. Each router has task preferences to customize the appropriate adapter mix, and then generates hints from the adapter mix, calculated as follows:

CVPR 2024 | 基于MoE的通用图像融合模型,添加2.8%参数完成多项任务

Prompt-driven fusion . Task-tailored cues are subject to mutual information regularization (MIR), which guarantees complementarity to different sources. Cues therefore serve as an estimate of the proportion of important information in each source. Through the dot product of multi-source features and cues, we retain complementary information while removing redundant information. Then, considering that the feature representation should contain source-dependent biases (such as visible or infrared images), we introduce input-independent learnable parameters for each source, i.e., source encoding s. After the features are modified by hints and source offsets, we get the refined source features CVPR 2024 | 基于MoE的通用图像融合模型,添加2.8%参数完成多项任务, and then obtain the fusion features through the fusion layer F. The process is as follows:

CVPR 2024 | 基于MoE的通用图像融合模型,添加2.8%参数完成多项任务

Finally, We obtained a fusion feature with task-tailored prompts. In order to encourage the model to extract important information step by step, we define the features output to the next Transformer block as follows (CVPR 2024 | 基于MoE的通用图像融合模型,添加2.8%参数完成多项任务 is a hyperparameter):

CVPR 2024 | 基于MoE的通用图像融合模型,添加2.8%参数完成多项任务

## Information regular. In order to ensure that the model dynamically retains complementary information while discarding redundant information from multi-source features, we impose regularization constraints on prompts. Assuming that the feature representation changes linearly, we define MIR as follows:

CVPR 2024 | 基于MoE的通用图像融合模型,添加2.8%参数完成多项任务

Experimental effect

Qualitative and quantitative experiments. As shown in Figure 3-5 and Table 1-3, qualitative and quantitative comparisons on three fusion tasks show that the performance of our method surpasses previous general fusion methods. Compared with task-specific methods, our method also achieves state-of-the-art performance on all tasks and even leads on some tasks (VIF). The superiority of the proposed method is proved.

CVPR 2024 | 基于MoE的通用图像融合模型,添加2.8%参数完成多项任务

                               图3 VIF 任务LLVIP 数据集上的定性比较实验 

CVPR 2024 | 基于MoE的通用图像融合模型,添加2.8%参数完成多项任务

                                                                                                                                                                                                                      Qualitative comparative experiment on the MEF task MEFB data set

CVPR 2024 | 基于MoE的通用图像融合模型,添加2.8%参数完成多项任务

##                                   5 Qualitative comparison experiment on the MFF task data set

CVPR 2024 | 基于MoE的通用图像融合模型,添加2.8%参数完成多项任务

#                                                                                                     —                                                    
2 MEF task LLVIP data set The quantitative comparative experiment

CVPR 2024 | 基于MoE的通用图像融合模型,添加2.8%参数完成多项任务

## Table 3 MFF task LLVIP data set quantitative comparative experiment

CVPR 2024 | 基于MoE的通用图像融合模型,添加2.8%参数完成多项任务

# m m m m m m m and unknown task pan Generalizability

##Controllability and generalization
.
As shown in Figure 6, by controlling the hyperparameters α and β of the fusion prompt, we can respectively control the feature selection strength of the model for the complementary information of the source image (region level) and the similarity between the fused image and a certain source image ( image level). We can fuse the cues through linear transformation, ultimately generating a customized fused image. For known tasks, such as multi-exposure fusion, we can obtain customized fusion results that best match human perception. For unknown tasks, we can modulate the most appropriate fusion parameters and generalize the model to unknown tasks.

The above is the detailed content of CVPR 2024 | A general image fusion model based on MoE, adding 2.8% parameters to complete multiple tasks. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)
2 weeks ago By 尊渡假赌尊渡假赌尊渡假赌
Hello Kitty Island Adventure: How To Get Giant Seeds
1 months ago By 尊渡假赌尊渡假赌尊渡假赌
Two Point Museum: All Exhibits And Where To Find Them
1 months ago By 尊渡假赌尊渡假赌尊渡假赌

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

How to install deepseek How to install deepseek Feb 19, 2025 pm 05:48 PM

There are many ways to install DeepSeek, including: compile from source (for experienced developers) using precompiled packages (for Windows users) using Docker containers (for most convenient, no need to worry about compatibility) No matter which method you choose, Please read the official documents carefully and prepare them fully to avoid unnecessary trouble.

Summary of FAQs for DeepSeek usage Summary of FAQs for DeepSeek usage Feb 19, 2025 pm 03:45 PM

DeepSeekAI Tool User Guide and FAQ DeepSeek is a powerful AI intelligent tool. This article will answer some common usage questions to help you get started quickly. FAQ: The difference between different access methods: There is no difference in function between web version, App version and API calls, and App is just a wrapper for web version. The local deployment uses a distillation model, which is slightly inferior to the full version of DeepSeek-R1, but the 32-bit model theoretically has 90% full version capability. What is a tavern? SillyTavern is a front-end interface that requires calling the AI ​​model through API or Ollama. What is breaking limit

What are the AI ​​tools? What are the AI ​​tools? Nov 29, 2024 am 11:11 AM

AI tools include: Doubao, ChatGPT, Gemini, BlenderBot, etc.

What are the Grayscale Encryption Trust Funds? Common Grayscale Encryption Trust Funds Inventory What are the Grayscale Encryption Trust Funds? Common Grayscale Encryption Trust Funds Inventory Mar 05, 2025 pm 12:33 PM

Grayscale Investment: The channel for institutional investors to enter the cryptocurrency market. Grayscale Investment Company provides digital currency investment services to institutions and investors. It allows investors to indirectly participate in cryptocurrency investment through the form of trust funds. The company has launched several crypto trusts, which has attracted widespread market attention, but the impact of these funds on token prices varies significantly. This article will introduce in detail some of Grayscale's major crypto trust funds. Grayscale Major Crypto Trust Funds Available at a glance Grayscale Investment (founded by DigitalCurrencyGroup in 2013) manages a variety of crypto asset trust funds, providing institutional investors and high-net-worth individuals with compliant investment channels. Its main funds include: Zcash (ZEC), SOL,

Delphi Digital: How to change the new AI economy by parsing the new ElizaOS v2 architecture? Delphi Digital: How to change the new AI economy by parsing the new ElizaOS v2 architecture? Mar 04, 2025 pm 07:00 PM

ElizaOSv2: Empowering AI and leading the new economy of Web3. AI is evolving from auxiliary tools to independent entities. ElizaOSv2 plays a key role in it, which gives AI the ability to manage funds and operate Web3 businesses. This article will dive into the key innovations of ElizaOSv2 and how it shapes an AI-driven future economy. AI Automation: Going to independently operate ElizaOS was originally an AI framework focusing on Web3 automation. v1 version allows AI to interact with smart contracts and blockchain data, while v2 version achieves significant performance improvements. Instead of just executing simple instructions, AI can independently manage workflows, operate business and develop financial strategies. Architecture upgrade: Enhanced A

As top market makers enter the crypto market, what impact will Castle Securities have on the industry? As top market makers enter the crypto market, what impact will Castle Securities have on the industry? Mar 04, 2025 pm 08:03 PM

The entry of top market maker Castle Securities into Bitcoin market maker is a symbol of the maturity of the Bitcoin market and a key step for traditional financial forces to compete for future asset pricing power. At the same time, for retail investors, it may mean the gradual weakening of their voice. On February 25, according to Bloomberg, Citadel Securities is seeking to become a liquidity provider for cryptocurrencies. The company aims to join the list of market makers on various exchanges, including exchanges operated by CoinbaseGlobal, BinanceHoldings and Crypto.com, people familiar with the matter said. Once approved by the exchange, the company initially planned to set up a market maker team outside the United States. This move is not only a sign

Significantly surpassing SFT, the secret behind o1/DeepSeek-R1 can also be used in multimodal large models Significantly surpassing SFT, the secret behind o1/DeepSeek-R1 can also be used in multimodal large models Mar 12, 2025 pm 01:03 PM

Researchers from Shanghai Jiaotong University, Shanghai AILab and the Chinese University of Hong Kong have launched the Visual-RFT (Visual Enhancement Fine Tuning) open source project, which requires only a small amount of data to significantly improve the performance of visual language big model (LVLM). Visual-RFT cleverly combines DeepSeek-R1's rule-based reinforcement learning approach with OpenAI's reinforcement fine-tuning (RFT) paradigm, successfully extending this approach from the text field to the visual field. By designing corresponding rule rewards for tasks such as visual subcategorization and object detection, Visual-RFT overcomes the limitations of the DeepSeek-R1 method being limited to text, mathematical reasoning and other fields, providing a new way for LVLM training. Vis

Bitwise: Businesses Buy Bitcoin A Neglected Big Trend Bitwise: Businesses Buy Bitcoin A Neglected Big Trend Mar 05, 2025 pm 02:42 PM

Weekly Observation: Businesses Hoarding Bitcoin – A Brewing Change I often point out some overlooked market trends in weekly memos. MicroStrategy's move is a stark example. Many people may say, "MicroStrategy and MichaelSaylor are already well-known, what are you going to pay attention to?" This is true, but many investors regard it as a special case and ignore the deeper market forces behind it. This view is one-sided. In-depth research on the adoption of Bitcoin as a reserve asset in recent months shows that this is not an isolated case, but a major trend that is emerging. I predict that in the next 12-18 months, hundreds of companies will follow suit and buy large quantities of Bitcoin

See all articles