Easily understand 4K HD images! This large multi-modal model automatically analyzes the content of web posters, making it very convenient for workers.-Hardware Review-php.cn

Home

Easily understand 4K HD images! This large multi-modal model automatically analyzes the content of web posters, making it very convenient for workers.

WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWB

Apr 23, 2024 am 08:04 AM

git composer resolution Effect radar beautiful pictures Chinese University of Hong Kong lab

A large model that can automatically analyze the content of PDFs, web pages, posters, and Excel charts is not too convenient for part-time workers.

The InternLM-XComposer2-4KHD (abbreviated as IXC2-4KHD) model proposed by Shanghai AI Lab, the Chinese University of Hong Kong and other research institutions makes this a reality.

轻松拿捏 4K 高清图像理解！这个多模态大模型自动分析网页海报内容，打工人简直不要太方便

Compared with other multi-modal large models that do not exceed the resolution limit of 1500x1500, this work increases the maximum input image of the multi-modal large model to more than 4K (3840 x1600) resolution, and supports any aspect ratio and dynamic resolution changes from 336 pixels to 4K.

Three days after its release, the model topped the Hugging Face visual question and answer model popularity list.

轻松拿捏 4K 高清图像理解！这个多模态大模型自动分析网页海报内容，打工人简直不要太方便

Easy 4K image understanding

Let’s take a look at the effect first~

The researcher inputs the paper (ShareGPT4V: Improving Large Multi-Modal Models with Better Captions) (resolution is 2550x3300), and asked the paper which model has the highest performance on MMBench.

It should be noted that this information is not mentioned in the text part of the input screenshot, but only appears in a rather complicated radar chart. Faced with such a tricky question, IXC2-4KHD successfully understood the information in the radar chart and answered the question correctly.

轻松拿捏 4K 高清图像理解！这个多模态大模型自动分析网页海报内容，打工人简直不要太方便

Faced with more extreme resolution image input (816 x 5133), IXC2-4KHD easily understands that the image consists of 7 parts and accurately explains what each part contains. Text message content.

轻松拿捏 4K 高清图像理解！这个多模态大模型自动分析网页海报内容，打工人简直不要太方便

Subsequently, the researchers also comprehensively tested the capabilities of IXC2-4KHD on 16 multi-modal large model evaluation indicators, of which 5 evaluations (DocVQA, ChartQA, InfographicVQA , TextVQA, OCRBench) focuses on the model’s high-resolution image understanding capabilities.

Using only 7B parameters, IXC2-4KHD achieved results that were comparable to or even surpassed GPT4V and Gemini Pro in 10 of the evaluations, demonstrating that it is not limited to high-resolution image understanding, but also for various tasks and Scenario versatility.

轻松拿捏 4K 高清图像理解！这个多模态大模型自动分析网页海报内容，打工人简直不要太方便

△With only 7B parameters, the performance of IXC2-4KHD is comparable to GPT-4V and Gemini-Pro. How to achieve 4K dynamic resolution?

In order to achieve the goal of 4K dynamic resolution, IXC2-4KHD includes three main designs:

(1) Dynamic resolution training:

轻松拿捏 4K 高清图像理解！这个多模态大模型自动分析网页海报内容，打工人简直不要太方便

△4K resolution image processing strategy

In the framework of IXC2-4KHD, the input image is randomly enlarged to a value between the input area and the maximum area (not exceeding An intermediate size (55x336x336, equivalent to 3840x1617 resolution).

Subsequently, the image is automatically cut into multiple 336x336 areas to extract visual features respectively. This dynamic resolution training strategy allows the model to adapt to visual input of any resolution, while also making up for the problem of insufficient high-resolution training data.

Experiments show that as the upper limit of dynamic resolution increases, the model achieves stable performance improvement on high-resolution image understanding tasks (InfographicVQA, DocVQA, TextVQA), and it still does not reach the upper limit at 4K resolution. world, demonstrating the potential for further expansion at higher resolutions.

轻松拿捏 4K 高清图像理解！这个多模态大模型自动分析网页海报内容，打工人简直不要太方便

(2) Add tile layout information:

In order to enable the model to adapt to changing dynamic resolutions, the researchers found that it is necessary to add tile layout information information as additional input. To achieve this, the researchers adopted a simple strategy: a special ‘newline’ (‘ n ’) token is inserted after each row of tiles to inform the model of the layout of the tiles. Experiments show that adding tile layout information has little impact on dynamic resolution training with relatively small changes (HD9 represents that the number of tile areas does not exceed 9), but can bring significant performance improvements to dynamic 4K resolution training .

轻松拿捏 4K 高清图像理解！这个多模态大模型自动分析网页海报内容，打工人简直不要太方便

(3) Expanding the resolution during the inference phase

Researchers also found that models using dynamic resolution can be directly expanded during the inference phase by increasing the maximum tile upper limit resolution and bring additional performance gains. For example, by testing a trained model on HD9 (up to 9 blocks) directly using HD16, a performance improvement of up to 8% can be observed on InfographicVQA.

轻松拿捏 4K 高清图像理解！这个多模态大模型自动分析网页海报内容，打工人简直不要太方便

IXC2-4KHD increases the resolution supported by multi-modal large models to the 4K level. Researchers said that currently this method supports larger images by increasing the number of tiles. The input strategy encountered computational cost and video memory bottlenecks, so they plan to propose a more efficient strategy to support higher resolutions in the future.

Paper link:

https://arxiv.org/pdf/2404.06512.pdf

Project link:

https://github.com /InternLM/InternLM-XComposer

— Finished—

Please send an email to:

ai@qbitai.com

Indicate the title and tell us ：

Who are you, where are you from, submission content

Attach the paper/project homepage link and contact information

We will (try our best) to reply to you in time

轻松拿捏 4K 高清图像理解！这个多模态大模型自动分析网页海报内容，打工人简直不要太方便

Click here to follow me and remember to star~

One click three times to "share", "like" and "watch"

The cutting-edge progress of science and technology will be seen every day~

The above is the detailed content of Easily understand 4K HD images! This large multi-modal model automatically analyzes the content of web posters, making it very convenient for workers.. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)

4 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

R.E.P.O. Best Graphic Settings

4 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

Assassin's Creed Shadows: Seashell Riddle Solution

2 weeks ago By DDD

R.E.P.O. How to Fix Audio if You Can't Hear Anyone

4 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

WWE 2K25: How To Unlock Everything In MyRise

1 months ago By 尊渡假赌尊渡假赌尊渡假赌

Hot Tools

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Hot Topics

Where is the login entrance for gmail email?

7516

CakePHP Tutorial

1378

What is the format of the account name of steam

win11 activation key permanent

nyt connections hints and answers

Related knowledge

How to run the h5 project Apr 06, 2025 pm 12:21 PM

Running the H5 project requires the following steps: installing necessary tools such as web server, Node.js, development tools, etc. Build a development environment, create project folders, initialize projects, and write code. Start the development server and run the command using the command line. Preview the project in your browser and enter the development server URL. Publish projects, optimize code, deploy projects, and set up web server configuration.

Does H5 page production require continuous maintenance? Apr 05, 2025 pm 11:27 PM

The H5 page needs to be maintained continuously, because of factors such as code vulnerabilities, browser compatibility, performance optimization, security updates and user experience improvements. Effective maintenance methods include establishing a complete testing system, using version control tools, regularly monitoring page performance, collecting user feedback and formulating maintenance plans.

What is a composer used for? Apr 06, 2025 am 12:02 AM

Composer is a dependency management tool for PHP. The core steps of using Composer include: 1) Declare dependencies in composer.json, such as "stripe/stripe-php":"^7.0"; 2) Run composerinstall to download and configure dependencies; 3) Manage versions and autoloads through composer.lock and autoload.php. Composer simplifies dependency management and improves project efficiency and maintainability.

Can you learn how to make H5 pages by yourself? Apr 06, 2025 am 06:36 AM

It is feasible to self-study H5 page production, but it is not a quick success. It requires mastering HTML, CSS, and JavaScript, involving design, front-end development, and back-end interaction logic. Practice is the key, and learn by completing tutorials, reviewing materials, and participating in open source projects. Performance optimization is also important, requiring optimization of images, reducing HTTP requests and using appropriate frameworks. The road to self-study is long and requires continuous learning and communication.

How to view the results after Bootstrap is modified Apr 07, 2025 am 10:03 AM

Steps to view modified Bootstrap results: Open the HTML file directly in the browser to ensure that the Bootstrap file is referenced correctly. Clear the browser cache (Ctrl Shift R). If you use CDN, you can directly modify CSS in the developer tool to view the effects in real time. If you modify the Bootstrap source code, download and replace the local file, or rerun the build command using a build tool such as Webpack.

How to use vue pagination Apr 08, 2025 am 06:45 AM

Pagination is a technology that splits large data sets into small pages to improve performance and user experience. In Vue, you can use the following built-in method to paging: Calculate the total number of pages: totalPages() traversal page number: v-for directive to set the current page: currentPage Get the current page data: currentPageData()

HadiDB: A lightweight, horizontally scalable database in Python Apr 08, 2025 pm 06:12 PM

HadiDB: A lightweight, high-level scalable Python database HadiDB (hadidb) is a lightweight database written in Python, with a high level of scalability. Install HadiDB using pip installation: pipinstallhadidb User Management Create user: createuser() method to create a new user. The authentication() method authenticates the user's identity. fromhadidb.operationimportuseruser_obj=user("admin","admin")user_obj.

Monitor MySQL and MariaDB Droplets with Prometheus MySQL Exporter Apr 08, 2025 pm 02:42 PM

Effective monitoring of MySQL and MariaDB databases is critical to maintaining optimal performance, identifying potential bottlenecks, and ensuring overall system reliability. Prometheus MySQL Exporter is a powerful tool that provides detailed insights into database metrics that are critical for proactive management and troubleshooting.

See all articles