Apple's large model MM1 is entering the market: 30 billion parameters, multi-modal, MoE architecture, more than half of the authors are Chinese-AI-php.cn

Table of Contents

Method Overview: The Secret to Building MM1

Home

Apple's large model MM1 is entering the market: 30 billion parameters, multi-modal, MoE architecture, more than half of the authors are Chinese

王林

Mar 15, 2024 pm 02:43 PM

ai Model arrangement

Since this year, Apple has obviously increased its emphasis and investment in generative artificial intelligence (GenAI). At the recent Apple shareholders meeting, Apple CEO Tim Cook said that the company plans to make significant progress in the field of GenAI this year. In addition, Apple announced that it was abandoning its 10-year car-making project, which caused some team members originally engaged in car-making to begin turning to the GenAI field.

Through these initiatives, Apple has demonstrated to the outside world their determination to strengthen GenAI. Currently, GenAI technology and products in the multi-modal field have attracted much attention, especially OpenAI’s Sora. Apple naturally hopes to make a breakthrough in this area.

In a co-authored research paper "MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training", Apple disclosed their research based on multimodal pre-training As a result, a multi-modal LLM series model containing up to 30B parameters was launched.

Apples large model MM1 is entering the market: 30 billion parameters, multi-modal, MoE architecture, more than half of the authors are Chinese

Paper address: https://arxiv.org/pdf/2403.09611.pdf

at During the study, the team conducted in-depth discussions on the criticality of different architectural components and data selection. Through careful selection of image encoders, visual language connectors, and various pre-training data, they summarized some important design guidelines. Specifically, the main contributions of this study include the following aspects.

First, the researchers conducted small-scale ablation experiments on model architecture decisions and pre-training data selection, and discovered several interesting trends. The importance of modeling design aspects is in the following order: image resolution, visual encoder loss and capacity, and visual encoder pre-training data.

Secondly, the researchers used three different types of pre-training data: image captions, interleaved image text, and plain text data. They found that interleaved and text-only training data were important when it came to few-shot and text-only performance, while for zero-shot performance, subtitle data was most important. These trends persist after supervised fine-tuning (SFT), indicating that the performance and modeling decisions presented during pre-training are preserved after fine-tuning.

Finally, researchers built MM1, a multi-modal model series with parameters up to 30 billion (others are 3 billion and 7 billion), which consists of dense models It is composed of mixed experts (MoE) variants, which not only achieves SOTA in pre-trained indicators, but also maintains competitive performance after supervised fine-tuning on a series of existing multi-modal benchmarks.

The pre-trained model MM1 performs superiorly on subtitles and question and answer tasks in a few-shot scenario, outperforming Emu2, Flamingo and IDEFICS. MM1 after supervised fine-tuning also shows strong competitiveness on 12 multi-modal benchmarks.

Thanks to large-scale multi-modal pre-training, MM1 has good performance in context prediction, multi-image and thought chain reasoning. Similarly, MM1 demonstrates strong few-shot learning capabilities after instruction tuning.

Apples large model MM1 is entering the market: 30 billion parameters, multi-modal, MoE architecture, more than half of the authors are Chinese

Method Overview: The Secret to Building MM1

Building a high-performance MLLM (Multimodal Large Language Model, multimodal large language model) is a highly practical work. Although the high-level architecture design and training process are clear, the specific implementation methods are not always obvious. In this work, the researchers describe in detail the ablations performed to build high-performance models. They explored three main design decision directions:

Architecture: The researchers looked at different pre-trained image encoders and explored connecting LLMs with these encoders Various ways to get up.
Data: The researcher considered different types of data and their relative mixing weights.
Training procedure: The researchers explored how to train MLLM, including hyperparameters and which parts of the model were trained when.

Ablation settings

Since training large MLLM will consume a lot of resources, The researchers used a simplified ablation setup. The basic configuration of ablation is as follows:

Image encoder: ViT-L/14 model trained with CLIP loss on DFN-5B and VeCap-300M; image size is 336 ×336.
Visual language connector: C-Abstractor, containing 144 image tokens.
Pre-training data: mixed subtitle images (45%), interleaved image text documents (45%) and plain text (10%) data.
Language Model: 1.2B Transformer Decoder Language Model.

To evaluate different design decisions, the researchers used zero-shot and few-shot (4 and 8 samples) performance on various VQA and image description tasks. : COCO Captioning, NoCaps, TextCaps, VQAv2, TextVQA, VizWiz, GQA and OK-VQA.

Model Architecture Ablation Experiment

The researchers analyzed the components that enable LLM to process visual data. Specifically, they studied (1) how to optimally pretrain a visual encoder, and (2) how to connect visual features to the space of LLMs (see Figure 3 left).

Apples large model MM1 is entering the market: 30 billion parameters, multi-modal, MoE architecture, more than half of the authors are Chinese

Image encoder pre-training. In this process, researchers mainly ablated the importance of image resolution and image encoder pre-training goals. It should be noted that unlike other ablation experiments, the researchers used 2.9B LLM (instead of 1.2B) to ensure sufficient capacity to use some larger image encoders.
Encoder experience: Image resolution has the greatest impact, followed by model size and training data composition. As shown in Table 1, increasing the image resolution from 224 to 336 improves all metrics for all architectures by approximately 3%. Increasing the model size from ViT-L to ViT-H doubles the parameters, but the performance gain is modest, typically less than 1%. Finally, adding VeCap-300M, a synthetic caption dataset, improves performance by more than 1% in few-shot scenarios.

Apples large model MM1 is entering the market: 30 billion parameters, multi-modal, MoE architecture, more than half of the authors are Chinese

Visual Language Connector and Image Resolution. The goal of this component is to transform visual representations into LLM space. Since the image encoder is ViT, its output is either a single embedding or a set of grid-arranged embeddings corresponding to input image segments. Therefore, the spatial arrangement of image tokens needs to be converted into the sequential arrangement of LLM. At the same time, the actual image token representation must also be mapped to the word embedding space.
VL connector experience: The number of visual tokens and image resolution are most important, while the type of VL connector has little impact. As shown in Figure 4, as the number of visual tokens or/and image resolution increases, the recognition rates of zero samples and few samples will increase.

Apples large model MM1 is entering the market: 30 billion parameters, multi-modal, MoE architecture, more than half of the authors are Chinese

Pre-training data ablation experiment

Generally, the model The training is divided into two stages: pre-training and instruction tuning. The former stage uses network-scale data, while the latter stage uses mission-specific curated data. The following focuses on the pre-training phase of this article and details the researcher’s data selection (Figure 3 right).

There are two types of data commonly used to train MLLM: caption data consisting of image and text pair descriptions; and image-text interleaved documents from the web. Table 2 is the complete list of data sets:

Apples large model MM1 is entering the market: 30 billion parameters, multi-modal, MoE architecture, more than half of the authors are Chinese

##Data Lesson 1: Interleaved data helps is used to improve few-sample and plain text performance, while subtitle data can improve zero-sample performance. Figure 5a shows the results for different combinations of interleaved and subtitled data.
Data experience 2: Plain text data helps improve few-sample and plain-text performance. As shown in Figure 5b, combining plain text data and subtitle data improves few-shot performance.
Data Lesson 3: Carefully blending image and text data results in optimal multimodal performance while retaining strong text performance. Figure 5c tries several mixing ratios between image (title and interlaced) and plain text data.
Data experience 4: Synthetic data helps with few-shot learning. As shown in Figure 5d, synthetic data does significantly improve the performance of few-shot learning, with absolute values of 2.4% and 4% respectively.

Apples large model MM1 is entering the market: 30 billion parameters, multi-modal, MoE architecture, more than half of the authors are Chinese

Final model and training method

The researcher collected previous ablation results, Determine the final recipe for MM1 multi-modal pre-training:

Image encoder: Considering the importance of image resolution, the researcher used the ViT-H model with a resolution of 378x378px and pre-trained using the CLIP target on DFN-5B;
Visual language connector: Since the number of visual tokens is most important, the researcher used a VL connector with 144 tokens. The actual architecture does not seem to be important, and the researcher chose C-Abstract;
Data: In order to maintain the performance of zero samples and few samples, the researcher used the following carefully combined data: 45 % images-text interleaved documents, 45% images-text documents and 10% text-only documents.

To improve the performance of the model, the researchers expanded the size of the LLM to 3B, 7B, and 30B parameters. All models were fully unfrozen pretrained with a batch size of 512 sequences with a sequence length of 4096, a maximum of 16 images per sequence, and a resolution of 378 × 378. All models were trained using the AXLearn framework.

They performed a grid search on learning rates at small scale, 9M, 85M, 302M and 1.2B, using linear regression in log space to extrapolate from smaller models to larger Changes to the model (see Figure 6), the result is to predict the optimal peak learning rate η given the number of (non-embedded) parameters N:

Apples large model MM1 is entering the market: 30 billion parameters, multi-modal, MoE architecture, more than half of the authors are Chinese

Extended via Mix of Experts (MoE). In experiments, the researchers further explored ways to extend the dense model by adding more experts to the FFN layer of the language model.

To convert a dense model to MoE, simply replace the dense language decoder with the MoE language decoder. To train MoE, the researchers used the same training hyperparameters and the same training settings as Dense Backbone 4, including training data and training tokens.

Regarding the multi-modal pre-training results, the researchers evaluated the pre-trained models on upper bound and VQA tasks with appropriate prompts. Table 3 evaluates zero samples and few samples:

Apples large model MM1 is entering the market: 30 billion parameters, multi-modal, MoE architecture, more than half of the authors are Chinese

Supervised fine-tuning results

Finally, The researchers introduced supervised fine-tuning (SFT) experiments trained on top of pre-trained models.

They followed LLaVA-1.5 and LLaVA-NeXT and collected about 1 million SFT samples from different datasets. Given that intuitively higher image resolution leads to better performance, the researchers also adopted the SFT method extended to high resolution.

The results of supervised fine-tuning are as follows:

Table 4 shows the comparison with SOTA, "-Chat" indicates the MM1 model after supervised fine-tuning .

First, on average, the MM1-3B-Chat and MM1-7B-Chat outperform all listed models of the same size. MM1-3B-Chat and MM1-7B-Chat perform particularly well on VQAv2, TextVQA, ScienceQA, MMBench, and recent benchmarks (MMMU and MathVista).

Secondly, the researchers explored two MoE models: 3B-MoE (64 experts) and 6B-MoE (32 experts). Apple's MoE model achieved better performance than the dense model in almost all benchmarks. This shows the huge potential for further expansion of the MoE.

Third, for the 30B size model, MM1-30B-Chat performs better than Emu2-Chat37B and CogVLM-30B on TextVQA, SEED and MMMU. MM1 also achieves competitive overall performance compared to LLaVA-NeXT.

However, LLaVA-NeXT does not support multi-image inference, nor does it support few-sample prompts, because each image is represented as 2880 tokens sent to LLM, and the total number of tokens in MM1 There are only 720 of them. This limits certain applications involving multiple images.

Apples large model MM1 is entering the market: 30 billion parameters, multi-modal, MoE architecture, more than half of the authors are Chinese

Figure 7b shows the impact of input image resolution on the average performance of the SFT evaluation index. Figure 7c shows that as the pre-training data increases, The performance of the model continues to improve.

The impact of image resolution. Figure 7b shows the impact of input image resolution on the average performance of the SFT evaluation metric.

Impact of pre-training: Figure 7c shows that as the pre-training data increases, the performance of the model continues to improve.

Apples large model MM1 is entering the market: 30 billion parameters, multi-modal, MoE architecture, more than half of the authors are Chinese

For more research details, please refer to the original paper.

The above is the detailed content of Apple's large model MM1 is entering the market: 30 billion parameters, multi-modal, MoE architecture, more than half of the authors are Chinese. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)

3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

R.E.P.O. Best Graphic Settings

3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

Assassin's Creed Shadows: Seashell Riddle Solution

1 weeks ago By DDD

R.E.P.O. How to Fix Audio if You Can't Hear Anyone

3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

Where to find the Crane Control Keycard in Atomfall

1 weeks ago By DDD

Hot Tools

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Hot Topics

Where is the login entrance for gmail email?

7444

CakePHP Tutorial

1371

What is the format of the account name of steam

win11 activation key permanent

nyt connections hints and answers

Related knowledge

What method is used to convert strings into objects in Vue.js? Apr 07, 2025 pm 09:39 PM

When converting strings to objects in Vue.js, JSON.parse() is preferred for standard JSON strings. For non-standard JSON strings, the string can be processed by using regular expressions and reduce methods according to the format or decoded URL-encoded. Select the appropriate method according to the string format and pay attention to security and encoding issues to avoid bugs.

Laravel's geospatial: Optimization of interactive maps and large amounts of data Apr 08, 2025 pm 12:24 PM

Efficiently process 7 million records and create interactive maps with geospatial technology. This article explores how to efficiently process over 7 million records using Laravel and MySQL and convert them into interactive map visualizations. Initial challenge project requirements: Extract valuable insights using 7 million records in MySQL database. Many people first consider programming languages, but ignore the database itself: Can it meet the needs? Is data migration or structural adjustment required? Can MySQL withstand such a large data load? Preliminary analysis: Key filters and properties need to be identified. After analysis, it was found that only a few attributes were related to the solution. We verified the feasibility of the filter and set some restrictions to optimize the search. Map search based on city

Remote senior backend engineers (platforms) need circles Apr 08, 2025 pm 12:27 PM

Remote Senior Backend Engineer Job Vacant Company: Circle Location: Remote Office Job Type: Full-time Salary: $130,000-$140,000 Job Description Participate in the research and development of Circle mobile applications and public API-related features covering the entire software development lifecycle. Main responsibilities independently complete development work based on RubyonRails and collaborate with the React/Redux/Relay front-end team. Build core functionality and improvements for web applications and work closely with designers and leadership throughout the functional design process. Promote positive development processes and prioritize iteration speed. Requires more than 6 years of complex web application backend

Vue and Element-UI cascade drop-down box v-model binding Apr 07, 2025 pm 08:06 PM

Vue and Element-UI cascaded drop-down boxes v-model binding common pit points: v-model binds an array representing the selected values at each level of the cascaded selection box, not a string; the initial value of selectedOptions must be an empty array, not null or undefined; dynamic loading of data requires the use of asynchronous programming skills to handle data updates in asynchronously; for huge data sets, performance optimization techniques such as virtual scrolling and lazy loading should be considered.

Vue.js How to convert an array of string type into an array of objects? Apr 07, 2025 pm 09:36 PM

Summary: There are the following methods to convert Vue.js string arrays into object arrays: Basic method: Use map function to suit regular formatted data. Advanced gameplay: Using regular expressions can handle complex formats, but they need to be carefully written and considered. Performance optimization: Considering the large amount of data, asynchronous operations or efficient data processing libraries can be used. Best practice: Clear code style, use meaningful variable names and comments to keep the code concise.

How to optimize database performance after mysql installation Apr 08, 2025 am 11:36 AM

MySQL performance optimization needs to start from three aspects: installation configuration, indexing and query optimization, monitoring and tuning. 1. After installation, you need to adjust the my.cnf file according to the server configuration, such as the innodb_buffer_pool_size parameter, and close query_cache_size; 2. Create a suitable index to avoid excessive indexes, and optimize query statements, such as using the EXPLAIN command to analyze the execution plan; 3. Use MySQL's own monitoring tool (SHOWPROCESSLIST, SHOWSTATUS) to monitor the database health, and regularly back up and organize the database. Only by continuously optimizing these steps can the performance of MySQL database be improved.

How to use mysql after installation Apr 08, 2025 am 11:48 AM

The article introduces the operation of MySQL database. First, you need to install a MySQL client, such as MySQLWorkbench or command line client. 1. Use the mysql-uroot-p command to connect to the server and log in with the root account password; 2. Use CREATEDATABASE to create a database, and USE select a database; 3. Use CREATETABLE to create a table, define fields and data types; 4. Use INSERTINTO to insert data, query data, update data by UPDATE, and delete data by DELETE. Only by mastering these steps, learning to deal with common problems and optimizing database performance can you use MySQL efficiently.

How to set the timeout of Vue Axios Apr 07, 2025 pm 10:03 PM

In order to set the timeout for Vue Axios, we can create an Axios instance and specify the timeout option: In global settings: Vue.prototype.$axios = axios.create({ timeout: 5000 }); in a single request: this.$axios.get('/api/users', { timeout: 10000 }).

See all articles