萬億token！史上最大多模態資料集誕生-人工智慧-PHP中文網

萬億token！史上最大多模態資料集誕生

WBOY

發布： 2024-07-28 09:38:23

原創

766 人瀏覽過

開源多模態大模型或將開始起飛。

萬億token！史上最大多模態資料集誕生

While Llama 3.1 is grabbing the headlines, another very important release suddenly appeared - an open source multi-modal dataset of unprecedented scale.

For large models, the importance of data sets goes without saying. It can even be said that it is impossible to have large models without large data sets. Now is the time when the development of large multi-modal models (LMM) is booming. High-quality and open-source multi-modal data sets of large enough scale have become a "rigid need" in this field.

However, compared with open source text data sets, existing open source multimodal data sets are relatively small and lack diversity, and their sources are basically HTML documents - this limits the breadth of data. and diversity. This undoubtedly limits the development of open source LMM and makes the difference between open source LMM and closed source LMM very large.

Recently, a joint team from the University of Washington, Salesforce Research, and Stanford University has filled this gap and built a trillion-token-level interleaved multi-modal open source data set MINT-1T (Multimodal INTerleaved). Without a doubt, this is the largest open source multimodal dataset currently available.

Dataset address: https://github.com/mlfoundations/MINT-1T
Paper address: https://arxiv.org/abs/2406.11271
Paper title: MINT -1T: Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens

MINT-1T contains a total of one trillion text tokens and three billion images, and it has HTML/PDF/ ArXiv and many other sources. Before the advent of MINT-1T, the largest open source dataset in this field was OBELICS, which contained 115 billion text tokens and 353 million images, and the source was only HTML. Figure 1 compares these data sets.

Construction of the dataset

First, the team collected a large amount of multi-modal data from diverse sources (including HTML, PDF, ArXiv), Figure 2 shows the multi-modal data from these different sources. Modal document sample.

Then, to improve data quality and security, they performed text quality filtering, image filtering, security filtering (including removing NSFW images and personally identifiable information), and deduplication. Figure 3 briefly illustrates these data filtering processes.

In the end, the MINT-1T dataset they obtained contained 922 billion HTML tokens, 106 billion PDF tokens, and 9 billion ArXiv tokens. It is worth noting that the entire data processing process consumed approximately 4.2 million CPU hours. Table 1 compares some common open or closed source multimodal datasets.

Model Experiment

The team also experimented with the effect of using this data set to train a multi-modal model and compared it with other data sets.

The model architecture they used is Salesforce's XGen-MM, and what they evaluate is the context learning and multi-image reasoning capabilities of the model after learning on the data set. Evaluation benchmarks include: visual description benchmarks (COCO and TextCaps), visual question answering benchmarks (VQAv2, OK-VQA, TextVQA, and VizWiz), and multi-image reasoning benchmarks (MMMU and Mantis-Eval).

Experimental results

Training on HTML documents

The team first compared the HTML part of MINT-1T with OBELICS; because OBELICS is the previous leading multi-modal dataset and Also based on HTML documents, they trained two models with 10 billion multi-modal tokens based on these two data sets, and evaluated their context learning performance.

Table 2 gives the 4-shot and 8-shot performance on common benchmarks.

It can be seen that for the VQA (visual question answering) task, the model trained on MINT-1T HTML documents performs better than the model trained on OBELICS, but the former performs worse on the visual description task. On average, OBELICS is slightly better than MINT-1T (HTML).

Add PDF and ArXiv documents

After that, the team tested on the MINT-1T full data set, which contains HTML, PDF and ArXiv documents at the same time. They typically sample 10 billion multimodal tokens, 50% from HTML, 45% from PDF, and 5% from ArXiv.

The results are also shown in Table 2. It can be seen that the model trained on MINT-1T mixed data outperforms the model trained on OBELICS and MINT-1T (HTML) on most benchmarks.

On more complex multi-modal reasoning tasks, as shown in Table 3, the model trained with MINT-1T is better than the model trained with OBELICS on MMMU, but not as good as the Mantis-Eval benchmark. the latter.

For more fine-grained testing and the impact of model architecture, please refer to the original paper.

Can this ultra-large-scale open source multi-modal data set become the starting point of a series of legends, eventually creating a multi-modal large model series like the Llama series of models? Let's wait and see.

以上是萬億token！史上最大多模態資料集誕生的詳細內容。更多資訊請關注PHP中文網其他相關文章！