Conveniently trained the biggest ViT in history? Google upgrades visual language model PaLI: supports 100+ languages-AI-php.cn

Table of Contents

Dataset of 109 languages

Training large models

Home

Technology peripherals

Conveniently trained the biggest ViT in history? Google upgrades visual language model PaLI: supports 100+ languages

王林

Apr 12, 2023 am 09:31 AM

language model google

The progress of natural language processing in recent years has largely come from large-scale language models. Each new model released pushes the amount of parameters and training data to new highs, and will also Carry out a massacre of the existing benchmark rankings!

For exampleIn April this year, Google released the 540 billion parameter language model PaLM (Pathways Language Model) in language and reasoning It has successfully surpassed humans in a series of evaluations, especially its excellent performance in the few-shot small sample learning scenario, and PaLM is considered the development direction of the next generation language model.

Conveniently trained the biggest ViT in history? Google upgrades visual language model PaLI: supports 100+ languages

Similarly, Visual language modelIn factis alsoStrong efforts can produce miracles , you can improve the performance by increasing the size of the model.

Of course, if is just a visual language model for multi-tasking , it is obviously not very universal, and it must support input in multiple languages Just output.

Recently, Google upgraded the PaLM extension to PALI (Pathways Language and Image model), which has both multi-language and image understanding capabilities , and supports 100 languages to perform a variety of image and language applications across vision, language and multi-modal, such as visual question answering, image caption (image caption), object detection, image classification, OCR, Text reasoning, etc.

Conveniently trained the biggest ViT in history? Google upgrades visual language model PaLI: supports 100+ languages

##Paper link: https://arxiv.org/abs/2209.06794

The model is trained using a public image collection, which includes automatically crawled annotations in 109 languages, also called the WebLI data set in the article.

PaLI models pre-trained on WebLI achieve state-of-the-art performance on multiple image and language benchmarks, such as COCO-Captions, TextCaps, VQAv2, OK-VQA, TextVQA, etc. etc., also surpassed the benchmarks of multilingual visual captioning and visual question answering of previous models.

Model Architecture

One of the goals of PALI is to study language and visual models at performance and scale Are the connections on the same, especially the scalability of the language-image model?

So the architectural design of the model is very simple, mainly for the convenience of experiments, especially for reusability and scalability.

Conveniently trained the biggest ViT in history? Google upgrades visual language model PaLI: supports 100+ languages

The model consists of a Transformer encoder that processes input text and an autoregressive Transformer decoder that generates output text.

When processing images, the input to the Transformer encoder also includes visual words representing the images processed by ViT.

A key design of the PaLI model is reuse. The researchers used the weights of previously trained single-modal vision and language models (such as mT5-XXL and large ViTs) as seeds for the model. ,This reuse not only transfers the capabilities of single-modal ,training, but also saves computational costs.

The visual component of the model uses the largest ViT architecture to date, ViT-e, which has the same structure as the 1.8 billion parameter ViT-G model, and Using the same training parameters, the difference is that it is expanded to 4 billion parameters.

Although the scaling rules have been studied in both the visual field and the language field, there is less discussion of scaling behavior in the combined model of vision and language. Expanding the scale of the visual backbone model may lead to saturation of gains in classification tasks.

The researchers also further confirmed this, and it can be observed that ViT-e is only slightly better than ViT-G on ImageNet, but ViT-e has a great improvement on the visual language task of PaLI. For example, ViT-e outperforms ViT-G by nearly 3 CIDEr points on the COCO subtitle task. 3 points more than ViT-G in tasks. This also hints at room for using larger ViT skeleton models in visual language tasks in the future.

Conveniently trained the biggest ViT in history? Google upgrades visual language model PaLI: supports 100+ languages

The researchers adopted mT5 backbone as the language modeling component, using pre-trained mT5-Large (1 billion parameters) and mT5-XXL (13 billion parameters) to initialize PaLI’s language encoder-decoder and then continue hybrid training on many language tasks, including pure language understanding tasks, which also helps avoid catastrophic forgetting of mT5’s language understanding and generative capacity.

Finally, we got three PALI models of different sizes.

Conveniently trained the biggest ViT in history? Google upgrades visual language model PaLI: supports 100+ languages

Dataset of 109 languages

Extension research related to deep learning shows that the larger the model, the more training data required The set is also larger.

So in order to comprehensively study and release the potential of language-image pre-training models, researchers crawled a large amount of image and text data from the Internet and constructed a new data set WebLI , which includes 12 billion alt-texts and 10 billion images in 109 languages.

Conveniently trained the biggest ViT in history? Google upgrades visual language model PaLI: supports 100+ languages

In addition to using network text for annotation, the researchers also used the cloud vision API to perform OCR recognition on images, thereby obtaining 29 billion images-OCR of data pairs.

Conveniently trained the biggest ViT in history? Google upgrades visual language model PaLI: supports 100+ languages

Using near-duplication to deduplicate images from the training, validation and test portions of 68 common visual and visual language datasets to avoid data leakage in downstream evaluation tasks.

Conveniently trained the biggest ViT in history? Google upgrades visual language model PaLI: supports 100+ languages

In order to further improve data quality, researchers will also score and adjust based on the cross-modal similarity of "image and alt-text" Threshold, and finally only retain 10% of the images. A total of 1 billion images are used to train PaLI

Training large models

Since the visual-language task is multi-modal , so the model needs to have multiple semantic processing capabilities and have different goals. For example, some tasks require local localization of objects to accurately solve the task, while other tasks may require more global semantic information.

Similarly, some language tasks may require long answers, while others may require compact answers.

To resolve all these inconsistent goals, researchers leveraged the richness of WebLI pre-training data and introduced a Pretraining Task Mixture to prepare models for various downstream applications. .

In order to make the model more versatile to solve a variety of tasks, the author classified all tasks into a single common API (input: image text; output: text), making multiple images Knowledge sharing is possible between and language tasks, which is also shared with pre-training settings.

The targets used for pre-training are projected into the same API as a weighted mix, with the goal of both maintaining the ability to reuse model components while training the model to perform new tasks .

The model uses the open source T5X and Flaxformer frameworks and is trained with Flax in JAX. The visual part of ViT-e uses the open source BigVision framework to generate word vectors of the language part and the visual part. The patch vectors are cascaded and jointly used as the input of the multi-modal encoder-decoder. The encoder is initialized using mT5-XXL pre-training. During the training process of PaLI, the weights of the visual components are frozen and only the weights of the multimodal encoder-decoder are updated.

In the experimental part, the researchers compared PaLI on common visual language benchmarks, and the PaLI model achieved state-of-the-art results on these tasks, even exceeding the very large ones proposed in the previous literature. Model.

Conveniently trained the biggest ViT in history? Google upgrades visual language model PaLI: supports 100+ languages

For example, the 17 billion parameter PALI performs better than the 80 billion parameter Flamingo model on some VQA and image captioning tasks.

And PALI also maintains good performance on single language or single visual tasks, although this is not the main training goal of PALI.

We also examine how the image and language model components interact in terms of model extensions, and where the model yields the greatest gains.

The final conclusion is that joint scaling (scaling) of these two components yields the best performance, specifically for visual components that require relatively few parameters Scaling is critical, but scaling is also important for improving performance on multilingual tasks.

Conveniently trained the biggest ViT in history? Google upgrades visual language model PaLI: supports 100+ languages

After evaluating PaLI on the benchmark Crossmodal-3600 in 35 languages, it can be found that the multi-language title task benefits more from the extension of the PaLI model. many.

Conveniently trained the biggest ViT in history? Google upgrades visual language model PaLI: supports 100+ languages

To avoid creating or reinforcing unfair bias in large language and image models, understanding of the data used and how the models use that data is required To maintain transparency, test the fairness of the model and conduct responsible data analysis, the article also provides a Data Card and Model Card

Conveniently trained the biggest ViT in history? Google upgrades visual language model PaLI: supports 100+ languages

The above is the detailed content of Conveniently trained the biggest ViT in history? Google upgrades visual language model PaLI: supports 100+ languages. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)

4 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

R.E.P.O. Best Graphic Settings

4 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

Assassin's Creed Shadows: Seashell Riddle Solution

2 weeks ago By DDD

R.E.P.O. How to Fix Audio if You Can't Hear Anyone

4 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

WWE 2K25: How To Unlock Everything In MyRise

1 months ago By 尊渡假赌尊渡假赌尊渡假赌

Hot Tools

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Hot Topics

Where is the login entrance for gmail email?

7516

CakePHP Tutorial

1378

What is the format of the account name of steam

win11 activation key permanent

nyt connections hints and answers

Related knowledge

Google Pixel 9 and Pixel 9 Pro rumoured to gain Creative Assistant AI upon release Jun 22, 2024 am 10:50 AM

Currently, four new Pixel smartphones are anticipated to land this autumn. To recap, the series is rumoured to feature thePixel 9 and Pixel 9 Pro at launch. However, the Pixel 9 Pro will be a rival to the iPhone 16 Pro rather than a Pixel 8 Pro (curr

Google Pixel 9 Pro XL gets tested with desktop mode Aug 29, 2024 pm 01:09 PM

Google has introduced DisplayPort Alternate Mode with the Pixel 8 series, and it's present on the newly launched Pixel 9 lineup. While it's mainly there to let you mirror the smartphone display with a connected screen, you can also use it for desktop

Google AI announces Gemini 1.5 Pro and Gemma 2 for developers Jul 01, 2024 am 07:22 AM

Google AI has started to provide developers with access to extended context windows and cost-saving features, starting with the Gemini 1.5 Pro large language model (LLM). Previously available through a waitlist, the full 2 million token context windo

Google app beta APK teardown reveals new extensions coming to Gemini AI assistant Jul 30, 2024 pm 01:06 PM

Google's AI assistant, Gemini, is set to become even more capable, if the APK teardown of the latest update (v15.29.34.29 beta) is to be considered. The tech behemoth's new AI assistant could reportedly get several new extensions. These extensions wi

Google Tensor G4 of Pixel 9 Pro XL lags behind Tensor G2 in Genshin Impact Aug 24, 2024 am 06:43 AM

Google recently responded to the performance concerns about the Tensor G4 of the Pixel 9 line. The company said that the SoC wasn't designed to beat benchmarks. Instead, the team focused on making it perform well in the areas where Google wants the c

Google Pixel 9 smartphones will not launch with Android 15 despite seven-year update commitment Aug 01, 2024 pm 02:56 PM

The Pixel 9 series is almost here, having been scheduled for an August 13 release. Based on recent rumours, the Pixel 9, Pixel 9 Pro and Pixel 9 Pro XL will mirror the Pixel 8 and Pixel 8 Pro (curr. $749 on Amazon) by starting with 128 GB of storage.

To provide a new scientific and complex question answering benchmark and evaluation system for large models, UNSW, Argonne, University of Chicago and other institutions jointly launched the SciQAG framework Jul 25, 2024 am 06:42 AM

Editor |ScienceAI Question Answering (QA) data set plays a vital role in promoting natural language processing (NLP) research. High-quality QA data sets can not only be used to fine-tune models, but also effectively evaluate the capabilities of large language models (LLM), especially the ability to understand and reason about scientific knowledge. Although there are currently many scientific QA data sets covering medicine, chemistry, biology and other fields, these data sets still have some shortcomings. First, the data form is relatively simple, most of which are multiple-choice questions. They are easy to evaluate, but limit the model's answer selection range and cannot fully test the model's ability to answer scientific questions. In contrast, open-ended Q&A

New Google Pixel desktop mode showcased in fresh video as possible Motorola Ready For and Samsung DeX alternative Aug 08, 2024 pm 03:05 PM

A few months have passed since Android Authority demonstrated a new Android desktop mode that Google had hidden away within Android 14 QPR3 Beta 2.1. Arriving hot on the heels of Google adding DisplayPort Alt Mode support for the Pixel 8 and Pixel 8

See all articles