Table of Contents
MIMIC-IT Dataset
Home Technology peripherals AI 2.8 million multimodal command-response pairs, common in eight languages, the first command data set covering video content MIMIC-IT is here

2.8 million multimodal command-response pairs, common in eight languages, the first command data set covering video content MIMIC-IT is here

Jun 13, 2023 am 10:34 AM
instruction data set

In recent times, AI dialogue assistants have made considerable progress in language tasks. This significant improvement is not only based on the strong generalization ability of LLM, but also should be attributed to instruction tuning. This involves fine-tuning the LLM on a range of tasks through diverse and high-quality instruction.

One potential reason for achieving zero-shot performance with instruction tuning is that it internalizes context. This is important especially when user input skips common sense context. By incorporating instruction tuning, LLM gains a high level of understanding of user intent and exhibits better zero-shot capabilities even in previously unseen tasks.

However, an ideal AI conversational assistant should be able to solve tasks involving multiple modalities. This requires obtaining a diverse and high-quality multimodal instruction following dataset. For example, the LLaVAInstruct-150K dataset (also known as LLaVA) is a commonly used visual-verbal instruction following dataset, which uses COCO images, instructions and responses based on image captions and target bounding boxes obtained from GPT-4 Constructed. However, LLaVA-Instruct-150K has three limitations: limited visual diversity; using pictures as single visual data; and contextual information only containing language morphology.

In order to promote AI dialogue assistants to break through these limitations, scholars from Nanyang Technological University in Singapore and Microsoft Research Institute in Redmond proposed the multi-modal contextual instruction tuning data set MIMIC-IT , a dataset containing 2.8 million multimodal context instruction-response pairs, covering a variety of real-life scenarios.

2.8 million multimodal command-response pairs, common in eight languages, the first command data set covering video content MIMIC-IT is here

Paper address: https://arxiv.org/pdf/2306.05425.pdf

In addition, in order to efficiently generate instruction-response pairs, the researchers also introduced "Sythus", an automated pipeline of instruction-response annotations inspired by the self-instruct method. Sythus uses system information, visual annotations, and examples in context to guide a language model (GPT-4 or ChatGPT) to generate command-response pairs based on visual context, targeting the three fundamental capabilities of visual language models: perception, reasoning, and planning. Additionally, English commands and responses are translated into seven languages ​​to support multilingual use.

It is particularly noteworthy that the researchers trained the multi-modal model "Otter" based on OpenFlamingo on MIMIC-IT. Evaluated on both ChatGPT and humans, Otter's multimodal capabilities surpass other recent visual-language models.

In a series of demos, researchers demonstrated Otter's multi-modal question and answer capabilities:

Q: Which card will be played next? ?

A: Eight tubes. This way you can keep fours and fives waiting for a straight.

2.8 million multimodal command-response pairs, common in eight languages, the first command data set covering video content MIMIC-IT is here

Q: I want to land, what should I do?

A: Press the lower left button once to turn left, and pull the handbrake on the right when you want to land.

2.8 million multimodal command-response pairs, common in eight languages, the first command data set covering video content MIMIC-IT is here

Q: What next?

A: Pass the ball to the center, run to the goal, and look for scoring opportunities.

2.8 million multimodal command-response pairs, common in eight languages, the first command data set covering video content MIMIC-IT is here

Next, let’s look at the study details.

MIMIC-IT Dataset

The MIMIC-IT dataset includes 2.8 million multimodal instruction-response pairs, covering basic capabilities: perception, reasoning, and planning. Each instruction is accompanied by a multimodal conversational context, enabling VLM trained on MIMIC-IT to demonstrate good proficiency in interactive instructions and perform zero-shot generalization.

2.8 million multimodal command-response pairs, common in eight languages, the first command data set covering video content MIMIC-IT is here

## Compared with LLaVA, the features of MIMIC-IT include:

(1) Diverse visual scenes, including images and videos from different data sets such as general scenes, egocentric perspective scenes and indoor RGB-D images;

(2) More An image (or a video) as visual data;

(3) Multi-modal context information, including multiple command-response pairs and multiple images or videos;

(4) Supports eight languages, including English, Chinese, Spanish, Japanese, French, German, Korean and Arabic.

The following figure further shows the command-response comparison of the two (the yellow box is LLaVA):

2.8 million multimodal command-response pairs, common in eight languages, the first command data set covering video content MIMIC-IT is here

As shown in Table 1, the data sources of MIMIC-IT come from seven data sets: COCO, Spot-the-diff (SD), ScanNetV2 (SN), VisualStorytelling (VIST), DenseCaption /Activity caption (DC), TVCaption (TVC) and Ego4D (E4D). "lang." in the "Context" column represents language, and "vis." represents vision.

2.8 million multimodal command-response pairs, common in eight languages, the first command data set covering video content MIMIC-IT is hereSythus: Automated command-response pair generation pipeline

At the same time, the researcher proposed Sythus (Figure 3), which is An automated pipeline for generating high-quality command-response pairs in multiple languages. Based on the framework proposed by LLaVA, researchers used ChatGPT to generate command-response pairs based on visual content. To ensure the quality of the generated command-response pairs, the pipeline uses system information, visual annotations, and samples in context as prompts for ChatGPT. System information defines the expected tone and style of the generated command-response pairs, while visual annotations provide basic image information such as bounding boxes and image descriptions. Examples in context help ChatGPT learn in context.

Since the quality of the core set will affect the subsequent data collection process, the researchers adopted a cold start strategy to strengthen the samples in context before large-scale querying. During the cold start phase, a heuristic approach is adopted to prompt ChatGPT to collect samples in context only through system information and visual annotations. This phase ends only after the samples in a satisfactory context have been identified. In the fourth step, once the command-response pairs are obtained, the pipeline expands them into Chinese (zh), Japanese (ja), Spanish (es), German (de), French (fr), Korean (ko) and Arabic (ar). Further details can be found in Appendix C, and specific task prompts can be found in Appendix D.

2.8 million multimodal command-response pairs, common in eight languages, the first command data set covering video content MIMIC-IT is here

Empirical Evaluation

The researchers then demonstrated various applications and the potential capabilities of visual language models (VLMs) trained on them. First, the researchers introduced Otter, a contextual instruction tuning model developed using the MIMIC-IT dataset. The researchers then explored various methods of training Otter on the MIMIC-IT dataset and discussed the many scenarios in which Otter can be used effectively.

Figure 5 is an example of Otter’s response in different scenarios. Thanks to training on the MIMIC-IT dataset, Otter is capable of serving situational understanding and reasoning, contextual sample learning, and egocentric visual assistants.

2.8 million multimodal command-response pairs, common in eight languages, the first command data set covering video content MIMIC-IT is here

Finally, the researchers conducted a comparative analysis of the performance of Otter and other VLMs in a series of benchmark tests.

ChatGPT Evaluation

Table 2 below shows the researcher’s evaluation of the visual language model using the MMAGIBench framework [43] Perception and reasoning abilities are broadly assessed.

2.8 million multimodal command-response pairs, common in eight languages, the first command data set covering video content MIMIC-IT is here

##Human Assessment

Multi-Modality Arena [32] uses the Elo rating system to evaluate the usefulness and consistency of VLM responses. Figure 6(b) shows that Otter demonstrates superior practicality and consistency, achieving the highest Elo rating in recent VLMs.

Few-shot contextual learning benchmark evaluation

Otter is fine-tuned based on OpenFlamingo, a multi-model An architecture designed for dynamic context learning. After fine-tuning using the MIMIC-IT dataset, Otter significantly outperforms OpenFlamingo on the COCO Captioning (CIDEr) [27] few-shot evaluation (see Figure 6 (c)). As expected, fine-tuning also brings marginal performance gains on zero-sample evaluation.

2.8 million multimodal command-response pairs, common in eight languages, the first command data set covering video content MIMIC-IT is here

Figure 6: Evaluation of ChatGPT video understanding.

Discuss

Flaws. Although researchers have iteratively improved system messages and command-response examples, ChatGPT is prone to language hallucinations, so it may generate erroneous responses. Often, more reliable language models require self-instruct data generation.

Future jobs. In the future, the researchers plan to support more specific AI datasets, such as LanguageTable and SayCan. Researchers are also considering using more trustworthy language models or generation techniques to improve the instruction set.

The above is the detailed content of 2.8 million multimodal command-response pairs, common in eight languages, the first command data set covering video content MIMIC-IT is here. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)
2 weeks ago By 尊渡假赌尊渡假赌尊渡假赌
Hello Kitty Island Adventure: How To Get Giant Seeds
1 months ago By 尊渡假赌尊渡假赌尊渡假赌
Two Point Museum: All Exhibits And Where To Find Them
1 months ago By 尊渡假赌尊渡假赌尊渡假赌

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

How to get items using commands in Terraria? -How to collect items in Terraria? How to get items using commands in Terraria? -How to collect items in Terraria? Mar 19, 2024 am 08:13 AM

How to get items using commands in Terraria? 1. What is the command to give items in Terraria? In the Terraria game, giving command to items is a very practical function. Through this command, players can directly obtain the items they need without having to fight monsters or teleport to a certain location. This can greatly save time, improve the efficiency of the game, and allow players to focus more on exploring and building the world. Overall, this feature makes the gaming experience smoother and more enjoyable. 2. How to use Terraria to give item commands 1. Open the game and enter the game interface. 2. Press the "Enter" key on the keyboard to open the chat window. 3. Enter the command format in the chat window: "/give[player name][item ID][item quantity]".

VUE3 quick start: using Vue.js instructions to switch tabs VUE3 quick start: using Vue.js instructions to switch tabs Jun 15, 2023 pm 11:45 PM

This article aims to help beginners quickly get started with Vue.js3 and achieve a simple tab switching effect. Vue.js is a popular JavaScript framework that can be used to build reusable components, easily manage the state of your application, and handle user interface interactions. Vue.js3 is the latest version of the framework. Compared with previous versions, it has undergone major changes, but the basic principles have not changed. In this article, we will use Vue.js instructions to implement the tab switching effect, with the purpose of making readers familiar with Vue.js

Image classification with few-shot learning using PyTorch Image classification with few-shot learning using PyTorch Apr 09, 2023 am 10:51 AM

In recent years, deep learning-based models have performed well in tasks such as object detection and image recognition. On challenging image classification datasets like ImageNet, which contains 1,000 different object classifications, some models now exceed human levels. But these models rely on a supervised training process, they are significantly affected by the availability of labeled training data, and the classes the models are able to detect are limited to the classes they were trained on. Since there are not enough labeled images for all classes during training, these models may be less useful in real-world settings. And we want the model to be able to recognize classes it has not seen during training, since it is almost impossible to train on images of all potential objects. We will learn from a few samples

To provide a new scientific and complex question answering benchmark and evaluation system for large models, UNSW, Argonne, University of Chicago and other institutions jointly launched the SciQAG framework To provide a new scientific and complex question answering benchmark and evaluation system for large models, UNSW, Argonne, University of Chicago and other institutions jointly launched the SciQAG framework Jul 25, 2024 am 06:42 AM

Editor |ScienceAI Question Answering (QA) data set plays a vital role in promoting natural language processing (NLP) research. High-quality QA data sets can not only be used to fine-tune models, but also effectively evaluate the capabilities of large language models (LLM), especially the ability to understand and reason about scientific knowledge. Although there are currently many scientific QA data sets covering medicine, chemistry, biology and other fields, these data sets still have some shortcomings. First, the data form is relatively simple, most of which are multiple-choice questions. They are easy to evaluate, but limit the model's answer selection range and cannot fully test the model's ability to answer scientific questions. In contrast, open-ended Q&A

Implementing OpenAI CLIP on custom datasets Implementing OpenAI CLIP on custom datasets Sep 14, 2023 am 11:57 AM

In January 2021, OpenAI announced two new models: DALL-E and CLIP. Both models are multimodal models that connect text and images in some way. The full name of CLIP is Contrastive Language-Image Pre-training (ContrastiveLanguage-ImagePre-training), which is a pre-training method based on contrasting text-image pairs. Why introduce CLIP? Because the currently popular StableDiffusion is not a single model, but consists of multiple models. One of the key components is the text encoder, which is used to encode the user's text input, and this text encoder is the text encoder CL in the CLIP model

Data modeling using Kernel Model Gaussian Processes (KMGPs) Data modeling using Kernel Model Gaussian Processes (KMGPs) Jan 30, 2024 am 11:15 AM

Kernel Model Gaussian Processes (KMGPs) are sophisticated tools for handling the complexity of various data sets. It extends the concept of traditional Gaussian processes through kernel functions. This article will discuss in detail the theoretical basis, practical applications and challenges of KMGPs. The kernel model Gaussian process is an extension of the traditional Gaussian process and is used in machine learning and statistics. Before understanding kmgp, you need to master the basic knowledge of Gaussian process, and then understand the role of the kernel model. Gaussian processes (GPs) are a set of random variables, a finite number of variables jointly distributed with a Gaussian distribution, and are used to define function probability distributions. Gaussian processes are commonly used in regression and classification tasks in machine learning and can be used to fit the probability distribution of data. An important feature of Gaussian processes is their ability to provide uncertainty estimates and predictions

Modular MoE will become the basic model for visual multi-task learning Modular MoE will become the basic model for visual multi-task learning Apr 13, 2023 pm 12:40 PM

Multi-task learning (MTL) presents many challenges because gradients between different tasks may be contradictory. To exploit the correlation between tasks, the authors introduce the Mod-Squad model, which is a modular model composed of multiple experts. The model can flexibly optimize the matching of tasks and experts, and select some experts for the task. The model allows each expert to correspond to only part of the tasks, and each task to only correspond to part of the experts, thereby maximizing the use of the positive connections between tasks. Mod-Squad integrates a Mixture of Expert (MoE) layer into the Vision Transformer model and introduces a new loss function that encourages sparse but strong dependencies between experts and tasks. also

Explore the infinite possibilities of the input method that comes with MC commands (an innovative tool to create a perfect gaming experience - the input method that comes with MC commands) Explore the infinite possibilities of the input method that comes with MC commands (an innovative tool to create a perfect gaming experience - the input method that comes with MC commands) May 02, 2024 pm 03:01 PM

Mobile devices have become an essential part of people's lives in modern society. Games have also become one of the main forms of entertainment in people's spare time. There are constantly people working on developing new tools and technologies to optimize gameplay and improve the gaming experience. The input method with its own MC command is one of the eye-catching innovations. And how it can bring a better gaming experience to players. This article will delve into the infinite possibilities of the built-in MC command input method. Introduction to the built-in MC command input method. The built-in MC command input method is an innovative tool that combines the functions of MC commands and intelligent input methods. This enables more operations and functions. By installing the input method on a mobile device, players can easily use various commands in the game. Enter commands quickly to improve game efficiency

See all articles