Table of Contents
Results
Conclusion
Home Technology peripherals AI Explore a new generation of small models that go beyond GPT 3.5.

Explore a new generation of small models that go beyond GPT 3.5.

Apr 27, 2023 am 11:43 AM
Model paper

At the end of last year, OpenAI launched ChatGPT to the public. Once released, this technology immediately pushed AI-driven chatbots to the center of mainstream discourse, and many researchers discussed how it can change business, education, etc. There was another round of debate.

Subsequently, technology giants followed suit and invested in scientific research teams, and their so-called "generative AI" technology (technology that can produce conversational text, graphics, etc.) was also ready.

As we all know, ChatGPT is fine-tuned based on the GPT-3.5 series of models. We have seen many studies following closely behind it. However, with ChatGPT How good is their new research in comparison? Recently, in a paper "Multimodal Chain-of-Thought Reasoning in Language Models" released by Amazon, they proposed Multimodal-CoT that includes visual features. This architecture performed well in the ScienceQA benchmark when the number of parameters was less than 1 billion. , 16 percentage points higher than GPT-3.5 (75.17%→91.68%), even surpassing many humans.

Here is a brief introduction to the ScienceQA benchmark. It is the first multi-modal scientific question and answer data set with detailed explanations, proposed by UCLA and the Allen Institute for Artificial Intelligence (AI2). It is mainly used to test the multi-modal reasoning ability of the model. It has a very rich field diversity, covering the fields of natural science, language science and social science, and puts forward high requirements for the logical reasoning ability of the model.

超越GPT 3.5的小模型来了!

## Paper address: https://arxiv.org/abs/2302.00923

Project address: https://github.com/amazon-science/mm-cot

Let’s take a look How Amazon's language model surpasses GPT-3.5.

Multimodal-CoT including visual features

Large Language Model (LLM) performs well on complex reasoning tasks and cannot do without the assistance of Chain of Thought (CoT) prompts . However, existing CoT research only focuses on language modalities. To trigger CoT inference in multi-modality, one possible solution is to fine-tune a small language model to perform CoT inference by fusing visual and language features.

However, it has been observed that small models tend to make up things more frequently than large models. This behavior of models is often called "hallucination." A previous Google study also showed (paper Chain-of-Thought Prompting Elicits Reasoning in Large Language Models) that CoT-based prompts are only useful when the model has at least 100 billion parameters!

That said, CoT hints do not have a positive impact on the performance of small models, and only yield performance gains when used with models of ∼100B parameters.

However, this article studies performance improvement with less than 1 billion parameters. How is it achieved? To put it simply, this paper proposes Multimodal-CoT that contains visual features, and uses this paradigm (Multimodal-CoT) to find CoT reasoning in multiple modalities.

Multimodal-CoT combines visual features in a single training framework to reduce the impact of language models that have a tendency to produce illusive reasoning patterns. Overall, this framework divides the reasoning process into two parts: rationale generation (finding reasons) and answer reasoning (finding answers).

超越GPT 3.5的小模型来了!

Multimodal CoT Two-stage process: uses text (question context) and visual features to generate logical justification.

Dataset

This article mainly focuses on the ScienceQA data set. The set includes images and text as part of the context. Additionally, the dataset contains explanations of the answers so that the model can be fine-tuned to generate CoT rationales. In addition, this paper utilizes the DETR model to generate visual features.

Smaller LMs are prone to hallucinations when generating CoT/Basic Principles. The author speculates that if there is a modified architecture where the model can utilize the textual features generated by the LM and the visual features generated by the image model, then more Ability to give reasons and answer questions.

Architecture

In general, we need an architecture that can generate text features and visual features and use them to generate A model for text responsiveness.

It is also known that there is some interaction between text and visual features, which is essentially some kind of joint attention mechanism, which helps to encapsulate the information present in the two modalities. , which makes it possible to learn from ideas. To accomplish all this, the authors chose the T5 model, which has an encoder-decoder architecture, and as mentioned above, the DETR model is used to generate visual features.

The encoder of the T5 model is responsible for generating text features, but the decoder of the T5 model does not use the text features generated by the encoder, but uses the joint attention interaction layer proposed by the author ( co-attention-styled interaction layer) output.

Looking at the dismantling, assume that H_language is the output of the T5 encoder. X_vision is the output of DETR. The first step is to ensure that the visual features and textual features have the same hidden size so that we can use the attention layer.

Note: All code snippets are from the GitHub of the paper: https://github.com/amazon-science/mm-cot/blob/main/model.py

self.image_dense = nn.Linear(self.patch_dim, config.d_model)
Copy after login

W_h is essentially a linear layer, and H_vision corresponds to the final visual features. W_h helps change the size of the visual features to match the size of the text features.

Next we need to add an attention layer so that visual and textual features can interact with each other. To do this, the authors use a single-head attention layer with H_language as the query vector and H_vision as the key and value vectors.

self.mha_layer = torch.nn.MultiheadAttention(embed_dim=config.hidden_size, 
kdim=config.hidden_size, vdim=config.hidden_size,
num_heads=1, batch_first=True)


image_att, _ = self.mha_layer(hidden_states, image_embedding, image_embedding)
Copy after login

Now we have an embedding that contains information from text and visual features. The authors then utilize gated fusion to generate a final set of features that will be sent to the decoder. There are two steps to gated fusion:

  1. Get a vector of scores between 0 and 1 to determine the importance of each attention feature.
  2. Use score to fuse text and attention features.

超越GPT 3.5的小模型来了!

W_I and W_v are essentially two linear layers.

self.gate_dense = nn.Linear(2*config.hidden_size, config.hidden_size) 
self.sigmoid = nn.Sigmoid()


hidden_states = encoder_outputs[0]
merge = torch.cat([hidden_states, image_att], dim=-1)
gate = self.sigmoid(self.gate_dense(merge))
hidden_states = (1 - gate) * hidden_states + gate * image_att
Copy after login

Finally, the fused features are passed to the decoder.

decoder_outputs = self.decoder( input_ids=decoder_input_ids, 
attention_mask=decoder_attention_mask, 
inputs_embeds=decoder_inputs_embeds, 
past_key_values=past_key_values, 
encoder_hidden_states=hidden_states,
Copy after login

This is pretty much the structure the author follows! However, remember that there are two phases. The first stage is to generate the rationale/CoT. The second stage utilizes the CoT produced in the first stage to generate the answer, as shown in the figure above.

Results

The authors used the weights of the UnifiedQA model as the initialization point of the T5 model and fine-tuned it on the ScienceQA dataset. They observed that their Multimodal CoT method outperformed all previous baselines, including GPT-3.5.

What’s interesting is that even the base model with only 223 million parameters outperforms GPT-3.5 and other Visual QA models! This highlights the power of having a multimodal architecture.

The authors also show that their two-stage approach outperforms the single-stage approach.

超越GPT 3.5的小模型来了!

Conclusion

The biggest gain brought by this paper is that multi-modal features are useful in solving problems with How powerful are visual and textual features when it comes to questions.

The authors show that leveraging visual features, even a small language model (LM) can produce meaningful thought chains/reasoning with much less hallucinations, which reveals that visual models The role that can be played in developing learning technologies based on thought chains.

From experiments, we see that adding visual features at the cost of millions of parameters can bring greater value than scaling a plain text model to billions of parameters.

The above is the detailed content of Explore a new generation of small models that go beyond GPT 3.5.. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)
2 weeks ago By 尊渡假赌尊渡假赌尊渡假赌
Hello Kitty Island Adventure: How To Get Giant Seeds
1 months ago By 尊渡假赌尊渡假赌尊渡假赌
Two Point Museum: All Exhibits And Where To Find Them
1 months ago By 尊渡假赌尊渡假赌尊渡假赌

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

The world's most powerful open source MoE model is here, with Chinese capabilities comparable to GPT-4, and the price is only nearly one percent of GPT-4-Turbo The world's most powerful open source MoE model is here, with Chinese capabilities comparable to GPT-4, and the price is only nearly one percent of GPT-4-Turbo May 07, 2024 pm 04:13 PM

Imagine an artificial intelligence model that not only has the ability to surpass traditional computing, but also achieves more efficient performance at a lower cost. This is not science fiction, DeepSeek-V2[1], the world’s most powerful open source MoE model is here. DeepSeek-V2 is a powerful mixture of experts (MoE) language model with the characteristics of economical training and efficient inference. It consists of 236B parameters, 21B of which are used to activate each marker. Compared with DeepSeek67B, DeepSeek-V2 has stronger performance, while saving 42.5% of training costs, reducing KV cache by 93.3%, and increasing the maximum generation throughput to 5.76 times. DeepSeek is a company exploring general artificial intelligence

KAN, which replaces MLP, has been extended to convolution by open source projects KAN, which replaces MLP, has been extended to convolution by open source projects Jun 01, 2024 pm 10:03 PM

Earlier this month, researchers from MIT and other institutions proposed a very promising alternative to MLP - KAN. KAN outperforms MLP in terms of accuracy and interpretability. And it can outperform MLP running with a larger number of parameters with a very small number of parameters. For example, the authors stated that they used KAN to reproduce DeepMind's results with a smaller network and a higher degree of automation. Specifically, DeepMind's MLP has about 300,000 parameters, while KAN only has about 200 parameters. KAN has a strong mathematical foundation like MLP. MLP is based on the universal approximation theorem, while KAN is based on the Kolmogorov-Arnold representation theorem. As shown in the figure below, KAN has

Hello, electric Atlas! Boston Dynamics robot comes back to life, 180-degree weird moves scare Musk Hello, electric Atlas! Boston Dynamics robot comes back to life, 180-degree weird moves scare Musk Apr 18, 2024 pm 07:58 PM

Boston Dynamics Atlas officially enters the era of electric robots! Yesterday, the hydraulic Atlas just "tearfully" withdrew from the stage of history. Today, Boston Dynamics announced that the electric Atlas is on the job. It seems that in the field of commercial humanoid robots, Boston Dynamics is determined to compete with Tesla. After the new video was released, it had already been viewed by more than one million people in just ten hours. The old people leave and new roles appear. This is a historical necessity. There is no doubt that this year is the explosive year of humanoid robots. Netizens commented: The advancement of robots has made this year's opening ceremony look like a human, and the degree of freedom is far greater than that of humans. But is this really not a horror movie? At the beginning of the video, Atlas is lying calmly on the ground, seemingly on his back. What follows is jaw-dropping

AI subverts mathematical research! Fields Medal winner and Chinese-American mathematician led 11 top-ranked papers | Liked by Terence Tao AI subverts mathematical research! Fields Medal winner and Chinese-American mathematician led 11 top-ranked papers | Liked by Terence Tao Apr 09, 2024 am 11:52 AM

AI is indeed changing mathematics. Recently, Tao Zhexuan, who has been paying close attention to this issue, forwarded the latest issue of "Bulletin of the American Mathematical Society" (Bulletin of the American Mathematical Society). Focusing on the topic "Will machines change mathematics?", many mathematicians expressed their opinions. The whole process was full of sparks, hardcore and exciting. The author has a strong lineup, including Fields Medal winner Akshay Venkatesh, Chinese mathematician Zheng Lejun, NYU computer scientist Ernest Davis and many other well-known scholars in the industry. The world of AI has changed dramatically. You know, many of these articles were submitted a year ago.

Google is ecstatic: JAX performance surpasses Pytorch and TensorFlow! It may become the fastest choice for GPU inference training Google is ecstatic: JAX performance surpasses Pytorch and TensorFlow! It may become the fastest choice for GPU inference training Apr 01, 2024 pm 07:46 PM

The performance of JAX, promoted by Google, has surpassed that of Pytorch and TensorFlow in recent benchmark tests, ranking first in 7 indicators. And the test was not done on the TPU with the best JAX performance. Although among developers, Pytorch is still more popular than Tensorflow. But in the future, perhaps more large models will be trained and run based on the JAX platform. Models Recently, the Keras team benchmarked three backends (TensorFlow, JAX, PyTorch) with the native PyTorch implementation and Keras2 with TensorFlow. First, they select a set of mainstream

FisheyeDetNet: the first target detection algorithm based on fisheye camera FisheyeDetNet: the first target detection algorithm based on fisheye camera Apr 26, 2024 am 11:37 AM

Target detection is a relatively mature problem in autonomous driving systems, among which pedestrian detection is one of the earliest algorithms to be deployed. Very comprehensive research has been carried out in most papers. However, distance perception using fisheye cameras for surround view is relatively less studied. Due to large radial distortion, standard bounding box representation is difficult to implement in fisheye cameras. To alleviate the above description, we explore extended bounding box, ellipse, and general polygon designs into polar/angular representations and define an instance segmentation mIOU metric to analyze these representations. The proposed model fisheyeDetNet with polygonal shape outperforms other models and simultaneously achieves 49.5% mAP on the Valeo fisheye camera dataset for autonomous driving

Tesla robots work in factories, Musk: The degree of freedom of hands will reach 22 this year! Tesla robots work in factories, Musk: The degree of freedom of hands will reach 22 this year! May 06, 2024 pm 04:13 PM

The latest video of Tesla's robot Optimus is released, and it can already work in the factory. At normal speed, it sorts batteries (Tesla's 4680 batteries) like this: The official also released what it looks like at 20x speed - on a small "workstation", picking and picking and picking: This time it is released One of the highlights of the video is that Optimus completes this work in the factory, completely autonomously, without human intervention throughout the process. And from the perspective of Optimus, it can also pick up and place the crooked battery, focusing on automatic error correction: Regarding Optimus's hand, NVIDIA scientist Jim Fan gave a high evaluation: Optimus's hand is the world's five-fingered robot. One of the most dexterous. Its hands are not only tactile

DualBEV: significantly surpassing BEVFormer and BEVDet4D, open the book! DualBEV: significantly surpassing BEVFormer and BEVDet4D, open the book! Mar 21, 2024 pm 05:21 PM

This paper explores the problem of accurately detecting objects from different viewing angles (such as perspective and bird's-eye view) in autonomous driving, especially how to effectively transform features from perspective (PV) to bird's-eye view (BEV) space. Transformation is implemented via the Visual Transformation (VT) module. Existing methods are broadly divided into two strategies: 2D to 3D and 3D to 2D conversion. 2D-to-3D methods improve dense 2D features by predicting depth probabilities, but the inherent uncertainty of depth predictions, especially in distant regions, may introduce inaccuracies. While 3D to 2D methods usually use 3D queries to sample 2D features and learn the attention weights of the correspondence between 3D and 2D features through a Transformer, which increases the computational and deployment time.

See all articles