Is fine-tuning the 'knowledge-based image question and answer' useless? Google releases search system AVIS: few samples surpass supervised PALI, and the accuracy is tripled-AI-php.cn

Table of Contents

Simulating human decision-making

Overall Framework

Experimental results

Conclusion

Home

Technology peripherals

Is fine-tuning the 'knowledge-based image question and answer' useless? Google releases search system AVIS: few samples surpass supervised PALI, and the accuracy is tripled

WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWB

Aug 24, 2023 pm 07:21 PM

Model train

With the support of large language models (LLM), multi-modal tasks combined with vision, such as image description, visual question answering (VQA) and open-vocabulary object detection, etc. Significant progress has been made

However, the current visual language model (VLM) basically only uses the visual information in the image to complete the task, requiring external knowledge assistance in informseek and OK-VQA. Question and answer data sets often perform poorly.

Is fine-tuning the knowledge-based image question and answer useless? Google releases search system AVIS: few samples surpass supervised PALI, and the accuracy is tripled

Recently Google has released a new autonomous visual information search method AVIS, which uses large language models (LLM) to dynamically formulate the use of external tools Strategies, including calling API, analyzing output results, decision-making and other operations, provide key knowledge for image question and answer.

Is fine-tuning the knowledge-based image question and answer useless? Google releases search system AVIS: few samples surpass supervised PALI, and the accuracy is tripled

Please click the following link to read the paper: https://arxiv.org/pdf/2306.08129.pdf

AVIS mainly integrates three types of tools:

1. Tools for extracting visual information from images

2. Retrieval A web search tool for open world knowledge and facts

3. An image search tool that can be used to retrieve visually similar images

Is fine-tuning the knowledge-based image question and answer useless? Google releases search system AVIS: few samples surpass supervised PALI, and the accuracy is tripled

Then use a planner based on a large language model to select a tool and query results at each step to dynamically generate answers to the questions.

Simulating human decision-making

Many visual problems in Infoseek and OK-VQA datasets are quite difficult even for humans, and usually require the assistance of various external tools, So the researchers chose to conduct a user survey first to observe how humans solve complex visual problems.

Is fine-tuning the knowledge-based image question and answer useless? Google releases search system AVIS: few samples surpass supervised PALI, and the accuracy is tripled

First, we will provide users with a set of available tools, including PALI, PALM and network search. Next, we show the input image, question, detected object crop, linked knowledge graph entities from the image search results, similar image titles, related product titles, and image descriptions

The researchers then record the user’s actions and output and use two methods to guide the system to answer:

1. Build transformations by analyzing the sequence of decisions made by the user A graph that contains different states, each with a different set of available actions.

Is fine-tuning the knowledge-based image question and answer useless? Google releases search system AVIS: few samples surpass supervised PALI, and the accuracy is tripled

Rewritten content: AVIS conversion diagram The redesigned AVIS conversion diagram is a graphical representation used to illustrate the AVIS conversion process. This diagram clearly illustrates the various stages and steps of AVIS and presents it to the user in an easy-to-understand manner. Through this conversion diagram, users can better understand the working principle and operation process of AVIS. The design of this chart is concise and clear, allowing users to quickly grasp the AVIS conversion process. Both beginners and experienced users can easily understand and apply the conversion process through this AVIS conversion diagram

For example, in the starting state, the system can only perform three operations: PALI description, PALI VQA or target detection.

To improve system performance and effectiveness, examples of human decision-making can be used to guide planners and reasoners to interact with relevant context instances

Overall Framework

The AVIS approach adopts a dynamic decision-making strategy designed to respond to queries for visual information

The system consists of three main component:

The content that needs to be rewritten is: 1. Planner, used to determine subsequent operations, including appropriate API calls and queries that need to be processed

2 . Working memory: Working memory, which retains the result information obtained from API execution.

3. The reasoner is used to process the output of the API call and can determine whether the information obtained is sufficient to generate the final response, or whether additional data retrieval is required

Every time it needs to decide which tool to use and which queries to send to the system, the planner performs a series of actions; depending on the current state, the planner also provides potential follow-up actions

In order to solve the problem that the search space is too large because there may be too many potential action spaces, the planner needs to refer to the transition graph to eliminate irrelevant actions that have been taken before and stored in working memory. Actions.

Is fine-tuning the knowledge-based image question and answer useless? Google releases search system AVIS: few samples surpass supervised PALI, and the accuracy is tripled

The planner then assembles a set of contextual examples from the user research data, combined with records of previous tool interactions, and after the planner formulates prompts As input to the language model, the LLM returns a structured answer that determines the next tool to activate and the query to dispatch.

The entire design process can be driven by multiple calls to the planner to drive dynamic decisions and gradually generate answers

Is fine-tuning the knowledge-based image question and answer useless? Google releases search system AVIS: few samples surpass supervised PALI, and the accuracy is tripled

Researchers use reasoners to analyze the output of tool execution, extract useful information, and decide on the category of tool output: informative, uninformative, or final answer

If the reasoner returns a result of "providing an answer", it directly outputs it as the final result and ends the task; if the result is no information, it returns to the planner and selects another action based on the current state; if the reasoner thinks that the tool output is useful , then the state is modified and control is transferred back to the planner to make new decisions in the new state.

Is fine-tuning the knowledge-based image question and answer useless? Google releases search system AVIS: few samples surpass supervised PALI, and the accuracy is tripled

AVIS adopts a dynamic decision-making strategy to respond to visual information search queries

Experimental results

What needs to be rewritten is: Tool collection

Using the PALI 17B model, the image description model can generate descriptions for input images and detected object cropped images

Visual question answering model, using the PALI 17B VQA model, takes images and questions as input and text-based answers as output.

Object detection, using an object detector trained on a superset of the Open Images dataset, provided by the specific category Google Lens API; using a high confidence threshold, only retaining the ranking in the input image The front detection frame.

Use Google Image Search to get image crop information related to the detected box

When making decisions, the planner will The utilization of each piece of information is considered a separate operation, because each piece of information may contain hundreds of tokens and requires complex processing and reasoning.

In some cases, images may contain textual content, such as street names or brand names. You can use the Optical Character Recognition (OCR) feature in the Google Lens API to extract this text

By using the Google Search API for web searches, you can enter a text query and get relevant document links and snippet output results, while also providing a knowledge graph panel containing direct answers and up to five questions related to the input query

Experimental results

The researchers evaluated the AVIS framework on the Infoseek and OK-VQA datasets. From the results, it can be seen that even very robust visual language models such as OFA and PALI model, also cannot obtain high accuracy after fine-tuning on the Infoseek dataset.

Is fine-tuning the knowledge-based image question and answer useless? Google releases search system AVIS: few samples surpass supervised PALI, and the accuracy is tripled

Without fine-tuning, the AVIS method successfully achieved an accuracy of 50.7%

on OK-VQA data On the set, the AVIS system achieved an accuracy of 60.2% under few-shot settings, second only to the fine-tuned PALI model.

Is fine-tuning the knowledge-based image question and answer useless? Google releases search system AVIS: few samples surpass supervised PALI, and the accuracy is tripled

Most question and answer examples in OK-VQA rely on common sense knowledge rather than fine-grained knowledge, so the difference in performance is probably due to this . PALI is able to exploit common knowledge encoded in model parameters without relying on the assistance of external knowledge

Is fine-tuning the knowledge-based image question and answer useless? Google releases search system AVIS: few samples surpass supervised PALI, and the accuracy is tripled

A key feature of AVIS is the ability to dynamically do Make decisions rather than execute fixed sequences. From the above example, you can see the flexibility of AVIS using different tools at different stages.

It is worth noting that the reasoner design in this article enables AVIS to identify irrelevant information, go back to the previous state, and repeat the search.

For example, in the second example about fungal taxonomy, AVIS initially made the wrong decision by selecting a leaf object; after the reasoner found it irrelevant to the problem, it prompted AVIS to re- planning, and then successfully selected the object related to the false turkey tail fungus, resulting in the correct answer, Stereum

Conclusion

The researchers came up with a A new approach, AVIS, uses LLM as an assembly center and uses a variety of external tools to answer knowledge-intensive vision questions.

In this approach, the researchers chose to use human decision-making data collected from user studies as anchors, adopt a structured framework, and use an LLM-based planner to dynamically Decision tool selection and query formation tools until all the necessary information needed to answer the visual question is gathered

The above is the detailed content of Is fine-tuning the 'knowledge-based image question and answer' useless? Google releases search system AVIS: few samples surpass supervised PALI, and the accuracy is tripled. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

Assassin's Creed Shadows: Seashell Riddle Solution

3 weeks ago By DDD

What's New in Windows 11 KB5054979 & How to Fix Update Issues

2 weeks ago By DDD

Where to find the Crane Control Keycard in Atomfall

3 weeks ago By DDD

Saving in R.E.P.O. Explained (And Save Files)

1 months ago By 尊渡假赌尊渡假赌尊渡假赌

Assassin's Creed Shadows - How To Find The Blacksmith And Unlock Weapon And Armour Customisation

4 weeks ago By DDD

Hot Tools

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Hot Topics

Where is the login entrance for gmail email?

7564

CakePHP Tutorial

1385

What is the format of the account name of steam

win11 activation key permanent

nyt connections hints and answers

Related knowledge

Open source! Beyond ZoeDepth! DepthFM: Fast and accurate monocular depth estimation! Apr 03, 2024 pm 12:04 PM

0.What does this article do? We propose DepthFM: a versatile and fast state-of-the-art generative monocular depth estimation model. In addition to traditional depth estimation tasks, DepthFM also demonstrates state-of-the-art capabilities in downstream tasks such as depth inpainting. DepthFM is efficient and can synthesize depth maps within a few inference steps. Let’s read about this work together ~ 1. Paper information title: DepthFM: FastMonocularDepthEstimationwithFlowMatching Author: MingGui, JohannesS.Fischer, UlrichPrestel, PingchuanMa, Dmytr

The world's most powerful open source MoE model is here, with Chinese capabilities comparable to GPT-4, and the price is only nearly one percent of GPT-4-Turbo May 07, 2024 pm 04:13 PM

Imagine an artificial intelligence model that not only has the ability to surpass traditional computing, but also achieves more efficient performance at a lower cost. This is not science fiction, DeepSeek-V2[1], the world’s most powerful open source MoE model is here. DeepSeek-V2 is a powerful mixture of experts (MoE) language model with the characteristics of economical training and efficient inference. It consists of 236B parameters, 21B of which are used to activate each marker. Compared with DeepSeek67B, DeepSeek-V2 has stronger performance, while saving 42.5% of training costs, reducing KV cache by 93.3%, and increasing the maximum generation throughput to 5.76 times. DeepSeek is a company exploring general artificial intelligence

AI subverts mathematical research! Fields Medal winner and Chinese-American mathematician led 11 top-ranked papers | Liked by Terence Tao Apr 09, 2024 am 11:52 AM

AI is indeed changing mathematics. Recently, Tao Zhexuan, who has been paying close attention to this issue, forwarded the latest issue of "Bulletin of the American Mathematical Society" (Bulletin of the American Mathematical Society). Focusing on the topic "Will machines change mathematics?", many mathematicians expressed their opinions. The whole process was full of sparks, hardcore and exciting. The author has a strong lineup, including Fields Medal winner Akshay Venkatesh, Chinese mathematician Zheng Lejun, NYU computer scientist Ernest Davis and many other well-known scholars in the industry. The world of AI has changed dramatically. You know, many of these articles were submitted a year ago.

Hello, electric Atlas! Boston Dynamics robot comes back to life, 180-degree weird moves scare Musk Apr 18, 2024 pm 07:58 PM

Boston Dynamics Atlas officially enters the era of electric robots! Yesterday, the hydraulic Atlas just "tearfully" withdrew from the stage of history. Today, Boston Dynamics announced that the electric Atlas is on the job. It seems that in the field of commercial humanoid robots, Boston Dynamics is determined to compete with Tesla. After the new video was released, it had already been viewed by more than one million people in just ten hours. The old people leave and new roles appear. This is a historical necessity. There is no doubt that this year is the explosive year of humanoid robots. Netizens commented: The advancement of robots has made this year's opening ceremony look like a human, and the degree of freedom is far greater than that of humans. But is this really not a horror movie? At the beginning of the video, Atlas is lying calmly on the ground, seemingly on his back. What follows is jaw-dropping

KAN, which replaces MLP, has been extended to convolution by open source projects Jun 01, 2024 pm 10:03 PM

Earlier this month, researchers from MIT and other institutions proposed a very promising alternative to MLP - KAN. KAN outperforms MLP in terms of accuracy and interpretability. And it can outperform MLP running with a larger number of parameters with a very small number of parameters. For example, the authors stated that they used KAN to reproduce DeepMind's results with a smaller network and a higher degree of automation. Specifically, DeepMind's MLP has about 300,000 parameters, while KAN only has about 200 parameters. KAN has a strong mathematical foundation like MLP. MLP is based on the universal approximation theorem, while KAN is based on the Kolmogorov-Arnold representation theorem. As shown in the figure below, KAN has

The vitality of super intelligence awakens! But with the arrival of self-updating AI, mothers no longer have to worry about data bottlenecks Apr 29, 2024 pm 06:55 PM

I cry to death. The world is madly building big models. The data on the Internet is not enough. It is not enough at all. The training model looks like "The Hunger Games", and AI researchers around the world are worrying about how to feed these data voracious eaters. This problem is particularly prominent in multi-modal tasks. At a time when nothing could be done, a start-up team from the Department of Renmin University of China used its own new model to become the first in China to make "model-generated data feed itself" a reality. Moreover, it is a two-pronged approach on the understanding side and the generation side. Both sides can generate high-quality, multi-modal new data and provide data feedback to the model itself. What is a model? Awaker 1.0, a large multi-modal model that just appeared on the Zhongguancun Forum. Who is the team? Sophon engine. Founded by Gao Yizhao, a doctoral student at Renmin University’s Hillhouse School of Artificial Intelligence.

Kuaishou version of Sora 'Ke Ling' is open for testing: generates over 120s video, understands physics better, and can accurately model complex movements Jun 11, 2024 am 09:51 AM

What? Is Zootopia brought into reality by domestic AI? Exposed together with the video is a new large-scale domestic video generation model called "Keling". Sora uses a similar technical route and combines a number of self-developed technological innovations to produce videos that not only have large and reasonable movements, but also simulate the characteristics of the physical world and have strong conceptual combination capabilities and imagination. According to the data, Keling supports the generation of ultra-long videos of up to 2 minutes at 30fps, with resolutions up to 1080p, and supports multiple aspect ratios. Another important point is that Keling is not a demo or video result demonstration released by the laboratory, but a product-level application launched by Kuaishou, a leading player in the short video field. Moreover, the main focus is to be pragmatic, not to write blank checks, and to go online as soon as it is released. The large model of Ke Ling is already available in Kuaiying.

Tesla robots work in factories, Musk: The degree of freedom of hands will reach 22 this year! May 06, 2024 pm 04:13 PM

The latest video of Tesla's robot Optimus is released, and it can already work in the factory. At normal speed, it sorts batteries (Tesla's 4680 batteries) like this: The official also released what it looks like at 20x speed - on a small "workstation", picking and picking and picking: This time it is released One of the highlights of the video is that Optimus completes this work in the factory, completely autonomously, without human intervention throughout the process. And from the perspective of Optimus, it can also pick up and place the crooked battery, focusing on automatic error correction: Regarding Optimus's hand, NVIDIA scientist Jim Fan gave a high evaluation: Optimus's hand is the world's five-fingered robot. One of the most dexterous. Its hands are not only tactile

See all articles