Home > Technology peripherals > AI > Prompt is no longer needed. You can play the multi-modal dialogue system with just your hands. iChat is here!

Prompt is no longer needed. You can play the multi-modal dialogue system with just your hands. iChat is here!

WBOY
Release: 2023-05-15 17:55:06
forward
1098 people have browsed it

Xi Xiaoyao Technology Talk Original
Author | IQ has dropped all over the place

Recently, many teams have re-created based on the user-friendly ChatGPT, many of which have relatively eye-catching results. The InternChat work emphasizes user-friendliness by interacting with the chatbot in ways beyond language (cursors and gestures) for multimodal tasks. The name of InternChat is also interesting. It stands for interaction, nonverbal and chatbots. It can be referred to as iChat. Unlike existing interactive systems that rely on pure language, iChat significantly improves the efficiency of communication between users and chatbots by adding pointing instructions. In addition, the author also provides a large visual language model called Husky that can perform capture and visual question answering, and can also impress GPT-3.5-turbo with only 7 billion parameters.

However, due to the popularity of the Demo website, the team officially closed the experience page temporarily. Let’s first understand the content of this work through the following video~

Paper title:
InternChat: Solving Vision-Centric Tasks by Interacting with Chatbots Beyond Language

Paper link:
https://www.php.cn/link/7c9966afcc510cf5a40621d1d92bdaf1

Demo address :
https://www.php.cn/link/e355ad06c5a89f911fbb0aff2de52435

Project address:
https://www.php.cn/link/ 2d13d901966a8eaa7f9c943eba6a540b

Main features of the system

The author has provided some task screenshots on the project homepage, so that you can intuitively see some functions and effects of this interactive system:

(a) Remove obscured objects

Prompt is no longer needed. You can play the multi-modal dialogue system with just your hands. iChat is here!


(b) Interactive image editing

Prompt is no longer needed. You can play the multi-modal dialogue system with just your hands. iChat is here!

(c) Image generation

Prompt is no longer needed. You can play the multi-modal dialogue system with just your hands. iChat is here!

(d) Interactive visual question and answer

Prompt is no longer needed. You can play the multi-modal dialogue system with just your hands. iChat is here!

( e) Interactive image generation

Prompt is no longer needed. You can play the multi-modal dialogue system with just your hands. iChat is here!

(f) Video highlight explanation

Prompt is no longer needed. You can play the multi-modal dialogue system with just your hands. iChat is here!

Paper quick overview

Here we first introduce the two concepts mentioned in this article:

  • Vision-centric tasks: In order for computers to understand what they see from the world and react accordingly .
  • Communication in the form of non-verbal instructions: pointing actions such as cursors and hand gestures.

Prompt is no longer needed. You can play the multi-modal dialogue system with just your hands. iChat is here!

▲Figure 1 The overall architecture of iChat

iChat combines the advantages of pointing and language instructions to perform vision-centric tasks. As shown in Figure 1, this system consists of 3 main components:

  1. A perception unit that processes pointing instructions on images or videos;
  2. Has an auxiliary control that can accurately parse language instructions LLM controller of the mechanism;
  3. An open world toolkit that integrates HuggingFace's various online models, user-trained private models, and other applications (such as calculators and search engines).

It can effectively operate on 3 levels, namely:

  1. Basic interaction;
  2. Language-guided interaction;
  3. Point-to-language-enhanced interaction.

Thus, as shown in Figure 2, when a pure language system cannot complete the task, the system can still successfully perform complex interactive tasks.

Prompt is no longer needed. You can play the multi-modal dialogue system with just your hands. iChat is here!

▲Figure 2 Pointing to the advantages of language-driven interactive system

Experiment

First let’s look at combining language and non-language Commands to improve communication with interactive systems. To demonstrate the advantages of this hybrid model compared to pure language instructions, the research team conducted a user survey. Participants chatted with Visual ChatGPT and iChat and gave feedback on their experience using it. The results in Tables 1 and 2 show that iChat is more efficient and user-friendly than Visual ChatGPT.

Prompt is no longer needed. You can play the multi-modal dialogue system with just your hands. iChat is here!

▲Table 1 User survey of “Remove something”

Prompt is no longer needed. You can play the multi-modal dialogue system with just your hands. iChat is here!

▲Table 2 “Replace with something” "Something" user survey

Summary

However, the system still has some limitations, including:

  • The efficiency of iChat is greatly improved. The extent depends on the quality and accuracy of its underlying open source model. However, these models may have limitations or biases that adversely affect iChat performance.
  • As user interactions become more complex or the number of instances increases, the system needs to maintain accuracy and response time, which can be challenging for iChat.
  • In addition, there is a lack of learnable collaboration between current vision and language-based models, such as the lack of functions that can be adjusted by the instruction data.
  • iChat may have difficulty responding to novel or unusual situations outside of the training data, causing performance to suffer.
  • Achieving seamless integration across different devices and platforms can be challenging because of varying hardware capabilities, software limitations, and accessibility requirements.

On the plan list listed on the project homepage, there are still several goals that have not yet been achieved. Among them is the Chinese interaction that the editor must experience every time on the new dialogue system. Currently, this The system still probably does not support Chinese for the time being, but there seems to be no solution. Since most multi-modal data sets are based on English, English-Chinese translation wastes online resources and processing time. It is estimated that the road to Chineseization will still take some time.

The above is the detailed content of Prompt is no longer needed. You can play the multi-modal dialogue system with just your hands. iChat is here!. For more information, please follow other related articles on the PHP Chinese website!

Related labels:
source:51cto.com
Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Popular Tutorials
More>
Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template