Xi Xiaoyao Technology Talk Original
Author | IQ has dropped all over the place
Recently, many teams have re-created based on the user-friendly ChatGPT, many of which have relatively eye-catching results. The InternChat work emphasizes user-friendliness by interacting with the chatbot in ways beyond language (cursors and gestures) for multimodal tasks. The name of InternChat is also interesting. It stands for interaction, nonverbal and chatbots. It can be referred to as iChat. Unlike existing interactive systems that rely on pure language, iChat significantly improves the efficiency of communication between users and chatbots by adding pointing instructions. In addition, the author also provides a large visual language model called Husky that can perform capture and visual question answering, and can also impress GPT-3.5-turbo with only 7 billion parameters.
However, due to the popularity of the Demo website, the team officially closed the experience page temporarily. Let’s first understand the content of this work through the following video~
Paper title:
InternChat: Solving Vision-Centric Tasks by Interacting with Chatbots Beyond Language
Paper link:
https://www.php.cn/link/7c9966afcc510cf5a40621d1d92bdaf1
Demo address :
https://www.php.cn/link/e355ad06c5a89f911fbb0aff2de52435
Project address:
https://www.php.cn/link/ 2d13d901966a8eaa7f9c943eba6a540b
The author has provided some task screenshots on the project homepage, so that you can intuitively see some functions and effects of this interactive system:
(a) Remove obscured objects
(b) Interactive image editing
(c) Image generation
(d) Interactive visual question and answer
( e) Interactive image generation
(f) Video highlight explanation
Here we first introduce the two concepts mentioned in this article:
▲Figure 1 The overall architecture of iChat
iChat combines the advantages of pointing and language instructions to perform vision-centric tasks. As shown in Figure 1, this system consists of 3 main components:
It can effectively operate on 3 levels, namely:
Thus, as shown in Figure 2, when a pure language system cannot complete the task, the system can still successfully perform complex interactive tasks.
▲Figure 2 Pointing to the advantages of language-driven interactive system
First let’s look at combining language and non-language Commands to improve communication with interactive systems. To demonstrate the advantages of this hybrid model compared to pure language instructions, the research team conducted a user survey. Participants chatted with Visual ChatGPT and iChat and gave feedback on their experience using it. The results in Tables 1 and 2 show that iChat is more efficient and user-friendly than Visual ChatGPT.
▲Table 1 User survey of “Remove something”
▲Table 2 “Replace with something” "Something" user survey
However, the system still has some limitations, including:
On the plan list listed on the project homepage, there are still several goals that have not yet been achieved. Among them is the Chinese interaction that the editor must experience every time on the new dialogue system. Currently, this The system still probably does not support Chinese for the time being, but there seems to be no solution. Since most multi-modal data sets are based on English, English-Chinese translation wastes online resources and processing time. It is estimated that the road to Chineseization will still take some time.
The above is the detailed content of Prompt is no longer needed. You can play the multi-modal dialogue system with just your hands. iChat is here!. For more information, please follow other related articles on the PHP Chinese website!