The following is the content of Professor Lu Zhiwu’s speech at the ChatGPT and Large Model Technology Conference held by the Heart of the Machine. The Heart of the Machine has edited and organized it without changing the original meaning:
Hello everyone, I am Lu Zhiwu from Renmin University of China. The title of my report today is "Important Enlightenments of ChatGPT on Multimodal Generative Models", which consists of four parts.
First of all, ChatGPT brings us some inspiration about the innovation of research paradigms. The first point is to use "big model and big data", which is a research paradigm that has been verified repeatedly and is also the basic research paradigm of ChatGPT. It is particularly important to emphasize that only when a large model reaches a certain level will it have emergent capabilities, such as in-context learning, CoT reasoning and other capabilities. These capabilities are very amazing.
The second point is to insist on "large model reasoning". This is also the point that impressed me most about ChatGPT. Because in the field of machine learning or artificial intelligence, reasoning is recognized as the most difficult, and ChatGPT has also made breakthroughs in this regard. Of course, ChatGPT’s reasoning ability may mainly come from code training, but whether there is an inevitable connection is not yet certain. In terms of reasoning, we should put more effort into figuring out where it comes from, or whether there are other training methods to further enhance its reasoning ability.
The third point is that the large model must be aligned with humans. This is the important thing ChatGPT gives us from an engineering perspective or a model landing perspective. Enlightenment. If not aligned with humans, the model will generate a lot of harmful information, making the model unusable. The third point is not to raise the upper limit of the model, but the reliability and security of the model are indeed very important.
The advent of ChatGPT has had a great impact on many fields, including myself. Because I have been doing multimodality for several years, I will start to reflect on why we have not made such a powerful model.
ChatGPT is a universal generation in language or text. Let’s take a look at the latest progress in the field of multi-modal universal generation. Multimodal pre-training models have begun to transform into multimodal general generative models, and there have been some preliminary explorations. First, let’s take a look at the Flamingo model proposed by Google in 2019. The following figure is its model structure.
The main body of the Flamingo model architecture is the decoder (Decoder) of the large language model, which is the blue module on the right side of the picture above. In each blue Some adapter layers are added between the color modules, and the Vision Encoder and Perceiver Resampler are added to the left visual part. The design of the entire model is to encode and convert visual things, pass through the adapter, and align them with the language, so that the model can automatically generate text descriptions for images.
Flamingo What are the benefits of such an architectural design? First of all, the blue module in the above picture is fixed (frozen), including the language model Decoder; while the parameter amount of the pink module itself is controllable, so the number of parameters actually trained by the Flamingo model is very small. So don’t think that multi-modal universal generative models are difficult to build. In fact, it’s not that pessimistic. The trained Flamingo model can do many common tasks based on text generation. Of course, the input is multi-modal, such as video description, visual question and answer, multi-modal dialogue, etc. From this perspective, Flamingo can be regarded as a general generative model.
The second example is the newly released BLIP-2 model some time ago. It is improved based on BLIP-1. Its model architecture is very similar to Flamingo, and it basically includes image coding. The decoder and the large language model decoder are fixed, and then a Q-Former with a converter function is added in the middle - from visual to language conversion. So, the part of BLIP-2 that really requires training is the Q-Former.
As shown in the figure below, first input a picture (the picture on the right) into the Image Encoder. The Text in the middle is the question or instruction raised by the user, which is input after Q-Former encoding. Go to a large language model and finally generate the answer, which is probably such a generation process.
The shortcomings of these two models are obvious, because they appeared relatively early or just appeared, and the engineering methods used by ChatGPT have not been considered. At least there is no instruction fine-tuning for graphic dialogue or multi-modal dialogue, so their overall generation effect is not satisfactory.
The third one is Kosmos-1 recently released by Microsoft. It has a very simple structure and only uses image and text pairs for training. The multi-modal data is relatively single. The biggest difference between Kosmos-1 and the above two models is that the large language model itself in the above two models is fixed, while the large language model itself in Kosmos-1 needs to be trained, so the Kosmos-1 model The number of parameters is only 1.6 billion, and a model with 1.6 billion parameters may not have the ability to emerge. Of course, Kosmos-1 did not take into account the fine-tuning of commands in graphic dialogue, causing it to sometimes speak nonsense.
The next example is Google’s multimodal embodied visual language model PaLM-E. The PaLM-E model is similar to the first three examples. PaLM-E also uses the ViT large language model. The biggest breakthrough of PaLM-E is that it finally explores the possibility of implementing multi-modal large language models in the field of robotics. PaLM-E attempts the first step of exploration, but the types of robot tasks it considers are very limited and cannot be truly universal.
The last example is GPT-4 - it gives particularly amazing results on standard data sets, and many times its results are even better than Currently, fine-tuned SOTA models trained on the dataset are even better. This may come as a shock, but it doesn't actually mean anything. When we were building multi-modal large models two years ago, we discovered that the capabilities of large models cannot be evaluated on standard data sets. Good performance on standard data sets does not mean good results in actual use. There are many differences between the two. Big gap. For this reason, I am slightly disappointed with the current GPT-4, as it only gives results on standard datasets. Moreover, the currently available GPT-4 is not a visual version, but a pure text version.
The above models are generally used for general language generation, and the input is multi-modal input. The following two models are different. Now - not only general language generation, but also visual generation, which can generate both language and images.
The first is Microsoft's Visual ChatGPT, let me briefly evaluate it. The idea of this model is very simple, and it is more of a product design consideration. There are many types of vision-related generation, as well as some visual detection models. The inputs and instructions for these different tasks vary widely. The problem is how to use one model to include all these tasks, so Microsoft designed the Prompt manager, which is used in the core part. OpenAI's ChatGPT is equivalent to translating instructions for different visual generation tasks through ChatGPT. The user's questions are instructions described in natural language, which are translated into instructions that the machine can understand through ChatGPT.
Visual ChatGPT does just such a thing. So it's really good from a product perspective, but nothing new from a model design perspective. Therefore, the overall model is a "stitch monster" from the perspective of the model. There is no unified model training, resulting in no mutual promotion between different modes. Why we do multi-modality is because we believe that data from different modalities must help each other. And Visual ChatGPT does not consider multi-modal generation instruction fine-tuning. Its instruction fine-tuning only relies on ChatGPT itself.
The next example is the UniDiffuser model released by Professor Zhu Jun’s team at Tsinghua University. From an academic perspective, this model can truly generate text and visual content from multi-modal input. This is due to their transformer-based network architecture U-ViT, which is similar to U-Net, the core component of Stable Diffusion, and then generates images. and text generation are unified in a framework. This work itself is very meaningful, but it is still relatively early. For example, it only considers Captioning and VQA tasks, does not consider multiple rounds of dialogue, and does not fine-tune instructions for multi-modal generation.
Having commented so much before, we also made a product called ChatImg, as shown in the picture below. Generally speaking, ChatImg includes an image encoder, a multi-modal image and text encoder, and a text decoder. It is similar to Flamingo and BLIP-2, but we consider more, and there are detailed differences in the specific implementation.
One of the biggest advantages of ChatImg is that it can accept video input. We pay special attention to multi-modal general generation, including text generation, image generation, and video generation. We hope to implement a variety of generation tasks in this framework, and ultimately hope to access text to generate videos.
Second, we pay special attention to real user data. We hope to continuously optimize the generation model itself and improve its capabilities after obtaining real user data, so we released the ChatImg application.
The following pictures are some examples of our tests. As an early model, although there are still some things that are not done well, in general ChatImg can still understand pictures. For example, ChatImg can generate descriptions of paintings in conversations and can also do some in-context learning.
The first example in the picture above describes the painting "Starry Night". In the description, ChatImg said that Van Gogh was an American painter. You tell it Wrong, it can be corrected immediately; the second example ChatImg made physical inferences about the objects in the picture; the third example is a photo I took myself. There are two rainbows in this photo, and it accurately Recognized.
We noticed that the third and fourth examples in the above picture involve emotional issues. This is actually related to the work we are going to do next. We want to connect ChatImg to the robot. Today's robots are usually passive, and all instructions are preset, which makes them seem very rigid. We hope that robots connected to ChatImg can actively communicate with people. How to do this? First of all, the robot must be able to feel people. It may be to objectively see the state of the world and people's emotions, or it may be to obtain a reflection; then the robot can understand and actively communicate with people. Through these two examples, I feel that this goal is achievable.
Finally, let me summarize today’s report. First of all, ChatGPT and GPT-4 have brought innovation to the research paradigm. All of us should actively embrace this change. We cannot complain or make excuses that we have no resources. As long as we face this change, there are always ways to overcome difficulties. Multimodal research does not even require machines with hundreds of cards. As long as corresponding strategies are adopted, a small number of machines can do good work. Second, existing multi-modal generative models all have their own problems. GPT-4 does not yet have an open visual version, and there is still a chance for all of us. Moreover, I think GPT-4 still has a problem, which is what the multi-modal generative model should ultimately look like. It does not give a perfect answer (in fact, it does not reveal any details of GPT-4). This is actually a good thing. People all over the world are very smart and everyone has their own ideas. This may create a new research situation where a hundred flowers bloom. That’s it for my speech, thank you all.
The above is the detailed content of 'Lu Zhiwu, a researcher at Renmin University of China, proposed the important impact of ChatGPT on multi-modal generative models'. For more information, please follow other related articles on the PHP Chinese website!