The language model defeats the diffusion model and achieves double SOTA in video and image generation!
This is the latest research result from Google CMU.
According to reports, this is the first time that a language model has defeated the diffusion model on the iconic ImageNet benchmark. The key component behind it is
Visual tokenizer(video tokenizer), which can map pixel space input into tokens suitable for LLM learning. The Google CMU research team proposed MAGVIT-v2, which surpassed the previous best visual word segmenter in two other tasks.
Large language model defeats diffusion model
But in terms of visual generation, language models have always lagged behind diffusion models.
The team believes that the main reason is the lack of a good visual representation, similar to a self-developed language system, that can effectively model the visual world. Unlike natural language, humans have not evolved an optimal vocabulary for the visual world. This also limits the visual generation capabilities of large language models.
Based on this judgment, this research mainly completed three tasks:
Proposed a new visual tokenizer, which is superior to visual generation, video compression and action recognition. Best performance ever.Based on the original SOTA visual tokenizer
MAGVIT(Masked Generative Video Transformer), this method mainly completes two designs: Lookup-Free Quantization , LFQ) and image-video joint tokenizer.
In the end, in video/image generation, ImageNet 512×512 and Kinetics-600 are both better than Diffusion Model.
In terms of video compression and action recognition, it is also better than previous results.
##One is an alumnus of Peking University
Yu Lijun is currently a doctoral student at the Institute of Language Technology, School of Computer Science, CMU, studying under Professor Alexander G. Hauptmann. He is also a Google Student Researcher. Research interests lie in multi-modal base models, especially multi-task video generation. Before coming to CMU, he received a double bachelor's degree in computer science and economics from Peking University.We also saw many other Chinese faces in the research team.
The corresponding author Jiang Lu is currently a scientist at Google Research and an adjunct professor at CMU. His research mainly focuses on the field of multi-modal big data, especially robust deep learning, generative artificial intelligence and multi-modal basic models.Paper link:
https://arxiv.org/abs/2310.05737
https://magvit.cs.cmu .edu/v2/
The above is the detailed content of Large language model beats diffusion model! Video image generation dual SOTA, Google CMU's latest research, a Peking University alumnus. For more information, please follow other related articles on the PHP Chinese website!