智谱大模型团队自研打造。
自从快手可灵 AI 火爆海内外,国内视频生成也如同 2023 年的文本大模型一样,越来越卷了。刚刚,又一视频生成大模型产品宣布正式上线:智谱 AI 正式发布「清影」。只要你有好的创意(几个字到几百个字),再加上一点点耐心(30 秒),「清影」就能生成 1440x960 清晰度的高精度视频。即日起,清影上线清言 App,所有用户都可以全方位体验对话、图片、视频、代码和 Agent 生成功能。除了覆盖智谱清言的网页端和 App,你也可以在「AI 动态照片小程序」上进行操作,快速为手机里的照片实现动态效果。智谱「清影」生成的视频时长有 6 秒,清晰度达到 1440×960,所有用户均可以免费使用。
- PC 端访问链接:https://chatglm.cn/
- 移动端访问链接:https://chatglm.cn/download?fr=web_home
智谱 AI 表示,随着技术的不断发展,「清影」的生成能力很快将可以用于短视频制作,广告生成甚至电影剪辑等工作中。在生成式 AI 视频模型的研发中,Scaling Law 继续在算法和数据两方面发挥作用。「我们积极在模型层面探索更高效的 scaling 方式。」在智谱 Open Day 上,智谱 AI CEO 张鹏表示:「随着算法、数据不断迭代,相信 Scaling Law 将继续发挥强有力作用。」
从目前的一些 Demo,以及简单试用的情况看来,智谱 AI 的「清影」具有如下特点:
- 在风景、动物、科幻、人文历史等类型的视频内容上生成的表现较好;
- 擅长生成的视频风格包括卡通风格、真实摄影风格、二次元动漫风格等;
- 实体类型呈现效果上看,动物 > 植物 > 物品 > 建筑 > 人物。
它既可以文字生成视频,也可以完成图生视频,生成的风格覆盖奇幻动画风格。提示词:低角度向上推进,缓缓抬头,冰山上突然出现一条恶龙,然后恶龙发现你,冲向你。好莱坞电影风。
提示词:一个法师正在海浪中施展法术,宝石将海水都聚集过来,打开了一道魔法传送门。
提示词:在一片森林中,人视,参天大树遮蔽着太阳,树叶的缝隙中洒下一些阳光,丁达尔效应。
提示词:一只水豚鼠像人一样站立着,手里拿着冰激凌,开心得吃起来。
In addition to text-generated videos, you can also play with picture-generated videos on Qingying. Tusheng Video brings more new ways to play, including emoticons, advertising production, plot creation, short video creation, etc. At the same time, the "Old Photos Animated" applet based on Qingying will also be launched simultaneously. You only need to upload old photos in one step, and AI can animate the photos condensed in the old time. Prompt word: A freely moving colorful fish.
Prompt word: The man in the picture stands up, the wind blowing his hair.
Prompt word: The little yellow duck toy is floating on the surface of the swimming pool, close-up.
Tips: The camera rotates around a bunch of old TVs showing different programs - 1950s sci-fi movies, horror movies, news, Static, 1970s sitcoms and more, set in a large gallery at a New York museum.
Prompt word: Take out an iPhone and take a photo.
Your commonly used emoticons, Zhipu AI can extend them into "series". Prompt words: The four masters and apprentices stretched out their hands to high-five each other, with confused expressions on their faces. Prompt word: The kitten opened its mouth wide, with a confused expression on its face and many question marks. It can be seen that Qingying can handle various styles, and there are more ways to play waiting for people to discover. Just click on the "Qingying Intelligent" function on the Zhipu Qingyan PC/APP, and every idea you have can be turned into reality in an instant. Fully self-developed technologyAll in large-model smart spectrum AI, which has started deploying multi-modal generative AI models very early. Starting from 2021, Zhipu AI has released many studies such as CogView (NeurIPS’21), CogView2 (NeurIPS’22), CogVideo (ICLR’23), Relay Diffusion (ICLR’24), CogView3 (2024), etc. According to reports, "Qingying" relies on CogVideoX, a new generation of large video generation model independently developed by the Zhipu AI large model team. In November last year, his team created the text-to-video generation model CogVideo based on the Vincent graph model CogView2, and subsequently made it open source.
CogVideo has 9.4 billion parameters. It generates a series of initial frames through CogView2, and implements video generation by interpolating frames from images based on the bidirectional attention model. In addition, CogVideo generates a 3D environment based on text descriptions and can directly utilize pre-trained models to avoid expensive training. It also supports Chinese Prompt input. The video generation model of Qingying Base this time is CogVideoX, which can integrate the three dimensions of text, time and space. It refers to Sora's algorithm design. It is also a DiT architecture. Through optimization, CogVideoX The inference speed of the previous generation (CogVideo) has been increased by 6 times. The emergence of OpenAI’s Sora has enabled AI to make significant progress in video generation, but most models still have difficulties in generating video content with coherence and logical consistency. In order to solve these problems, Zhipu AI has independently developed an efficient three-dimensional variational autoencoder structure (3D VAE), which can compress the original video space to 2%, significantly reducing the cost of model training. The difficulty is also greatly reduced. The model structure uses Causal 3D convolution as the main model component, and removes the attention module commonly used in autoencoders, so that the model has the ability to be transferred to different resolutions. At the same time, causal convolution in the time dimension makes the model video encoding and decoding sequence independent from front to back, which helps to expand the model to higher frame rates and longer scenes through fine-tuning. In addition, video generation also faces such a problem, that is, most of the video data lacks corresponding descriptive text or the description quality is low. For this reason, Zhipu AI has self-developed an end-to-end video understanding model for Generate detailed descriptions that fit the content of massive video data, and then build massive high-quality video-text pairs, making the trained model highly compliant with instructions. Finally, it is worth mentioning that Zhipu AI has developed a transformer architecture that integrates text, time, and space. This architecture does not use the traditional cross attention module, but embeds text and video in the input stage. Embeddings are connected for fuller interaction of the two modalities. However, there are big differences in text and video feature spaces. Zhipu AI processes both separately through expert adaptive layernorm, allowing the model to efficiently utilize parameters to better align visual information with semantic information. Zhipu AI stated that through optimization technology, the inference speed of the Zhipu AI generative video model has increased by 6 times. Currently, the theoretical time it takes the model to generate a 6s video is 30 seconds. Now with the launch of "Qingying", Zhipu AI, a major player in the video generation track, has appeared again. In addition to the applications that everyone can try, Qingying API is also simultaneously launched on the big model open platform bigmodel.cn. Enterprises and developers can experience and use Wensheng Video and Tusheng Video by calling the API model capabilities. With the continuous launch of AI video generation functions by various companies, this year’s generative AI competition has entered a white-hot stage. For most users, there are more choices: now, both people with no video production background and professional content creators can achieve video creation with the help of large model capabilities. The above is the detailed content of Zhipu AI enters video generation: 'Qingying' is online, 6 seconds long, free and unlimited. For more information, please follow other related articles on the PHP Chinese website!