


Adding special effects only requires one sentence or a picture. The company Stable Diffusion has used AIGC to play new tricks.
I believe that many people have already understood the charm of generative AI technology, especially after experiencing the AIGC outbreak in 2022. Text-to-image generation technology represented by Stable Diffusion was once popular all over the world, and countless users poured in to express their artistic imagination with the help of AI...
Compared with image editing, video Editing is a more challenging topic, requiring synthesizing new actions rather than just modifying the visual appearance, while also maintaining temporal consistency.
There are many companies exploring this track. Some time ago, Google released Dreamix to apply text conditional video diffusion model (VDM) to video editing.
Recently, Runway, a company that participated in the creation of Stable Diffusion, launched a new artificial intelligence model "Gen-1", which uses any style specified by applying text prompts or reference images. Can convert existing videos into new videos.
Paper link: https://arxiv.org/pdf/2302.03011.pdf
Project homepage: https://research.runwayml.com/gen1
In 2021, Runway and the University of Munich Researchers collaborated to build the first version of Stable Diffusion. Then Stability AI, a UK startup, stepped in to fund the computational expenses needed to train the model on more data. In 2022, Stability AI brings Stable Diffusion into the mainstream, transforming it from a research project into a global phenomenon.
Runway said it hopes Gen-1 can do for video what Stable Diffusion has done for images.
“We’ve seen an explosion of image generation models,” said Cristóbal Valenzuela, CEO and co-founder of Runway. "I really believe that 2023 will be the year of video."
Specifically, Gen-1 supports several editing modes:
1. Stylization. Transfer the style of any image or prompt to every frame of your video.
2. Storyboard. Turn your model into a fully stylized and animated rendering.
3. Mask. Isolate topics in videos and modify them using simple text prompts.
4. Rendering. Turn textureless rendering into photorealistic output by applying input images or prompts.
5. Customization. Unleash the full power of Gen-1 by customizing your model for higher-fidelity results.
In a demo posted on the company’s official website, it shows how Gen-1 can smoothly change video styles. Let’s take a look at a few examples.
For example, to turn "people on the street" into "clay puppets", you only need one line of prompt:
Or turn "books stacked on the table" into "cityscape at night":
From "running on the snow" to "walking on the moon":
The young girl, in seconds Become an ancient sage:
Paper Details
Visual effects and video editing are ubiquitous in the contemporary media landscape. As video-centric platforms gain popularity, the need for more intuitive and powerful video editing tools increases. However, due to the temporal nature of video data, editing in this format is still complex and time-consuming. State-of-the-art machine learning models show great promise in improving the editing process, but many methods have to strike a balance between temporal consistency and spatial detail.
Generative methods for image synthesis have recently experienced a phase of rapid growth in quality and popularity due to the introduction of diffusion models trained on large-scale datasets. Some text-conditional models, such as DALL-E 2 and Stable Diffusion, enable novice users to generate detailed images with just a text prompt. Latent diffusion models provide efficient methods for generating images by compositing in a perceptually compressed space.
In this paper, the researchers propose a controllable structure- and content-aware video diffusion model on unsubtitled video and paired text-image data. trained on large-scale data sets. We chose to use monocular depth estimation to represent structure and embeddings predicted by a pre-trained neural network to represent content.
This method provides several powerful control modes during its generation process: First, similar to image synthesis models, the researchers train the model to make inferred video content, such as its appearance or style, matching a user-supplied image or text prompt (Figure 1). Second, inspired by the diffusion process, the researchers applied an information masking process to the structural representation to be able to select how well the model supports a given structure. Finally, we tune the inference process through a custom guidance method inspired by classification-free guidance to achieve control over the temporal consistency of generated segments.
Overall, the highlights of this study are as follows:
- By introducing a temporal layer into the pre-trained image model, and Joint training on images and videos extends the latent diffusion model to the field of video generation;
- proposes a structure- and content-aware model that modifies videos under the guidance of sample images or text . Editing occurs entirely within inference time, requiring no additional training or preprocessing for each video;
- demonstrates complete control over time, content, and structural consistency. This study shows for the first time that joint training on image and video data enables inference time to control temporal consistency. For structural consistency, training at different levels of detail in the representation allows the desired settings to be selected during inference;
- In a user study, our method More popular than several other methods;
- The trained model can be further customized by fine-tuning on a small set of images to produce more accurate videos of a specific subject.
Method
For research purposes it will be helpful to consider a video from both a content and structure perspective. By structure, here we mean features that describe its geometry and dynamics, such as the shape and position of its bodies, and their temporal changes. For content, it is defined here as features that describe the appearance and semantics of a video, such as the color and style of objects and the lighting of the scene. The goal of the Gen-1 model is to edit the content of a video while preserving its structure.
In order to achieve this goal, the researcher learned the generative model p (x|s, c) of video x, whose conditions are structural representation (represented by s) and content representation ( Represented by c). They infer the shape representation s from the input video and modify it based on the text prompt c describing the edit. First, the implementation of the generative model as a conditional latent video diffusion model is described, and then, the choice of shape and content representations is described. Finally, the optimization process of the model is discussed.
The model structure is shown in Figure 2.
Experiment
To evaluate the method, the researchers used DAVIS videos and various materials. To automatically create the editing prompt, the researchers first ran a subtitle model to obtain a description of the original video content, and then used GPT-3 to generate the editing prompt.
Qualitative research
As shown in Figure 5, the results prove that the method in this article is effective on some different inputs good performance.
##User Research
Researcher also A user study was conducted using Amazon Mechanical Turk (AMT) on an evaluation set of 35 representative video editing prompts. For each sample, 5 annotators were asked to compare the fidelity of video editing prompts between the baseline method and our method ("Which video better represents the provided edited subtitles?"), and then randomly Presented sequentially, with majority vote used to determine final outcome.
The results are shown in Figure 7:
##Quantitative Evaluation
Figure 6 shows the results of each model using the consistency and prompt consistency indicators of this article's framework. The performance of the model in this paper tends to surpass the baseline model in both aspects (i.e., it is higher in the upper right corner of the figure). The researchers also noticed that there is a slight tradeoff for increasing the intensity parameter in the baseline model: greater intensity scaling means higher prompt consistency at the cost of lower frame consistency. They also observed that increasing structural scaling leads to higher prompt consistency because the content becomes no longer determined by the input structure.
Customization
Figure 10 shows a model with different numbers of customization steps and different levels of structural dependencies. ts example. The researchers observed that customization increases fidelity to the character's style and appearance, so that, despite using driven videos of characters with different characteristics, combined with higher ts values, accurate animation effects can be achieved.
The above is the detailed content of Adding special effects only requires one sentence or a picture. The company Stable Diffusion has used AIGC to play new tricks.. For more information, please follow other related articles on the PHP Chinese website!

Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics



<p>Windows 11 improves personalization in the system, allowing users to view a recent history of previously made desktop background changes. When you enter the personalization section in the Windows System Settings application, you can see various options, changing the background wallpaper is one of them. But now you can see the latest history of background wallpapers set on your system. If you don't like seeing this and want to clear or delete this recent history, continue reading this article, which will help you learn more about how to do it using Registry Editor. </p><h2>How to use registry editing

Windows are never one to neglect aesthetics. From the bucolic green fields of XP to the blue swirling design of Windows 11, default desktop wallpapers have been a source of user delight for years. With Windows Spotlight, you now have direct access to beautiful, awe-inspiring images for your lock screen and desktop wallpaper every day. Unfortunately, these images don't hang out. If you have fallen in love with one of the Windows spotlight images, then you will want to know how to download them so that you can keep them as your background for a while. Here's everything you need to know. What is WindowsSpotlight? Window Spotlight is an automatic wallpaper updater available from Personalization > in the Settings app

With the continuous development of artificial intelligence technology, image semantic segmentation technology has become a popular research direction in the field of image analysis. In image semantic segmentation, we segment different areas in an image and classify each area to achieve a comprehensive understanding of the image. Python is a well-known programming language. Its powerful data analysis and data visualization capabilities make it the first choice in the field of artificial intelligence technology research. This article will introduce how to use image semantic segmentation technology in Python. 1. Prerequisite knowledge is deepening

With the iOS 17 Photos app, Apple makes it easier to crop photos to your specifications. Read on to learn how. Previously in iOS 16, cropping an image in the Photos app involved several steps: Tap the editing interface, select the crop tool, and then adjust the crop using a pinch-to-zoom gesture or dragging the corners of the crop tool. In iOS 17, Apple has thankfully simplified this process so that when you zoom in on any selected photo in your Photos library, a new Crop button automatically appears in the upper right corner of the screen. Clicking on it will bring up the full cropping interface with the zoom level of your choice, so you can crop to the part of the image you like, rotate the image, invert the image, or apply screen ratio, or use markers

New perspective image generation (NVS) is an application field of computer vision. In the 1998 SuperBowl game, CMU's RI demonstrated NVS given multi-camera stereo vision (MVS). At that time, this technology was transferred to a sports TV station in the United States. , but it was not commercialized in the end; the British BBC Broadcasting Company invested in research and development for this, but it was not truly commercialized. In the field of image-based rendering (IBR), there is a branch of NVS applications, namely depth image-based rendering (DBIR). In addition, 3D TV, which was very popular in 2010, also needed to obtain binocular stereoscopic effects from monocular video, but due to the immaturity of the technology, it did not become popular in the end. At that time, methods based on machine learning had begun to be studied, such as

Thanks to the differentiable rendering provided by NeRF, recent 3D generative models have achieved stunning results on stationary objects. However, in a more complex and deformable category such as the human body, 3D generation still poses great challenges. This paper proposes an efficient combined NeRF representation of the human body, enabling high-resolution (512x256) 3D human body generation without the use of super-resolution models. EVA3D has significantly surpassed existing solutions on four large-scale human body data sets, and the code has been open source. Paper name: EVA3D: Compositional 3D Human Generation from 2D image Collections Paper address: http

Those who have to work with image files on a daily basis often have to resize them to fit the needs of their projects and jobs. However, if you have too many images to process, resizing them individually can consume a lot of time and effort. In this case, a tool like PowerToys can come in handy to, among other things, batch resize image files using its image resizer utility. Here's how to set up your Image Resizer settings and start batch resizing images with PowerToys. How to Batch Resize Images with PowerToys PowerToys is an all-in-one program with a variety of utilities and features to help you speed up your daily tasks. One of its utilities is images

With the vigorous development of the digital culture industry, artificial intelligence technology has begun to be widely used in the field of image editing and beautification. Among them, portrait skin beautification is undoubtedly one of the most widely used and most demanded technologies. Traditional beauty algorithms use filter-based image editing technology to achieve automated skin resurfacing and blemish removal effects, and have been widely used in social networking, live broadcasts and other scenarios. However, in the professional photography industry with high thresholds, due to the high requirements for image resolution and quality standards, manual retouchers are still the main productive force in portrait beauty retouching, completing tasks including skin smoothing, blemish removal, whitening, etc. Series work. Usually, the average processing time for a professional retoucher to perform skin beautification operations on a high-definition portrait is 1-2 minutes. In fields such as advertising, film and television, which require higher precision, this
