In recent years, generation technology in the field of computer vision has become more and more powerful, and the corresponding "forgery" technology has become more and more mature. From DeepFake face-changing to action simulation, it is difficult to distinguish the real from the fake.
Recently, NVIDIA has made another big move, and published a new Implicit Warping (Implicit Warping) framework at the NeurIPS 2022 conference, using A set of source images and drive the movement of the video to make the target animation .
## Paper link: https://arxiv.org/pdf/2210.01794.pdf
From the effect point of view, the generated images are more realistic. When the characters move in the video, the background will not change.
Multiple source images input usually provide different appearance information, reducing the generator's "fantasy" space , for example, the following two are used as model input.
It can be found that compared with other models, implicit distortion does not produce "space distortion" similar to the beauty effect.
Because of the occlusion of characters, multiple source images can also provide a more complete background.
As you can see from the video below, if there is only one picture on the left, is the one behind the background "BD" or " ED" is difficult to guess, which will cause background distortion, and two pictures will generate a more stable image.
When comparing other models, only one source image is better.
Magical Implicit Distortion
The academic world’s focus on video imitation can be traced back to 2005, and many projects have real-time facial reproduction. Expression transmission, Face2Face, synthetic Obama, Recycle-GAN, ReenactGAN, dynamic neural radiation field, etc. diversified the use of several limited technologies at the time, such as generative adversarial networks (GAN), neural radiation fields (NeRF) and autoencoders.Not all methods are trying to generate videos from a single frame of images. There are also some studies that perform complex calculations on each frame in the video. This is actually what Deepfake does. Take the imitation route.
However, since the DeepFake model obtains less information, this method requires training for each video clip, and the performance is reduced compared to the open source methods of DeepFaceLab or FaceSwap. This Both models are able to impose an identity onto any number of video clips.
The FOMM model released in 2019 allows characters to move with the video, giving the video imitation task another shot in the arm.
Other researchers subsequently attempted to derive multiple poses and expressions from a single face image or full-body representation; however, this approach generally only worked for relatively expressionless and immobile subjects. , such as a relatively stationary "talking head" because there are no "sudden changes in behavior" in facial expressions or gestures that the network has to interpret.
Although some of these technologies and methods gained public attention before deepfakes and potential diffusion image synthesis methods became popular, their scope of applicability is limited. , versatility was questioned.
The implicit distortion that NVIDIA focuses on this time is to obtain information between multiple frames or even only between two frames, rather than obtaining all necessary poses from one frame. Information, this setup isn't present in other competing models, or is handled very poorly.
For example, Disney's workflow is that senior animators draw the main frames and key frames, and other junior animators are responsible for drawing intermediate frames. Through testing on previous versions, NVIDIA researchers found that the quality of results from the previous method deteriorated with additional "keyframes", and that the new method was inconsistent with the logic of animation production. Consistently, performance improves in a linear fashion as the number of keyframes increases. If there are some sudden changes in the middle of the clip, such as an event or expression that is not shown in the start frame or end frame, implicit distortion can be added at this midpoint. One frame, additional information will be fed back to the attention mechanism of the entire clip. Previous methods like FOMM, Monkey-Net and face-vid2vid etc. use explicit distortion to draw a Time series,The information extracted from source faces and control,movements must be adapted and consistent with this time,series. Under this model design, the final mapping of key points is quite strict. In contrast, Implicit Warp uses a cross-modal attention layer with fewer predefined bootstrapping in its workflow and can adapt to inputs from multiple frameworks. The workflow also does not require distortion on a per-key basis, the system can select the most appropriate features from a series of images. Implicit warping also reuses some key point prediction components in the FOMM framework, and finally uses a simple U-net to derive the space Drive keypoint representation for encoding. A separate U-net is used to encode the source image together with the derived spatial representation. Both networks can operate at resolutions ranging from 64px (256px squared output) to 384x384px. Because this mechanism cannot automatically account for all possible changes in pose and movement in any given video, additional keyframes are necessary , can be added temporarily. Without this intervention capability, keys that are not similar enough to the target motion point will automatically update, resulting in a decrease in output quality. The researchers’ explanation for this is that although it is the most similar key to the query in a given set of keyframes, it may not be enough to produce a good output. For example, suppose the source image has a face with closed lips, and the driver image has a face with open lips and exposed teeth. In this case, there is no appropriate key (and value) in the source image to drive the mouth region of the image. This method overcomes this problem by learning additional image-independent key-value pairs, which can cope with the lack of information in the source image. Although the current implementation is quite fast, around 10 FPS on a 512x512px image, the researchers believe that in a future version the pipeline could be passed through a factorized I-D attention layer Or Spatial Reduction Attention (SRA) layer (i.e. Pyramid Vision Transformer) to optimize. Because implicit warping uses global attention instead of local attention, it can predict factors that previous models cannot predict. The researchers tested the system on the VoxCeleb2 data set, the more challenging TED Talk data set and the TalkingHead-1KH data set, comparing Baseline between 256x256px and full 512x512px resolution, using metrics including FID, AlexNet-based LPIPS, and Peak Signal-to-Noise Ratio (pSNR). The contrasting frameworks used for testing include FOMM and face-vid2vid, as well as AA-PCA. Since previous methods have little or no ability to use multiple keyframes, this is also the main innovation of implicit distortion, research The staff also designed similar testing methods. Implicit warping outperforms most contrasting methods on most metrics. In the multi-keyframe reconstruction test, in which the researchers used sequences of up to 180 frames and selected gap frames, implicit warping won overall this time. As the number of source images increases, this method can achieve better reconstruction results, and the scores of all indicators improve. And as the number of source images increases, the reconstruction effect of the previous work becomes worse, contrary to expectations. After conducting qualitative research through AMT staff, it is also believed that the generation results of implicit deformation are stronger than other methods. Having access to this framework would allow users to create more coherent and longer video simulations and full-body deepfake videos, all while Capable of exhibiting a much greater range of motion than any frame the system has been tested on. But research into more realistic image synthesis also raises concerns because these techniques can be easily used for forgery, and there are standard disclaimers in papers. If our method is used to create DeepFake products, it may have negative impacts. Malicious speech synthesis creates false images of people by transferring and transmitting false information across identities, leading to identity theft or the spread of false news. But in controlled settings, the same technology can also be used for entertainment purposes. The paper also points out the potential of this system for neural video reconstruction, such as Google's Project Starline. In this framework, the reconstruction work is mainly focused on the client side, leveraging the sparse input from the person on the other end. Sports information. This solution has attracted more and more interest from the research community, and some companies intend to implement low-bandwidth conference calls by sending pure motion data or sparsely spaced key frames. These key frames will Interpreted and inserted into full HD video upon reaching the target client. Model structure
Experimental results
The above is the detailed content of DeepFake has never been so real! How strong is Nvidia's latest 'implicit distortion”?. For more information, please follow other related articles on the PHP Chinese website!