Up の所有者はすでに、Tencent のオープンソース「AniPortrait」を悪用し、写真に歌わせたりしゃべらせたりし始めています。

王林
リリース: 2024-04-07 09:01:16
転載
552 人が閲覧しました

AniPortrait モデルはオープンソースであり、自由に再生できます。


""Xiaopozhan Ghost Zone 用の新しい生産性ツール。"

最近、Tencent Open Source がリリースした新しいプロジェクトが Twitter でこのような評価を受けました。このプロジェクトは AniPortrait で、オーディオと参照画像に基づいて高品質のアニメーション ポートレートを生成します。

早速、弁護士の手紙で警告されているデモを見てみましょう:
Up の所有者はすでに、Tencent のオープンソース「AniPortrait」を悪用し、写真に歌わせたりしゃべらせたりし始めています。
アニメーション画像 簡単に話すこともできます:
Up の所有者はすでに、Tencent のオープンソース「AniPortrait」を悪用し、写真に歌わせたりしゃべらせたりし始めています。
このプロジェクトは、開始からわずか数日ですでに広く賞賛されており、GitHub スターの数は 2,800 を超えています。

Up の所有者はすでに、Tencent のオープンソース「AniPortrait」を悪用し、写真に歌わせたりしゃべらせたりし始めています。

AniPortrait の革新性を見てみましょう。

Up の所有者はすでに、Tencent のオープンソース「AniPortrait」を悪用し、写真に歌わせたりしゃべらせたりし始めています。

  • 論文タイトル: AniPortrait: フォトリアリスティックなポートレート アニメーションのオーディオ駆動合成
  • 論文アドレス: https ://arxiv.org/pdf/2403.17694.pdf
  • コードアドレス: https://github.com/Zejun-Yang/AniPortrait

#AniPortrait

Tencent が新たに提案した AniPortrait フレームワークには、Audio2Lmk と Lmk2Video という 2 つのモジュールが含まれています。

Audio2Lmk は、音声入力から複雑な顔の表情や唇の動きをキャプチャするランドマーク シーケンスを抽出するために使用されます。 Lmk2Video は、このランドマーク シーケンスを使用して、時間的に安定した一貫した高品質のポートレート ビデオを生成します。

図 1 は、AniPortrait フレームワークの概要を示しています。

Audio2Lmk

#For a sequence of speech clips, the goal here is to predict the corresponding 3D face mesh sequence and gesture sequence.

The team used pre-trained wav2vec to extract audio features. The model generalizes well and can accurately recognize pronunciation and intonation in audio - crucial for generating realistic facial animations. By exploiting the obtained robust speech features, they can be efficiently converted into 3D face meshes using a simple architecture consisting of two fc layers. The team observed that this simple and straightforward design not only ensures accuracy but also improves the efficiency of the inference process.

In the task of converting audio into gestures, the backbone network used by the team is still the same wav2vec. However, the weights of this network are different from the audio-to-mesh module's network. This is because gestures are more closely related to rhythm and pitch in the audio, whereas audio-to-grid tasks focus on a different focus (pronunciation and intonation). To take the impact of previous states into account, the team employed a transformer decoder to decode the gesture sequence. In this process, the module uses a cross-attention mechanism to integrate audio features into the decoder. For the above two modules, the loss function used for training is a simple L1 loss.

After obtaining the mesh and pose sequence, use perspective projection to convert them into a 2D face landmark sequence. These Landmarks are the input signals for the next stage.

Lmk2Video

Given a reference portrait and a face Landmark sequence, The team's proposed Lmk2Video can create temporally consistent portrait animations. The animation process is about aligning the motion with the Landmark sequence while maintaining a consistent look with the reference image. The idea adopted by the team is to represent portrait animation as a sequence of portrait frames.

Lmk2Video’s network structure design is inspired by AnimateAnyone. The backbone network is SD1.5, which integrates a temporal motion module that effectively converts multi-frame noise input into a sequence of video frames.

In addition, they also used a ReferenceNet, which also uses the SD1.5 structure. Its function is to extract the appearance information of the reference image and integrate it into the backbone network. . This strategic design ensures that Face ID remains consistent throughout the output video.

Unlike AnimateAnyone, this increases the complexity of PoseGuider's design. The original version just integrated several convolutional layers, and then the Landmark features were fused with the latent features of the input layer of the backbone network. The Tencent team found that this rudimentary design could not capture the complex movements of lips. Therefore, they adopted ControlNet’s multi-scale strategy: integrating Landmark features of corresponding scales into different modules of the backbone network. Despite these improvements, the number of parameters in the final model is still quite low.

The team also introduced another improvement: using the Landmark of the reference image as an additional input. PoseGuider's cross-attention module facilitates interaction between reference landmarks and target landmarks in each frame. This process provides the network with additional clues that allow it to understand the connection between facial landmarks and appearance, which can help the portrait animation generate more precise movements.
Experiment

##Implementation details

The backbone network used in the Audio2Lmk stage is wav2vec2.0. The tool used to extract 3D meshes and 6D poses is MediaPipe. Audio2Mesh’s training data comes from Tencent’s internal dataset, which contains nearly an hour of high-quality speech data from a single speaker.

To ensure the stability of the 3D mesh extracted by MediaPipe, the performer's head position is stable and facing the camera during recording. Training Audio2Pose uses HDTF. All training operations are performed on a single A100, using the Adam optimizer, and the learning rate is set to 1e-5.

The Lmk2Video process uses a two-step training method.

#The initial step phase focuses on training the backbone network ReferenceNet and the 2D component of PoseGuider, regardless of the motion module. In subsequent steps, all other components will be frozen to focus on training the motion module. To train the model, two large-scale high-quality face video datasets are used here: VFHQ and CelebV-HQ. All data is passed through MediaPipe to extract 2D face landmarks. To improve the network's sensitivity to lip movements, the team's approach was to annotate the upper and lower lips with different colors when rendering pose images based on 2D Landmarks.

All images have been rescaled to 512x512.The model was trained using 4 A100 GPUs, with each step taking 2 days. The optimizer is AdamW and the learning rate is fixed at 1e-5.

Experimental results

As shown in Figure 2, the animation obtained by the new method is Excellent in quality and realism.

Up の所有者はすでに、Tencent のオープンソース「AniPortrait」を悪用し、写真に歌わせたりしゃべらせたりし始めています。

Additionally, users can edit the 3D representation in between, thereby modifying the final output. For example, users can extract Landmarks from a source and modify their ID information to achieve facial reproduction, as shown in the following video: Up の所有者はすでに、Tencent のオープンソース「AniPortrait」を悪用し、写真に歌わせたりしゃべらせたりし始めています。Please refer to the original paper for more details.

以上がUp の所有者はすでに、Tencent のオープンソース「AniPortrait」を悪用し、写真に歌わせたりしゃべらせたりし始めています。の詳細内容です。詳細については、PHP 中国語 Web サイトの他の関連記事を参照してください。

関連ラベル:
ソース:jiqizhixin.com
このウェブサイトの声明
この記事の内容はネチズンが自主的に寄稿したものであり、著作権は原著者に帰属します。このサイトは、それに相当する法的責任を負いません。盗作または侵害の疑いのあるコンテンツを見つけた場合は、admin@php.cn までご連絡ください。
人気のチュートリアル
詳細>
最新のダウンロード
詳細>
ウェブエフェクト
公式サイト
サイト素材
フロントエンドテンプレート