Open source model, single card training, take you to understand the popular text-guided audio generation technology AudioLDM-AI-php.cn

Home

Open source model, single card training, take you to understand the popular text-guided audio generation technology AudioLDM

王林

Apr 12, 2023 pm 07:04 PM

Model Open source

Given a piece of text, artificial intelligence can generate music, voice, various sound effects, and even imaginary sounds, such as black holes and laser guns. AudioLDM, recently launched jointly by the University of Surrey and Imperial College London, quickly became popular abroad after its release. It received nearly 300 retweets and 1,500 likes on Twitter within a week. On the second day after the model was open sourced, AudioLDM rushed to the top of the Hugging Face hot search list, and within a week entered the Hugging Face's top 40 most popular applications list (about 25,000 in total), and quickly appeared in many Derivative work based on AudioLDM.

AudioLDM model has the following highlights:

The first open source model that can generate music, speech and sound effects from text at the same time .
Developed by academia, it uses less data, a single GPU, and smaller models to achieve the best results currently.
It is proposed to train the generative model in a self-supervised manner, so that text-guided audio generation is no longer limited by the problem of missing (text-audio) data pairs.
The model can achieve audio style transfer, audio missing filling, and audio super-resolution without additional training (zero-shot).

Open source model, single card training, take you to understand the popular text-guided audio generation technology AudioLDM

Project homepage: https://audioldm.github.io/
Paper: https://arxiv.org/abs/2301.12503
Open source code and model: https://github.com/haoheliu/AudioLDM
Hugging Face Space: https://huggingface.co/spaces/haoheliu/audioldm- text-to-audio-generation

The author first released a preview of the model on January 27th, showing a very simple text: " A music made by []” (a piece of music generated by []) to generate different sound effects. The video, which shows music made with different instruments and even a mosquito, quickly gained traction on Twitter, being played over 35.4K times and retweeted over 130 times.

Open source model, single card training, take you to understand the popular text-guided audio generation technology AudioLDM

The author then released the paper and a new video. In this video, the author demonstrates most of the capabilities of the model, as well as the effect of working with ChatGPT to generate sounds. AudioLDM can even generate sounds from outer space.

The author then released the paper, the pre-trained model, and a playable interface, which ignited the enthusiasm of Twitter netizens and quickly appeared on Hugging Face the next day. The first place on the hot search list:

Open source model, single card training, take you to understand the popular text-guided audio generation technology AudioLDM

This work has received widespread attention on Twitter, and scholars in the industry They have forwarded and commented:

Open source model, single card training, take you to understand the popular text-guided audio generation technology AudioLDM

Netizens used AudioLDM to generate a variety of sounds.

For example, the sound of a two-dimensional cat girl snoring is generated:

Open source model, single card training, take you to understand the popular text-guided audio generation technology AudioLDM

## And the voice of the ghost:

Open source model, single card training, take you to understand the popular text-guided audio generation technology AudioLDM

Some netizens synthesized: "The sound of a mummy, low frequency, with some painful moans."

Some netizens even synthesized: "melody fart sound".

I have to lament the rich imagination of netizens.

Some netizens directly used AudioLDM to generate a series of music albums in various styles, including jazz, funk, electronic and classical. Some of the music is quite inventive.

For example "Create an ambient music with the theme of the universe and the moon":

Open source model, single card training, take you to understand the popular text-guided audio generation technology AudioLDM

## and "Create a music using the sounds of the future":

Open source model, single card training, take you to understand the popular text-guided audio generation technology AudioLDM

Interested readers can visit This music album website: https://www.latent.store/albums

Some netizens also used their imagination to create a picture by combining the image-generated text model and AudioLDM. Applications that guide sound effect generation.

For example, if you give AudioLDM this text: "A dog running in the water with a frisbee" (a dog running in the water with a frisbee in its mouth):

Open source model, single card training, take you to understand the popular text-guided audio generation technology AudioLDM

can generate the following sound of a dog slapping the water.

You can even restore the sounds in old photos, such as the picture below:

Open source model, single card training, take you to understand the popular text-guided audio generation technology AudioLDM

##After obtaining the text of "A man and a woman sitting at a bar" (a man and a woman sitting at a bar), the model can generate the following sound, where you can hear vague voices and the collision of wine glasses in the background the sound of.

Some netizens used AudioLDM to generate the sound of a flaming dog, which is very interesting.

The author also produced a video to demonstrate the model's ability to generate sound effects, showing how AudioLDM's generated samples are close to the effect of the sound effects library.

In fact, text audio generation is only part of the capabilities of AudioLDM. AudioLDM can also achieve timbre conversion, missing filling and super-resolution.

The two pictures below show the timbre transformation from (1) percussion to ambient music; and (2) trumpet to children’s singing.

Open source model, single card training, take you to understand the popular text-guided audio generation technology AudioLDM

##The following is from percussion to ambient music ( Gradual transition intensity) effect.

The sound of the trumpet is transformed into the sound of a child singing (gradual conversion intensity).

Below we will show the effect of the model on audio super-resolution, audio missing filling and sound material control. Due to the limited length of the article, audio is mainly displayed in the form of spectrograms. Interested readers please go to the AudioLDM project homepage: https://audioldm.github.io/

In terms of audio super-resolution, the effect of AudioLDM is also very good. Compared with the previous super-resolution model, AudioLDM is a universal super-resolution model and is not limited to processing music and speech.

Open source model, single card training, take you to understand the popular text-guided audio generation technology AudioLDM

In terms of audio missing filling, AudioLDM can fill in different audio content according to the given text, and in The transition at the border is relatively natural.

In addition, AudioLDM also shows strong control capabilities, such as acoustic environment, music mood and speed, object materials, pitch pitch and sequence, etc. For control capabilities, interested readers can check out AudioLDM’s paper or project homepage.

In the article, the author made subjective scoring and objective index evaluation of the AudioLDM model. The results show that both can significantly exceed the previous optimal model:

Open source model, single card training, take you to understand the popular text-guided audio generation technology AudioLDM

AudioGen is a model proposed by Facebook in October 2022, using ten data sets, 64 GPUs and 285 MB of parameters. In comparison, AudioLDM-S can achieve better results with a single data set, 1 GPU and 181 MB of parameters.

Open source model, single card training, take you to understand the popular text-guided audio generation technology AudioLDM

Subjective scoring also shows that AudioLDM is significantly better than the previous solution DiffSound. So, what improvements has AudioLDM made to make the model have such excellent performance?

First of all, in order to solve the problem of too few text-audio data pairs, the author proposed a self-supervised method to train AudioLDM.

Open source model, single card training, take you to understand the popular text-guided audio generation technology AudioLDM

Specifically, when training the core module LDMs, the author uses the embedding of the audio itself as the condition of the LDMs Signal, the entire process does not involve the use of text (as shown in the image above). This scheme is based on a pair of pre-trained audio-text contrastive learning encoders (CLAP), which has demonstrated good generalization capabilities in the original CLAP text. AudioLDM takes advantage of CLAP's excellent generalization capabilities to achieve model training on large-scale audio data without the need for text labels.

In fact, the authors found that training with audio alone is even better than using audio-text data pairs:

Open source model, single card training, take you to understand the popular text-guided audio generation technology AudioLDM

The author analyzed two reasons: (1) The text annotation itself is difficult to include all the information of the audio, such as acoustic environment, frequency distribution, etc., resulting in the embedding of the text not being able to well represent the audio, ( 2) The quality of the text itself is not perfect. For example, such annotation "Boats: Battleships-5.25 conveyor space" is difficult for even humans to imagine what the specific sound is, which will cause problems in model training. In contrast, using the audio itself as the condition of LDM can ensure a strong correlation between the target audio and the condition, thereby achieving better generation results.

In addition, the Latent Diffusion solution adopted by the author allows the Diffusion model to be calculated in a smaller space, thereby greatly reducing the computational power requirements of the model.

Many detailed explorations in model training and structure also help AudioLDM achieve excellent performance.

The author also drew a simple structure diagram to introduce the two main downstream tasks:

Open source model, single card training, take you to understand the popular text-guided audio generation technology AudioLDM

The author also conducted detailed experiments with different model structures, model sizes, DDIM sampling steps, and different Classifier-free Guidance Scales.

While disclosing the model, the authors also disclosed the code base of their generative model evaluation system to unify the evaluation methods of the academic community on such issues in the future, thereby facilitating the preparation of papers. Comparison between the Questioned:

Open source model, single card training, take you to understand the popular text-guided audio generation technology AudioLDM

##The author’s team said it would Limit the use of models, especially commercial use, to ensure that models are only used for academic communication, and use appropriate LICENSE and watermark protection to prevent ethical problems. Open source model, single card training, take you to understand the popular text-guided audio generation technology AudioLDM

Author information

The paper has two co-authors: Liu Haohe (University of Surrey, UK) and Chen Zehua (Imperial College London, UK).

## Liu Haohe is currently studying for his PhD at the University of Surrey, UK, under the tutelage of Professor Mark D. Plumbley. Its open source projects have received thousands of stars on GitHub. He has published more than 20 papers at major academic conferences and won the top three rankings in several world machine acoustics competitions. In the corporate world, we have extensive cooperation with Microsoft, ByteDance, the British Broadcasting Corporation, etc. Personal homepage: https://www.surrey.ac.uk/people/haohe-liu

Open source model, single card training, take you to understand the popular text-guided audio generation technology AudioLDM

Chen Zehua is a doctoral student at Imperial College London, studying under Professor Danilo Mandic. He has interned at Microsoft Speech Synthesis Research Group and JD Artificial Intelligence Laboratory. His research interests include Generative models, speech synthesis, bioelectrical signal generation.

Open source model, single card training, take you to understand the popular text-guided audio generation technology AudioLDM

The above is the detailed content of Open source model, single card training, take you to understand the popular text-guided audio generation technology AudioLDM. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)

1 months ago By 尊渡假赌尊渡假赌尊渡假赌

R.E.P.O. Best Graphic Settings

1 months ago By 尊渡假赌尊渡假赌尊渡假赌

Assassin's Creed Shadows: Seashell Riddle Solution

2 weeks ago By DDD

R.E.P.O. How to Fix Audio if You Can't Hear Anyone

1 months ago By 尊渡假赌尊渡假赌尊渡假赌

R.E.P.O. Chat Commands and How to Use Them

1 months ago By 尊渡假赌尊渡假赌尊渡假赌

Hot Tools

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Hot Topics

Where is the login entrance for gmail email?

7543

CakePHP Tutorial

1381

What is the format of the account name of steam

win11 activation key permanent

nyt connections hints and answers

Related knowledge

The world's most powerful open source MoE model is here, with Chinese capabilities comparable to GPT-4, and the price is only nearly one percent of GPT-4-Turbo May 07, 2024 pm 04:13 PM

Imagine an artificial intelligence model that not only has the ability to surpass traditional computing, but also achieves more efficient performance at a lower cost. This is not science fiction, DeepSeek-V2[1], the world’s most powerful open source MoE model is here. DeepSeek-V2 is a powerful mixture of experts (MoE) language model with the characteristics of economical training and efficient inference. It consists of 236B parameters, 21B of which are used to activate each marker. Compared with DeepSeek67B, DeepSeek-V2 has stronger performance, while saving 42.5% of training costs, reducing KV cache by 93.3%, and increasing the maximum generation throughput to 5.76 times. DeepSeek is a company exploring general artificial intelligence

AI subverts mathematical research! Fields Medal winner and Chinese-American mathematician led 11 top-ranked papers | Liked by Terence Tao Apr 09, 2024 am 11:52 AM

AI is indeed changing mathematics. Recently, Tao Zhexuan, who has been paying close attention to this issue, forwarded the latest issue of "Bulletin of the American Mathematical Society" (Bulletin of the American Mathematical Society). Focusing on the topic "Will machines change mathematics?", many mathematicians expressed their opinions. The whole process was full of sparks, hardcore and exciting. The author has a strong lineup, including Fields Medal winner Akshay Venkatesh, Chinese mathematician Zheng Lejun, NYU computer scientist Ernest Davis and many other well-known scholars in the industry. The world of AI has changed dramatically. You know, many of these articles were submitted a year ago.

Google is ecstatic: JAX performance surpasses Pytorch and TensorFlow! It may become the fastest choice for GPU inference training Apr 01, 2024 pm 07:46 PM

The performance of JAX, promoted by Google, has surpassed that of Pytorch and TensorFlow in recent benchmark tests, ranking first in 7 indicators. And the test was not done on the TPU with the best JAX performance. Although among developers, Pytorch is still more popular than Tensorflow. But in the future, perhaps more large models will be trained and run based on the JAX platform. Models Recently, the Keras team benchmarked three backends (TensorFlow, JAX, PyTorch) with the native PyTorch implementation and Keras2 with TensorFlow. First, they select a set of mainstream

Hello, electric Atlas! Boston Dynamics robot comes back to life, 180-degree weird moves scare Musk Apr 18, 2024 pm 07:58 PM

Boston Dynamics Atlas officially enters the era of electric robots! Yesterday, the hydraulic Atlas just "tearfully" withdrew from the stage of history. Today, Boston Dynamics announced that the electric Atlas is on the job. It seems that in the field of commercial humanoid robots, Boston Dynamics is determined to compete with Tesla. After the new video was released, it had already been viewed by more than one million people in just ten hours. The old people leave and new roles appear. This is a historical necessity. There is no doubt that this year is the explosive year of humanoid robots. Netizens commented: The advancement of robots has made this year's opening ceremony look like a human, and the degree of freedom is far greater than that of humans. But is this really not a horror movie? At the beginning of the video, Atlas is lying calmly on the ground, seemingly on his back. What follows is jaw-dropping

KAN, which replaces MLP, has been extended to convolution by open source projects Jun 01, 2024 pm 10:03 PM

Earlier this month, researchers from MIT and other institutions proposed a very promising alternative to MLP - KAN. KAN outperforms MLP in terms of accuracy and interpretability. And it can outperform MLP running with a larger number of parameters with a very small number of parameters. For example, the authors stated that they used KAN to reproduce DeepMind's results with a smaller network and a higher degree of automation. Specifically, DeepMind's MLP has about 300,000 parameters, while KAN only has about 200 parameters. KAN has a strong mathematical foundation like MLP. MLP is based on the universal approximation theorem, while KAN is based on the Kolmogorov-Arnold representation theorem. As shown in the figure below, KAN has

Recommended: Excellent JS open source face detection and recognition project Apr 03, 2024 am 11:55 AM

Face detection and recognition technology is already a relatively mature and widely used technology. Currently, the most widely used Internet application language is JS. Implementing face detection and recognition on the Web front-end has advantages and disadvantages compared to back-end face recognition. Advantages include reducing network interaction and real-time recognition, which greatly shortens user waiting time and improves user experience; disadvantages include: being limited by model size, the accuracy is also limited. How to use js to implement face detection on the web? In order to implement face recognition on the Web, you need to be familiar with related programming languages and technologies, such as JavaScript, HTML, CSS, WebRTC, etc. At the same time, you also need to master relevant computer vision and artificial intelligence technologies. It is worth noting that due to the design of the Web side

Tesla robots work in factories, Musk: The degree of freedom of hands will reach 22 this year! May 06, 2024 pm 04:13 PM

The latest video of Tesla's robot Optimus is released, and it can already work in the factory. At normal speed, it sorts batteries (Tesla's 4680 batteries) like this: The official also released what it looks like at 20x speed - on a small "workstation", picking and picking and picking: This time it is released One of the highlights of the video is that Optimus completes this work in the factory, completely autonomously, without human intervention throughout the process. And from the perspective of Optimus, it can also pick up and place the crooked battery, focusing on automatic error correction: Regarding Optimus's hand, NVIDIA scientist Jim Fan gave a high evaluation: Optimus's hand is the world's five-fingered robot. One of the most dexterous. Its hands are not only tactile

FisheyeDetNet: the first target detection algorithm based on fisheye camera Apr 26, 2024 am 11:37 AM

Target detection is a relatively mature problem in autonomous driving systems, among which pedestrian detection is one of the earliest algorithms to be deployed. Very comprehensive research has been carried out in most papers. However, distance perception using fisheye cameras for surround view is relatively less studied. Due to large radial distortion, standard bounding box representation is difficult to implement in fisheye cameras. To alleviate the above description, we explore extended bounding box, ellipse, and general polygon designs into polar/angular representations and define an instance segmentation mIOU metric to analyze these representations. The proposed model fisheyeDetNet with polygonal shape outperforms other models and simultaneously achieves 49.5% mAP on the Valeo fisheye camera dataset for autonomous driving

See all articles