In August 2022, a digital painting called "Space Opera" won the championship and caused huge controversy. AIGC (AI-Generated Content) was out of the circle Incidents frequently appear in the public eye. The chat robot model ChatGPT released by OpenAI on November 30 of the same year is free and open to the public, which has aroused widespread interest in AIGC. Various fancy questions, such as changing codes, talking about knowledge, asking questions about life... ChatGPT's "wit" and "eruditeness" "It's impressive and refreshing.
The reason why ChatGPT has attracted widespread attention is that OpenAI has released three generations of GPT models. Each generation of model parameters has increased 10 times or even 100 times compared with the previous generation. The GPT-3.5 generation model uses With the RLHF (Reinforcement Learning from Human Feedback) method, we can better understand the meaning of human language, that is, when interacting with humans in chatting, writing articles, answering inquiries, checking code, etc., we are more like a person who has given careful thoughts after "serious thinking". The answer is "people".
Faced with such hot topics in the circle, according to Stephen, a researcher on the Huoshan speech and audio synthesis algorithm: "The reason why AIGC has been so popular recently is inseparable from the step-by-step improvement in the quality of content produced by AI. AI as a production tool triggers In order to achieve higher efficiency, AIGC includes many directions such as text generation, audio generation, image generation and video generation, which will in turn stimulate the rapid development of the artificial intelligence technology behind it and gradually reflect great commercial value."
We often lament that AI painting allows you to appreciate the "imagination" of AI. AI question and answer represented by ChatGPT It will shock you because of its erudition and the "readability" of the answers, while the AI voice tests whether the person can correctly understand the content like a real person and express it by matching the character's timbre and the tone that suits the situation. , this is often seen in the cooperation between Huoshan Voice (ByteDance AI Lab intelligent voice and audio team) and Tomato Novel. The voice generated by the AI algorithm allows you to listen to any text version of the novel directly, and it sounds "smarter" ": With differentiated timbre and appropriate tone, you can become a "drama queen" and perform "emotions, anger, sorrow, and joy" when reading aloud.
It is understood that in order for AI to be expressive and capable of speaking and acting, it is first necessary to ensure that the output content is not read incorrectly, which requires a text analysis model for analysis. "In Tomato Novels, we use the Transformer architecture model BERT, which is widely used in the NLP field, for the text analysis front-end. Mainly through the regularization model (TN) and multi-task front-end model of neural network plus rule mixture, combined with long-term manual rule correction, we continuously It improves the sentence-level accuracy of the front-end and reduces computing power requirements through distillation, quantification and other technologies."
In addition, in order to make the voice sound better, the team also added more functional modules based on the regular TTS process to achieve Develop role ownership and emotional control. For example, the BERT structure is also used in role attribution to model the two tasks of dialogue determination and reference disambiguation. In addition, a similar structure is also used for emotion prediction. "Usually there will be multi-person conversations in novels, and each speaker has a variety of emotions of his own. If timbre and emotion can be decoupled, the expressiveness of synthesized speech can be better controlled, and different timbres and emotions can be achieved. Flexible combination of different emotions is very important."
Importantly, in order to enable AI to understand the text of various types of novels, Huoshan Voice also took the lead in proposing the "AI text understanding" model, which is a set of multiple Long text understanding AI system for tasks. It can automatically distinguish the dialogue characters from the novel text, identify the emotions wanted to express in the dialogue, and predict reasonable pauses between sentences, which greatly improves the production efficiency of high-quality AI audiobooks and effectively breaks through manual annotation. production bottleneck.
"Al text understanding" model
Furthermore, on the basis of clear pronunciation, coherent rhythm, and ups and downs of intonation, the Huoshan Voice team self-developed an end-to-end style control acoustic model of semi-supervised learning to make the voice follow Plutchik's Wheel. of Emotions), showing a variety of emotional colors such as happiness, sadness, surprise, fear, etc., using the method of emotional migration to give the originally emotionless pronunciation a multi-emotional synthesis effect. It better expresses "expressing feelings through sound", and meticulously models and restores the "paralanguage" phenomenon that often occurs in human language, and realizes the common accent pauses, questioning questions, laughter and crying, and various common sounds in audiobooks. Sighs, shouts, etc. achieve a wonderful interpretation of the text content.
"The effect of being close to real people's speech, so that the final AI voice can reflect the effects of different characters in different contexts, is the goal we have been pursuing. In the future, we hope to achieve it through text - A large speech joint training model extracts representations from texts in different contexts and improves the success rate of character identification; with a large multi-talker speech synthesis model, attributes such as emotion, style, timbre, and accent are decoupled and can be freely transferred; at the same time Generate matching background sounds based on text descriptions to enhance the sense of immersion when listening to audiobooks.”
In more practices, we found , in addition to text and images, people use voice interaction in a wider range of applications. For example, people often issue instructions to control various appliances through voice interaction at home; when traveling, they use in-car voice assistants to complete navigation, restaurant reservations, etc.; and in office scenes Conference assistants used at medium and high frequencies are inseparable from intelligent voice solutions to improve content quality and production efficiency.
In this regard, the Huoshan Voice team has also made more related innovative attempts. For example, today when short videos have become a national fashion, in the face of the random recording of UGC group video creation and the uncontrollable audio quality, etc. Due to practical factors, the Huoshan Voice Intelligent Subtitles solution automatically adds subtitles to video creations. It is not only compatible with the recognition of commonly used languages and dialects such as Chinese, English, and Cantonese, but also can recognize songs.
In this regard, W, product manager for the voice and audio understanding direction of Huoshan, added: "In the production of video content, the traditional method of adding subtitles requires the creator to dictate and proofread the video several times, and also needs to frame by frame according to the starting time. Alignment, often a 10-minute video requires several hours of post-production time to complete. In addition, the subtitle team must be proficient in multiple languages and familiar with the production of subtitle files. Overall, the cost of video production is very high, which is very difficult for individual creators in today’s short video era. Or it has long been out of reach for users who simply record their lives."
In order to lower the threshold of creation and allow all creators to easily produce high-quality video content and record a beautiful life, Huoshan Voice is self-developed and launched Intelligent subtitle solution. Not only can it efficiently recognize dialects and songs, but it can also have a good recognition effect on scenes where languages are mixed and speaking and singing are mixed; in addition, through the audio characteristics and domain analysis of user-created content, and algorithm optimization, the speech recognition can be greatly improved. Performance in complex scenes such as noise scenes and multiple people talking. In particular, mobile users have higher requirements for function response time, that is, they want subtitles to be fast and accurate. For this purpose, Huoshan Voice has made a lot of engineering optimizations and strategies. A 1-minute video can be completed in just 2-3 seconds. .
As we all know, when faced with the same content, humans’ acquisition efficiency of audio information is much lower than that of text information. The key to converting speech into text that is recorded and used lies in speech recognition, such as the launch of Volcano Voice The real-time subtitle solution of "a thousand words into text, a word is worth a thousand words" uses the AI link of "speech recognition and speech translation" to make cross-country and cross-language communication smoother; by automatically generating meeting records and minutes, It can greatly improve the work efficiency of participants and significantly reduce the workload of post-meeting sorting and mid-meeting recording. It is foreseeable that with the rapid development of technology, AI voice will increase information output channels for human-computer interaction and improve the efficiency of information acquisition.
Similarly faced with the problem of improving quality and efficiency brought by AIGC, in the view of Y, the product manager of Huoshan Voice Voice Interaction, AIGC is indeed expected to be implemented in the auxiliary scenario of intelligent voice interaction, and can realize conversation summarization, speech interaction, etc. Customer service functions such as technical recommendation, emotional comfort, and work order summary provide auxiliary solutions to improve production efficiency. For example, when a human-machine conversation triggers a human-machine conversation, a conversation summary of the human-machine conversation can be automatically generated to help the human-machine understand the user's demands faster and avoid the sudden interruption to check the chat history; during the conversation with the person, By understanding the user's speech, AIGC capabilities are used to generate answers for customer service reference, which improves the efficiency of customer service dialogue.
"In addition, it can also play a role in handling abnormal situations. For example, when users are irritable, angry, etc., AICG may automatically generate soothing words for customer service reference to improve service satisfaction. In the future, with multi-modal As technology and AIGC technology continue to mature, perhaps virtual digital humans can replace part of the labor force and directly serve customers in a human-machine symbiosis, significantly reducing labor costs and improving service efficiency." But he also made it clear that today's AIGC still has It is unable to truly produce content independently and is still at the stage of assisting humans to improve content production efficiency.
Whether it is the amazing answer given by ChatGPT or the touching voice performed by AI in the Tomato novel, even Musk was amazed : We are not far away from dangerously powerful artificial intelligence. This seems to indicate that the era of AIGC is coming.
However, Stephen, a researcher on the Huoshan speech and audio synthesis algorithm who has worked on the front line of AI algorithms for many years, has a more sober judgment. He pointed out: "The technology behind AIGC may perform multi-modal fusion in the future, not just Single-modal generative tasks, just like when humans create content, they do not just conceive of new content based on a single form of knowledge. For example, in the task of generating interactive digital people, currently the main It is a separate prediction of faces, expressions, postures and actions. In the future, a generative model may be used to predict these features to improve the synergy between features and reduce the workload caused by separate recordings; in addition, it will also be based on multi-modal The representation obtained by understanding the task is based on the expression, tone and body movements of the user who is talking, and corresponding feedback is given on the generated image and sound."
In addition to the prediction of technological development, one point that cannot be ignored is that currently AIGC still faces huge challenges in cost, copyright and practicality. He believes that the current cost of AIGC remains high. The most obvious manifestation is that high-quality text, image and video generation technology, etc., all correspond to the consumption of a large amount of hardware resources in the training and inference stages, which makes it difficult for universities and research institutions to participate in it. , which is not conducive to the promotion of industry development.
"In addition, in terms of copyright protection, some of the currently generated content may be used to carry out illegal activities, so it is becoming more and more important to add copyright protection, such as image and audio watermarks, to the content, but after adding During the process, you must also consider not to cause watermark failure due to post-processing methods such as cutting and mixing."
In the past 2022, although the technical application effects in the direction of image and video generation have significantly improved, there is still a need Only after a large amount of manual screening can the content be actually implemented; and generating context-sensitive comics and videos based on long chapters of text not only ensures the continuity of the scene, but also reflects the changes in the characters. There are still a lot of technical problems that need to be solved to avoid manual Turning intelligence into "artificial intelligence" is a challenge, so there is still more room for improvement in practicality.
We may think that the reason why AIGC, as a new method of content production, has attracted attention fully illustrates the desire of all walks of life for content, especially the Internet platform, how to efficiently understand, create, and interact and distributing content have indeed brought opportunities and challenges to today's AI technology.
The above is the detailed content of AI creations are stunning, but many challenges still need to be overcome. For more information, please follow other related articles on the PHP Chinese website!