Table of Contents
Being emotional and able to speak and act has become the main difficulty of AI voice
Improving content quality and production efficiency is the core value of AIGC
Cost, copyright, and practicality are still bottlenecks in the development of AIGC
Home Technology peripherals AI AI creations are stunning, but many challenges still need to be overcome

AI creations are stunning, but many challenges still need to be overcome

Apr 11, 2023 pm 01:43 PM
chatgpt openai

AI creations are stunning, but many challenges still need to be overcome

In August 2022, a digital painting called "Space Opera" won the championship and caused huge controversy. AIGC (AI-Generated Content) was out of the circle Incidents frequently appear in the public eye. The chat robot model ChatGPT released by OpenAI on November 30 of the same year is free and open to the public, which has aroused widespread interest in AIGC. Various fancy questions, such as changing codes, talking about knowledge, asking questions about life... ChatGPT's "wit" and "eruditeness" "It's impressive and refreshing.

The reason why ChatGPT has attracted widespread attention is that OpenAI has released three generations of GPT models. Each generation of model parameters has increased 10 times or even 100 times compared with the previous generation. The GPT-3.5 generation model uses With the RLHF (Reinforcement Learning from Human Feedback) method, we can better understand the meaning of human language, that is, when interacting with humans in chatting, writing articles, answering inquiries, checking code, etc., we are more like a person who has given careful thoughts after "serious thinking". The answer is "people".

Faced with such hot topics in the circle, according to Stephen, a researcher on the Huoshan speech and audio synthesis algorithm: "The reason why AIGC has been so popular recently is inseparable from the step-by-step improvement in the quality of content produced by AI. AI as a production tool triggers In order to achieve higher efficiency, AIGC includes many directions such as text generation, audio generation, image generation and video generation, which will in turn stimulate the rapid development of the artificial intelligence technology behind it and gradually reflect great commercial value."

Being emotional and able to speak and act has become the main difficulty of AI voice

We often lament that AI painting allows you to appreciate the "imagination" of AI. AI question and answer represented by ChatGPT It will shock you because of its erudition and the "readability" of the answers, while the AI ​​voice tests whether the person can correctly understand the content like a real person and express it by matching the character's timbre and the tone that suits the situation. , this is often seen in the cooperation between Huoshan Voice (ByteDance AI Lab intelligent voice and audio team) and Tomato Novel. The voice generated by the AI ​​algorithm allows you to listen to any text version of the novel directly, and it sounds "smarter" ": With differentiated timbre and appropriate tone, you can become a "drama queen" and perform "emotions, anger, sorrow, and joy" when reading aloud.

It is understood that in order for AI to be expressive and capable of speaking and acting, it is first necessary to ensure that the output content is not read incorrectly, which requires a text analysis model for analysis. "In Tomato Novels, we use the Transformer architecture model BERT, which is widely used in the NLP field, for the text analysis front-end. Mainly through the regularization model (TN) and multi-task front-end model of neural network plus rule mixture, combined with long-term manual rule correction, we continuously It improves the sentence-level accuracy of the front-end and reduces computing power requirements through distillation, quantification and other technologies."

In addition, in order to make the voice sound better, the team also added more functional modules based on the regular TTS process to achieve Develop role ownership and emotional control. For example, the BERT structure is also used in role attribution to model the two tasks of dialogue determination and reference disambiguation. In addition, a similar structure is also used for emotion prediction. "Usually there will be multi-person conversations in novels, and each speaker has a variety of emotions of his own. If timbre and emotion can be decoupled, the expressiveness of synthesized speech can be better controlled, and different timbres and emotions can be achieved. Flexible combination of different emotions is very important."

Importantly, in order to enable AI to understand the text of various types of novels, Huoshan Voice also took the lead in proposing the "AI text understanding" model, which is a set of multiple Long text understanding AI system for tasks. It can automatically distinguish the dialogue characters from the novel text, identify the emotions wanted to express in the dialogue, and predict reasonable pauses between sentences, which greatly improves the production efficiency of high-quality AI audiobooks and effectively breaks through manual annotation. production bottleneck.

AI creations are stunning, but many challenges still need to be overcome

"Al text understanding" model

Furthermore, on the basis of clear pronunciation, coherent rhythm, and ups and downs of intonation, the Huoshan Voice team self-developed an end-to-end style control acoustic model of semi-supervised learning to make the voice follow Plutchik's Wheel. of Emotions), showing a variety of emotional colors such as happiness, sadness, surprise, fear, etc., using the method of emotional migration to give the originally emotionless pronunciation a multi-emotional synthesis effect. It better expresses "expressing feelings through sound", and meticulously models and restores the "paralanguage" phenomenon that often occurs in human language, and realizes the common accent pauses, questioning questions, laughter and crying, and various common sounds in audiobooks. Sighs, shouts, etc. achieve a wonderful interpretation of the text content.

"The effect of being close to real people's speech, so that the final AI voice can reflect the effects of different characters in different contexts, is the goal we have been pursuing. In the future, we hope to achieve it through text - A large speech joint training model extracts representations from texts in different contexts and improves the success rate of character identification; with a large multi-talker speech synthesis model, attributes such as emotion, style, timbre, and accent are decoupled and can be freely transferred; at the same time Generate matching background sounds based on text descriptions to enhance the sense of immersion when listening to audiobooks.”

Improving content quality and production efficiency is the core value of AIGC

In more practices, we found , in addition to text and images, people use voice interaction in a wider range of applications. For example, people often issue instructions to control various appliances through voice interaction at home; when traveling, they use in-car voice assistants to complete navigation, restaurant reservations, etc.; and in office scenes Conference assistants used at medium and high frequencies are inseparable from intelligent voice solutions to improve content quality and production efficiency.

In this regard, the Huoshan Voice team has also made more related innovative attempts. For example, today when short videos have become a national fashion, in the face of the random recording of UGC group video creation and the uncontrollable audio quality, etc. Due to practical factors, the Huoshan Voice Intelligent Subtitles solution automatically adds subtitles to video creations. It is not only compatible with the recognition of commonly used languages ​​and dialects such as Chinese, English, and Cantonese, but also can recognize songs.

In this regard, W, product manager for the voice and audio understanding direction of Huoshan, added: "In the production of video content, the traditional method of adding subtitles requires the creator to dictate and proofread the video several times, and also needs to frame by frame according to the starting time. Alignment, often a 10-minute video requires several hours of post-production time to complete. In addition, the subtitle team must be proficient in multiple languages ​​and familiar with the production of subtitle files. Overall, the cost of video production is very high, which is very difficult for individual creators in today’s short video era. Or it has long been out of reach for users who simply record their lives."

In order to lower the threshold of creation and allow all creators to easily produce high-quality video content and record a beautiful life, Huoshan Voice is self-developed and launched Intelligent subtitle solution. Not only can it efficiently recognize dialects and songs, but it can also have a good recognition effect on scenes where languages ​​are mixed and speaking and singing are mixed; in addition, through the audio characteristics and domain analysis of user-created content, and algorithm optimization, the speech recognition can be greatly improved. Performance in complex scenes such as noise scenes and multiple people talking. In particular, mobile users have higher requirements for function response time, that is, they want subtitles to be fast and accurate. For this purpose, Huoshan Voice has made a lot of engineering optimizations and strategies. A 1-minute video can be completed in just 2-3 seconds. .

As we all know, when faced with the same content, humans’ acquisition efficiency of audio information is much lower than that of text information. The key to converting speech into text that is recorded and used lies in speech recognition, such as the launch of Volcano Voice The real-time subtitle solution of "a thousand words into text, a word is worth a thousand words" uses the AI ​​link of "speech recognition and speech translation" to make cross-country and cross-language communication smoother; by automatically generating meeting records and minutes, It can greatly improve the work efficiency of participants and significantly reduce the workload of post-meeting sorting and mid-meeting recording. It is foreseeable that with the rapid development of technology, AI voice will increase information output channels for human-computer interaction and improve the efficiency of information acquisition.

Similarly faced with the problem of improving quality and efficiency brought by AIGC, in the view of Y, the product manager of Huoshan Voice Voice Interaction, AIGC is indeed expected to be implemented in the auxiliary scenario of intelligent voice interaction, and can realize conversation summarization, speech interaction, etc. Customer service functions such as technical recommendation, emotional comfort, and work order summary provide auxiliary solutions to improve production efficiency. For example, when a human-machine conversation triggers a human-machine conversation, a conversation summary of the human-machine conversation can be automatically generated to help the human-machine understand the user's demands faster and avoid the sudden interruption to check the chat history; during the conversation with the person, By understanding the user's speech, AIGC capabilities are used to generate answers for customer service reference, which improves the efficiency of customer service dialogue.

"In addition, it can also play a role in handling abnormal situations. For example, when users are irritable, angry, etc., AICG may automatically generate soothing words for customer service reference to improve service satisfaction. In the future, with multi-modal As technology and AIGC technology continue to mature, perhaps virtual digital humans can replace part of the labor force and directly serve customers in a human-machine symbiosis, significantly reducing labor costs and improving service efficiency." But he also made it clear that today's AIGC still has It is unable to truly produce content independently and is still at the stage of assisting humans to improve content production efficiency.

Whether it is the amazing answer given by ChatGPT or the touching voice performed by AI in the Tomato novel, even Musk was amazed : We are not far away from dangerously powerful artificial intelligence. This seems to indicate that the era of AIGC is coming.

However, Stephen, a researcher on the Huoshan speech and audio synthesis algorithm who has worked on the front line of AI algorithms for many years, has a more sober judgment. He pointed out: "The technology behind AIGC may perform multi-modal fusion in the future, not just Single-modal generative tasks, just like when humans create content, they do not just conceive of new content based on a single form of knowledge. For example, in the task of generating interactive digital people, currently the main It is a separate prediction of faces, expressions, postures and actions. In the future, a generative model may be used to predict these features to improve the synergy between features and reduce the workload caused by separate recordings; in addition, it will also be based on multi-modal The representation obtained by understanding the task is based on the expression, tone and body movements of the user who is talking, and corresponding feedback is given on the generated image and sound."

In addition to the prediction of technological development, one point that cannot be ignored is that currently AIGC still faces huge challenges in cost, copyright and practicality. He believes that the current cost of AIGC remains high. The most obvious manifestation is that high-quality text, image and video generation technology, etc., all correspond to the consumption of a large amount of hardware resources in the training and inference stages, which makes it difficult for universities and research institutions to participate in it. , which is not conducive to the promotion of industry development.

"In addition, in terms of copyright protection, some of the currently generated content may be used to carry out illegal activities, so it is becoming more and more important to add copyright protection, such as image and audio watermarks, to the content, but after adding During the process, you must also consider not to cause watermark failure due to post-processing methods such as cutting and mixing."

In the past 2022, although the technical application effects in the direction of image and video generation have significantly improved, there is still a need Only after a large amount of manual screening can the content be actually implemented; and generating context-sensitive comics and videos based on long chapters of text not only ensures the continuity of the scene, but also reflects the changes in the characters. There are still a lot of technical problems that need to be solved to avoid manual Turning intelligence into "artificial intelligence" is a challenge, so there is still more room for improvement in practicality.

We may think that the reason why AIGC, as a new method of content production, has attracted attention fully illustrates the desire of all walks of life for content, especially the Internet platform, how to efficiently understand, create, and interact and distributing content have indeed brought opportunities and challenges to today's AI technology.

The above is the detailed content of AI creations are stunning, but many challenges still need to be overcome. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)
1 months ago By 尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. Best Graphic Settings
1 months ago By 尊渡假赌尊渡假赌尊渡假赌
Will R.E.P.O. Have Crossplay?
1 months ago By 尊渡假赌尊渡假赌尊渡假赌

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

ChatGPT now allows free users to generate images by using DALL-E 3 with a daily limit ChatGPT now allows free users to generate images by using DALL-E 3 with a daily limit Aug 09, 2024 pm 09:37 PM

DALL-E 3 was officially introduced in September of 2023 as a vastly improved model than its predecessor. It is considered one of the best AI image generators to date, capable of creating images with intricate detail. However, at launch, it was exclus

A new programming paradigm, when Spring Boot meets OpenAI A new programming paradigm, when Spring Boot meets OpenAI Feb 01, 2024 pm 09:18 PM

In 2023, AI technology has become a hot topic and has a huge impact on various industries, especially in the programming field. People are increasingly aware of the importance of AI technology, and the Spring community is no exception. With the continuous advancement of GenAI (General Artificial Intelligence) technology, it has become crucial and urgent to simplify the creation of applications with AI functions. Against this background, "SpringAI" emerged, aiming to simplify the process of developing AI functional applications, making it simple and intuitive and avoiding unnecessary complexity. Through "SpringAI", developers can more easily build applications with AI functions, making them easier to use and operate.

Choosing the embedding model that best fits your data: A comparison test of OpenAI and open source multi-language embeddings Choosing the embedding model that best fits your data: A comparison test of OpenAI and open source multi-language embeddings Feb 26, 2024 pm 06:10 PM

OpenAI recently announced the launch of their latest generation embedding model embeddingv3, which they claim is the most performant embedding model with higher multi-language performance. This batch of models is divided into two types: the smaller text-embeddings-3-small and the more powerful and larger text-embeddings-3-large. Little information is disclosed about how these models are designed and trained, and the models are only accessible through paid APIs. So there have been many open source embedding models. But how do these open source models compare with the OpenAI closed source model? This article will empirically compare the performance of these new models with open source models. We plan to create a data

How to install chatgpt on mobile phone How to install chatgpt on mobile phone Mar 05, 2024 pm 02:31 PM

Installation steps: 1. Download the ChatGTP software from the ChatGTP official website or mobile store; 2. After opening it, in the settings interface, select the language as Chinese; 3. In the game interface, select human-machine game and set the Chinese spectrum; 4 . After starting, enter commands in the chat window to interact with the software.

Posthumous work of the OpenAI Super Alignment Team: Two large models play a game, and the output becomes more understandable Posthumous work of the OpenAI Super Alignment Team: Two large models play a game, and the output becomes more understandable Jul 19, 2024 am 01:29 AM

If the answer given by the AI ​​model is incomprehensible at all, would you dare to use it? As machine learning systems are used in more important areas, it becomes increasingly important to demonstrate why we can trust their output, and when not to trust them. One possible way to gain trust in the output of a complex system is to require the system to produce an interpretation of its output that is readable to a human or another trusted system, that is, fully understandable to the point that any possible errors can be found. For example, to build trust in the judicial system, we require courts to provide clear and readable written opinions that explain and support their decisions. For large language models, we can also adopt a similar approach. However, when taking this approach, ensure that the language model generates

Rust-based Zed editor has been open sourced, with built-in support for OpenAI and GitHub Copilot Rust-based Zed editor has been open sourced, with built-in support for OpenAI and GitHub Copilot Feb 01, 2024 pm 02:51 PM

Author丨Compiled by TimAnderson丨Produced by Noah|51CTO Technology Stack (WeChat ID: blog51cto) The Zed editor project is still in the pre-release stage and has been open sourced under AGPL, GPL and Apache licenses. The editor features high performance and multiple AI-assisted options, but is currently only available on the Mac platform. Nathan Sobo explained in a post that in the Zed project's code base on GitHub, the editor part is licensed under the GPL, the server-side components are licensed under the AGPL, and the GPUI (GPU Accelerated User) The interface) part adopts the Apache2.0 license. GPUI is a product developed by the Zed team

Don't wait for OpenAI, wait for Open-Sora to be fully open source Don't wait for OpenAI, wait for Open-Sora to be fully open source Mar 18, 2024 pm 08:40 PM

Not long ago, OpenAISora quickly became popular with its amazing video generation effects. It stood out among the crowd of literary video models and became the focus of global attention. Following the launch of the Sora training inference reproduction process with a 46% cost reduction 2 weeks ago, the Colossal-AI team has fully open sourced the world's first Sora-like architecture video generation model "Open-Sora1.0", covering the entire training process, including data processing, all training details and model weights, and join hands with global AI enthusiasts to promote a new era of video creation. For a sneak peek, let’s take a look at a video of a bustling city generated by the “Open-Sora1.0” model released by the Colossal-AI team. Open-Sora1.0

Microsoft, OpenAI plan to invest $100 million in humanoid robots! Netizens are calling Musk Microsoft, OpenAI plan to invest $100 million in humanoid robots! Netizens are calling Musk Feb 01, 2024 am 11:18 AM

Microsoft and OpenAI were revealed to be investing large sums of money into a humanoid robot startup at the beginning of the year. Among them, Microsoft plans to invest US$95 million, and OpenAI will invest US$5 million. According to Bloomberg, the company is expected to raise a total of US$500 million in this round, and its pre-money valuation may reach US$1.9 billion. What attracts them? Let’s take a look at this company’s robotics achievements first. This robot is all silver and black, and its appearance resembles the image of a robot in a Hollywood science fiction blockbuster: Now, he is putting a coffee capsule into the coffee machine: If it is not placed correctly, it will adjust itself without any human remote control: However, After a while, a cup of coffee can be taken away and enjoyed: Do you have any family members who have recognized it? Yes, this robot was created some time ago.

See all articles