In the next ten years, AI speech recognition will develop in these five directions-AI-php.cn

Table of Contents

1. Multilingual ASR model

#2. Rich standardized output objects

3. Large-scale ASR for everyone

#4. Human-machine collaboration

5. Responsible ASR

Home

Technology peripherals

In the next ten years, AI speech recognition will develop in these five directions

王林

Apr 11, 2023 pm 08:10 PM

field technology asr

Author|Migüel Jetté

Compiler|bluemin

Editor|Chen Caixian

In the past two years, Automatic Speech Recognition (ASR) has been widely used in Important development has been achieved in commercial use. One of the measurement indicators is that multiple enterprise-level ASR models based entirely on neural networks have been successfully launched, such as Alexa, Rev, AssemblyAI, ASAPP, etc. In 2016, Microsoft Research published an article announcing that their model had reached human-level performance (as measured by word error rate) on the 25-year-old "Switchboard" data set. ASR accuracy continues to improve, reaching human-level performance across more data sets and use cases.

未来十年，AI 语音识别将朝着这五个方向发展

Image source: Awni Hannun's blog post "Speech Recognition is not Solved"

With the recognition accuracy of ASR technology greatly improved, the application scenarios are becoming more and more popular. We believe that it is not yet the peak of commercial use of ASR, and research and market applications in this field have yet to be explored. We predict that AI voice related research and commercial systems will focus on the following five areas in the next ten years:

1. Multilingual ASR model

“Over the next decade, we will deploy truly multilingual models in production, enabling developers to build applications that anyone can understand in any language, truly unleashing the power of speech recognition to the world.”

未来十年，AI 语音识别将朝着这五个方向发展

Source: "Unsupervised cross-lingual representation learning for speech recognition" paper published by Alexis Conneau et al. in 2020

Today's commercial ASR models mainly use It is trained on the English dataset and therefore has higher accuracy on English input. There is greater long-term interest in English in academia and industry due to data availability and market demand. Although the recognition accuracy of popular commercial languages such as French, Spanish, Portuguese and German is also reasonable, there is clearly a long tail of languages with limited training data and relatively low ASR output quality.

In addition, most business systems are based on a single language, which cannot be applied to the multilingual scenarios unique to many societies. Multilingualism can take the form of back-to-back languages, such as media programming in bilingual countries. Amazon has made great strides in dealing with this problem by recently launching a product that integrates language identification (LID) and ASR. In contrast, translanguaging (also known as code-switching) is a language system used by an individual to combine words and grammar from two languages in the same sentence. This is an area where academia continues to make interesting progress.

Just as the field of natural language processing adopts a multilingual approach, we will see ASR follow suit in the next decade. As we learn how to leverage emerging end-to-end technologies, we will train large-scale multilingual models that can transfer learning between multiple languages. Meta's XLS-R is a good example: in one demo, users could speak any of 21 languages without specifying a language, and the model would eventually translate to English. By understanding and applying similarities between languages, these smarter ASR systems will provide high-quality ASR availability for low-resource language and mixed-language use cases and will enable commercial-grade applications.

#2. Rich standardized output objects

“In the next ten years, we believe that commercial ASR systems will output richer transcription objects. There will be more to it than just simple words. Furthermore, we anticipate that this richer output will be endorsed by standards bodies like the W3C so that all APIs will return similarly constructed output. This will further free up everyone in the world Potential for speech applications.

" Although the National Institute of Standards and Technology (NIST) has a long tradition of exploring "rich transcription," it has only scratched the surface in incorporating it into a standardized and scalable format for ASR output. The concept of rich transcription initially involved capitalization, punctuation, and diaryization, but to some extent expanded to speaker roles and a range of nonverbal speech events. Anticipated innovations include transcribing overlapping speech from different speakers, varying emotions and other paralinguistic features, as well as a range of non-linguistic and even non-human speech scenes and events, as well as transcribing text-based or linguistic diversity. Tanaka et al. depict a scenario in which a user may wish to choose among transcription options of varying richness, and obviously the amount and nature of the additional information we predict is specifiable, depending on the downstream application.

Traditional ASR systems are capable of generating a grid of multiple hypotheses in the process of recognizing spoken words, which have proven to be of great benefit in human-assisted transcription, spoken dialogue systems, and information retrieval. Including n-best information in a rich output format will encourage more users to use the ASR system, thereby improving user experience. While no standard currently exists for structuring or storing the additional information currently or potentially generated during speech decoding, CallMiner’s Open Speech Transcription Standard (OVTS) is a solid step in this direction, making it easy for enterprises to explore and choose Multiple ASR vendors.

We predict that in the future, ASR systems will produce richer output in standard formats, supporting more powerful downstream applications. For example, an ASR system might output the full range of possible meshes, and an application could use this additional data to do intelligent automated transcription when editing the transcript. Similarly, ASR transcriptions that include additional metadata such as detected regional dialects, accents, ambient noise, or mood can enable more powerful search applications.

3. Large-scale ASR for everyone

“In this decade, large-scale ASR (i.e., privatization, Affordable, reliable, and fast) will become part of everyone's daily lives. These systems will be able to search for videos, index all the media content we engage with, and make every video accessible to hearing-impaired consumers around the world. ASR will be the answer to Every audio and video key is to make it accessible and actionable.”

未来十年，AI 语音识别将朝着这五个方向发展

We probably all use audio and video software heavily: Podcasts, social media streams, online videos, live group chats, Zoom meetings and more. Yet very little of the relevant content is actually transcribed. Today, content transcription has become one of the largest markets for ASR APIs and will grow exponentially over the next decade, especially given their accuracy and affordability. Having said that, ASR transcription is currently only used for specific applications (broadcast video, certain conferences and podcasts, etc.). As a result, many people cannot access this media content and find it difficult to find relevant information after a broadcast or event.

In the future, this situation will change. As Matt Thompson predicted in 2010, at some point ASR will become cheap and widespread enough that we will experience what he called "speechability." We predict that in the future nearly all audio and video content will be transcribed and made instantly accessible, storable, and searchable at scale. But the development of ASR will not stop here, we also hope that these contents will be actionable. We hope that each audio and video consumed or engaged will provide additional context, such as automatically generated insights from a podcast or conference, or automatic summarization of key moments in the video, etc. We hope that NLP systems can routinize the above processing.

#4. Human-machine collaboration

“By the end of this century, we will have an evolving ASR system that is like a living organism , learning continuously with human help or self-supervision. These systems will learn from different sources in the real world, understand new words and language variants in real-time rather than asynchronously, self-debug and automatically monitor different usages."

未来十年，AI 语音识别将朝着这五个方向发展

As ASR becomes mainstream and covers an increasing number of use cases, human-machine collaboration will play a key role. The training of the ASR model reflects this well. Today, open source datasets and pre-trained models lower the barrier to entry for ASR vendors. However, the training process is still fairly simple: collect data, annotate data, train model, evaluate results, improve model. But this is a slow process and, in many cases, error-prone due to difficulty in tuning or insufficient data. Garnerin et al. observed that missing metadata and inconsistencies in representation across corpora make it difficult to guarantee equal accuracy in ASR performance, which is also the problem that Reid and Walker tried to solve when developing the metadata standard.

In the future, humans will efficiently supervise ASR training through intelligent means and play an increasingly important role in accelerating machine learning. Human-in-the-loop approaches place human reviewers in the machine learning/feedback loop, allowing for continuous review and adjustment of model results. This will make machine learning faster and more efficient, resulting in higher quality output. Earlier this year, we discussed how improvements to ASR would allow Rev's human transcribers (called "Revvers") to make post-editing ASR drafts, making them more productive. Revver's transcription can be directly input into the improved ASR model, forming a virtuous cycle.

One area where human language experts remain integral to ASR is inverse text normalization (ITN), where they convert recognized strings (like "five dollars") into their expected written form (like " $5”). Pusateri et al. proposed a hybrid approach using "hand-made grammar and statistical models", and Zhang et al. continued along these lines by constraining RNNs with hand-crafted FSTs.

5. Responsible ASR

“As with all AI systems, future ASR systems will adhere to stricter AI ethics principles so that the system treats everyone equally, has a higher degree of explainability, is accountable for its decisions, and respects the privacy of users and their data.”

未来十年，AI 语音识别将朝着这五个方向发展

Future ASR Systems The four principles of AI ethics will be followed: fairness, explainability, respect for privacy and accountability.

Fairness: Fair ASR systems recognize speech regardless of the speaker’s background, socioeconomic status, or other characteristics. It is worth noting that building such a system requires identifying and reducing biases in our models and training data. Fortunately, governments, NGOs, and businesses are already working to create the infrastructure to identify and mitigate bias.

Interpretability: ASR systems will no longer be "black boxes": they will interpret data collection and analysis, model performance and output processes as required. This additional transparency requirement allows for better human oversight of model training and performance. Like Gerlings et al., we view interpretability from the perspective of a range of stakeholders (including researchers, developers, customers, and in Rev's case, transcriptionists). Researchers may want to know the reason for outputting erroneous text in order to mitigate the problem; while transcriptionists may want some evidence of why ASR thinks that way to help them evaluate its effectiveness, especially in noisy situations where ASR may be more efficient than People "hear" better. Weitz et al. took important first steps toward interpretability for end users in the context of audio keyword recognition. Laguarta and Subirana have incorporated clinician-guided interpretation into a speech biomarker system for Alzheimer's disease detection.

Respect Privacy: "Voices" are considered "personal data" under various U.S. and international laws, and therefore, the collection and processing of voice recordings are subject to strict personal privacy protections. At Rev, we already provide data security and control capabilities, and future ASR systems will further respect the privacy of user data and the privacy of models. In many cases, this will most likely involve pushing the ASR model to the edge (on the device or browser). Voice privacy challenges are driving research in this area, and many jurisdictions, such as the European Union, have initiated legislative efforts. The field of privacy-preserving machine learning promises to draw attention to this critical aspect of the technology so that it can be widely accepted and trusted by the public.

Accountability: We will monitor the ASR system to ensure it adheres to the first three principles. This in turn requires the investment of resources and infrastructure to design and develop the necessary monitoring systems and to take action in response to findings. Companies deploying ASR systems will be responsible for their use of the technology and make specific efforts to adhere to ASR ethical principles. It is worth mentioning that humans, as designers, maintainers, and consumers of ASR systems, will be responsible for implementing and enforcing these principles—yet another example of human-machine collaboration.

Reference link: https://thegradient.pub/the-future-of-speech-recognition/https://awni.github.io/speech-recognition/

The above is the detailed content of In the next ten years, AI speech recognition will develop in these five directions. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)

3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

R.E.P.O. Best Graphic Settings

3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

Assassin's Creed Shadows: Seashell Riddle Solution

2 weeks ago By DDD

R.E.P.O. How to Fix Audio if You Can't Hear Anyone

3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

WWE 2K25: How To Unlock Everything In MyRise

4 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

Hot Tools

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Hot Topics

Where is the login entrance for gmail email?

7486

CakePHP Tutorial

1377

What is the format of the account name of steam

win11 activation key permanent

nyt connections hints and answers

Related knowledge

The Stable Diffusion 3 paper is finally released, and the architectural details are revealed. Will it help to reproduce Sora? Mar 06, 2024 pm 05:34 PM

StableDiffusion3’s paper is finally here! This model was released two weeks ago and uses the same DiT (DiffusionTransformer) architecture as Sora. It caused quite a stir once it was released. Compared with the previous version, the quality of the images generated by StableDiffusion3 has been significantly improved. It now supports multi-theme prompts, and the text writing effect has also been improved, and garbled characters no longer appear. StabilityAI pointed out that StableDiffusion3 is a series of models with parameter sizes ranging from 800M to 8B. This parameter range means that the model can be run directly on many portable devices, significantly reducing the use of AI

Have you really mastered coordinate system conversion? Multi-sensor issues that are inseparable from autonomous driving Oct 12, 2023 am 11:21 AM

The first pilot and key article mainly introduces several commonly used coordinate systems in autonomous driving technology, and how to complete the correlation and conversion between them, and finally build a unified environment model. The focus here is to understand the conversion from vehicle to camera rigid body (external parameters), camera to image conversion (internal parameters), and image to pixel unit conversion. The conversion from 3D to 2D will have corresponding distortion, translation, etc. Key points: The vehicle coordinate system and the camera body coordinate system need to be rewritten: the plane coordinate system and the pixel coordinate system. Difficulty: image distortion must be considered. Both de-distortion and distortion addition are compensated on the image plane. 2. Introduction There are four vision systems in total. Coordinate system: pixel plane coordinate system (u, v), image coordinate system (x, y), camera coordinate system () and world coordinate system (). There is a relationship between each coordinate system,

This article is enough for you to read about autonomous driving and trajectory prediction! Feb 28, 2024 pm 07:20 PM

Trajectory prediction plays an important role in autonomous driving. Autonomous driving trajectory prediction refers to predicting the future driving trajectory of the vehicle by analyzing various data during the vehicle's driving process. As the core module of autonomous driving, the quality of trajectory prediction is crucial to downstream planning control. The trajectory prediction task has a rich technology stack and requires familiarity with autonomous driving dynamic/static perception, high-precision maps, lane lines, neural network architecture (CNN&GNN&Transformer) skills, etc. It is very difficult to get started! Many fans hope to get started with trajectory prediction as soon as possible and avoid pitfalls. Today I will take stock of some common problems and introductory learning methods for trajectory prediction! Introductory related knowledge 1. Are the preview papers in order? A: Look at the survey first, p

DualBEV: significantly surpassing BEVFormer and BEVDet4D, open the book! Mar 21, 2024 pm 05:21 PM

This paper explores the problem of accurately detecting objects from different viewing angles (such as perspective and bird's-eye view) in autonomous driving, especially how to effectively transform features from perspective (PV) to bird's-eye view (BEV) space. Transformation is implemented via the Visual Transformation (VT) module. Existing methods are broadly divided into two strategies: 2D to 3D and 3D to 2D conversion. 2D-to-3D methods improve dense 2D features by predicting depth probabilities, but the inherent uncertainty of depth predictions, especially in distant regions, may introduce inaccuracies. While 3D to 2D methods usually use 3D queries to sample 2D features and learn the attention weights of the correspondence between 3D and 2D features through a Transformer, which increases the computational and deployment time.

What are the five most profitable areas of self-media? What is the direction of Douyin's support in 2024? Mar 22, 2024 am 10:11 AM

With the rapid development of the Internet, the self-media industry has become the focus of more and more people's attention. In this industry, some areas have attracted much attention due to their broad market prospects and profitability. This article will reveal to you the five most profitable areas of self-media, while discussing the direction of Douyin’s support in 2024 to help you better grasp the development opportunities of self-media. 1. What are the five most profitable areas of self-media? With the rise of online education, the field of education and training has become increasingly popular. People are willing to invest in acquiring knowledge and skills, not only in academic courses but also in skills training and workplace advancement. Self-media creators can achieve profitability by creating high-quality educational content to attract students to pay for learning. This trend shows that people are interested in lifelong learning

The first multi-view autonomous driving scene video generation world model | DrivingDiffusion: New ideas for BEV data and simulation Oct 23, 2023 am 11:13 AM

Some of the author’s personal thoughts In the field of autonomous driving, with the development of BEV-based sub-tasks/end-to-end solutions, high-quality multi-view training data and corresponding simulation scene construction have become increasingly important. In response to the pain points of current tasks, "high quality" can be decoupled into three aspects: long-tail scenarios in different dimensions: such as close-range vehicles in obstacle data and precise heading angles during car cutting, as well as lane line data. Scenes such as curves with different curvatures or ramps/mergings/mergings that are difficult to capture. These often rely on large amounts of data collection and complex data mining strategies, which are costly. 3D true value - highly consistent image: Current BEV data acquisition is often affected by errors in sensor installation/calibration, high-precision maps and the reconstruction algorithm itself. this led me to

GSLAM | A general SLAM architecture and benchmark Oct 20, 2023 am 11:37 AM

Suddenly discovered a 19-year-old paper GSLAM: A General SLAM Framework and Benchmark open source code: https://github.com/zdzhaoyong/GSLAM Go directly to the full text and feel the quality of this work ~ 1 Abstract SLAM technology has achieved many successes recently and attracted many attracted the attention of high-tech companies. However, how to effectively perform benchmarks on speed, robustness, and portability with interfaces to existing or emerging algorithms remains a problem. In this paper, a new SLAM platform called GSLAM is proposed, which not only provides evaluation capabilities but also provides researchers with a useful way to quickly develop their own SLAM systems.

'Minecraft' turns into an AI town, and NPC residents role-play like real people Jan 02, 2024 pm 06:25 PM

Please note that this square man is frowning, thinking about the identities of the "uninvited guests" in front of him. It turned out that she was in a dangerous situation, and once she realized this, she quickly began a mental search to find a strategy to solve the problem. Ultimately, she decided to flee the scene and then seek help as quickly as possible and take immediate action. At the same time, the person on the opposite side was thinking the same thing as her... There was such a scene in "Minecraft" where all the characters were controlled by artificial intelligence. Each of them has a unique identity setting. For example, the girl mentioned before is a 17-year-old but smart and brave courier. They have the ability to remember and think, and live like humans in this small town set in Minecraft. What drives them is a brand new,

See all articles