Author|Migüel Jetté
Compiler|bluemin
Editor|Chen Caixian
In the past two years, Automatic Speech Recognition (ASR) has been widely used in Important development has been achieved in commercial use. One of the measurement indicators is that multiple enterprise-level ASR models based entirely on neural networks have been successfully launched, such as Alexa, Rev, AssemblyAI, ASAPP, etc. In 2016, Microsoft Research published an article announcing that their model had reached human-level performance (as measured by word error rate) on the 25-year-old "Switchboard" data set. ASR accuracy continues to improve, reaching human-level performance across more data sets and use cases.
Image source: Awni Hannun's blog post "Speech Recognition is not Solved"
With the recognition accuracy of ASR technology greatly improved, the application scenarios are becoming more and more popular. We believe that it is not yet the peak of commercial use of ASR, and research and market applications in this field have yet to be explored. We predict that AI voice related research and commercial systems will focus on the following five areas in the next ten years:
“Over the next decade, we will deploy truly multilingual models in production, enabling developers to build applications that anyone can understand in any language, truly unleashing the power of speech recognition to the world.”
Source: "Unsupervised cross-lingual representation learning for speech recognition" paper published by Alexis Conneau et al. in 2020
Today's commercial ASR models mainly use It is trained on the English dataset and therefore has higher accuracy on English input. There is greater long-term interest in English in academia and industry due to data availability and market demand. Although the recognition accuracy of popular commercial languages such as French, Spanish, Portuguese and German is also reasonable, there is clearly a long tail of languages with limited training data and relatively low ASR output quality.
In addition, most business systems are based on a single language, which cannot be applied to the multilingual scenarios unique to many societies. Multilingualism can take the form of back-to-back languages, such as media programming in bilingual countries. Amazon has made great strides in dealing with this problem by recently launching a product that integrates language identification (LID) and ASR. In contrast, translanguaging (also known as code-switching) is a language system used by an individual to combine words and grammar from two languages in the same sentence. This is an area where academia continues to make interesting progress.
Just as the field of natural language processing adopts a multilingual approach, we will see ASR follow suit in the next decade. As we learn how to leverage emerging end-to-end technologies, we will train large-scale multilingual models that can transfer learning between multiple languages. Meta's XLS-R is a good example: in one demo, users could speak any of 21 languages without specifying a language, and the model would eventually translate to English. By understanding and applying similarities between languages, these smarter ASR systems will provide high-quality ASR availability for low-resource language and mixed-language use cases and will enable commercial-grade applications.
“In the next ten years, we believe that commercial ASR systems will output richer transcription objects. There will be more to it than just simple words. Furthermore, we anticipate that this richer output will be endorsed by standards bodies like the W3C so that all APIs will return similarly constructed output. This will further free up everyone in the world Potential for speech applications.
" Although the National Institute of Standards and Technology (NIST) has a long tradition of exploring "rich transcription," it has only scratched the surface in incorporating it into a standardized and scalable format for ASR output. The concept of rich transcription initially involved capitalization, punctuation, and diaryization, but to some extent expanded to speaker roles and a range of nonverbal speech events. Anticipated innovations include transcribing overlapping speech from different speakers, varying emotions and other paralinguistic features, as well as a range of non-linguistic and even non-human speech scenes and events, as well as transcribing text-based or linguistic diversity. Tanaka et al. depict a scenario in which a user may wish to choose among transcription options of varying richness, and obviously the amount and nature of the additional information we predict is specifiable, depending on the downstream application.
Traditional ASR systems are capable of generating a grid of multiple hypotheses in the process of recognizing spoken words, which have proven to be of great benefit in human-assisted transcription, spoken dialogue systems, and information retrieval. Including n-best information in a rich output format will encourage more users to use the ASR system, thereby improving user experience. While no standard currently exists for structuring or storing the additional information currently or potentially generated during speech decoding, CallMiner’s Open Speech Transcription Standard (OVTS) is a solid step in this direction, making it easy for enterprises to explore and choose Multiple ASR vendors.
We predict that in the future, ASR systems will produce richer output in standard formats, supporting more powerful downstream applications. For example, an ASR system might output the full range of possible meshes, and an application could use this additional data to do intelligent automated transcription when editing the transcript. Similarly, ASR transcriptions that include additional metadata such as detected regional dialects, accents, ambient noise, or mood can enable more powerful search applications.
“In this decade, large-scale ASR (i.e., privatization, Affordable, reliable, and fast) will become part of everyone's daily lives. These systems will be able to search for videos, index all the media content we engage with, and make every video accessible to hearing-impaired consumers around the world. ASR will be the answer to Every audio and video key is to make it accessible and actionable.”
We probably all use audio and video software heavily: Podcasts, social media streams, online videos, live group chats, Zoom meetings and more. Yet very little of the relevant content is actually transcribed. Today, content transcription has become one of the largest markets for ASR APIs and will grow exponentially over the next decade, especially given their accuracy and affordability. Having said that, ASR transcription is currently only used for specific applications (broadcast video, certain conferences and podcasts, etc.). As a result, many people cannot access this media content and find it difficult to find relevant information after a broadcast or event.
In the future, this situation will change. As Matt Thompson predicted in 2010, at some point ASR will become cheap and widespread enough that we will experience what he called "speechability." We predict that in the future nearly all audio and video content will be transcribed and made instantly accessible, storable, and searchable at scale. But the development of ASR will not stop here, we also hope that these contents will be actionable. We hope that each audio and video consumed or engaged will provide additional context, such as automatically generated insights from a podcast or conference, or automatic summarization of key moments in the video, etc. We hope that NLP systems can routinize the above processing.
“By the end of this century, we will have an evolving ASR system that is like a living organism , learning continuously with human help or self-supervision. These systems will learn from different sources in the real world, understand new words and language variants in real-time rather than asynchronously, self-debug and automatically monitor different usages."
As ASR becomes mainstream and covers an increasing number of use cases, human-machine collaboration will play a key role. The training of the ASR model reflects this well. Today, open source datasets and pre-trained models lower the barrier to entry for ASR vendors. However, the training process is still fairly simple: collect data, annotate data, train model, evaluate results, improve model. But this is a slow process and, in many cases, error-prone due to difficulty in tuning or insufficient data. Garnerin et al. observed that missing metadata and inconsistencies in representation across corpora make it difficult to guarantee equal accuracy in ASR performance, which is also the problem that Reid and Walker tried to solve when developing the metadata standard.
In the future, humans will efficiently supervise ASR training through intelligent means and play an increasingly important role in accelerating machine learning. Human-in-the-loop approaches place human reviewers in the machine learning/feedback loop, allowing for continuous review and adjustment of model results. This will make machine learning faster and more efficient, resulting in higher quality output. Earlier this year, we discussed how improvements to ASR would allow Rev's human transcribers (called "Revvers") to make post-editing ASR drafts, making them more productive. Revver's transcription can be directly input into the improved ASR model, forming a virtuous cycle.
One area where human language experts remain integral to ASR is inverse text normalization (ITN), where they convert recognized strings (like "five dollars") into their expected written form (like " $5”). Pusateri et al. proposed a hybrid approach using "hand-made grammar and statistical models", and Zhang et al. continued along these lines by constraining RNNs with hand-crafted FSTs.
“As with all AI systems, future ASR systems will adhere to stricter AI ethics principles so that the system treats everyone equally, has a higher degree of explainability, is accountable for its decisions, and respects the privacy of users and their data.”
Future ASR Systems The four principles of AI ethics will be followed: fairness, explainability, respect for privacy and accountability.
Fairness: Fair ASR systems recognize speech regardless of the speaker’s background, socioeconomic status, or other characteristics. It is worth noting that building such a system requires identifying and reducing biases in our models and training data. Fortunately, governments, NGOs, and businesses are already working to create the infrastructure to identify and mitigate bias.
Interpretability: ASR systems will no longer be "black boxes": they will interpret data collection and analysis, model performance and output processes as required. This additional transparency requirement allows for better human oversight of model training and performance. Like Gerlings et al., we view interpretability from the perspective of a range of stakeholders (including researchers, developers, customers, and in Rev's case, transcriptionists). Researchers may want to know the reason for outputting erroneous text in order to mitigate the problem; while transcriptionists may want some evidence of why ASR thinks that way to help them evaluate its effectiveness, especially in noisy situations where ASR may be more efficient than People "hear" better. Weitz et al. took important first steps toward interpretability for end users in the context of audio keyword recognition. Laguarta and Subirana have incorporated clinician-guided interpretation into a speech biomarker system for Alzheimer's disease detection.
Respect Privacy: "Voices" are considered "personal data" under various U.S. and international laws, and therefore, the collection and processing of voice recordings are subject to strict personal privacy protections. At Rev, we already provide data security and control capabilities, and future ASR systems will further respect the privacy of user data and the privacy of models. In many cases, this will most likely involve pushing the ASR model to the edge (on the device or browser). Voice privacy challenges are driving research in this area, and many jurisdictions, such as the European Union, have initiated legislative efforts. The field of privacy-preserving machine learning promises to draw attention to this critical aspect of the technology so that it can be widely accepted and trusted by the public.
Accountability: We will monitor the ASR system to ensure it adheres to the first three principles. This in turn requires the investment of resources and infrastructure to design and develop the necessary monitoring systems and to take action in response to findings. Companies deploying ASR systems will be responsible for their use of the technology and make specific efforts to adhere to ASR ethical principles. It is worth mentioning that humans, as designers, maintainers, and consumers of ASR systems, will be responsible for implementing and enforcing these principles—yet another example of human-machine collaboration.
Reference link: https://thegradient.pub/the-future-of-speech-recognition/https://awni.github.io/speech-recognition/
The above is the detailed content of In the next ten years, AI speech recognition will develop in these five directions. For more information, please follow other related articles on the PHP Chinese website!