


Wang Wenbing, head of Rokid algorithm: 'Sound' under AR is in a 'wonderful' state
Sound is ubiquitous in our daily lives and is an indispensable part, and the same is true in the metaverse world. In order to achieve a full range of immersion in the scenes of the Metaverse, the continuous upgrading and development of various sound technologies are required. At the "AISummit Global Artificial Intelligence Technology Conference" held recently by 51CTO, Wang Wenbing, the head of Rokid algorithm, published a speech The keynote speech "Sound in AR under "Wonderful" Land" introduced the concept of Rokid's self-developed 6DoF spatial sound field, the main technical modules, technical difficulties, the development trend of combining with AR and the original intention of developing the technology, explaining the spatial sound field An important manifestation of technology in the metaverse world.
The speech content is now organized as follows:
What is the 6dof spatial sound field?
When talking about this issue, you can first put aside the technical limitations and imagine how the sound on AR should be presented. In fact, most of the TVs and mobile phones we use now are two-channel like stereo. Home theaters have already used multi-channel. Professional scenes such as movie theaters also have speakers in the spatial layout.
How should it be presented on AR? We can imagine a scene, such as online meetings or online education that are very popular now. If you see the digital person on the right talking all the time in the metaverse world, but the voice comes from your left, does it feel weird at this time?
In addition, we can imagine an AR game. In the previous 2D vision, the sound can move with the focus of the vision, but in the 360-degree range of the 3D scene, Human eyes cannot grasp the entire visual focus, but sound has global focus. This is why in many games, people will switch perspectives according to the sound. Therefore, we can see some of the characteristics that sound on AR needs to have: it needs to meet people's high sensitivity to sound, the global focus of sound, and the realism requirements of sound.
Next, we will introduce the development path of sound form from three dimensions.
First, the spatial expression dimension. The expression dimension of the entire sound ranges from mono/stereo to multi-channel in the plane of 5.1/7.1/9.1/..., to multi-channel in the space of 5.1.x/7.1.x, etc. There are more and more speakers, and their placement has increased from plane to space;
Second, the dimension of encoding methods. From the very beginning, channel-based (that is, channel-based encoding, each channel will have a variety of sounds, such as our usual left and right channel expressions), to object-based (also That is to code the object that happened), including the Dolby Atmos film source that everyone watched in the cinema. For example, when a cannonball is shot down, the object of that cannonball is specially coded, and its movement trajectory is recorded in the metadata, and then Playback is based on the corresponding speaker position; but our ultimate goal is to achieve an effect completely based on the scene, similar to the panoramic sound method such as HOA, not just the cannonballs, we all hope that every flower, grass and leaf will fall. It has a sense of space.
#Third, the XR experience dimension. In the past, virtual sound was separated from the real world. Now in XR, especially in AR, what we have been doing is the integration of virtual and reality.
The reason why people can distinguish sounds in such fine detail is because of the binaural mode, technically speaking it is ITD and ILD, which is the time difference and sound intensity difference between the two ears. These two differences will help us quickly locate the direction of the object's sound.
So how to make 3D sound popular? How to break through venue limitations? How to reduce user consumption costs? How can everyone enjoy technology? Rokid's self-developed 6dof spatial sound field will help solve these problems.
6dof spatial sound field can be divided into two parts from the name: 6dof and spatial sound field. 6dof mainly expresses six degrees of freedom. The gyroscope provides rotation around the three directions of XYZ, and the accelerometer provides acceleration in the three directions of XYZ.
6dof spatial sound field involves the generation, dissemination, rendering, encoding and decoding of sound, as well as the fusion and interaction of virtual and real sounds throughout the process.
The main technology of 6dof spatial sound field
The main technical modules of 6dof spatial sound field include HRTFs, sound field rendering and sound effects. HRTFs is the impact function of the sound source from the free field to the eardrum. It is the process of transmitting all-round sound to the human ear in a simulated anechoic chamber environment. Sound field rendering can give people the ability to distinguish the position of sounds by listening, and can blend virtual and real objects to perfectly handle the impact of real objects on virtual sound sources. The sound effect is to enrich the sound quality by using open speakers designed for privacy to reduce sound leakage and ensure volume.
The SDK at the top of the architecture diagram provides external spatial modules, namely the spatial engine export and the speech engine export. Spatial information can be acquired and modeled, helping to integrate the digital and physical worlds.
In addition, we have also made some modifications to Room Effect. Its overall framework is similar to the classic network structure. First, the network is constructed, and then a theoretical lossless network is generated. Then, based on this theory, various attenuation and loss related settings are made, including absorption, occlusion, reflection, etc. In fact, our own purpose is not to produce various sound effects. We just provide sound effects based on the usage scenarios of the product, such as theaters or music, so that users can achieve a good audio-visual experience. These can be experienced on the next-generation AR glasses Rokid Max. .
6dof spatial sound field comparison. The left side is the effect of a third-party SDK. When rotating from 0 degrees to 90 degrees, the change of each frequency is not smooth, and the decrease is sharp at first, and the subsequent changes are very small. The 6dof spatial sound field made by Rokid on the right has obvious changes in different frequency bands as your position changes. The picture shows the performance of different angles, different frequency bands, and different amplitudes.
The development trend of 6dof space sound field
With the era of metaverse With the advent of 2020 and the rise of AR and VR technologies, the development of spatial sound fields has also ushered in new opportunities.
The development trend of spatial sound fields is mainly reflected in three aspects:
First, immersion, people can follow the real world Provide feedback to better integrate and interact virtual and real, and truly achieve an immersive experience. All sounds in the virtual world should not be free from the influence of any objects in the real world, because this will make people feel that it is still separate. In addition to integration, interaction is also required. For example, in the virtual world, you can interact with the enhanced sound on the AR terminal through different methods such as voice and gestures, to choose to pause, play, or switch windows of different levels and perspectives, or to feel your own way. Voices of interest and more.
The second is refinement, which involves refined exploration and practice in different aspects such as HRTF, resolution, test methods, and customization. The more difficult thing to refine is the head pass, because the generation method of the head pass itself is more time-consuming and laborious. It needs to play every point at different distances in the entire spherical space, and then sample the ear canal. Currently, some scholars are studying how to generate the same degree of refinement with fewer sampling points, and how to achieve higher accuracy through interpolation or other technical means; at the same time, from a longer-term perspective, the refinement One limit is customized implementation.
#The third is privacy and sound effects, and experience the auditory feast brought by sounds in different frequency bands. Different harmonics or different frequency bands give us different feelings. For example, severe reverberation will affect human hearing, while appropriate reverberation will bring rich listening experience in terms of sound quality; especially early reverberation, it is often used to judge timbre, below 3K The reverberation and lateral reflection will help create a better sense of space and depth, while the high-frequency component will help us achieve a sense of surround.
The original intention of exploring spatial sound fields
Why does Rokid create spatial sound fields? There are three main reasons:
First, immersion. We have been pursuing the integration of the digital world and the physical world, such as the vividness when playing games, the reality of online meetings or online education.
Second, virtual and real interaction. We believe that the future in this world will be a fusion of reality and reality. Based on the fusion, many interactions can be made, including the process of spatial perception, the interaction of subjective behaviors, etc. Spatial perception refers to aspects of the world such as the size of objects, the size of space, materials, etc. This perception then forms an impact on virtual sounds; the interaction of subjective behavior is human intervention, selection, and interaction with sounds in the digital world. communicate.
Three, ultimate quality. AR Glass is different from mobile phones, tablets, TVs and other products. When you use your mobile phone, network disconnection or lag is tolerable, but the real-time requirements for AR Glass worn on your eyes are very high. How can we achieve this high real-time requirement? This involves the overall optimization of algorithms, engineering, systems, hardware, and applications.
These are the missions we have been pursuing. Rokid hopes to directly promote and popularize these capabilities to the public through AR Glass products; at the same time, we also hope to use these technologies as part of our Yoda OS The basic capabilities are released, thereby indirectly benefiting users and empowering all walks of life through the use of developers.
Now the conference speech replay and PPT are online, go to the official website to view the exciting content (https://www.php.cn/link/53253027fef2ab5162a602f2acfed431 )
The above is the detailed content of Wang Wenbing, head of Rokid algorithm: 'Sound' under AR is in a 'wonderful' state. For more information, please follow other related articles on the PHP Chinese website!

Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics

This site reported on June 27 that Jianying is a video editing software developed by FaceMeng Technology, a subsidiary of ByteDance. It relies on the Douyin platform and basically produces short video content for users of the platform. It is compatible with iOS, Android, and Windows. , MacOS and other operating systems. Jianying officially announced the upgrade of its membership system and launched a new SVIP, which includes a variety of AI black technologies, such as intelligent translation, intelligent highlighting, intelligent packaging, digital human synthesis, etc. In terms of price, the monthly fee for clipping SVIP is 79 yuan, the annual fee is 599 yuan (note on this site: equivalent to 49.9 yuan per month), the continuous monthly subscription is 59 yuan per month, and the continuous annual subscription is 499 yuan per year (equivalent to 41.6 yuan per month) . In addition, the cut official also stated that in order to improve the user experience, those who have subscribed to the original VIP

Improve developer productivity, efficiency, and accuracy by incorporating retrieval-enhanced generation and semantic memory into AI coding assistants. Translated from EnhancingAICodingAssistantswithContextUsingRAGandSEM-RAG, author JanakiramMSV. While basic AI programming assistants are naturally helpful, they often fail to provide the most relevant and correct code suggestions because they rely on a general understanding of the software language and the most common patterns of writing software. The code generated by these coding assistants is suitable for solving the problems they are responsible for solving, but often does not conform to the coding standards, conventions and styles of the individual teams. This often results in suggestions that need to be modified or refined in order for the code to be accepted into the application

To learn more about AIGC, please visit: 51CTOAI.x Community https://www.51cto.com/aigc/Translator|Jingyan Reviewer|Chonglou is different from the traditional question bank that can be seen everywhere on the Internet. These questions It requires thinking outside the box. Large Language Models (LLMs) are increasingly important in the fields of data science, generative artificial intelligence (GenAI), and artificial intelligence. These complex algorithms enhance human skills and drive efficiency and innovation in many industries, becoming the key for companies to remain competitive. LLM has a wide range of applications. It can be used in fields such as natural language processing, text generation, speech recognition and recommendation systems. By learning from large amounts of data, LLM is able to generate text

Large Language Models (LLMs) are trained on huge text databases, where they acquire large amounts of real-world knowledge. This knowledge is embedded into their parameters and can then be used when needed. The knowledge of these models is "reified" at the end of training. At the end of pre-training, the model actually stops learning. Align or fine-tune the model to learn how to leverage this knowledge and respond more naturally to user questions. But sometimes model knowledge is not enough, and although the model can access external content through RAG, it is considered beneficial to adapt the model to new domains through fine-tuning. This fine-tuning is performed using input from human annotators or other LLM creations, where the model encounters additional real-world knowledge and integrates it

Editor |ScienceAI Question Answering (QA) data set plays a vital role in promoting natural language processing (NLP) research. High-quality QA data sets can not only be used to fine-tune models, but also effectively evaluate the capabilities of large language models (LLM), especially the ability to understand and reason about scientific knowledge. Although there are currently many scientific QA data sets covering medicine, chemistry, biology and other fields, these data sets still have some shortcomings. First, the data form is relatively simple, most of which are multiple-choice questions. They are easy to evaluate, but limit the model's answer selection range and cannot fully test the model's ability to answer scientific questions. In contrast, open-ended Q&A

Editor | KX In the field of drug research and development, accurately and effectively predicting the binding affinity of proteins and ligands is crucial for drug screening and optimization. However, current studies do not take into account the important role of molecular surface information in protein-ligand interactions. Based on this, researchers from Xiamen University proposed a novel multi-modal feature extraction (MFE) framework, which for the first time combines information on protein surface, 3D structure and sequence, and uses a cross-attention mechanism to compare different modalities. feature alignment. Experimental results demonstrate that this method achieves state-of-the-art performance in predicting protein-ligand binding affinities. Furthermore, ablation studies demonstrate the effectiveness and necessity of protein surface information and multimodal feature alignment within this framework. Related research begins with "S

Machine learning is an important branch of artificial intelligence that gives computers the ability to learn from data and improve their capabilities without being explicitly programmed. Machine learning has a wide range of applications in various fields, from image recognition and natural language processing to recommendation systems and fraud detection, and it is changing the way we live. There are many different methods and theories in the field of machine learning, among which the five most influential methods are called the "Five Schools of Machine Learning". The five major schools are the symbolic school, the connectionist school, the evolutionary school, the Bayesian school and the analogy school. 1. Symbolism, also known as symbolism, emphasizes the use of symbols for logical reasoning and expression of knowledge. This school of thought believes that learning is a process of reverse deduction, through existing

NTT QONOQ Devices has unveiled the Mirza wireless XR glasses for smartphones, freeing users from needing to wrangle cords. The glasses can display virtual AR content in real-world spaces like Pokemon Go or their phone content on a large virtual displ
