Advances in artificial intelligence and machine learning have significantly expanded the boundaries of what is possible within the browser. Running text-to-speech (TTS) models directly in the browser opens new opportunities for privacy, speed, and convenience. In this blog post, we will explore how to run the Kokoro-82M ONNX TTS model in a browser using a JavaScript implementation. If you’re curious, you can test it out in my demo: Kitt AI Text-to-Speech .
Traditionally, TTS models are executed on a server and require an internet connection to send input and receive synthesized speech. However, with the enhancements to WebGPU and ONNX.js, you can now run advanced models like Kokoro-82M ONNX directly in the browser. This brings many advantages:
The Kokoro-82M ONNX model is a lightweight yet effective TTS model optimized for on-device inference. It provides high-quality speech synthesis while maintaining a small footprint, making it suitable for browser environments.
To run Kokoro-82M ONNX in your browser you need:
You can set up your project by including the necessary dependencies in package.json:
<code>{ "dependencies": { "@huggingface/transformers": "^3.3.1" } }</code>
Next, make sure you have the Kokoro.js script, which is available from this repository.
To load and use the Kokoro-82M ONNX model in your browser, follow these steps:
<code class="language-javascript">this.model_instance = StyleTextToSpeech2Model.from_pretrained( this.modelId, { device: "wasm", progress_callback, } ); this.tokenizer = AutoTokenizer.from_pretrained(this.modelId, { progress_callback, });</code>
After loading the model and processing the text, you can run inference to generate speech:
<code class="language-javascript">const language = speakerId.at(0); // "a" 或 "b" const phonemes = await phonemize(text, language); const { input_ids } = await tokenizer(phonemes, { truncation: true }); const num_tokens = Math.max( input_ids.dims.at(-1) - 2, // 无填充; 0 ); const offset = num_tokens * STYLE_DIM; const data = await getVoiceData(speakerId as keyof typeof VOICES); const voiceData = data.slice(offset, offset + STYLE_DIM); const inputs = { input_ids, style: new Tensor("float32", voiceData, [1, STYLE_DIM]), speed: new Tensor("float32", [speed], [1]), }; const { waveform } = await model(inputs); const audio = new RawAudio(waveform.data, SAMPLE_RATE).toBlob();</code>
You can see this in my live demo: Kitt AI Text to Speech. This demo showcases real-time text-to-speech synthesis powered by Kokoro-82M ONNX.
Running TTS models like the Kokoro-82M ONNX in the browser represents a leap forward for privacy-preserving and low-latency applications. With just a few lines of JavaScript code and the power of ONNX.js, you can create high-quality, responsive TTS applications that delight your users. Whether you're building accessibility tools, voice assistants, or interactive applications, in-browser TTS could be a game-changer.
Try the Kitt AI text-to-speech demo now and see for yourself!
The above is the detailed content of Running Kokoro- ONNX TTS Model in the Browser. For more information, please follow other related articles on the PHP Chinese website!