Building a Local Voice Assistant with LLMs and Neural Networks on Your CPU Laptop-AI-php.cn

Building a Local Voice Assistant with LLMs and Neural Networks on Your CPU Laptop

WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWB

Release： 2025-02-25 17:10:11

Original

256 people have browsed it

Unlock the Power of Local Voice Assistants: A Step-by-Step Guide

The rise of multimodal Large Language Models (LLMs) has revolutionized how we interact with AI, enabling voice-based interactions. While OpenAI's voice-enabled ChatGPT offers a convenient solution, building a local voice assistant provides enhanced data privacy, unlimited API calls, and the ability to fine-tune models for specific needs. This guide details the construction of such an assistant on a standard CPU-based machine.

Why Choose a Local Voice Assistant?

Three key advantages drive the appeal of local voice assistants:

Data Privacy: Avoid transmitting sensitive information to external servers.
Unrestricted API Calls: Bypass limitations imposed by proprietary APIs.
Customizable Models: Fine-tune LLMs for optimal performance within your specific domain.

Building Your Local Voice Assistant

This project comprises four core components:

Voice Recording: Capture audio input from your device's microphone. The sounddevice library facilitates this process, saving the audio as a WAV file. The code snippet below demonstrates this:

import sounddevice as sd
import wave
import numpy as np

sampling_rate = 16000  # Matches Whisper.cpp model

recorded_audio = sd.rec(int(duration * sampling_rate), samplerate=sampling_rate, channels=1, dtype=np.int16)
sd.wait()

audio_file = "<path>/recorded_audio.wav"
with wave.open(audio_file, "w") as wf:
    wf.setnchannels(1)
    wf.setsampwidth(2)
    wf.setframerate(sampling_rate)
    wf.writeframes(recorded_audio.tobytes())

Copy after login

Speech-to-Text Conversion: Transcribe the recorded audio into text. OpenAI's Whisper model (specifically, ggml-base.en.bin) is utilized for this purpose.

import subprocess

WHISPER_BINARY_PATH = "/<path>/whisper.cpp/main"
MODEL_PATH = "/<path>/whisper.cpp/models/ggml-base.en.bin"

try:
    result = subprocess.run([WHISPER_BINARY_PATH, "-m", MODEL_PATH, "-f", audio_file, "-l", "en", "-otxt"], capture_output=True, text=True)
    transcription = result.stdout.strip()
except FileNotFoundError:
    print("Whisper.cpp binary not found. Check the path.")

Copy after login

Text-Based Response Generation: Employ a lightweight LLM (e.g., Ollama's qwen:0.5b) to generate a textual response to the transcribed input. A utility function, run_ollama_command, handles the LLM interaction.

import subprocess
import re

def run_ollama_command(model, prompt):
    try:
        result = subprocess.run(["ollama", "run", model], input=prompt, text=True, capture_output=True, check=True)
        return result.stdout
    except subprocess.CalledProcessError as e:
        print(f"Ollama error: {e.stderr}")
        return None

matches = re.findall(r"] *(.*)", transcription)
concatenated_text = " ".join(matches)
prompt = f"""Please ignore [BLANK_AUDIO]. Given: "{concatenated_text}", answer in under 15 words."""
answer = run_ollama_command(model="qwen:0.5b", prompt=prompt)

Copy after login

Text-to-Speech Conversion: Convert the generated text response back into audio using NVIDIA's NeMo toolkit (FastPitch and HiFi-GAN models).

import nemo_tts
import torchaudio
from io import BytesIO

try:
    fastpitch_model = nemo_tts.models.FastPitchModel.from_pretrained("tts_en_fastpitch")
    hifigan_model = nemo_tts.models.HifiGanModel.from_pretrained("tts_en_lj_hifigan_ft_mixerttsx")
    fastpitch_model.eval()
    parsed_text = fastpitch_model.parse(answer)
    spectrogram = fastpitch_model.generate_spectrogram(tokens=parsed_text)
    hifigan_model.eval()
    audio = hifigan_model.convert_spectrogram_to_audio(spec=spectrogram)
    audio_buffer = BytesIO()
    torchaudio.save(audio_buffer, audio.cpu(), sample_rate=22050, format="wav")
    audio_buffer.seek(0)
except Exception as e:
    print(f"TTS error: {e}")

Copy after login

System Integration and Future Improvements

A Streamlit application integrates these components, providing a user-friendly interface. Further enhancements could include conversation history management, multilingual support, and source attribution for responses. Consider exploring Open WebUI for additional audio model integration capabilities. Remember to always critically evaluate AI-generated responses.

Building a Local Voice Assistant with LLMs and Neural Networks on Your CPU Laptop

This revised response maintains the core information while significantly improving clarity, structure, and code formatting. It also removes the YouTube embed, as it's not directly reproducible.

The above is the detailed content of Building a Local Voice Assistant with LLMs and Neural Networks on Your CPU Laptop. For more information, please follow other related articles on the PHP Chinese website!