voiceedge AIprivacy

Local Assistant on a Budget: Build a Siri-like Voice Agent on a Pi and Open Models

ddev tools

2026-01-28

9 min read

Build a privacy-first Siri alternative on a Raspberry Pi 5 using local LLMs, whisper.cpp, and open speech stacks — step-by-step and budget-friendly.

Local Assistant on a Budget: Build a Siri-like Voice Agent on a Pi and Open Models

Hook: Frustrated by cloud-only voice assistants that leak data and cost money? In 2026 you can build a private, Siri-like voice agent on a Raspberry Pi 5 using small local LLMs and open speech stacks — for under $300 and with developer-grade control.

Why this matters in 2026

Major consumer assistants now lean heavily on cloud models (Apple’s Siri partnership with Google’s Gemini is the highest-profile example). That increases latency, cost, and privacy risk for sensitive uses. At the same time, hardware and model tooling improved drastically in late 2024–2025: single-board computers now support NPUs, quantized LLM runtimes (ggml/llama.cpp family) have matured, and efficient local speech models (whisper.cpp, Coqui, Silero) run with acceptable latency. That makes a practical, private, on-device assistant viable for everyday tasks. For hands-on tips about small edge models see the AuroraLite discussions on tiny multimodal models and edge tradeoffs.

What you’ll build (high level)

We’ll create a small on-device voice assistant with the following pipeline:

Wake word detection (always-on, low CPU).
Voice activity detection (VAD) to segment speech and reduce processing.
Speech-to-text (STT) using a local inference engine.
Local LLM for intent parsing and conversational responses.
Text-to-speech (TTS) generated locally and played back.

Target outcomes

Privacy-first assistant — no cloud calls unless you opt in. See the safety & consent guidance for voice listings and micro-gigs to model your consent UX.
Real-time conversational responses for local info, reminders, and automations.
Modular stack so you can swap models, tune quantization, and add integrations.

Hardware & software checklist

Minimum viable parts for a responsive, local assistant:

Raspberry Pi 5 (officially recommended) or similar SBC with PCIe/USB-C power delivery.
Optional AI HAT+ 2 or other NPU accelerator for faster model inference (recommended if you want sub-second LLM latency).
Microphone: a USB mic or a sound card + electret/stereo mic array (for beamforming if needed).
Speaker or USB audio DAC for TTS playback — see the best Bluetooth micro speakers guide for compact speaker options.
32–64 GB microSD (or NVMe via adapter for faster swap); a model store on external SSD if you plan to run larger models.
Recommended RAM: 8–16 GB to comfortably run quantized 4–8-bit LLMs.

Software building blocks (open stacks)

Pick robust, community-maintained components that run on-device:

Wake word: Mycroft Precise (open) or Porcupine (commercial) for low-power always-on detection.
VAD: webrtcvad (Python binding) for short, reliable speech segmentation.
STT: whisper.cpp or a Coqui/ONNX-compiled model for local transcription. whisper.cpp is highly optimized for low-memory inference.
LLM runtime: llama.cpp / ggml-based runtimes or llama-cpp-python for Python orchestration. Use quantized ggml models (int8/int4) for Pi CPU, or offload to NPU if available.
TTS: Coqui TTS, eSpeak NG (fast, small), or lightweight neural TTS compiled with ONNX for better naturalness.
Orchestration: a small Python service that ties these pieces together, e.g., FastAPI + systemd for production mode.

Design considerations and tradeoffs

Latency vs accuracy: Smaller models reduce latency but may hallucinate more. Use a two-tier approach: small local LLM for most queries and a cloud fallback for complex tasks if the user consents. If you use a cloud fallback consider patterns from hybrid avatar agents — send structured queries rather than raw audio.

Privacy: Keep models and audio local. Encrypt the storage where you keep model files and logs. If using an optional cloud fallback, explicitly show which data is sent.

Power/heat: Continuous wake-word listening is cheap; heavy LLM inference spikes CPU/NPU and can heat the Pi. Use thermal throttling and underclocking if needed — consider portable UPS or batteries when running outside a controlled power environment (see portable power comparisons).

Extensibility: Keep the orchestration modular so you can replace STT/LLM/TTS independently. Containerize each service or run them as separate systemd units.

Step-by-step: Build the assistant

1) Prep the Pi 5

Install Raspberry Pi OS (64-bit) and updates.

sudo apt update && sudo apt upgrade -y
sudo apt install git build-essential python3-pip python3-venv -y

Optional: Install drivers for AI HAT+ 2 or other NPU. Follow vendor instructions to enable kernel modules and verify available accelerators.
Set up audio: confirm mic and speaker devices with aplay/arecord.

2) Wake-word and VAD

We recommend Mycroft Precise for an open option. Use webrtcvad to trim silence and short noises so STT only runs when necessary.

# Install webrtcvad
python3 -m pip install webrtcvad sounddevice numpy

# Simple VAD loop (pseudo)
import webrtcvad, sounddevice as sd
import numpy as np

vad = webrtcvad.Vad(2)  # aggressive
samplerate = 16000

def callback(indata, frames, time, status):
    pcm = (indata * 32767).astype('int16').tobytes()
    if vad.is_speech(pcm, samplerate):
        # send to STT pipeline
        pass

with sd.InputStream(channels=1, samplerate=samplerate, callback=callback):
    sd.sleep(10_000_000)

3) Local STT: whisper.cpp

whisper.cpp is optimized to run quantized Whisper models on CPUs; it’s a reliable path for offline transcription.

Clone and build:

git clone https://github.com/ggerganov/whisper.cpp
cd whisper.cpp
make && sudo make install

Transcribe:

./main -m models/ggml-base.en.bin -f audio_chunk.wav

4) Local LLM: llama.cpp / quantized model

Use llama.cpp or similar GGML-based runtimes that support quantized files. For constrained devices, prefer 4-bit or 8-bit quantized models. If you have an NPU/HAT, use the vendor's runtime and an ONNX/TFLite export.

# Example: running a prompt with llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make
./main -m models/ggml-model.bin -p "You are a helpful assistant. User said: '{transcript}'" -n 128

For Python orchestration, use llama-cpp-python:

python3 -m pip install llama-cpp-python

from llama_cpp import Llama
llm = Llama(model_path="models/ggml-model.bin")
resp = llm.create(prompt=f"User: {transcript}\nAssistant:", max_tokens=128)
print(resp['choices'][0]['text'])

5) TTS: Coqui or eSpeak NG

Coqui TTS gives natural voices but is heavier; eSpeak NG is tiny and fast. Install the one that matches your resource budget.

# Coqui TTS quick run (if supported)
python3 -m pip install TTS
python -m TTS --text "{response_text}" --out_path output.wav
aplay output.wav

6) Glue code: put it all together

Outline of a minimal Python orchestrator:

#!/usr/bin/env python3
import subprocess, queue, threading

def transcribe(audio_file):
    out = subprocess.check_output(["./whisper.cpp/main", "-m", "models/ggml-base.en.bin", "-f", audio_file])
    return parse_transcript(out)

def query_llm(prompt):
    from llama_cpp import Llama
    llm = Llama(model_path='models/ggml-model.bin')
    r = llm.create(prompt=prompt, max_tokens=128)
    return r['choices'][0]['text']

def speak(text):
    subprocess.run(["python","-m","TTS","--text", text, "--out_path", "out.wav"]) 
    subprocess.run(["aplay","out.wav"])

# Main event loop: wakeword -> record -> VAD/transcribe -> LLM -> TTS

Performance optimizations

Quantization: Use ggml quantized models (Q4_0/Q4_1) — big wins for memory and CPU.
Batch smaller requests: For multi-turn prompts, keep a local context window and compress older turns (summarize) to limit tokens.
Use NPU: Offload heavy ops to the AI HAT+ 2 or equivalent when available to drop latency and power usage. See tiny-edge model reviews for tradeoffs.
Cache frequent responses: Local knowledge (e.g., weather format, calendar retrieval) can be cached to avoid repeated LLM calls.
Adaptive sampling: Run a small intent model first; for simple commands, skip full LLM inference and run deterministic command handlers.

Security and privacy best practices

Keep your assistant trustworthy and secure:

Store models and logs in an encrypted partition; use LUKS on external SSDs.
Run the orchestrator as an unprivileged system user; avoid exposing open ports to the internet.
Build a clear local consent mechanism for any cloud fallback — show which audio/text will be sent and require opt-in. See safety & consent guidance.
Rotate API keys locally and keep them in an encrypted store (e.g., pass or an HSM if available). Consider identity and zero-trust patterns from identity as the center of zero trust.
Implement rate limits and watchdogs to prevent runaway inference loops that can heat the device or use excessive power.

Integration examples

Real-world use cases you can add after the basic assistant works:

Home automation: Integrate with Home Assistant via local MQTT to control lights and routines without cloud.
Private calendar: Keep a local calendar backend (CalDAV) or sync read-only cloud calendars but keep contents encrypted locally.
Local knowledge base: Add a vector store on-device (FAISS or Chroma) with embeddings to serve private docs and notes.
Debugging/logging: Provide a developer mode that streams compact logs to an admin UI on the LAN.

Troubleshooting & tips from real builds

If the Pi overheats during long sessions, throttle the LLM or relocate heavy work to a scheduled batch process.
Wake-word false positives: retrain the wake model with negative samples from your environment.
Transcription errors on accents: try a small domain-adapted STT model or fine-tune a tiny model on sample audio.
If TTS sounds robotic, pre/post-process text with SSML-like rules (pause tokens, punctuation) before feeding to TTS.

2026 trends and what to expect next

Recent (late 2025 to early 2026) improvements make local assistants more compelling:

Wider availability of small, high-quality LLMs designed for on-device use with permissive licensing.
SBC vendors standardizing NPU acceleration and drivers — expect easier hardware offload in 2026.
Model toolchains (quantization, pruning, compilation to ONNX/TFLite/TVM) matured, giving reproducible performance gains. For tooling and continual learning patterns see continual-learning tooling.
Growing demand for privacy-first assistants as OEMs move some services back on-device — opening up more integration possibilities.

Apple’s shift to cloud models (e.g., Gemini partnership) highlights two paths: centralized convenience versus local privacy. This guide focuses on the latter — practical for developers and privacy-conscious users in 2026.

Advanced strategies (for power users)

Hybrid pipelines: Run intent classification locally and only send structured queries (not raw audio) to a cloud LLM when needed. This minimizes leakage and cost — patterns discussed in hybrid agent design notes.
Federated personalization: Keep personalization vectors locally and train small adapters on-device; occasionally aggregate anonymized gradients if you run a fleet. Continual learning playbooks show how to operationalize small-model updates.
Model distillation: Distill a big cloud teacher into a tiny on-device student for frequently used intents.
Edge orchestration: Use container orchestration (balena, k3s-lite) if you manage multiple Pis for load balancing and updates. See cluster and Pi-fleet tips for networking and storage.

Costs and time estimate

Budget estimate (2026 prices):

Raspberry Pi 5: $100–$150
AI HAT+ 2 (optional): $120–$180
Microphone + speaker: $20–$60
External SSD (optional): $30–$80

Developer time estimate: 1–3 days to a basic prototype; 1–3 weeks to a polished personal assistant with security and integration work.

Actionable checklist (start now)

Order Pi 5 + AI HAT+ 2 (if you want low-latency LLMs).
Flash a 64-bit OS image and enable SSH.
Install whisper.cpp and llama.cpp, pull small quantized models, and run sample inferences.
Wire up Mic/Speaker and test webrtcvad + wake-word detection.
Build the orchestrator: VAD → STT → LLM → TTS. Keep it modular.
Harden the device: encrypt storage, run as unprivileged user, and configure a local firewall.

Final thoughts

Building a Siri-like assistant on a Raspberry Pi 5 is no longer a novelty — it’s a practical solution for developers who value privacy, offline capability, and control. The combination of optimized runtimes (whisper.cpp, llama.cpp), improved hardware (NPUs on HATs), and mature orchestration patterns means you can ship a reliable local assistant with clear boundaries about what stays on-device.

Get the starter kit

Ready to move from concept to prototype? Clone a starter repository that contains the orchestrator template, sample configs for wake-word/VAD, and scripts to download quantized models. Use the modular design to iterate quickly: start with local-only models, then add hybrid cloud fallbacks and Home Assistant integrations as needed. Check a Pi micro-app example for deployment and storage patterns.

Call to action: Clone the starter repo, flash your Pi, and run the 30-minute quickstart. Share your builds and benchmarks with our community — privacy-first assistants are gaining momentum in 2026, and your feedback will help improve on-device AI for everyone.

dev tools

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.