Making Jarvis: How We Build a Hands-Free Voice Assistant End to End
If you want a serious voice assistant that feels like “Hey Siri” for your own product, the right answer is not to bolt a chat box onto a web page and call it done. The right answer is to separate the listening layer from the intelligence layer—and to treat mobile OS rules (foreground services, widgets, background audio) as first-class design constraints.
The short answer
Use Flutter for a single mobile codebase, on-device speech for wake-word detection, an Android foreground service (and iOS audio session policy) so listening survives backgrounding, a home screen widget for at-a-glance status, and a small Node.js WebSocket backend that pipes audio to Whisper, your AI gateway (e.g. OpenClaw), and TTS—with sentence-level TTS pipelining so users hear the first sentence while the rest of the answer is still generating.
Why this architecture wins
- You get real mobile behaviour—wake word, notification, widget—instead of a web view that stops when the user switches apps.
- You can swap the brain (cloud model, OpenClaw, or future local inference) without rewriting the client.
- You keep the hot path clear: mic → WebSocket → STT → LLM → streamed text → TTS → playback queue.
- You add side services (Redis, queues, object storage for audio URLs) only when latency or scale demands it.
The clean split: device, transport, intelligence, speech
Jarvis topology — same spirit as the Local AI Factory article: proven surfaces first, swappable engines second.
| Layer | What | Job |
|---|---|---|
| Device | Flutter + platform APIs | Wake word, recording, playback, widget UI, permissions |
| Transport | WebSocket (JSON + base64 PCM) | Streaming audio up; STT, status, deltas, TTS down |
| Intelligence | OpenClaw gateway or direct LLM | Tool use, RAG, agent flows, streaming tokens |
| Speech | Whisper + TTS API | Transcribe utterances; speak replies (optionally per sentence) |
Reference codebase: our Jarvis repo
Everything in this article maps to a real DestinPQ workspace we iterate on locally: a Flutter client under jarvis_app/ and a Node backend under backend/(Express + ws). On a dev machine the project root is typically Jarvis/—that layout is the source of truth for protocol names, state fields, and notification copy we ship.
- Wake + Android survival:
wake_word_service.dartusesspeech_to_textfor continuous “Hey Jarvis”-style listening andflutter_foreground_taskso a foreground service keeps the process eligible while backgrounded. - Notification UX:
task_handler.dartruns in the service isolate and updates titles like “Say ‘Hey Jarvis’ to activate” vs “Recording your voice…” when the main isolate sends events. - Home widget:
home_widgetwith app groupgroup.com.destinpq.jarvisand native classesJarvisWidgeton Android/iOS—status strings are saved fromjarvis_provider.dartso the launcher tile stays in sync. - Wire protocol:
websocket_service.dartdocuments outboundaudio.start,audio.chunk(base64 PCM16),audio.stopand inboundstt.partial/stt.final,assistant.status,assistant.delta,assistant.audio(data URLs), and completion signals.
Mobile path: Android
Android is strict about microphone work in the background. A foreground service with type microphone keeps the process eligible to keep running while the user is elsewhere—paired with a visible notification users can trust.
- Wake word: OS speech APIs watch for your phrase on the main engine; the foreground service keeps the app from being killed aggressively.
- Home widget: Shows status and deep-links; tap opens the app into chat (handled via
HomeWidget.widgetClicked). - After playback: Resume wake-word listening so the loop feels continuous.
Mobile path: iOS
iOS does not offer the same always-on mic story as a custom Android foreground service. UIBackgroundModes: audio and a careful audio session (play-and-record, spoken audio) help while the app is active or in permitted background audio contexts—but true “app killed, still listening” is not something Apple generally allows for arbitrary third-party assistants. Plan the product narrative accordingly: excellent in-session and background-within-policy behaviour first.
Backend path: small server, big leverage
In our backend/src/index.js, a single Node process owns HTTP health checks and WebSocket sessions. Per connection we track: session id, an array of base64 PCM chunks, streaming text buffers, a TTS queue and drain flag, and isProcessing so two overlapping audio.stop events never spawn duplicate Whisper → LLM → TTS pipelines.
audioHandler.js concatenates PCM, builds a minimal WAV (16 kHz mono), and sends it to Whisper via the OpenAI SDK; the transcript then feeds openclawHandler.js for gateway streaming. That separation keeps STT cheap to swap (other providers, local models) without touching socket framing on the client.
Reliability fixes that also improve perceived performance
- Wait for the AI gateway WebSocket to be open before falling back—avoids a race where the first utterance always hits the expensive fallback path.
- One shared HTTP client for OpenAI-style APIs instead of constructing a new client every request.
- Reset streaming text buffers on each new user turn (
audio.startclears leftover assembled text) so stale sentences never leak into TTS.
Sentence-level TTS pipelining
The biggest perceived latency win is often not a faster model—it is not waiting for the full answer before speech starts. As streamed tokens arrive, detect complete sentences, synthesize them in order, send assistant.audio for each clip, then signal completion when nothing else is queued. On the client, queue clips with a concatenating audio source so playback begins on sentence one while sentence three is still being generated server-side.
The request flow in plain English
- User speaks the wake word; the app opens chat (if needed) and starts streaming PCM to the server.
- User stops speaking; the server builds a WAV, runs STT, emits the final transcript to the client.
- The server forwards text to the AI gateway (or fallback LLM) and relays tool status and token deltas.
- As sentences complete, the server runs TTS per sentence and streams audio messages to the client.
- The client plays the queue; when done, it resumes wake-word listening.
What we optimise in phase one
| Phase one target | What success looks like |
|---|---|
| Device | Wake word works; foreground service + widget on Android; clean mic permission story |
| Transport | Stable WebSocket; no duplicate pipelines on rapid taps |
| Intelligence | Gateway connected; streaming text to UI; graceful fallback if gateway is down |
| Speech | Whisper + TTS reliable; first sentence audible before full reply finishes |
Instrumentation and ops (lightweight)
Before you reach for Kubernetes, add structured logs around each WebSocket phase (connect, first chunk, STT done, first token, first TTS byte) and a simple /healthendpoint. That is enough to debug “why was there silence for four seconds?” in the field. When you outgrow a single VM, the same process model maps cleanly to a managed WebSocket tier or a small autoscaled fleet—because session state is per connection, not global singletons.
Closing recommendation
If your goal is a product-grade Jarvis, keep the phone responsible for when and how we listen, keep the server responsible for transcription, orchestration, and speech synthesis, and keep the model layer pluggable so you can evolve from cloud to hybrid to local without rewriting the app. That is how you ship something that feels like a real assistant—not a demo that stops the moment the user presses Home.
Related reading
- Local AI Factory: VS Code + Continue + Ollama + Gemma 4 Architecture — DestinPQ Blog
- Flutter · docs.flutter.dev
- Android foreground services · developer.android.com
Pratik Khanapurkar
Co-founder, DestinPQ
Pratik builds AI-powered products for businesses across healthcare, hospitality, and professional services. He writes about practical AI adoption, real model costs, and what actually works in production.