Voice AIMobileArchitecture

Making Jarvis: How We Build a Hands-Free Voice Assistant End to End

Pratik Khanapurkar·Co-founder, DestinPQApril 9, 202611 min read

Audio summary · ~1 min

Audio summary · Making Jarvis

Flutter wake word, WebSocket backend, Whisper STT, and sentence-level TTS pipelining.

0:00 / ~1 min

If you want a serious voice assistant that feels like “Hey Siri” for your own product, the right answer is not to bolt a chat box onto a web page and call it done. The right answer is to separate the listening layer from the intelligence layerand to treat mobile OS rules (foreground services, widgets, background audio) as first-class design constraints.

The short answer

Use Flutter for a single mobile codebase, on-device speech for wake-word detection, an Android foreground service (and iOS audio session policy) so listening survives backgrounding, a home screen widget for at-a-glance status, and a small Node.js WebSocket backend that pipes audio to Whisper, your AI gateway (e.g. OpenClaw), and TTSwith sentence-level TTS pipelining so users hear the first sentence while the rest of the answer is still generating.

Why this architecture wins

You get real mobile behaviourwake word, notification, widgetinstead of a web view that stops when the user switches apps.
You can swap the brain (cloud model, OpenClaw, or future local inference) without rewriting the client.
You keep the hot path clear: mic → WebSocket → STT → LLM → streamed text → TTS → playback queue.
You add side services (Redis, queues, object storage for audio URLs) only when latency or scale demands it.

The clean split: device, transport, intelligence, speech

Jarvis topology same spirit as the Local AI Factory article: proven surfaces first, swappable engines second.

Layer	What	Job
Device	Flutter + platform APIs	Wake word, recording, playback, widget UI, permissions
Transport	WebSocket (JSON + base64 PCM)	Streaming audio up; STT, status, deltas, TTS down
Intelligence	OpenClaw gateway or direct LLM	Tool use, RAG, agent flows, streaming tokens
Speech	Whisper + TTS API	Transcribe utterances; speak replies (optionally per sentence)

Reference codebase: our Jarvis repo

Everything in this article maps to a real DestinPQ workspace we iterate on locally: a Flutter client under jarvis_app/ and a Node backend under backend/(Express + ws). On a dev machine the project root is typically Jarvis/that layout is the source of truth for protocol names, state fields, and notification copy we ship.

Wake + Android survival: wake_word_service.dart uses speech_to_textfor continuous “Hey Jarvis”-style listening and flutter_foreground_task so a foreground service keeps the process eligible while backgrounded.
Notification UX: task_handler.dartruns in the service isolate and updates titles like “Say ‘Hey Jarvis’ to activate” vs “Recording your voice…” when the main isolate sends events.
Home widget: home_widget with app group group.com.destinpq.jarvis and native classes JarvisWidget on Android/iOSstatus strings are saved from jarvis_provider.dart so the launcher tile stays in sync.
Wire protocol: websocket_service.dart documents outbound audio.start, audio.chunk (base64 PCM16), audio.stop and inbound stt.partial / stt.final, assistant.status, assistant.delta, assistant.audio (data URLs), and completion signals.

Mobile path: Android

Android is strict about microphone work in the background. A foreground service with type microphone keeps the process eligible to keep running while the user is elsewherepaired with a visible notification users can trust.

Wake word: OS speech APIs watch for your phrase on the main engine; the foreground service keeps the app from being killed aggressively.
Home widget: Shows status and deep-links; tap opens the app into chat (handled via HomeWidget.widgetClicked).
After playback: Resume wake-word listening so the loop feels continuous.

Mobile path: iOS

iOS does not offer the same always-on mic story as a custom Android foreground service. UIBackgroundModes: audio and a careful audio session (play-and-record, spoken audio) help while the app is active or in permitted background audio contextsbut true “app killed, still listening” is not something Apple generally allows for arbitrary third-party assistants. Plan the product narrative accordingly: excellent in-session and background-within-policy behaviour first.

Backend path: small server, big leverage

In our backend/src/index.js, a single Node process owns HTTP health checks and WebSocket sessions. Per connection we track: session id, an array of base64 PCM chunks, streaming text buffers, a TTS queue and drain flag, and isProcessing so two overlapping audio.stop events never spawn duplicate Whisper → LLM → TTS pipelines.

audioHandler.js concatenates PCM, builds a minimal WAV (16 kHz mono), and sends it to Whisper via the OpenAI SDK; the transcript then feeds openclawHandler.js for gateway streaming. That separation keeps STT cheap to swap (other providers, local models) without touching socket framing on the client.

Reliability fixes that also improve perceived performance

Wait for the AI gateway WebSocket to be open before falling backavoids a race where the first utterance always hits the expensive fallback path.
One shared HTTP client for OpenAI-style APIs instead of constructing a new client every request.
Reset streaming text buffers on each new user turn (audio.start clears leftover assembled text) so stale sentences never leak into TTS.

Sentence-level TTS pipelining

The biggest perceived latency win is often not a faster modelit is not waiting for the full answer before speech starts. As streamed tokens arrive, detect complete sentences, synthesize them in order, send assistant.audio for each clip, then signal completion when nothing else is queued. On the client, queue clips with a concatenating audio source so playback begins on sentence one while sentence three is still being generated server-side.

The request flow in plain English

User speaks the wake word; the app opens chat (if needed) and starts streaming PCM to the server.
User stops speaking; the server builds a WAV, runs STT, emits the final transcript to the client.
The server forwards text to the AI gateway (or fallback LLM) and relays tool status and token deltas.
As sentences complete, the server runs TTS per sentence and streams audio messages to the client.
The client plays the queue; when done, it resumes wake-word listening.

What we optimise in phase one

Phase one target	What success looks like
Device	Wake word works; foreground service + widget on Android; clean mic permission story
Transport	Stable WebSocket; no duplicate pipelines on rapid taps
Intelligence	Gateway connected; streaming text to UI; graceful fallback if gateway is down
Speech	Whisper + TTS reliable; first sentence audible before full reply finishes

Instrumentation and ops (lightweight)

Before you reach for Kubernetes, add structured logs around each WebSocket phase (connect, first chunk, STT done, first token, first TTS byte) and a simple /healthendpoint. That is enough to debug “why was there silence for four seconds?” in the field. When you outgrow a single VM, the same process model maps cleanly to a managed WebSocket tier or a small autoscaled fleetbecause session state is per connection, not global singletons.

Closing recommendation

If your goal is a product-grade Jarvis, keep the phone responsible for when and how we listen, keep the server responsible for transcription, orchestration, and speech synthesis, and keep the model layer pluggable so you can evolve from cloud to hybrid to local without rewriting the app. That is how you ship something that feels like a real assistantnot a demo that stops the moment the user presses Home.