Voice that never leaves the device
Transcription, diarisation, and speech running entirely on Apple silicon, and why keeping voice local is a product decision before it is a technical one.
22 Jun 2026 · 3 min
A voice recording is one of the most personal files a person owns. It is a meeting they were not supposed to repeat, a doctor's appointment, a kid's first sentence, a half-formed idea they said out loud in the car. The default for years was to hand all of that to a server you do not control so it could be turned into text. I think that default is wrong, and Apple silicon is now fast enough that I do not have to accept it.
So the whole pipeline runs on the device. Audio in, text out, and nothing in between touches a network.
The pipeline, on-device
Three stages, all local. Transcription turns the waveform into words. Diarisation figures out who spoke when, so the transcript reads like a conversation instead of a wall. Speech synthesis goes the other way, turning text back into audio for the parts of the app that talk back.
Each of these used to be a cloud API with a per-minute price and a privacy footnote. On current hardware they are model files that load into unified memory and run against the GPU and neural engine without ever opening a socket. The unified memory part matters more than the raw speed: the audio buffer and the model live in the same place, so you are not paying to copy a large array back and forth before any work happens.
Local-first is a product stance
The technical case is easy. The product case is the one I actually care about. When the pipeline is local, I get to make a promise I can keep: your recording stays on your phone. Not "encrypted in transit." Not "deleted after thirty days." It never leaves.
That promise changes what people are willing to record. The ceiling on a transcription app is not its accuracy, it is how much a person trusts it with. Move the trust problem off the table and the app becomes useful for the recordings that matter, not just the ones safe to leak.
What you trade, and why it is worth it
Local is not free. The app ships hundreds of megabytes of model weights, cold start costs something, and an older device runs the same work slower. I bundle the models, warm them on launch, and degrade gracefully when the chip is a few years old.
let request = SpeechTranscriptionRequest(locale: .current)
request.requiresOnDeviceRecognition = true // never falls back to a server
request.addsPunctuation = true
let result = try await analyzer.analyze(audioBuffer, with: request)
print(result.bestTranscription.formattedString)That one flag is the entire philosophy in a line. If the on-device path cannot run, the request fails loudly instead of quietly shipping audio off the phone. I would rather show an error than break the promise.
The part that surprised me
I expected local-first to feel like a compromise, the worthy-but-worse option. It does not. With no round trip, transcription starts the instant you stop talking, and it works on a plane, in a basement, in a tunnel, anywhere the audio happens to be. The privacy was the reason I built it this way. The speed and the offline behavior turned out to be the features people actually notice.
The cloud was never the point. It was a workaround for hardware that could not do this yet. The hardware caught up, so I stopped working around it.
- whisper_scheduleA recording goes in, a speaker-labelled transcript comes out.Lab
- Semantic food searchSearch food by meaning, not by exact product name.Lab
- Chunk audio with VAD before you transcribeFeeding long silent audio to a transcriber wastes time and money. Split on speech first using webrtcvad, then send only the chunks that contain voice.Snippet
- One voice note, five diariesIn NutriM8 you can mumble your whole day into your phone once. A background worker untangles it into sleep, weight, exercise, hydration and food, and resolves "a snack after lunch" to a real timestamp.Musing
- Infinite exercises, verifiedA model drafts maths questions against the component library, a verifier throws out the junk, and a clean one renders. Forever.Lab