Who, what, when: why I stopped trusting one diarization pipeline
Diarization is three questions wearing one coat. Split them into specialists, never merge across a turn, and the failures stop hiding inside each other.
14 Jun 2026 · 3 min
Speaker diarization is the task of labelling who spoke when in a recording. Most pipelines try to answer that as one question, and most pipelines are mediocre at it, because it is not one question. It is three, and they fail in different ways. When you fuse them into a single model you cannot tell which of the three is lying to you.
So I stopped fusing them. The pipeline I run now is three specialists, each answering exactly one question, and a hard rule that they never reach across a speaker turn into each other's work.
Three questions, three answers
WHEN is "where are the turn boundaries." tinydiarize answers that. It emits a speaker_turn flag at the points where the speaker changes, and that is all I ask of it: cut points, nothing else.
WHAT is "what words, at what times." whisper large-v2 answers that. It gives me the transcript with millisecond-level timings on each word. It is not asked who said anything. It only writes down what was said and exactly when.
WHO is "which speaker is this." WeSpeaker embeddings answer that, with spectral clustering on top. Each segment becomes an embedding, the embeddings get clustered, and the clusters are the speakers. The transcriber never votes on identity and the identity model never transcribes.
Keeping them separate is the whole trick. When the output is wrong I can see whether the boundary was misplaced, the words were misheard, or the speaker was misattributed, because those are three distinct stages with three distinct outputs. A fused model just hands you a wrong answer with no seam to pull on.
Never merge across a turn
The boundaries from tinydiarize are not used raw. A cut that lands in the middle of a word is worse than useless; it splits a phrase across two speaker labels and corrupts both. So every cut point gets snapped to the largest inter-word silence inside a window of plus or minus 1.5 seconds around it. The transcriber already told me where the silences are. I move the boundary to the quietest nearby gap, which means a label change never lands mid-phrase. The three specialists stay in their lanes, and the lane markers are real silences, not guesses.
The clustering detail that actually matters
Spectral clustering needs an affinity matrix, and how you build it from cosine similarities is where most implementations quietly break. Cosine similarity runs from -1 to 1. The lazy move is to rescale that whole range into [0, 1]. Do not. Rescaling maps a strongly-negative pair, two clearly different speakers, to a small positive affinity instead of zero, which tells the clustering they are a little bit connected when they are not. That inflates the off-diagonal mass, washes out the eigengap, and the speaker count estimate falls apart.
Instead, clamp the negatives to 0 and leave the positives alone. Different-speaker pairs become true zeros, the block structure stays sharp, and the eigengap in the normalized Laplacian is clean enough to read the speaker count straight off it. I do not ask the user how many speakers there are. The eigengap tells me.
What it cannot do
I trust this chain because I know exactly where it breaks. tinydiarize is English-only, so a bilingual recording is out. It does not handle overlapping speech, so two people talking at once collapse into a mess no boundary can fix. And a recording spliced from several monologues, with no real conversational turns, tends to collapse to a single speaker because there is nothing for the boundary detector to catch. None of these are bugs I can patch. They are the edges of the approach, and naming them is part of trusting it.
The reason it is built this way
The whole chain runs offline on a laptop. tinydiarize, whisper, WeSpeaker, the clustering: none of it phones home. A private meeting goes in one end and labelled transcript comes out the other, and the audio never leaves the machine. That is not a nice-to-have. It is the reason for the architecture. Once you commit to running locally, you cannot lean on a giant hosted model to paper over the seams, so you build the seams well instead, three honest specialists you can inspect, rather than one black box you have to send your private audio to.
- Estimate speaker count with the Laplacian eigengapDiarization needs to guess how many speakers are in a clip. The eigengap of the graph Laplacian gives you k without asking for it up front.Snippet
- whisper_scheduleA recording goes in, a speaker-labelled transcript comes out.Lab
- Chunk audio with VAD before you transcribeFeeding long silent audio to a transcriber wastes time and money. Split on speech first using webrtcvad, then send only the chunks that contain voice.Snippet
- Voice that never leaves the deviceTranscription, diarisation, and speech running entirely on Apple silicon, and why keeping voice local is a product decision before it is a technical one.Musing
- One voice note, five diariesIn NutriM8 you can mumble your whole day into your phone once. A background worker untangles it into sleep, weight, exercise, hydration and food, and resolves "a snack after lunch" to a real timestamp.Musing