Who, what, when: why I stopped trusting one diarization pipeline

Diarization is three questions wearing one coat. Split them into specialists, never merge across a turn, and the failures stop hiding inside each other.

14 Jun 2026 · 3 min

Speaker diarization is the task of labelling who spoke when in a recording. Most pipelines try to answer that as one question, and most pipelines are mediocre at it, because it is not one question. It is three, and they fail in different ways. When you fuse them into a single model you cannot tell which of the three is lying to you.

So I stopped fusing them. The pipeline I run now is three specialists, each answering exactly one question, and a hard rule that they never reach across a speaker turn into each other's work.

Three questions, three answers

WHEN is "where are the turn boundaries." tinydiarize answers that. It emits a speaker_turn flag at the points where the speaker changes, and that is all I ask of it: cut points, nothing else.

WHAT is "what words, at what times." whisper large-v2 answers that. It gives me the transcript with millisecond-level timings on each word. It is not asked who said anything. It only writes down what was said and exactly when.

WHO is "which speaker is this." WeSpeaker embeddings answer that, with spectral clustering on top. Each segment becomes an embedding, the embeddings get clustered, and the clusters are the speakers. The transcriber never votes on identity and the identity model never transcribes.

Keeping them separate is the whole trick. When the output is wrong I can see whether the boundary was misplaced, the words were misheard, or the speaker was misattributed, because those are three distinct stages with three distinct outputs. A fused model just hands you a wrong answer with no seam to pull on.

Never merge across a turn

The boundaries from tinydiarize are not used raw. A cut that lands in the middle of a word is worse than useless; it splits a phrase across two speaker labels and corrupts both. So every cut point gets snapped to the largest inter-word silence inside a window of plus or minus 1.5 seconds around it. The transcriber already told me where the silences are. I move the boundary to the quietest nearby gap, which means a label change never lands mid-phrase. The three specialists stay in their lanes, and the lane markers are real silences, not guesses.

The clustering detail that actually matters

Spectral clustering needs an affinity matrix, and how you build it from cosine similarities is where most implementations quietly break. Cosine similarity runs from -1 to 1. The lazy move is to rescale that whole range into [0, 1]. Do not. Rescaling maps a strongly-negative pair, two clearly different speakers, to a small positive affinity instead of zero, which tells the clustering they are a little bit connected when they are not. That inflates the off-diagonal mass, washes out the eigengap, and the speaker count estimate falls apart.

Instead, clamp the negatives to 0 and leave the positives alone. Different-speaker pairs become true zeros, the block structure stays sharp, and the eigengap in the normalized Laplacian is clean enough to read the speaker count straight off it. I do not ask the user how many speakers there are. The eigengap tells me.

What it cannot do

I trust this chain because I know exactly where it breaks. tinydiarize is English-only, so a bilingual recording is out. It does not handle overlapping speech, so two people talking at once collapse into a mess no boundary can fix. And a recording spliced from several monologues, with no real conversational turns, tends to collapse to a single speaker because there is nothing for the boundary detector to catch. None of these are bugs I can patch. They are the edges of the approach, and naming them is part of trusting it.

The reason it is built this way

The whole chain runs offline on a laptop. tinydiarize, whisper, WeSpeaker, the clustering: none of it phones home. A private meeting goes in one end and labelled transcript comes out the other, and the audio never leaves the machine. That is not a nice-to-have. It is the reason for the architecture. Once you commit to running locally, you cannot lean on a giant hosted model to paper over the seams, so you build the seams well instead, three honest specialists you can inspect, rather than one black box you have to send your private audio to.

voice ai local-models architecture