Python
Estimate speaker count with the Laplacian eigengap
Diarization needs to guess how many speakers are in a clip. The eigengap of the graph Laplacian gives you k without asking for it up front.
27 Jun 2026
Build a clean affinity matrix, read k off the largest gap in the sorted eigenvalues, then cluster with that k.
import numpy as np
from sklearn.cluster import SpectralClustering
def estimate_speakers(embeddings: np.ndarray, max_k: int = 8):
# L2-normalize so the dot product is cosine similarity.
x = embeddings / np.linalg.norm(embeddings, axis=1, keepdims=True)
sim = x @ x.T
# CLAMP negatives to 0. Do NOT rescale [-1, 1] to [0, 1]:
# rescaling maps near-orthogonal (different-speaker) pairs at ~0
# up to 0.5, inflating cross-speaker edges and washing out the
# eigengap. Clamping leaves different-speaker pairs near 0.
affinity = np.clip(sim, 0.0, 1.0)
np.fill_diagonal(affinity, 1.0)
# Normalized (symmetric) graph Laplacian: L = I - D^-1/2 A D^-1/2.
deg = affinity.sum(axis=1)
d_inv_sqrt = 1.0 / np.sqrt(np.maximum(deg, 1e-12))
lap = np.eye(affinity.shape[0]) - (
affinity * d_inv_sqrt[:, None] * d_inv_sqrt[None, :]
)
eigvals = np.sort(np.linalg.eigvalsh(lap))
upper = min(max_k, len(eigvals) - 1)
gaps = np.diff(eigvals[: upper + 1])
k = int(np.argmax(gaps)) + 1
if k < 2:
return 1, np.zeros(affinity.shape[0], dtype=int)
labels = SpectralClustering(
n_clusters=k, affinity="precomputed", random_state=0
).fit_predict(affinity)
return k, labelsGotchas
The clamp is the whole trick. It is tempting to rescale cosine similarity from [-1, 1] to [0, 1] so everything is non-negative, but that pushes orthogonal embeddings (the signature of two different speakers) to 0.5 and makes every pair look half-connected. The eigengap flattens and you over-merge speakers. Clamp, do not rescale.
- whisper_scheduleA recording goes in, a speaker-labelled transcript comes out.Lab
- Chunk audio with VAD before you transcribeFeeding long silent audio to a transcriber wastes time and money. Split on speech first using webrtcvad, then send only the chunks that contain voice.Snippet
- Voice that never leaves the deviceTranscription, diarisation, and speech running entirely on Apple silicon, and why keeping voice local is a product decision before it is a technical one.Musing
- One voice note, five diariesIn NutriM8 you can mumble your whole day into your phone once. A background worker untangles it into sleep, weight, exercise, hydration and food, and resolves "a snack after lunch" to a real timestamp.Musing
- speech-swiftAn on-device speech stack for Apple Silicon, in Swift.Tool