whisper_schedule

LIVE

A recording goes in, a speaker-labelled transcript comes out.

system · whisper_schedule (hosted transcription + diarization service)

Input · pick a sample

·fetch media
·ffmpeg to 16k wav
·transcribe (whisper.cpp large-v2)
·detect turns (tinydiarize)
·embed + cluster (WeSpeaker)

Speaker-labelled transcript

Nothing yet. Run the pipeline to transcribe and label speakers.

Real vs simulatedThis demo replays two captured sample results in your browser. It calls no server and uploads no audio. The real service runs whisper.cpp and a Python diarization sidecar behind a job queue, not in this page.

How it works

Pick a sample, run the pipeline, and watch it move through the real stages: fetch the media, transcode to 16k WAV with ffmpeg, transcribe with whisper.cpp, find the speaker turns, then embed and cluster each turn to label who said what.

How it works

whisper_schedule is a job service, not a single function. You POST a job, it lands in a strict FIFO queue (BullMQ on Redis), and a worker runs the pipeline: download the media, ffmpeg to WAV, run whisper.cpp for the transcript with millisecond word timings, then diarize.

The diarization is the interesting part, and it does not trust one model. It splits the problem into three questions answered by three specialists:

When does the speaker change: a tinydiarize pass marks turn-cut boundaries.
What was said: whisper large-v2 gives the words and accurate timings.
Who is speaking: a WeSpeaker embedding per turn, clustered to assign consistent speaker labels.

The speaker count is estimated from the data, not hardcoded, and a cut between speakers is snapped to the nearest silence so a label change never splits a word. The honest limit: the turn detector is English-only and assumes one voice at a time, so heavily overlapping speech degrades it. The write-up goes deeper into why fusing three narrow tools beats one end-to-end pipeline.

voice ai local-models architecture