whisper_schedule
LIVEA recording goes in, a speaker-labelled transcript comes out.
system · whisper_schedule (hosted transcription + diarization service)
- ·fetch media
- ·ffmpeg to 16k wav
- ·transcribe (whisper.cpp large-v2)
- ·detect turns (tinydiarize)
- ·embed + cluster (WeSpeaker)
Nothing yet. Run the pipeline to transcribe and label speakers.
Pick a sample, run the pipeline, and watch it move through the real stages: fetch the media, transcode to 16k WAV with ffmpeg, transcribe with whisper.cpp, find the speaker turns, then embed and cluster each turn to label who said what.
How it works
whisper_schedule is a job service, not a single function. You POST a job, it lands in a strict FIFO queue (BullMQ on Redis), and a worker runs the pipeline: download the media, ffmpeg to WAV, run whisper.cpp for the transcript with millisecond word timings, then diarize.
The diarization is the interesting part, and it does not trust one model. It splits the problem into three questions answered by three specialists:
- When does the speaker change: a tinydiarize pass marks turn-cut boundaries.
- What was said: whisper large-v2 gives the words and accurate timings.
- Who is speaking: a WeSpeaker embedding per turn, clustered to assign consistent speaker labels.
The speaker count is estimated from the data, not hardcoded, and a cut between speakers is snapped to the nearest silence so a label change never splits a word. The honest limit: the turn detector is English-only and assumes one voice at a time, so heavily overlapping speech degrades it. The write-up goes deeper into why fusing three narrow tools beats one end-to-end pipeline.