Tech corner - 29. December 2025

Whisper: From transcription to speaker diarization (Part 2)

header_image

In the previous article, I described how I built a local tool to generate subtitles using faster-whisper. That solved the basic problem of turning speech into text. But for a lot of internal videos, scrums, demos, and knowledge-sharing sessions, text alone doesn’t answer the more interesting question:

Not just what was said, but who said it?

In professional subtitle workflows, this is known as speaker diarization: segmenting the audio by speaker identity and assigning each spoken fragment to the right person (SPEAKER_0, SPEAKER_1, etc.).

So I created an updated version of the tool that combines faster-whisper (with word-level timestamps) and Pyannote for speaker diarization. It merges these two data streams to output SRT subtitles with individual speaker labels.

Just like before, everything runs locally. The whole workflow looks like this:

  1. faster-whisper transcribes the audio and provides precise timestamps for every single word.
  2. Pyannote (a diarization model) decides who is speaking and when.
  3. The tool maps those word-level timestamps to the speaker segments.
  4. The tool merges both pieces of information and outputs:

Plaintext

1

0:00:05,200 --> 0:00:09,100

SPEAKER_1: Good morning, welcome.

2

0:00:10,300 --> 0:00:13,500

SPEAKER_2: Let us begin.

The Script

python

> #!/usr/bin/env python3
>import argparse
>from pathlib import Path
>import torch
>from faster_whisper import WhisperModel
>import tempfile
>import subprocess
>import os
>from dotenv import load_dotenv
>

>

># Load environment variables from .env file
>load_dotenv()
>

>

>def srt_time(t: float) -> str:
>   h = int(t // 3600)
>   m = int((t % 3600) // 60)
>   s = int(t % 60)
>   ms = int((t - int(t)) * 1000)
>   return f"{h:02}:{m:02}:{s:02},{ms:03}"
>

>

>def write_srt(segments, out_path: Path, with_speakers=True):
>   with out_path.open("w", encoding="utf-8") as f:
>       for i, seg in enumerate(segments, start=1):
>           start = srt_time(seg["start"])
>           end = srt_time(seg["end"])
>           text = seg["text"].strip()
>           speaker = seg.get("speaker")
>           line = f"{speaker}: {text}" if with_speakers and speaker else text
>           f.write(f"{i}\n{start} --> {end}\n{line}\n\n")
>

>

>def write_vtt(segments, out_path: Path, with_speakers=True):
>   with out_path.open("w", encoding="utf-8") as f:
>       f.write("WEBVTT\n\n")
>       for seg in segments:
>           start = srt_time(seg["start"]).replace(",", ".")
>           end = srt_time(seg["end"]).replace(",", ".")
>           text = seg["text"].strip()
>           speaker = seg.get("speaker")
>           line = f"{speaker}: {text}" if with_speakers and speaker else text
>           f.write(f"{start} --> {end}\n{line}\n\n")
>

>

>def main():
>   print(f"CUDA available: {torch.cuda.is_available()}")
>

>

>   ap = argparse.ArgumentParser(description="WhisperX alignment + diarization (Slovak)")
>   ap.add_argument("input", help="Audio or video file")
>   ap.add_argument("--output", default="out.srt")
>   ap.add_argument("--lang", default="sk")
>   ap.add_argument("--model", default="large-v3")
>   ap.add_argument("--no-speakers", action="store_true")
>   ap.add_argument("--hf-token", help="Hugging Face token for speaker diarization")
>   args = ap.parse_args()
>

>

>   device = "cuda" if torch.cuda.is_available() else "cpu"
>   input_file = args.input
>

>

>   compute_type = "int8" if device == "cpu" else "float16"
>  
>   # Use faster-whisper for transcription with word timestamps
>   model = WhisperModel(args.model, device=device, compute_type=compute_type)
>   segments, info = model.transcribe(
>       input_file,
>       language=args.lang,
>       word_timestamps=True,
>       vad_filter=True
>   )
>  
>   # Convert faster-whisper segments to list with word-level data
>   transcribed_segments = []
>   for seg in segments:
>       seg_dict = {
>           "start": seg.start,
>           "end": seg.end,
>           "text": seg.text,
>           "words": []
>       }
>       if seg.words:
>           for word in seg.words:
>               seg_dict["words"].append({
>                   "start": word.start,
>                   "end": word.end,
>                   "word": word.word
>               })
>       transcribed_segments.append(seg_dict)
>

>

>   if not args.no_speakers:
>       hf_token = args.hf_token or os.getenv("HF_TOKEN")
>       if not hf_token:
>           print("Error: Hugging Face token not found. Please provide it via --hf-token or set HF_TOKEN in .env file")
>           return
>      
>       # Load diarization model from pyannote
>       from pyannote.audio import Pipeline
>       diarize_model = Pipeline.from_pretrained(
>           "pyannote/speaker-diarization-3.1",
>           use_auth_token=hf_token
>       )
>      
>       # Extract audio to WAV for diarization (pyannote can't read MP4)
>       print("Extracting audio for diarization...")
>       with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as tmp_audio:
>           tmp_audio_path = tmp_audio.name
>       subprocess.run(
>           ["ffmpeg", "-i", input_file, "-ar", "16000", "-ac", "1", "-y", tmp_audio_path],
>           check=True, capture_output=True
>       )
>      
>       try:
>           # Run diarization on the extracted audio
>           print("Running speaker diarization...")
>           diarize_segments = diarize_model(tmp_audio_path)
>           print(f"Diarization complete. Type: {type(diarize_segments)}")
>          
>           # Convert pyannote Annotation to the format whisperx expects
>           import pandas as pd
>           diarize_df = pd.DataFrame([
>               {'start': segment.start, 'end': segment.end, 'speaker': label}
>               for segment, _, label in diarize_segments.itertracks(yield_label=True)
>           ])
>           print(f"Converted to DataFrame with {len(diarize_df)} segments")
>          
>           # Assign speakers to words manually
>           print("Assigning speakers to words...")
>           segs = assign_speakers_to_segments(transcribed_segments, diarize_df)
>           print("Speaker assignment complete.")
>       finally:
>           # Clean up temporary file
>           Path(tmp_audio_path).unlink(missing_ok=True)
>       out_srt = Path(args.output)
>       out_vtt = out_srt.with_suffix(".vtt")
>       write_srt(segs, out_srt, with_speakers=True)
>       write_vtt(segs, out_vtt, with_speakers=True)
>       print(f"Saved: {out_srt} | {out_vtt}")
>   else:
>       segs = transcribed_segments
>       out_srt = Path(args.output)
>       out_vtt = out_srt.with_suffix(".vtt")
>       write_srt(segs, out_srt, with_speakers=False)
>       write_vtt(segs, out_vtt, with_speakers=False)
>       print(f"Saved: {out_srt} | {out_vtt}")
>

>

>def assign_speakers_to_segments(segments, diarize_df):
>   """Assign speakers to segments based on word-level timestamps and diarization data."""
>   result_segments = []
>  
>   for seg in segments:
>       seg_copy = seg.copy()
>      
>       if not seg.get("words"):
>           # No word-level timestamps, assign speaker based on segment overlap
>           seg_start, seg_end = seg["start"], seg["end"]
>           speaker = find_speaker_for_time_range(seg_start, seg_end, diarize_df)
>           seg_copy["speaker"] = speaker
>       else:
>           # Assign speaker to each word, then determine segment speaker
>           word_speakers = []
>           for word in seg["words"]:
>               word_start = word["start"]
>               word_end = word["end"]
>               speaker = find_speaker_for_time_range(word_start, word_end, diarize_df)
>               word_speakers.append(speaker)
>          
>           # Segment speaker is the most common speaker among words
>           if word_speakers:
>               from collections import Counter
>               speaker_counts = Counter(word_speakers)
>               seg_copy["speaker"] = speaker_counts.most_common(1)[0][0]
>           else:
>               seg_copy["speaker"] = "UNKNOWN"
>      
>       result_segments.append(seg_copy)
>  
>   return result_segments
>

>

>def find_speaker_for_time_range(start, end, diarize_df):
>   """Find the speaker for a given time range based on maximum overlap."""
>   if diarize_df.empty:
>       return "UNKNOWN"
>  
>   mid_point = (start + end) / 2
>  
>   # Find speaker segments that overlap with this time range
>   overlapping = diarize_df[
>       (diarize_df['start'] <= mid_point) & (diarize_df['end'] >= mid_point)
>   ]
>  
>   if not overlapping.empty:
>       # Return the speaker with the most overlap
>       return overlapping.iloc[0]['speaker']
>  
>   # If no overlap at midpoint, find closest speaker segment
>   diarize_df['distance'] = diarize_df.apply(
>       lambda row: min(abs(row['start'] - mid_point), abs(row['end'] - mid_point)),
>       axis=1
>   )
>   closest = diarize_df.loc[diarize_df['distance'].idxmin()]
>   return closest['speaker']
>

>

>if __name__ == "__main__":
>   main()
>

>

>

>

>

>


Why faster-whisper?

While the standard Whisper model is great, faster-whisper is a reimplementation using CTranslate2. It is significantly faster and uses less memory while maintaining the same accuracy.

Key benefits include:

  1. Word-level timestamps: By setting word_timestamps=True, we get the exact start and end time of every word, which is crucial for matching text to a speaker.
  2. VAD Filter: Built-in Voice Activity Detection helps ignore background noise and silence.
  3. Quantization: It supports int8 for fast CPU performance and float16 for high-speed GPU processing.

Hugging Face and Pyannote

To handle the "who is speaking" part, I used Pyannote. These models are hosted on Hugging Face, the central hub for modern AI.

Unlike the transcription model, Pyannote's state-of-the-art diarization models (like version 3.1) are "gated." This means you need to:

  1. Accept the user conditions on the Pyannote Hugging Face page.
  2. Provide an HF Token to the script.

The script uses this token to download the weights once and then runs everything locally on your machine.

The challenge: Merging two worlds

The most interesting part of the engineering here is the assign_speakers_to_segments function.

  1. Whisper knows what was said at 00:05.20.
  2. Pyannote knows someone was speaking at 00:05.20.
  3. The script takes every word from Whisper, checks its midpoint against Pyannote’s timeline, and assigns the most likely speaker label.

This "voting" system ensures that even if timestamps aren't perfectly aligned to the millisecond, the subtitle segment as a whole gets assigned to the correct person.

Conclusion

By combining the speed of faster-whisper with the speaker-awareness of Pyannote, we’ve moved from simple transcription to a "professional" subtitle workflow. Whether it’s a recorded scrum or a technical demo, you now have a tool that gives you a searchable, speaker-labeled record—all while keeping your data private and local.


blog author
Author
Peter Lastovecký

I am a curious and creative professional with a keen eye for detail. I’m passionate about technology that simplifies life and conversations that bring meaningful insights. As the AI Stream Co-lead at Hotovo, I have the daily opportunity to channel my values and skills into building functional, impactful projects.

Read more

Contact us

Let's talk