Whisper: A local pipeline for offline subtitles

Running Whisper locally is one of those things that sounds more complicated than it actually is. With the right setup, it can be faster, cheaper, and more predictable than cloud-based alternatives.
For one of our internal videos, I needed Slovak subtitles.
My first instinct was the usual one: there must be a SaaS for this.
I tried a couple of paid, cloud-based subtitle tools. They worked… mostly. But every time I hit “Upload”, it bothered me that:
- I was sending quite large private videos to a third party,
- I’d pay per minute forever, instead of just once with hardware I already own,
- and I had zero control over the pipeline (no easy way to script or automate things).
So I ended up doing what developers do: I built my own small CLI tool on top of faster-whisper — a super-fast reimplementation of OpenAI’s Whisper model.
The result is a single Python script that:
- takes a local video or audio file,
- runs speech-to-text using faster-whisper,
- and outputs both .srt and .vtt subtitles.
Here’s the full script for reference. Then I’ll walk through the interesting parts:
python
Copied!
>#!/usr/bin/env python3
>import argparse, sys
>from pathlib import Path
>from faster_whisper import WhisperModel
>
>def srt_time(t: float) -> str:
>h = int(t // 3600)
>m = int((t % 3600) // 60)
>s = int(t % 60)
>ms = int((t - int(t)) * 1000)
>return f"{h:02}:{m:02}:{s:02},{ms:03}"
>
>def write_srt(segments, out_path: Path):
>with out_path.open("w", encoding="utf-8") as f:
>for i, seg in enumerate(segments, start=1):
>f.write(
>f"{i}\n"
>f"{srt_time(seg.start)} --> {srt_time(seg.end)}\n"
>f"{seg.text.strip()}\n\n"
>)
>
>def write_vtt(segments, out_path: Path):
>with out_path.open("w", encoding="utf-8") as f:
>f.write("WEBVTT\n\n")
>for seg in segments:
>start = srt_time(seg.start).replace(",", ".")
>end = srt_time(seg.end).replace(",", ".")
>f.write(f"{start} --> {end}\n{seg.text.strip()}\n\n")
>
>def main():
>p = argparse.ArgumentParser(description="Local Slovak subtitles via faster-whisper")
>p.add_argument("input", help="Video or audio file path")
>p.add_argument("--model", default="large-v3", help="tiny/base/small/medium/large-v3")
>p.add_argument("--lang", default="sk", help="Language code (default: sk)")
>p.add_argument("--word-ts", action="store_true", help="Per-word timestamps")
>p.add_argument("--out", default=None, help="Output basename (no extension)")
>p.add_argument("--device", default="auto", help="auto/cpu/cuda")
>p.add_argument("--compute", default="auto", help="auto/int8/int8_float16/float16/float32")
>p.add_argument("--no-vad", action="store_true", help="Disable VAD filter")
>args = p.parse_args()
>
>src = Path(args.input)
>if not src.exists():
>print("Input not found.", file=sys.stderr)
>sys.exit(1)
>
>basename = args.out or src.with_suffix("").name
>srt_path = Path(f"{basename}.srt")
>vtt_path = Path(f"{basename}.vtt")
>
>model = WhisperModel(args.model, device=args.device, compute_type=args.compute)
>segments = model.transcribe(
>str(src),
>language=args.lang,
>vad_filter=not args.no_vad,
>word_timestamps=args.word_ts
>)
>
>segs = list(segments)
>write_srt(segs, srt_path)
>write_vtt(segs, vtt_path)
>print(f"Saved: {srt_path} | {vtt_path}")
>
>if __name__ == "__main__":
>main()
>
>Why faster-whisper?
OpenAI’s original Whisper model is open source (MIT-licensed) and very good at multilingual transcription. The downside is that the reference implementation isn’t exactly lightweight.
faster-whisper reimplements Whisper on top of CTranslate2, an optimized inference engine for Transformer models. It’s built by SYSTRAN and is also MIT-licensed, just like Whisper itself.
According to the project’s benchmarks, faster-whisper offers:
- up to 4× faster inference than the original Whisper implementation at similar accuracy,
- lower memory usage, especially with 8-bit quantization enabled,
- CPU and GPU support across multiple platforms.
For a small internal tool that I want to run locally on different machines (a laptop, a workstation, maybe a small server later), this trade-off — speed, memory efficiency, and open source — is a great fit.
Command-line UX: one script, many knobs
At the top, I use argparse to turn the script into a small CLI:
python
Copied!
>p = argparse.ArgumentParser(description="Local Slovak subtitles via faster-whisper")
>p.add_argument("input", help="Video or audio file path")
>p.add_argument("--model", default="large-v3", help="tiny/base/small/medium/large-v3")
>p.add_argument("--lang", default="sk", help="Language code (default: sk)")
>p.add_argument("--word-ts", action="store_true", help="Per-word timestamps")
>p.add_argument("--out", default=None, help="Output basename (no extension)")
>p.add_argument("--device", default="auto", help="auto/cpu/cuda")
>p.add_argument("--compute", default="auto", help="auto/int8/int8_float16/float16/float32")
>p.add_argument("--no-vad", action="store_true", help="Disable VAD filter")
>This lets me do things like:
python
Copied!
># Simple Slovak subtitles, auto-detect CPU/GPU
>./subtitles.py talk.mp4
>
># Force GPU with int8 compute, English, custom output basename
>./subtitles.py talk.mp4 \
>--lang en \
>--model large-v3 \
>--device cuda \
>--compute int8_float16 \
>--out talk_en
>Some of these flags map almost 1:1 to faster-whisper concepts:
- --device: "cpu", "cuda", or "auto"
- --compute: numeric precision / quantization strategy
- --word-ts: enable word-level timestamps
- --no-vad: disable the Voice Activity Detection (VAD) filter
SRT / VTT: formatting the timestamps
The first building block is srt_time, which converts a floating-point timestamp (in seconds) into the HH:MM:SS,mmm format:
python
Copied!
>def srt_time(t: float) -> str:
>h = int(t // 3600)
>m = int((t % 3600) // 60)
>s = int(t % 60)
>ms = int((t - int(t)) * 1000)
>return f"{h:02}:{m:02}:{s:02},{ms:03}"
>This is deliberately simple and explicit. SRT and VTT are straightforward text formats, but they’re picky about formatting, and small timestamp errors can accumulate over long videos.
I then have two helpers for writing the actual files.
SRT
SRT expects:
- a numeric index per subtitle block,
- a start --> end timestamp line,
- the text,
- and a blank line.
VTT
VTT is very similar, but:
- it starts with a WEBVTT header,
- and timestamps use . instead of , for milliseconds.
Both writers consume the same list of segments and just format them differently.
Segments and word timestamps
faster-whisper performs streaming transcription and produces small batches of decoded speech chunks. Each chunk is exposed as a Segment object.
A segment represents a continuous portion of audio where the model believes someone is speaking the same coherent sentence or phrase — think of segments as “speech paragraphs”.
A simplified segment looks like this:
python
Copied!
>Segment(
>id=0,
>start=12.45,
>end=17.12,
>text="Good morning everybody",
>avg_logprob=-0.15,
>no_speech_prob=0.02,
>words=[
>Word(start=12.45, end=12.92, word="Good"),
>Word(start=12.92, end=13.30, word="morning"),
>Word(start=13.30, end=14.00, word="everybody"),
>]
>)
>The core: wiring faster-whisper
The key part of the script is:
python
Copied!
>model = WhisperModel(args.model, device=args.device, compute_type=args.compute)
>segments = model.transcribe(
>str(src),
>language=args.lang,
>vad_filter=not args.no_vad,
>word_timestamps=args.word_ts
>)
>
>segs = list(segments)
>write_srt(segs, srt_path)
>write_vtt(segs, vtt_path)
>1. Model selection
WhisperModel(args.model, ...) accepts either a model size ("tiny", "small", "large-v3", etc.) or a Hugging Face ID for a converted CTranslate2 model.
When you pass "large-v3", faster-whisper automatically downloads the optimized weights on first use.
2. Device & compute types (CPU vs GPU)
The device and compute_type arguments let you trade off speed, accuracy, and memory usage.
Devices
- "cpu" — run on CPU
- "cuda" — run on NVIDIA GPU
- "auto" — let faster-whisper decide
Compute types
- "float32" — full precision, highest memory usage
- "float16" — faster on GPU, lower VRAM usage
- "int8" / "int8_float16" — quantized inference, minimal memory usage with a small accuracy trade-off
Based on benchmarks:
- On a laptop without a GPU:
- --device cpu --compute int8 with a smaller model (small or medium)
- On a GPU-equipped desktop:
- --device cuda --compute float16 or int8_float16 for large-v3
You can tune these until nvidia-smi or htop stop screaming at you.
3. VAD filter
python
Copied!
>vad_filter = not args.no_vad
>faster-whisper includes a Voice Activity Detection (VAD) filter based on Silero VAD. It skips sections without speech (long silences, pauses), which:
- reduces unnecessary computation,
- produces cleaner subtitles that don’t span silent gaps.
It’s enabled by default but can be disabled with --no-vad.
4. Word-level timestamps
With --word-ts, the library populates segment.words with per-word timestamps.
I don’t use this yet in the SRT/VTT output, but it enables things like:
- karaoke-style highlighting,
- syncing text with on-screen elements,
- more precise editing and cutting tools.
5. Why list(segments)?
segments is a generator. Iterating over it drives the transcription.
Because I want to generate both SRT and VTT, I materialize it once into a list. Otherwise, the second writer would see an already-consumed generator and produce an empty file.
Licensing & “can I run this at work?”
From a legal and engineering perspective, this setup is refreshingly boring:
- OpenAI Whisper — MIT license
- faster-whisper — MIT license
MIT is very permissive:
- commercial use is allowed,
- modification and redistribution are allowed,
- attribution and license notices must be preserved.
Individual models on Hugging Face may have their own licenses, but the official faster-whisper CTranslate2 ports of OpenAI’s models are also MIT-licensed.
So this kind of script is generally fine to use in a company setting, subject to internal policies around running AI models locally.
Where this can go next
This is intentionally a minimal tool: one input, two subtitle files out.
Obvious next steps include:
- Parallel / batched processing — faster-whisper has a BatchedInferencePipeline for efficiently transcribing multiple files.
- Speaker diarization & alignment — tools like WhisperX build on faster-whisper to identify who is speaking.
- A simple web UI — wrap the script in a small FastAPI app for non-technical colleagues.
- Video pipeline integration — e.g., a CI job that generates subtitles when a video lands in a repo or object storage.
For now, though, this little script already gives me:
- local, offline Slovak subtitles,
- predictable runtime and memory usage,
- zero ongoing SaaS fees.
Sometimes the best tool for the job really is a 100-line Python script.