Tech corner - 27. December 2025

Whisper: A local pipeline for offline subtitles

Running Whisper locally is one of those things that sounds more complicated than it actually is. With the right setup, it can be faster, cheaper, and more predictable than cloud-based alternatives.

For one of our internal videos, I needed Slovak subtitles.

My first instinct was the usual one: there must be a SaaS for this.

I tried a couple of paid, cloud-based subtitle tools. They worked… mostly. But every time I hit “Upload”, it bothered me that:

I was sending quite large private videos to a third party,
I’d pay per minute forever, instead of just once with hardware I already own,
and I had zero control over the pipeline (no easy way to script or automate things).

So I ended up doing what developers do: I built my own small CLI tool on top of faster-whisper — a super-fast reimplementation of OpenAI’s Whisper model.

The result is a single Python script that:

takes a local video or audio file,
runs speech-to-text using faster-whisper,
and outputs both .srt and .vtt subtitles.

Here’s the full script for reference. Then I’ll walk through the interesting parts:

python

>#!/usr/bin/env python3
>import argparse, sys
>from pathlib import Path
>from faster_whisper import WhisperModel
>

>def srt_time(t: float) -> str:
>h = int(t // 3600)
>m = int((t % 3600) // 60)
>s = int(t % 60)
>ms = int((t - int(t)) * 1000)
>return f"{h:02}:{m:02}:{s:02},{ms:03}"
>

>def write_srt(segments, out_path: Path):
>with out_path.open("w", encoding="utf-8") as f:
>for i, seg in enumerate(segments, start=1):
>f.write(
>f"{i}\n"
>f"{srt_time(seg.start)} --> {srt_time(seg.end)}\n"
>f"{seg.text.strip()}\n\n"
>)
>

>def write_vtt(segments, out_path: Path):
>with out_path.open("w", encoding="utf-8") as f:
>f.write("WEBVTT\n\n")
>for seg in segments:
>start = srt_time(seg.start).replace(",", ".")
>end = srt_time(seg.end).replace(",", ".")
>f.write(f"{start} --> {end}\n{seg.text.strip()}\n\n")
>

>def main():
>p = argparse.ArgumentParser(description="Local Slovak subtitles via faster-whisper")
>p.add_argument("input", help="Video or audio file path")
>p.add_argument("--model", default="large-v3", help="tiny/base/small/medium/large-v3")
>p.add_argument("--lang", default="sk", help="Language code (default: sk)")
>p.add_argument("--word-ts", action="store_true", help="Per-word timestamps")
>p.add_argument("--out", default=None, help="Output basename (no extension)")
>p.add_argument("--device", default="auto", help="auto/cpu/cuda")
>p.add_argument("--compute", default="auto", help="auto/int8/int8_float16/float16/float32")
>p.add_argument("--no-vad", action="store_true", help="Disable VAD filter")
>args = p.parse_args()
>

>src = Path(args.input)
>if not src.exists():
>print("Input not found.", file=sys.stderr)
>sys.exit(1)
>

>basename = args.out or src.with_suffix("").name
>srt_path = Path(f"{basename}.srt")
>vtt_path = Path(f"{basename}.vtt")
>

>model = WhisperModel(args.model, device=args.device, compute_type=args.compute)
>segments = model.transcribe(
>str(src),
>language=args.lang,
>vad_filter=not args.no_vad,
>word_timestamps=args.word_ts
>)
>

>segs = list(segments)
>write_srt(segs, srt_path)
>write_vtt(segs, vtt_path)
>print(f"Saved: {srt_path} | {vtt_path}")
>

>if __name__ == "__main__":
>main()
>

>

Why faster-whisper?

OpenAI’s original Whisper model is open source (MIT-licensed) and very good at multilingual transcription. The downside is that the reference implementation isn’t exactly lightweight.

faster-whisper reimplements Whisper on top of CTranslate2, an optimized inference engine for Transformer models. It’s built by SYSTRAN and is also MIT-licensed, just like Whisper itself.

According to the project’s benchmarks, faster-whisper offers:

up to 4× faster inference than the original Whisper implementation at similar accuracy,
lower memory usage, especially with 8-bit quantization enabled,
CPU and GPU support across multiple platforms.

For a small internal tool that I want to run locally on different machines (a laptop, a workstation, maybe a small server later), this trade-off — speed, memory efficiency, and open source — is a great fit.

Command-line UX: one script, many knobs

At the top, I use argparse to turn the script into a small CLI:

python

>p = argparse.ArgumentParser(description="Local Slovak subtitles via faster-whisper")
>p.add_argument("input", help="Video or audio file path")
>p.add_argument("--model", default="large-v3", help="tiny/base/small/medium/large-v3")
>p.add_argument("--lang", default="sk", help="Language code (default: sk)")
>p.add_argument("--word-ts", action="store_true", help="Per-word timestamps")
>p.add_argument("--out", default=None, help="Output basename (no extension)")
>p.add_argument("--device", default="auto", help="auto/cpu/cuda")
>p.add_argument("--compute", default="auto", help="auto/int8/int8_float16/float16/float32")
>p.add_argument("--no-vad", action="store_true", help="Disable VAD filter")
>

This lets me do things like:

python

># Simple Slovak subtitles, auto-detect CPU/GPU
>./subtitles.py talk.mp4
>

># Force GPU with int8 compute, English, custom output basename
>./subtitles.py talk.mp4 \
>--lang en \
>--model large-v3 \
>--device cuda \
>--compute int8_float16 \
>--out talk_en
>

Some of these flags map almost 1:1 to faster-whisper concepts:

--device: "cpu", "cuda", or "auto"
--compute: numeric precision / quantization strategy
--word-ts: enable word-level timestamps
--no-vad: disable the Voice Activity Detection (VAD) filter

SRT / VTT: formatting the timestamps

The first building block is srt_time, which converts a floating-point timestamp (in seconds) into the HH:MM:SS,mmm format:

python

>def srt_time(t: float) -> str:
>h = int(t // 3600)
>m = int((t % 3600) // 60)
>s = int(t % 60)
>ms = int((t - int(t)) * 1000)
>return f"{h:02}:{m:02}:{s:02},{ms:03}"
>

This is deliberately simple and explicit. SRT and VTT are straightforward text formats, but they’re picky about formatting, and small timestamp errors can accumulate over long videos.

I then have two helpers for writing the actual files.

SRT

SRT expects:

a numeric index per subtitle block,
a start --> end timestamp line,
the text,
and a blank line.

VTT

VTT is very similar, but:

it starts with a WEBVTT header,
and timestamps use . instead of , for milliseconds.

Both writers consume the same list of segments and just format them differently.

Segments and word timestamps

faster-whisper performs streaming transcription and produces small batches of decoded speech chunks. Each chunk is exposed as a Segment object.

A segment represents a continuous portion of audio where the model believes someone is speaking the same coherent sentence or phrase — think of segments as “speech paragraphs”.

A simplified segment looks like this:

python

>Segment(
>id=0,
>start=12.45,
>end=17.12,
>text="Good morning everybody",
>avg_logprob=-0.15,
>no_speech_prob=0.02,
>words=[
>Word(start=12.45, end=12.92, word="Good"),
>Word(start=12.92, end=13.30, word="morning"),
>Word(start=13.30, end=14.00, word="everybody"),
>]
>)
>

The core: wiring faster-whisper

The key part of the script is:

python

>model = WhisperModel(args.model, device=args.device, compute_type=args.compute)
>segments = model.transcribe(
>str(src),
>language=args.lang,
>vad_filter=not args.no_vad,
>word_timestamps=args.word_ts
>)
>

>segs = list(segments)
>write_srt(segs, srt_path)
>write_vtt(segs, vtt_path)
>

1. Model selection

WhisperModel(args.model, ...) accepts either a model size ("tiny", "small", "large-v3", etc.) or a Hugging Face ID for a converted CTranslate2 model.

When you pass "large-v3", faster-whisper automatically downloads the optimized weights on first use.

2. Device & compute types (CPU vs GPU)

The device and compute_type arguments let you trade off speed, accuracy, and memory usage.

Devices

"cpu" — run on CPU
"cuda" — run on NVIDIA GPU
"auto" — let faster-whisper decide

Compute types

"float32" — full precision, highest memory usage
"float16" — faster on GPU, lower VRAM usage
"int8" / "int8_float16" — quantized inference, minimal memory usage with a small accuracy trade-off

Based on benchmarks:

On a laptop without a GPU:
--device cpu --compute int8 with a smaller model (small or medium)
On a GPU-equipped desktop:
--device cuda --compute float16 or int8_float16 for large-v3

You can tune these until nvidia-smi or htop stop screaming at you.

3. VAD filter

python

>vad_filter = not args.no_vad
>

faster-whisper includes a Voice Activity Detection (VAD) filter based on Silero VAD. It skips sections without speech (long silences, pauses), which:

reduces unnecessary computation,
produces cleaner subtitles that don’t span silent gaps.

It’s enabled by default but can be disabled with --no-vad.

4. Word-level timestamps

With --word-ts, the library populates segment.words with per-word timestamps.

I don’t use this yet in the SRT/VTT output, but it enables things like:

karaoke-style highlighting,
syncing text with on-screen elements,
more precise editing and cutting tools.

5. Why list(segments)?

segments is a generator. Iterating over it drives the transcription.

Because I want to generate both SRT and VTT, I materialize it once into a list. Otherwise, the second writer would see an already-consumed generator and produce an empty file.

Licensing & “can I run this at work?”

From a legal and engineering perspective, this setup is refreshingly boring:

OpenAI Whisper — MIT license
faster-whisper — MIT license

MIT is very permissive:

commercial use is allowed,
modification and redistribution are allowed,
attribution and license notices must be preserved.

Individual models on Hugging Face may have their own licenses, but the official faster-whisper CTranslate2 ports of OpenAI’s models are also MIT-licensed.

So this kind of script is generally fine to use in a company setting, subject to internal policies around running AI models locally.

Where this can go next

This is intentionally a minimal tool: one input, two subtitle files out.

Obvious next steps include:

Parallel / batched processing — faster-whisper has a BatchedInferencePipeline for efficiently transcribing multiple files.
Speaker diarization & alignment — tools like WhisperX build on faster-whisper to identify who is speaking.
A simple web UI — wrap the script in a small FastAPI app for non-technical colleagues.
Video pipeline integration — e.g., a CI job that generates subtitles when a video lands in a repo or object storage.

For now, though, this little script already gives me:

local, offline Slovak subtitles,
predictable runtime and memory usage,
zero ongoing SaaS fees.

Sometimes the best tool for the job really is a 100-line Python script.

Whisper: A local pipeline for offline subtitles

Why faster-whisper?

Command-line UX: one script, many knobs

SRT / VTT: formatting the timestamps

SRT

VTT

Segments and word timestamps

The core: wiring faster-whisper

1. Model selection

2. Device & compute types (CPU vs GPU)

3. VAD filter

4. Word-level timestamps

5. Why list(segments)?

Licensing & “can I run this at work?”

Where this can go next

Read more

Let's talk