AI Tooling · 2026
YouTube Video Transcriber
A modular Python CLI and Streamlit GUI that downloads YouTube videos and turns them into text transcripts and .srt subtitles — with a local Whisper model or the OpenAI API.
- Python
- Whisper
- OpenAI API
- yt-dlp
- FFmpeg
- Streamlit
- Role
- Design · Build
- Timeline
- 2026 · Mar — Jun
The problem
Getting the text out of a YouTube video — a talk, an interview, a lecture — usually means unreliable auto-captions or paying a transcription service per minute. There was no simple tool covering the whole flow that also let you choose between cloud quality and fully offline, cost-free transcription.
The approach
I built it as a modular Python package: yt-dlp handles the download, FFmpeg the audio extraction, and the transcription engine is pluggable — a locally run Whisper model (offline, GPU-accelerated) or the OpenAI Whisper API. Optional dependency groups mean you install only what you use, and both the CLI and the Streamlit GUI sit on the same core.
How it works
- 01
Fetch
yt-dlp downloads the video or audio-only stream, with format selection, browser-cookie support, and local files as an alternative input.
- 02
Extract
FFmpeg extracts and converts the audio, automatically re-encoding to MP3 when a file exceeds the API size limit.
- 03
Transcribe
The pluggable engine runs a lazily-loaded local Whisper model or calls the OpenAI Whisper API — the user picks the trade-off per run.
- 04
Export
Transcripts are saved as plain text and .srt subtitles; files that already exist are detected and skipped, so nothing is processed twice.
Stack
- Language
- Python 3.9+
- Download
- yt-dlp + FFmpeg
- Engines
- Local Whisper · OpenAI Whisper API
- Interface
- CLI + Streamlit GUI
- Packaging
- Modular extras (local / api / gui)
What I learned
- Optional dependency groups keep a tool light: video download, local inference and the API client install independently.
- Idempotent file handling — skip whatever already exists — turns a slow media pipeline into something you can re-run without thinking.
- Handling real-world media (long videos, size limits, odd formats) is where most of the engineering actually lives.
Outcome
- One tool covers download, audio extraction, transcription and subtitle generation end to end.
- Fully offline transcription with a local Whisper model, or faster cloud runs via the OpenAI API — chosen per run.
- Ships as an installable package with a CLI and a Streamlit GUI built on the same core.