← All projects

AI Tooling · 2026

YouTube Video Transcriber

A modular Python CLI and Streamlit GUI that downloads YouTube videos and turns them into text transcripts and .srt subtitles — with a local Whisper model or the OpenAI API.

  • Python
  • Whisper
  • OpenAI API
  • yt-dlp
  • FFmpeg
  • Streamlit
Role
Design · Build
Timeline
2026 · Mar — Jun

The problem

Getting the text out of a YouTube video — a talk, an interview, a lecture — usually means unreliable auto-captions or paying a transcription service per minute. There was no simple tool covering the whole flow that also let you choose between cloud quality and fully offline, cost-free transcription.

The approach

I built it as a modular Python package: yt-dlp handles the download, FFmpeg the audio extraction, and the transcription engine is pluggable — a locally run Whisper model (offline, GPU-accelerated) or the OpenAI Whisper API. Optional dependency groups mean you install only what you use, and both the CLI and the Streamlit GUI sit on the same core.

How it works

  1. 01

    Fetch

    yt-dlp downloads the video or audio-only stream, with format selection, browser-cookie support, and local files as an alternative input.

  2. 02

    Extract

    FFmpeg extracts and converts the audio, automatically re-encoding to MP3 when a file exceeds the API size limit.

  3. 03

    Transcribe

    The pluggable engine runs a lazily-loaded local Whisper model or calls the OpenAI Whisper API — the user picks the trade-off per run.

  4. 04

    Export

    Transcripts are saved as plain text and .srt subtitles; files that already exist are detected and skipped, so nothing is processed twice.

Stack

Language
Python 3.9+
Download
yt-dlp + FFmpeg
Engines
Local Whisper · OpenAI Whisper API
Interface
CLI + Streamlit GUI
Packaging
Modular extras (local / api / gui)

What I learned

  • Optional dependency groups keep a tool light: video download, local inference and the API client install independently.
  • Idempotent file handling — skip whatever already exists — turns a slow media pipeline into something you can re-run without thinking.
  • Handling real-world media (long videos, size limits, odd formats) is where most of the engineering actually lives.

Outcome

  • One tool covers download, audio extraction, transcription and subtitle generation end to end.
  • Fully offline transcription with a local Whisper model, or faster cloud runs via the OpenAI API — chosen per run.
  • Ships as an installable package with a CLI and a Streamlit GUI built on the same core.