Case · Video Generation Pipeline

01Context

Hao builds AI-autonomous orchestrators. One of them needed a pipeline that turns a topic into a finished vertical video — scripted dialogue, voiced lines, generated images, composed with sprites and karaoke subtitles, uploaded to storage. Every generated unit costs real money across LLM calls, TTS, images, and storage, so latency and cost were the whole point of the rebuild.

When I joined the project in June 2025, the pipeline was linear — a single-path orchestrator with no batch mode or cost instrumentation, operator-driven between stages. Each generation took minutes and cost roughly $2. That's the starting line.

Three generated clips — same pipeline, different characters and topics.

02Architecture

I redesigned the pipeline as a two-stage CLI API that talks to itself via JSON on stdin/stdout. Stage one generates the dialogue through a provider-agnostic LLM call, reading per-character profiles with voices, sprites and catchphrases. Stage two consumes that JSON and runs TTS and image generation in parallel, composes the final video with FFmpeg, and uploads to S3. The two commands join with a pipe, with no manual step in between.

03What changed

Parallelized generation— throughput

Broke the sequential flow into independent units and ran LLM + TTS + image generation concurrently. Framed the bottleneck as I/O-bound network waits, not CPU, and moved everything to async tasks with tenacity-backed retries.

LLM request deduplication— cost

Scripts repeatedly asked for near-identical completions. Added a caller-level dedup layer keyed on prompt + model + params, so identical requests share a single response — same for image prompts with stable seeds.

Two-stage CLI with JSON contract— composability

Split dialogue_api and video_from_dialogue_api into independent CLI commands that communicate via JSON on stdin/stdout. Either stage can be rerun, cached, or orchestrated externally without touching the other.

Character profile system— reuse

Each character is a YAML profile — ElevenLabs voice id, PNG sprite variants (telling, hearing), catchphrases. Adding a new character is a file drop, not a code change. FFmpeg overlays sprites per-line; ASS subtitles render karaoke sync per character.

Automated script workflow— manual work

Replaced operator-driven steps with automated prompt workflows, so what used to be hand-tuned by a person runs end-to-end. Placeholder fallback images cover failed generations so the pipeline never blocks on one flaky API.

04Results

6× Faster per-unit processing 60 min → 10 min per generation

−60% Generation cost after dedup ≈$2 → ≈$0.80 per generation

½× Manual script work automated prompt workflows

Takeaway

The cost and throughput wins came from treating every API call — LLM, TTS, image — as one I/O-bound problem to parallelize and dedup. The two-stage JSON contract was the unlock: either stage can be re-run, cached, or replayed without touching the other.

05Stack

Orchestration

SceneOrchestrator two-stage CLI · JSON stdin/stdout pydantic-settings structlog tenacity retries

Dialogue

LLM dialogue structured outputs per-character profiles

Voice

ElevenLabs TTS pydub loudness norm

Images

pluggable image providers provider-agnostic adapter placeholder fallback

Video

FFmpeg ASS karaoke subtitles sprite overlays background loop trimming

Storage

Backblaze S3 boto3

Infra

Python 3.13 uv SOCKS5 proxy

Video generation pipeline, rebuilt for throughput & cost.

01Context

02Architecture

03What changed

04Results

05Stack