Back to portfolio
Case Study at Hao

Video generation pipeline, rebuilt for throughput & cost.

Full video generation at Hao — topic to published mp4. Dialogue, TTS, images, FFmpeg compose, S3 upload, composed as a two-stage JSON API, parallelized across every I/O-bound call, deduplicated end-to-end.

RoleLead on the pipeline
ScopeDialogue · TTS · images · compose · cost tracking
TimelineJun 2025 – Nov 2025
StatusLive in production

01Context

Hao builds AI-autonomous orchestrators. One of them needed a pipeline that turns a topic into a finished vertical video — scripted dialogue, voiced lines, generated images, composed with sprites and karaoke subtitles, uploaded to storage. Every generated unit costs real money — LLM calls, TTS, images, storage — so latency and cost are not polish, they're the business.

When I joined the project in June 2025, the pipeline was linear — a single-path orchestrator, no batch mode, no cost instrumentation, operator-driven between stages. Each generation took minutes and cost roughly $2. That's the starting line.

02Architecture

I redesigned the pipeline as a two-stage CLI API that talks to itself via JSON on stdin/stdout. Stage one generates the dialogue through a provider-agnostic LLM call, reading per-character profiles with voices, sprites and catchphrases. Stage two consumes that JSON and runs TTS and image generation in parallel, composes the final video with FFmpeg, uploads to S3. Two commands, one pipe, full automation.

03What changed

Parallelized generation— throughput

Broke the sequential flow into independent units and ran LLM + TTS + image generation concurrently. Framed the bottleneck as I/O-bound network waits, not CPU, and moved everything to async tasks with tenacity-backed retries.

LLM request deduplication— cost

Scripts repeatedly asked for near-identical completions. Added a caller-level dedup layer keyed on prompt + model + params, so identical requests share a single response — same for image prompts with stable seeds.

Two-stage CLI with JSON contract— composability

Split dialogue_api and video_from_dialogue_api into independent CLI commands that communicate via JSON on stdin/stdout. Either stage can be rerun, cached, or orchestrated externally without touching the other.

Character profile system— reuse

Each character is a YAML profile — ElevenLabs voice id, PNG sprite variants (telling, hearing), catchphrases. Adding a new character is a file drop, not a code change. FFmpeg overlays sprites per-line; ASS subtitles render karaoke sync per character.

Automated script workflow— manual work

Replaced operator-driven steps with automated prompt workflows, so what used to be hand-tuned by a person runs end-to-end. Placeholder fallback images cover failed generations so the pipeline never blocks on one flaky API.

04Results

Faster per-unit processing 60 min → 10 min per generation
−60% Generation cost after dedup ≈$2 → ≈$0.80 per generation
½× Manual script work automated prompt workflows
Takeaway

The cost and throughput wins came from treating every API call — LLM, TTS, image — as one I/O-bound problem to parallelize and dedup. The two-stage JSON contract was the unlock: either stage can be re-run, cached, or replayed without touching the other.

05Stack

Orchestration
SceneOrchestrator two-stage CLI · JSON stdin/stdout pydantic-settings structlog tenacity retries
Dialogue
LLM dialogue structured outputs per-character profiles
Voice
ElevenLabs TTS pydub loudness norm
Images
pluggable image providers provider-agnostic adapter placeholder fallback
Video
FFmpeg ASS karaoke subtitles sprite overlays background loop trimming
Storage
Backblaze S3 boto3
Infra
Python 3.13 uv SOCKS5 proxy