Full video generation at Hao — topic to published mp4. Dialogue, TTS, images, FFmpeg compose, S3 upload, composed as a two-stage JSON API, parallelized across every I/O-bound call, deduplicated end-to-end.
Hao builds AI-autonomous orchestrators. One of them needed a pipeline that turns a topic into a finished vertical video — scripted dialogue, voiced lines, generated images, composed with sprites and karaoke subtitles, uploaded to storage. Every generated unit costs real money — LLM calls, TTS, images, storage — so latency and cost are not polish, they're the business.
When I joined the project in June 2025, the pipeline was linear — a single-path orchestrator, no batch mode, no cost instrumentation, operator-driven between stages. Each generation took minutes and cost roughly $2. That's the starting line.
I redesigned the pipeline as a two-stage CLI API that talks to itself via JSON on stdin/stdout. Stage one generates the dialogue through a provider-agnostic LLM call, reading per-character profiles with voices, sprites and catchphrases. Stage two consumes that JSON and runs TTS and image generation in parallel, composes the final video with FFmpeg, uploads to S3. Two commands, one pipe, full automation.
Broke the sequential flow into independent units and ran LLM + TTS + image generation concurrently. Framed the bottleneck as I/O-bound network waits, not CPU, and moved everything to async tasks with tenacity-backed retries.
Scripts repeatedly asked for near-identical completions. Added a caller-level dedup layer keyed on prompt + model + params, so identical requests share a single response — same for image prompts with stable seeds.
Split dialogue_api and video_from_dialogue_api into independent CLI commands that communicate via JSON on stdin/stdout. Either stage can be rerun, cached, or orchestrated externally without touching the other.
Each character is a YAML profile — ElevenLabs voice id, PNG sprite variants (telling, hearing), catchphrases. Adding a new character is a file drop, not a code change. FFmpeg overlays sprites per-line; ASS subtitles render karaoke sync per character.
Replaced operator-driven steps with automated prompt workflows, so what used to be hand-tuned by a person runs end-to-end. Placeholder fallback images cover failed generations so the pipeline never blocks on one flaky API.
The cost and throughput wins came from treating every API call — LLM, TTS, image — as one I/O-bound problem to parallelize and dedup. The two-stage JSON contract was the unlock: either stage can be re-run, cached, or replayed without touching the other.