Last month I wrote about building an automated short-form content pipeline. The goal was simple: make YouTube Shorts automatically until quality becomes inevitable.
Here's the honest update on where that stands.
The Primitive Pipeline Works
Let me be clear about something: I have a working system.
It's not pretty. It's not fully autonomous. But it produces real videos that go on real platforms:
- Scrape trending GitHub repos
- Generate scripts automatically (human review for approval)
- Capture mobile screenshots with Playwright
- Generate voiceover with Kokoro TTS
- Apply Ken Burns motion effects
- Burn word-synced subtitles
- Output a YouTube-ready Short
Cost: $0/month. Everything runs on free tools and local compute.
Time: 6-10 minutes per video. Not bad, but not "fire and forget" either.
The channel has 60+ videos produced, 49 published on YouTube. All made with this pipeline. It works. But "works" isn't the same as "scales."
What's Actually Performing (The Data)
Before talking about the future, here's what the analytics show:
Top Videos by Retention:
- Stirling-PDF (local PDF tools): 59.84% retention, 1,302 views
- LocalSend (cross-platform file transfer): 58.77% retention, 794 views
- MIT CAD: 54.9% retention, 1,297 views
- Voice Assistant (privacy angle): 48.14% retention, 302 views
Top Videos by Total Views:
- Home Assistant: 1,755 views, 47.7% retention
- Jellyfin: 1,322 views, 46.1% retention
- Stirling-PDF: 1,302 views, 59.8% retention
- MIT CAD: 1,297 views, 54.9% retention
What Works Right Now:
- No background music (criticalâmore on this below)
- Privacy/security positioning ("No Cloud Spying", "No SaaS Surveillance")
- Cross-platform utilities (Windows/Linux/Mac compatibility)
- Simple screenshots + Ken Burns motion
- Duration: 26-44 seconds (sweet spot)
The Takeaway: The primitive pipeline with screenshots actually performs well. But that doesn't mean it's the end goal.
The Background Music Discovery
Early on, I added background music to some videos. Used YouTube's own royalty-free audio library. Seemed professional. Seemed safe.
The data told a different story.
Here's what I initially thought: "Background music kills retention. People don't like videos with music."
I was wrong.
Let me show you the actual numbers:
Video with background music:
- 80.51% average retention (people who watched it loved it)
- 72.73% stayed to watch (low bounce rate)
- 29 total views
- 248 impressions
Typical video without music:
- 40-60% average retention (actually LOWER than the music video)
- 1,000-1,700 total views
- Thousands of impressions
The real problem isn't that people dislike background music. It's that YouTube's algorithm won't show them the videos.
Videos with background music get 10-40x fewer impressions than identical videos without music. When people DO find them, retention is actually excellent. But the algorithm throttles distribution.
And here's the kicker: The background music was from YouTube's own royalty-free audio library.
So YouTube provides free music for creators, creators use it in good faith, and YouTube's algorithm punishes those videos by suppressing impressions. Different teams, different incentives, creators caught in the middle.
The corrected hypothesis: Background music doesn't hurt viewer engagement. It triggers some algorithmic filter (probably audio fingerprinting or Content ID systems) that kills distribution. The videos never reach the audience that would engage with them.
The fix: No background music. Crystal-clear voiceover at 100% volume, nothing else.
The Vision: Fully Autonomous
What I actually want is a system that:
- Finds its own content - monitors trending repos, news, tech feeds
- Writes its own scripts - understands what hooks work, adapts tone to topic
- Generates its own visuals - not just screenshots, but actual B-roll clips
- Assembles everything autonomously - no human in the loop except QC
- Learns from performance - what got views? What flopped? Adjust.
I've been building toward this. The architecture exists. The components are modular. But one piece is killing me.
The B-Roll Problem
Here's where I'm stuck: AI-generated cutaway clips.
The current pipeline uses screenshots. That's fine for YouTube Shorts (26-44 seconds)âand the analytics prove it works. But I can't scale to long-form content (5-15 minute videos) with just README screenshots. You can't watch someone scroll through a GitHub repo for 10 minutes.
Shorts work now. Long-form needs B-roll.
What I need to unlock long-form production:
- Hands typing on a keyboard (tutorial sequences)
- Someone viewed over-the-shoulder at a terminal (workflow demonstrations)
- Abstract code visualizations (concept explanations)
- Tech-aesthetic establishing shots (production value, pacing breaks)
- Screen recording sequences (actual software demos)
The plan: use an image/video model to generate a library of tagged clips. The AI writer references these tags when building scripts. "This section needs a 'typing-hands' B-roll for the tutorial segment." The assembler pulls the clip. Done.
The reality? Image models are not reliable enough for this yet.
What's Actually Failing
I've been experimenting with local image generation for these cutscenes. Here's what I've learned:
Consistency is the killer. I can get one good "hands on keyboard" image. But I need dozens that feel like they belong in the same video. Different angles, same aesthetic. Image models don't do "same aesthetic, different shot" well without heavy prompt engineeringâand even then it's a coin flip.
Hands are still cursed. We're in late 2025 and AI still struggles with hands. For a "typing on keyboard" clip, that's... a problem.
Style drift between generations. Even with the same prompt and seed, regenerating a week later gives different results. That breaks the "tagged library" conceptâyou can't tag something if it might look completely different tomorrow.
Video models aren't there yet. I've tested several. They're impressive for art projects. They're not reliable for "I need 3 seconds of someone typing, consistent style, no artifacts."
What I'm Trying Now
Pre-generated static library. Instead of generating on-demand, I'm building a curated set of B-roll clips. Generate hundreds, manually filter to the best 50, tag those. The AI writer pulls from a fixed library rather than requesting generation.
Downside: Limited variety. Upside: Consistent quality.
Hybrid approach. Screenshots for the main content (they work), AI-generated clips only for transitions and establishing shots where consistency matters less.
Stock footage integration. Yeah, I said it. Sometimes the answer isn't "generate it"âit's "use the thing that already exists." Pexels and similar have CC0 tech footage. Not as cool as "fully AI-generated" but actually reliable.
This is smart pragmatism, not compromise. Ship what works while building toward the ideal.
The Tagging System
This part is actually working. I'm using a graph database to store clip metadata:
- Visual description (what's in the clip)
- Mood/tone tags
- Technical tags (resolution, duration, motion level)
- Usage history (which videos used this clip)
The AI writer queries this when building scripts. "Find me a clip tagged 'terminal' and 'focused' that hasn't been used in the last 5 videos."
The system is ready. The clips to fill it aren't. Yet.
Honest Assessment
What works:
- Script generation (automated with tone system)
- Voice synthesis (natural pacing, consistent quality)
- Motion graphics and subtitle burning (word-level sync)
- The assembly pipeline (6-10 min per video)
- The tagging/retrieval system
- Analytics tracking
- Screenshot-based content (proven 40-60% retention)
What's struggling:
- Autonomous visual generation (B-roll consistency for long-form)
- True "fire and forget" operation (still requires human review)
- Format expansion (can't scale to 10-min videos without B-roll library)
What I've learned:
- "Fully autonomous" is a spectrum, not a switch
- Sometimes the boring solution (stock footage) beats the cool solution (AI generation)âfor now
- The bottleneck isn't always where you expectâI thought scripting would be hard; it's the visuals
- Background music doesn't kill retentionâit kills algorithmic distribution
- Privacy/security angles drive performance consistently
- Cross-platform tools outperform niche-specific content
What's Next
I'm not abandoning the vision. But I'm being realistic about the path:
Short term: Publish waiting videos, remove background music from templates, build initial curated B-roll library.
Medium term: Test B-roll integration on long-form videos, expand content in analytics-proven niches, keep testing image models as they improve.
Long term: 5-15 minute YouTube videos (requires B-roll), full autonomous generation when the tech catches up, cross-platform expansion.
The primitive pipeline keeps producing while I build the ambitious one. That's the actual strategyâship what works, iterate toward what's possible.
The Meta Point
Every "AI automation" guru on YouTube makes it look easy. "Just prompt it right." "10x your content with this one trick."
The reality is messier. Models have limitations. Consistency is hard. The gap between "impressive demo" and "reliable production system" is massive.
I'm documenting the mess because nobody else does. The failures are where the learning happens.
And sometimes the data surprises you:
- I thought background music hurt retention. Wrong. It kills algorithmic distribution. YouTube won't show videos with background audioâeven their own royalty-free music. The people who do find them actually engage well (80% retention vs 40-60% typical).
- I thought visual variety drove performance. Data proved clear audio + simple screenshots win for Shorts.
- I thought 60-second videos were fine. Turns out 26-44 seconds is the sweet spot.
The lesson: Build the vision. Measure the reality. Adjust. And don't trust your assumptionsâtrust the data.
More updates as things breakâand as things work.
~ OnlyParams Dev
Previous post: Building an Automated Short-Form Content Pipeline