Karaoke Audio by Voice, 200 Miles Out
Today I shipped a karaoke-style audio reader for my blog posts. Click play on Part 6 of LLM Fundamentals and the words light up emerald in sync with the text-to-speech audio. The page auto-scrolls to keep the active word in view. You can change playback speed and the highlighting stays locked on.
Built the whole thing 200 miles from my laptop. The laptop runs Claude Code’s remote-control subcommand inside a detached screen session. The session shows up at claude.ai/code as a chat. From a phone in a passenger seat I type into that chat, the laptop runs the work, output streams back. No keyboard touched, no screen except the six-inch one in my hand. Three Playwright agents caught a first-click race condition that would have shipped without anyone noticing, more on that later.
The Setup
The constraint that forces good behavior: I cannot see localhost. The laptop’s npm run dev is on a network I am not on. The only feedback loop is the deployed site, so every change has to ship before it gets reviewed.
That sounds like a tax. It turns out to be a discipline.
The Loop
This runs in five-to-fifteen-minute cycles. Each cycle ships something. The phone never sees code, only specs going out and URLs coming back.
What makes it addictive: every cycle ends with a real artifact I can touch. There is no “let me show you the diff” moment, only the deployed feature.
How the Karaoke Got Built
The session opened with a different goal: a five-agent review pass on an unpublished blog post. One agent for API surface accuracy, one for billing, one for the vendor-comparison thesis, one for the tool-use protocol, one for visual gaps. Each had specific docs URLs and a tight scope. The vendor-comparison agent killed that post’s central thesis in 90 seconds, which is a different post for a different day.
What matters here is what came next. Could we build a karaoke audio player that highlights each word as the TTS plays, like Audible? The answer turned into the rest of this post.
The Pieces
ElevenLabs has a TTS endpoint variant called /with-timestamps that returns character-level alignment alongside the MP3. Two-line change in generate-tts.mjs:
const url = `https://api.elevenlabs.io/v1/text-to-speech/${VOICE_ID}/with-timestamps`;// POST same body, get back { audio_base64, alignment: { characters,// character_start_times_seconds, character_end_times_seconds } }I aggregated character timings to word level by splitting on whitespace, took the first character’s start time and the last character’s end time per word, wrote a sidecar {slug}-timings.json, and kept the existing chunked-MP3 concat path. The Extended Thinking post produced 1295 words covering 642 seconds of audio in a 48 KB JSON.
For the client side, native browser APIs throughout. TreeWalker to find text nodes inside .prose, skipping <code>, <pre>, and <svg>. A regex split (/\S+|\s+/) and document.createDocumentFragment() to wrap each word in a <span class="audio-word">. A greedy alignment pass attaches timings, with a five-word lookahead so a rendered “$5” matches the timing word “5” even though the TTS reads “5 dollars.”
The sync trick: requestAnimationFrame polling audio.currentTime at 60 Hz instead of waiting for the default 4 Hz timeupdate event. Playback rate sync is automatic because currentTime reflects the source-timeline position regardless of playbackRate. At 1.5x the events fire more often per second of wall time, but currentTime still maps directly to the timings array.
Auto-scroll keeps the active word in the 25-to-70-percent viewport band. Outside that, smooth-scroll to put it at 40 percent. Listen to wheel, touchmove, and arrow keys to detect manual scroll, then suppress auto-scroll for four seconds. Listen to those specific events, not the synthetic scroll event, because programmatic scrollTo fires scroll but not wheel, so the auto-scroll will not self-pause.
One Astro component, zero dependencies added. Lazy: nothing runs until the user clicks Play.
What the Test Team Caught
Three agents ran the verification pass through Playwright. The first found a race condition I would have shipped without thinking about. setupKaraoke() is async because it fetches the timings JSON and walks the DOM, but audio.play() is sync. On first click, both started in parallel. The play event fired before karaoke setup finished, and my guard if (karaokeReady) was false at that moment, so the requestAnimationFrame tick never started. Audio played, nothing lit up. Pausing and replaying would fix it because karaoke was ready by then, but no first-time user does that.
The fix was three lines at the end of setupKaraoke():
if (!audio.paused) { cancelAnimationFrame(rafId); rafId = requestAnimationFrame(tick);}If audio is already running by the time karaoke setup finishes, kick off the rAF tick now.
The second agent verified the fix, then verified playback-speed sync at 1x, 1.25x, 1.5x, and 2x. Delta within 0.02 seconds of expected at every speed. It also chased a red herring: my Playwright network panel showed hundreds of GETs for the in-body SVGs during playback. The first agent had hypothesized class mutations triggering image re-resolves. The second traced the actual stack and found Astro’s dev-toolbar perf-audit rule fetching every <img src> to measure file size, re-running on each DOM mutation. Production unaffected. False alarm caught in ninety seconds because someone actually opened the network panel and read the call stack. The human equivalent of that diagnosis is an hour of guessing or a Stack Overflow rabbit hole, neither of which I had time for from a phone.
The third agent ran mobile at 390 by 844, seek (single, paginated, and paused), the listen-event beacon firing exactly once per session via sessionStorage, and the missing-timings fallback (audio plays normally on posts without sidecars). All passed.
What Makes the Loop Stick
The thing I keep underestimating: I am not waiting on agents the way I used to wait on coworkers. Agents run in parallel and report back on a timeline I can keep up with from a phone. Five reviewers spend a couple minutes each, all at once. By the time the coffee is half done, I have five expert opinions.
Geographic separation also forces a discipline I do not have at my desk. There is no half-built local state. Every change ships, every artifact gets verified by agents and by deploys. If I cannot describe what I want clearly enough to ship, I cannot get it.
Pair programming with three people who type fast and never get tired comes close. One tests on mobile while another checks billing claims while a third verifies that the blog post’s central thesis matches the vendor docs it cites. Sub-agents handled what I would have done sequentially at my desk, in a quarter of the time.
What I’d Do Differently
That five-agent review pass was the single best decision of the session, and I almost did not run it. Any single reviewer being less precise and the vendor-comparison thesis would have shipped wrong. Lesson: scope each reviewer to one specific claim against one specific source. Generalist proofreading misses what targeted verification catches.
Async edge cases are where agents earn their keep. I have a habit of waving them away because “it’ll usually work,” and a first-click race is exactly the kind of bug that survives manual testing because the second click hides it. Always test the cold-start path, not just the warm path.
Total Cost
One session, about three hours of phone time, of which maybe an hour was focused dictation. The rest was watching agents work, reviewing what shipped, picking the next thing. ElevenLabs charges around 50 cents to regenerate a 10-minute post with timestamps. Cloudflare bandwidth is free at this volume. My Claude Max plan was already running.
A surface I had wanted for months, shipped from a passenger seat. Building from a desk would have meant a weekend. Building by voice meant the weekend was still mine.