· 7 min read ·

My Mastra Agent Found a Production Bug in Five Minutes

I stood up a Mastra workflow from my phone via Telegram to monitor Cloudflare Workers. It caught bots crashing my media proxy on the very first run.

dev ai cloudflare workflow

I saw a tweet about Sentry shipping an AI agent that watches your error logs and auto-suggests fixes. I thought: I can build that. So I did, from my phone, on a Friday afternoon, while walking around the house.

I stood up a Mastra workflow via Telegram, pointed it at my three Cloudflare Workers sites, and ran it. Within five minutes it flagged scriptThrewException errors on all three sites. Bots were hitting my media proxy endpoint, the Worker was crashing on every request, and my uptime monitor had been saying everything was fine for days.

Stack: Mastra for the workflow engine, Cloudflare’s GraphQL Analytics API for the data, a Telegram bot for alerts. Runs locally on my Mac, costs nothing, and took about two hours to build. I pushed a blog post from a mountain via Telegram that same afternoon. Here’s how it works and what it found.

Why “Is It Up?” Isn’t Enough

I already had an uptime monitor running as a GitHub Action every 30 minutes. It hits each site, checks for a 200 response, and moves on. All three of my sites, bdigitalmedia.io, tech.bdigitalmedia.io, and truck.bdigitalmedia.io, were passing every check. Green across the board.

But the sites were throwing scriptThrewException errors on every Worker invocation. Bots and crawlers hitting API routes, the Worker script crashing, Cloudflare returning 500s. The uptime monitor never noticed because it only checks static pages, which are served directly from Cloudflare’s edge cache without invoking the Worker at all.

I needed something that could look at the actual error rates across all HTTP traffic, not just ping the homepage and call it a day.

What Is Mastra

Mastra is a TypeScript framework for building AI agents and workflows. It comes from the team behind Gatsby, it’s open source (Apache 2.0), and it has over 22,000 GitHub stars. I use it at my day job and wanted to practice building workflows with it outside of work.

It’s built around a simple concept: you define steps with Zod-validated input/output schemas and chain them together with .then(). Each step is a typed function that receives data from the previous step and returns data for the next one.

const workflow = createWorkflow({
  id: "site-monitor",
  inputSchema: z.object({ sites: z.array(siteSchema) }),
  outputSchema: z.object({ sent: z.boolean() }),
})
  .then(fetchMetrics)
  .then(analyzeFindings)
  .then(sendTelegram)
  .commit();

Three steps. Fetch the data, analyze it, send an alert if something is wrong. That’s the entire workflow.

Pulling Metrics from Cloudflare

My sites run on Cloudflare Workers. Cloudflare exposes a GraphQL Analytics API that gives you zone-level HTTP request data: total requests, status codes, response times, broken down by hostname.

Here’s what I learned the hard way: I needed zone-level HTTP analytics, not Worker invocation metrics. My sites use Cloudflare’s [assets] directive, which means static pages are served directly from the edge without invoking the Worker script. If you only look at Worker invocations, you’re seeing a tiny slice of traffic, mostly API routes and bot requests, and the error rates look wildly wrong.

My first version used workersInvocationsAdaptive and reported 100% error rates on all three sites. Every single request was an error. That looked terrifying until I realized the only Worker invocations were bots crashing on API routes. Real visitor traffic, people reading blog posts, browsing the portfolio, buying video clips, was invisible to that metric.

Switching to httpRequestsAdaptiveGroups at the zone level gave me the complete picture:

bdigitalmedia.io:       348 requests | 46 5xx | 13.2% error rate
tech.bdigitalmedia.io:  299 requests | 7 5xx  | 2.3% error rate
truck.bdigitalmedia.io: 231 requests | 2 5xx  | 0.9% error rate

Real data. And the 46 errors on the main site were real errors worth investigating.

From there, the analysis step classifies each site based on 5xx error rate:

  • Healthy: under 5% error rate
  • Degraded: 5-20%, something is wrong but the site is mostly functional
  • Down: over 20%, a large portion of requests failing

It formats a clean message with status icons, request counts, error breakdowns, and HTTP status code distributions. Designed to be scannable in a Telegram notification. Glance at it and know if you need to act.

Telegram Alerts (Only When It Matters)

This is the part I spent the most time thinking about. A monitor that sends a message every time it runs is a monitor you learn to ignore. The alert has to mean something.

Two modes:

Check mode (every 4 hours): Looks at the last 4 hours of traffic. If all sites are healthy, it stays completely silent. No message, no notification. You only hear from it when something is actually wrong.

Summary mode (daily at 7am): Always sends a full traffic report regardless of status. Total requests, error counts, status code breakdowns across all three sites. A morning briefing, a quick glance at how the sites performed over the last 24 hours.

Telegram integration is simple. A bot token, a chat ID, and a fetch call to the Bot API. What makes it useful is the suppression logic:

if (mode === "check" && !inputData.hasIssues) {
  console.log("All sites healthy, no alert sent.");
  return { sent: false };
}

Silence is the signal that everything is fine.

It Found a Real Bug

First run with real data: scriptThrewException errors on all three sites. Main site had 66 exceptions in the last 24 hours.

I dug into the Cloudflare GraphQL data with status dimensions and found the root cause: the media proxy endpoint at /api/media/[...path] had an unguarded r2.sign() call in the presigned URL fallback path. When a bot hit the endpoint and the R2 signing failed for any reason, bad path, network hiccup, missing object, the Worker crashed instead of returning a 404.

Fix: a try/catch.

try {
  const signedRequest = await r2.sign(
    new Request(`${r2Endpoint}/bdigital-clips/${path}?X-Amz-Expires=300`),
    { aws: { signQuery: true } }
  );
  return Response.redirect(signedRequest.url, 302);
} catch {
  return new Response('Not found', { status: 404 });
}

Five lines. Zero new errors since the deploy.

A monitor that finds real bugs on day one has already paid for itself. Two hours to build, and it caught an issue that had been silently crashing in production. The uptime check said everything was fine. The error rate data told a different story.

Running It Locally

Everything runs locally with tsx:

npm run check      # 4h lookback, alerts only on errors
npm run summary    # 24h report, always sends to Telegram

No cloud compute, no hosting costs. Mastra runs on Node.js, the Cloudflare API is free, and the Telegram Bot API is free. The only cost is the electricity to keep my Mac awake.

For scheduling, I use macOS launchd agents that fire the check every 4 hours and the summary every morning. If the Mac is asleep, the check runs when it wakes up. If I need persistent monitoring, the same workflow can deploy to Cloudflare Workers using Mastra’s @mastra/deployer-cloudflare package. For three personal sites, local is fine.

What Comes Next

Right now the monitor tells me something is wrong. Next step: having it fix the problem.

Mastra supports agent steps, workflow steps that can reason about data, read code, and take action. When the monitor detects elevated errors, it could read the Cloudflare logs, identify the failing endpoint, grep the codebase for the relevant file, analyze the crash pattern, and either fix it directly or open a PR with the proposed fix.

I already have Claude Code running remotely and pushing to production from my phone. Connecting the monitor to an agent that can diagnose and patch production issues closes the loop. Detect, diagnose, fix, deploy, verify. All without human intervention.

That’s not built yet. But the foundation is here, and the workflow architecture supports it. Steps chain together. Each step has typed inputs and outputs. Adding a “diagnose” step and a “fix” step is just more .then() calls on the same workflow.

For now, the monitor runs every 4 hours, stays silent when things are healthy, and pings me on Telegram when they’re not. It found a real bug on day one. Good start.


The full monitor workflow is about 350 lines of TypeScript. If you’re building something similar with Mastra or want to talk about Cloudflare Workers monitoring, get in touch or find me on Instagram.