RAG as a Tool, Not a Pipeline: An On-Call Investigator on Mastra and Bedrock
It’s 2AM and a pager goes off. Your service is throwing connection errors and customers are starting to notice. Somewhere on a SharePoint, in a folder called Postmortems, three previous engineers documented exactly what happened the last three times this pattern hit. By the time you find one of them, scroll past the executive summary, and locate the part where the on-call team isolated root cause, ten minutes have passed. By the time you read all three, twenty.
What if a bot read those postmortems for you?
That bot is the project this post describes. It builds on top of an existing Mastra agent stack running on AWS, hosting all model inference through Amazon Bedrock. Postmortem documents live in S3 as .docx files. Retrieval lives in Aurora PostgreSQL with the pgvector extension. When a PagerDuty alert fires, a Temporal workflow kicks off an investigator agent that searches the right slices of the right past incidents at the right moments, then posts a hypothesis into Slack inside a minute.
What RAG Actually Solves
Retrieval-Augmented Generation is the technique of pulling relevant passages from a document store at query time and feeding them into a model’s context so the model can reason over them. Two gaps in conventional search drive its existence, and incident response surfaces both.
First, the semantic gap. Imagine your historical postmortems include three real incidents:
- RCA #142, titled “checkout-svc 504s during peak,” where the API exhausted a database connection pool.
- RCA #207, titled “Postgres ran out of available connections,” where pool saturation followed a deploy that doubled traffic.
- RCA #311, titled “transactions queueing in payments-db,” where
max_connectionswas hit and writers timed out.
A new alert fires saying “checkout-service: too many open connections to RDS, p99 climbing.” If you reach for SQL, you might write something like this:
SELECT * FROM rcas WHERE description LIKE '%connection pool%';That query matches RCA #142. It misses #207 and #311 entirely, even though all three describe the same incident family in different words. A senior on-call engineer would instantly see the connection. SQL has no concept of meaning and no way to know that “max_connections exhausted” is the same problem as “connection pool exhausted” written by a different author on a different team.
Embeddings close that gap. An embedding model converts text into a high-dimensional vector where semantic similarity becomes geometric proximity. When you embed both the alert text and each historical postmortem, similar incidents cluster together in vector space whether or not they share words. Retrieval becomes a nearest-neighbor search instead of a substring match.
Second, the structural gap. Postmortems contain different kinds of content for different investigative questions. A summary answers “is this familiar?”. A timeline answers “what was the progression?”. Action items answer “what did we change to prevent this?”. If you mush all that together into one searchable blob, retrieval gives you whichever section type dominates the corpus regardless of intent.
Why Not Just Long Context?
Frontier reasoning models now ship with context windows in the millions of tokens, and every major provider exposes file-search primitives that wrap the chunk-and-embed pattern behind a single API call. A reasonable reader is asking why any of the architecture below is necessary when you could load the whole postmortem corpus into the model and let it figure things out.
For corpora under roughly 500K tokens the objection wins. Build the simplest thing: stuff everything into a cached system prompt and let the model retrieve internally. Six months from now, if cost or latency forces a redesign, do that then.
Three conditions push you across the threshold and into a retrieval-shaped architecture.
Scale forces the issue first. A corpus of eight hundred postmortems across five years runs into the tens of millions of tokens, well past what fits in any model’s context regardless of cost. Even when it does fit, sending megabytes of input on every alert at SEV2 cadence is an unforgivable bill.
Latency tightens the squeeze. This investigator promises a Slack post within sixty seconds of an alert firing, and a long-context call against a multi-megabyte payload is measured in tens of seconds before the first token streams. K-NN search against pgvector is measured in milliseconds.
And structured filtering is something long context cannot do at all. Half the value in the trace shown later in this post comes from a query that asks specifically for action items whose status is still open. “Show me only the parts where status equals open” is not a question you can put to a long-context model. It can only re-read everything and hope the model notices.
If your corpus fits in context, your latency budget is generous, and your queries are unstructured, build the simpler thing first and revisit when the symptoms appear.
Retrieval-as-tool survives even when retrieval-as-preprocessing dies. An agent that calls search during reasoning, with parameters tuned to the current investigative phase, beats one handed a bag of context before it starts thinking. That holds whether the search function reads from pgvector, a managed file-search API, or a long-context model acting as a retrieval oracle.
Two Paths, Not One
Most failed RAG implementations are failures of architecture, not of model quality, and they conflate two paths that should stay separate.
These two paths share exactly one thing: the chunks table in Aurora. Everything else is independent. Hot path runs hundreds of times a day at sub-second latency. Cold path runs whenever a .docx is added or edited, takes seconds per document (or up to a few seconds when the LLM fallback fires), and tolerates retries.
Retrieval is a tool the agent calls, not a preprocessing step that happens before the agent runs. That distinction is the difference between a static RAG system and an investigative one.
From .docx to Chunks: Hybrid Extraction
Real-world postmortems do not arrive as clean JSON. They are Word documents authored by dozens of engineers across years of template revisions, with inconsistent headings, embedded tables, free-form prose, and occasional one-off formats.
Two pure approaches plus one hybrid, each with its own trade-offs.
Rule-based extraction as a cost optimization
A deterministic parser reads the .docx with mammoth, walks the heading structure, and matches each heading against a synonym table:
const SECTION_ALIASES: Record<string, RegExp> = { summary: /^(summary|tl;?dr|overview|executive summary|incident summary)$/i, timeline: /^(timeline|sequence of events|event log|chronology)$/i, root_cause: /^(root cause|what went wrong|cause analysis|cause)$/i, resolution: /^(resolution|mitigation|how we fixed it|fix|what worked)$/i, action_items: /^(action items|follow.?ups|next steps|remediation|todos)$/i,};Roughly 50ms per document, deterministic, runs offline. Against a corpus written on one strict template it works beautifully. Against eight hundred postmortems written across five years by thirty engineers using four template revisions, it silently fails on roughly one document in three. Headings get formatted with bold and a font size instead of a Heading 1 style; timelines hide inside tables whose cells are not paragraph-styled (so heading-walk extractors miss them entirely); senior engineers occasionally write a five-paragraph essay with no headings at all. When extraction misses a section nothing fires; that document contributes nothing to retrieval, and you find out months later when an investigation comes up empty.
Today’s small Bedrock models charge fractions of a cent per document at 2026 rates, so do not invest heavily in extending the parser. Treat it as a cheap pre-filter for the easy 70% and route the rest to the LLM path below.
Pure LLM extraction
LLM extraction sends the structure-preserving HTML (so headings remain visible to the model) to a fast extraction-tier model on Bedrock and asks for a structured response. With Bedrock’s tool-use API, this is shockingly clean:
import { BedrockRuntimeClient, ConverseCommand } from "@aws-sdk/client-bedrock-runtime";
const bedrock = new BedrockRuntimeClient({ region: "us-east-1" });
// Pick a fast structured-extraction model from your AWS region's Bedrock// catalog. Model IDs rotate fast; pin to env so you can roll forward// without touching code, and stamp embed_model_version on every chunk.const EXTRACTION_MODEL = process.env.BEDROCK_EXTRACTION_MODEL!;
const SYSTEM_PROMPT = `You extract structured data from incident postmortems.The input is HTML produced by mammoth from a Word document; headings arepreserved. Use the record_rca_sections tool. Return null for any sectionnot present. Do not invent content. If unclear, return null.`;
const EXTRACTION_TOOL = { name: "record_rca_sections", description: "Records the structured sections extracted from an RCA document.", inputSchema: { json: { type: "object", properties: { summary: { type: ["string", "null"] }, timeline: { type: ["array", "null"], items: { type: "object", properties: { timestamp: { type: "string" }, event: { type: "string" }, }, required: ["timestamp", "event"], }, }, root_cause: { type: ["string", "null"] }, resolution: { type: ["string", "null"] }, action_items: { type: ["array", "null"], items: { type: "object", properties: { text: { type: "string" }, status: { enum: ["open", "completed", "unknown"] }, owner: { type: ["string", "null"] }, }, required: ["text", "status"], }, }, }, }, },};
async function extractWithLlm(docHtml: string) { const response = await bedrock.send(new ConverseCommand({ modelId: EXTRACTION_MODEL, system: [{ text: SYSTEM_PROMPT, cachePoint: { type: "default" } as any }], messages: [{ role: "user", content: [{ text: docHtml }] }], toolConfig: { tools: [{ toolSpec: EXTRACTION_TOOL }], toolChoice: { tool: { name: "record_rca_sections" } }, }, })); const toolUse = response.output?.message?.content?.find((c) => "toolUse" in c); return toolUse?.toolUse?.input;}Three production knobs hide in there. First, toolChoice forces the model to produce JSON matching the schema exactly, so prose-instead-of-JSON failures vanish. Second, a fast extraction-tier model is the right call here, not a frontier reasoning model. Extraction is a structured task, not a reasoning one; the cheap tier is fast and accurate enough, and the reasoning tier runs several multiples per token for marginal quality gain on this workload. Third, prompt caching on the system prompt cuts cost dramatically when you reindex many documents in a single run, since the schema and instructions are identical across every call.
A back-of-envelope: an 800-document corpus at ~6K tokens each is ~4.8M input tokens. With prompt caching on the schema, an extraction reindex on a fast Bedrock model lands in the single dollars at 2026 rates. A Bedrock embedding pass adds another rounding error. A whole reindex costs less than a coffee, on-demand, and there is no shape of corpus where extraction cost should drive your architecture.
LLM extraction is robust against template drift, synonyms, and free-form prose. Its dangerous failure mode is hallucination: a model fills in an action item it inferred should be there, which surfaces months later in an investigation as if it were real. Mitigate with spot-check evals against a ground-truth set and aggressive instructions to return null rather than guess.
Hybrid extraction
Run the rule-based parser first. If it returns at least three of the five expected sections, accept it; otherwise fall back to LLM extraction. Log per-document which path was used and how many sections were detected, so you can tune both halves on real data.
// extractRcaSections: walks mammoth's HTML output and groups content under// each <h1>/<h2> whose text matches SECTION_ALIASES. logExtraction and// countSections are trivial helpers (one INSERT and one Object.values// length, respectively).async function extractSections(docxPath: string) { const ruleBased = await extractRcaSections(docxPath); const detectedCount = Object.keys(ruleBased).length;
if (detectedCount >= 3) { await logExtraction({ method: "rules", count: detectedCount, path: docxPath }); return ruleBased; }
const { value: docHtml } = await mammoth.convertToHtml({ path: docxPath }); const llmResult = await extractWithLlm(docHtml); await logExtraction({ method: "llm", count: countSections(llmResult), path: docxPath }); return llmResult;}Routing depends on the corpus. Strictly templated stores see most documents on the deterministic path; older stores with mixed templates can drop the deterministic share well below half. That ratio is one of the most useful per-document log fields you can capture early, since it tells you whether the rule-based parser deserves more synonyms or whether you should retire it in favor of the LLM path entirely.
Chunking by Section, Not by Document
Once a postmortem has been parsed into sections, you face a critical choice that 90% of RAG tutorials get wrong. Do you embed the whole document and return it in full at query time, or do you embed each section separately and return only what is relevant?
Imagine your investigator agent is two steps into a real incident. Initial retrieval already established that the current alert matches the connection-pool-exhaustion family. Now the agent wants to ask a different question: “What did we actually change last time to prevent recurrence?”
That question embeds into a region of vector space populated by solution-shaped language: install pgbouncer, add connection_count alert, scale RDS instance. If your index contains all sections mixed together, here is what comes back when you query against it:
| Chunk type | Similarity to query | Why |
|---|---|---|
| Summary chunks (like checkout-svc DB pool exhausted) | High | Shared vocabulary with the alert text and the agent’s earlier searches |
| Root cause chunks (like max_connections exceeded after traffic 2x) | Medium | Related concepts |
| Action item chunks (like install pgbouncer, add connection_count alerts) | Low | Solution-shaped vocabulary, totally different conceptual register |
Your agent gets back more of what it already has (symptom descriptions) and almost no action items. It reports “yes, we have seen this” and stops, completely missing the part where last time the team committed to installing pgbouncer.
Per-section chunking solves this by giving each section type its own region of the index. Implementation does not require literal separate vector stores; a single chunks table with metadata columns works fine:
CREATE EXTENSION IF NOT EXISTS vector;CREATE EXTENSION IF NOT EXISTS pgcrypto;
CREATE TABLE chunks ( id UUID PRIMARY KEY DEFAULT gen_random_uuid(), rca_id TEXT NOT NULL, section TEXT NOT NULL, -- summary | timeline | root_cause | ... text TEXT NOT NULL, embedding VECTOR(1024) NOT NULL, embed_model_version TEXT NOT NULL, -- whichever embedding model produced this row status TEXT, -- for action_items: open | completed owner TEXT, occurred_at TIMESTAMPTZ, created_at TIMESTAMPTZ DEFAULT NOW());
CREATE INDEX chunks_embedding_idx ON chunks USING hnsw (embedding vector_cosine_ops);
CREATE INDEX chunks_section_idx ON chunks (section, embed_model_version);Two columns earn special mention. embed_model_version is the day-zero footgun in any RAG system: the day a new embedding model lands in your stack (or you swap providers entirely), every chunk in the table is in a different vector space than incoming queries and recall silently collapses. Stamp the model version on each chunk, filter every query by the current version, and dual-write during model upgrades until shadow evals show parity. A chunks table without this column is one release note away from an outage.
Chunk size matters too. Aim for 200 to 500 tokens per chunk; sections over 800 tokens get a sliding-window split with 50-token overlap; reject any chunk over your embedding model’s input token limit (commonly around 8000) before embedding. A “timeline” section that grew to 3000 tokens because someone narrated a six-hour incident in prose has to be split, or it dilutes its own embedding into uselessness.
A query that asks specifically for action items now looks like this:
SET LOCAL hnsw.ef_search = 40;
SELECT rca_id, text, status, owner, 1 - (embedding <=> $1::vector) AS similarityFROM chunksWHERE section = ANY($2::text[]) AND embed_model_version = $3ORDER BY embedding <=> $1::vectorLIMIT $4;Two parameters drive the search shape: $2 is the array of section names (['action_items'] for action-items-only, ['summary', 'root_cause'] for triage), and $4 is top-K. That WHERE filter is post-filtering over the HNSW graph traversal in pgvector’s default mode, so highly selective filters can return fewer than LIMIT rows; turn on hnsw.iterative_scan = relaxed_order (pgvector 0.8+) if you see that. With ef_search=40 the lookup runs in single-digit milliseconds even at tens of thousands of chunks, and that knob trades latency for recall directly.
RAG as a Tool, Not a Step
A senior on-call engineer’s investigation moves through phases, and each phase asks a different question of the same corpus.
| Phase | Question | Sections retrieved |
|---|---|---|
| Triage | ”Have we seen this? Is it serious?“ | summary + root_cause snippet |
| Diagnose | ”What is broken? Match symptoms to root cause.” | timeline + root_cause (full) |
| Mitigate | ”Stop the bleeding fast.” | resolution + timeline (mitigation portion) |
| Hand off | ”Brief the human; suggest hypotheses.” | action_items, filtered by status=open |
A naive RAG implementation calls retrieval once, dumps the result into the agent’s context, and lets the model figure things out. An investigative implementation calls retrieval several times during one investigation, with different queries shaped by what previous retrievals revealed and which phase the agent is in.
Defining the retrieval tool for a Mastra agent is straightforward:
import { createTool } from "@mastra/core";import { z } from "zod";
export const searchPastIncidents = createTool({ id: "searchPastIncidents", description: `Search the corpus of past incident postmortems by semanticsimilarity. Use the 'sections' parameter to constrain which kinds of contentto retrieve (summary/root_cause for triage, timeline for diagnosis,resolution for mitigation, action_items for prevention). Returns up to topKchunks ranked by similarity.`, inputSchema: z.object({ query: z.string().describe("natural-language search text"), sections: z.array(z.enum([ "summary", "timeline", "root_cause", "resolution", "action_items", ])).default(["summary", "root_cause"]), topK: z.number().int().min(1).max(20).default(5), }), outputSchema: z.object({ chunks: z.array(z.object({ rca_id: z.string(), section: z.string(), text: z.string(), similarity: z.number(), status: z.string().nullable(), owner: z.string().nullable(), })), }), execute: async ({ context }) => { const embedding = await embedQuery(context.query); const chunks = await pgvectorSearch({ embedding, sections: context.sections, limit: context.topK, }); return { chunks }; },});Two things matter about this tool definition. First, the description does heavy lifting: it tells the model when to use which sections, mapping investigative intent to retrieval parameters. A vague description forces guessing; a precise one teaches steering. Second, the inputs give the agent that steering wheel rather than a search box, so it can override sections and topK based on what it is trying to learn. Bound the agent’s tool loop with maxSteps: 8 and add a per-call statement_timeout = 800ms on the SQL side so a slow pgvector query cannot eat the 60-second SLO; on a second searchPastIncidents failure the agent should fall back to posting a degraded Slack message (“retrieval unavailable; here are raw alert + Datadog metrics”) rather than stall silently.
Two helpers are referenced and not shown. embedQuery is a one-call wrapper around bedrock.invokeModel against your chosen embedding model returning the embedding array. pgvectorSearch is a parameterized version of the SQL block from the previous section that takes (embedding, sections[], embedModelVersion, limit) and returns rows. Both are 20-line glue files; the architectural commitment is in the schema and the agent loop, not in those wrappers. One version-sensitive caveat: Mastra’s execute signature has shifted across versions ({ context } is recent-but-not-eternal). If your installed Mastra version disagrees, swap to whatever shape its docs show.
A Trace from Alert to Slack
Here is what an actual run looks like end to end when a connection-pool alert comes in. In the trace, ▶ marks internal reasoning and →/← marks a tool call with its result.
═════════════════════════════════════════════════════════════════ALERT (PagerDuty webhook → Mastra agent) service: checkout-svc description: "too many open connections to RDS, p99 climbing" severity: SEV2 fired_at: 2026-04-25T02:14:08Z═════════════════════════════════════════════════════════════════
▶ Phase = TRIAGE. Have we seen this pattern before?
→ searchPastIncidents({ query: "checkout-svc connections RDS p99 climbing", sections: ["summary", "root_cause"], topK: 5, }) ← 3 strong matches: RCA #142 (sim 0.89): "checkout-svc 504s during peak" RCA #207 (sim 0.86): "Postgres ran out of available connections" RCA #311 (sim 0.81): "transactions queueing in payments-db"
▶ Strong pattern match: pool-exhaustion family. Phase = DIAGNOSE. Compare current observations to past escalation timelines.
→ callTool("datadog.metrics", { query: "checkout-svc.db.connection_count, last 30m", }) ← connection_count: 80 → 240 over the last 12 minutes, still climbing.
→ searchPastIncidents({ query: "connection count climb pattern before pool exhaustion", sections: ["timeline", "root_cause"], topK: 3, }) ← RCA #207 timeline: "p99 climbed at T-8min, conn count climbed at T-4min, pool exhausted at T-0. Cause: deploy doubled QPS without scaling DB connections."
▶ Pattern matches RCA #207 closely. Phase = MITIGATE. What worked last time?
→ searchPastIncidents({ query: "fastest mitigation for connection pool saturation", sections: ["resolution", "timeline"], topK: 3, }) ← RCA #207 resolution: "Rolling restart of checkout-svc API workers cleared the pool in 6 min. Long-term fix: install pgbouncer."
▶ Have a mitigation. Phase = HAND OFF. Check open action items. Is this recurrence preventable?
→ searchPastIncidents({ query: "pgbouncer connection pool prevention", sections: ["action_items"], topK: 5, }) ← RCA #207 action_items: "Install pgbouncer (status: OPEN, owner: jcarr)" RCA #311 action_items: "Add connection_count alert (status: COMPLETED)"
▶ Open action item from RCA #207 strongly correlates with this incident. Brief the human.
→ postSlack(...)
═════════════════════════════════════════════════════════════════SLACK MESSAGE (auto-posted at 2026-04-25T02:14:53Z, 45 seconds after alert)
🤖 Investigator update: checkout-svc pool exhaustion (SEV2)
Pattern match: RCA #207 (Postgres ran out of available connections after a deploy doubled QPS). Connection-count climb in this incident matches that timeline almost exactly: p99 first, conn count after.
Suggested mitigation: rolling restart of checkout-svc API workers. In RCA #207 this cleared the pool in 6 minutes.
⚠️ Action item from RCA #207 to install pgbouncer is still OPEN (owner: jcarr). This recurrence is likely a direct consequence.
→ Past incidents: #142, #207, #311 → Datadog: connection_count 80 → 240 over 12m═════════════════════════════════════════════════════════════════Retrieval was called four times in one investigation, with four different queries, against four different section combinations, each shaped by what the previous step revealed. That iterative loop is invisible to anyone who treats RAG as a preprocessing step and feeds the agent a single bag of chunks.
Most of the value in the final Slack message lives in the warning line about the open pgbouncer action item. That insight requires three things working together: per-section indexing so action items are retrievable separately, persistent action-item state so open is a real searchable signal, and an agent willing to make a connection between past unfinished work and the current incident. None of that emerges from naive RAG.
Ingestion as a Temporal Workflow
Every component above presumes an ingestion pipeline has populated the chunks table. Building that pipeline as a Temporal workflow rather than a one-shot script earns you retries, idempotency, observability, and graceful resume after failure, all of which matter once you have a few hundred documents.
import { proxyActivities } from "@temporalio/workflow";import type * as activities from "./activities";
const { extractText, detectSections, llmFallback, normalize, embedChunks, upsertChunks,} = proxyActivities<typeof activities>({ startToCloseTimeout: "5 minutes", retry: { maximumAttempts: 8, initialInterval: "2s", backoffCoefficient: 2, maximumInterval: "60s", nonRetryableErrorTypes: ["ValidationException", "CorruptDocxError"], },});
export async function rcaIngestion(s3Key: string, s3VersionId: string): Promise<void> { const html = await extractText(s3Key); let sections = await detectSections(html);
if (Object.keys(sections).length < 3) { sections = await llmFallback(html); }
const normalized = await normalize(sections, s3Key); const embedded = await embedChunks(normalized); await upsertChunks(embedded, s3VersionId);}Workflow code stays intentionally bare. Each activity is the unit of retry and the unit of observability. If embedChunks hits a Bedrock ThrottlingException, Temporal retries with exponential backoff without re-running the cheap extraction step that came before it. If the worker process dies mid-ingestion, the workflow resumes from the last completed activity.
Three operational details turn this from a happy-path sketch into something you can actually leave running. Idempotency: derive the workflow ID from sha256(s3Key + s3VersionId) so re-delivering an SQS message starts the same workflow rather than a duplicate, and have upsertChunks do a DELETE WHERE rca_id=$1 AND embed_model_version=$2 then bulk insert inside one transaction. Throttle survival: Bedrock embedding models on-demand cap in the low thousands of RPM in us-east-1, so batch InvokeModel calls at 25 inputs each and front the activity with a token-bucket limiter. Poison messages: configure the SQS queue with RedrivePolicy: { maxReceiveCount: 5 } to a rca-ingest-dlq, and alarm on ApproximateNumberOfMessagesVisible > 0 so a corrupt .docx does not vanish silently.
Trigger from S3 with an event notification on postmortems/*.docx to SQS, with a Temporal client consuming the queue and starting one workflow per file.
Why Not Bedrock Knowledge Bases?
AWS offers Bedrock Knowledge Bases, a managed service that bundles ingestion, chunking, embedding, and retrieval into a single configuration. It works, it is fast to set up, and for many use cases it is the right call.
It is not the right call here, and the reason matters. Knowledge Bases supports four chunking modes out of the box (fixed-size, semantic, hierarchical, and no-chunking) plus a custom-Lambda hook. None of those built-ins enforces per-section chunking with first-class metadata you can filter at query time. Your {section: "action_items", status: "open"} filter, which powers the most valuable insight in the Slack message above, is not a managed concept; it has to live inside the custom-Lambda escape hatch, at which point you have rebuilt the cold path inside someone else’s runtime.
For a generic FAQ bot, lean on Knowledge Bases. For an investigator pulling from structurally meaningful sections of expert-authored documents, build your own ingestion path against pgvector. You pay a few hundred lines of TypeScript and a Temporal workflow, and earn full control over what your retrieval can express.
Evaluating an Investigator
Building this system without an evaluation harness is malpractice. Once you have a corpus of past investigations and Slack messages, you have ground truth for two questions worth measuring continuously.
First, retrieval quality: precision@k against a held-out set. Building that set sounds expensive (“fifty alerts where you know the relevant historical RCAs by hand”) but the labels are sitting in your Slack history. Mine six months of resolved incident channels for messages where a human posted a RCA-\d+ link, and you have a free (alert_text, ground_truth_rca_id) corpus. Hold out 20%, sweep topK over {3, 5, 10, 20}, plot recall against latency, pick the knee. Re-label only when precision@5 drops more than five points week-over-week, since chasing every fluctuation will waste your weekend.
Second, end-to-end utility: did the investigator’s hypothesis match the actual root cause once the human resolved the incident? This is harder to automate but easier to check on a small sample. Ten incidents where the resolved root cause matches the bot’s top hypothesis means the system is delivering value. Two out of ten means it is not yet, and the failure mode is more likely retrieval than reasoning.
Logging every tool call (query, sections, top-K, chunk IDs returned, similarity scores) into a separate table is the cheapest investment you will make. Six months of those traces is a goldmine for tuning. They tell you which sections the agent is asking for too often, which it ignores, and which kinds of queries return empty results. All of that is invisible without the log.
What This Post Did Not Cover
Reranking. A small reranker model after retrieval can dramatically improve precision@k by re-scoring the top-50 results against the query with cross-attention rather than dot-product similarity. Cohere Rerank, Voyage AI’s reranker, or a self-hosted bge-reranker all work, at the cost of one extra hop per retrieval call.
Hybrid search. Embeddings excel at semantic similarity but struggle with exact term matches (a service name, an error code, a specific commit hash). Combining BM25 with vector search and merging the result lists (reciprocal rank fusion is the standard) gives you the best of both. Aurora can run ts_rank and <=> ordering in a single SQL statement; rank fusion itself is hand-rolled with CTEs and ROW_NUMBER() rather than a built-in operator.
Guardrails. Postmortems sometimes contain customer data, credentials accidentally pasted into a comment, or sensitive incident details. A redaction pass during ingestion, plus output filtering on the agent’s Slack messages, is required before any production deployment.
Multi-tenant separation. If your platform serves multiple teams or customers, every chunk needs a tenant ID and every query needs a corresponding filter. Aurora pgvector handles this with a WHERE tenant_id = $1 clause; the discipline is making sure no query path can omit it.
Observability. OpenTelemetry traces from Temporal workflow → Mastra agent → Bedrock calls → pgvector queries give you a single end-to-end view of any investigation. Without it, debugging “why did the agent miss this RCA?” is guesswork.
Action-item state lives in Jira, not the docx. The “OPEN pgbouncer” callout above depends on status being current, but ingestion only refreshes when someone re-edits the postmortem. Wire a Jira or Linear webhook into a Temporal signal that updates an action_item_status table independently, and join at query time. Without that, the bot will warn on a closed item forever.
Drift detection. A daily Temporal cron that runs SELECT section, embed_model_version, COUNT(*) FROM chunks GROUP BY 1, 2 and alarms if any section is empty or more than one model version coexists catches schema drift, partial reindexes, and corpus rot before retrieval gets weird.
What’s Next
Implementation work happens in a separate post. If you want the running version, three pieces of advice:
Start with the cold path. An ingestion pipeline that produces a clean chunks table is the foundation, and you can validate it independently of any agent by running plain SQL queries against pgvector and checking the results match your intuition.
Build the retrieval tool before the full agent. A standalone tool that takes a query and returns ranked chunks is testable in isolation; once it works, plugging it into a Mastra agent is mechanical.
Wire up logging from day one. Every tool call, every chunk returned, every Slack message sent. You will not regret the trace data; you will regret not having it the first time the bot misses an incident family.
Everything else is iteration. Watch the traces, fix the misses, expand the corpus. Six months in, your investigator will look back at six months of its own runs and tell you exactly which kinds of incidents it handles well and which it does not.