Lead Site Reliability Engineer / AI Platform Engineer
Feb 2022 — PresentRemote
- › Designed and deployed a multi-agent system for 24/7 deployment support automation, built on Temporal for durable execution and Mastra for agent and workflow definition. Seven specialized agents orchestrate parallel investigation with verification-first RCA (live log search and code tracing across pipeline and customer configuration before any diagnostic narrative), confidence-gated live posting, and in-thread human feedback that triggers self-analysis runs. Consumes the company's AWS Bedrock-backed inference platform for the Anthropic Claude family (Opus, Sonnet, Haiku), with per-agent model selection and a deterministic guardrail layer that prevents agents from proposing workarounds to existing operational policies. Handles thousands of investigations daily
- › Designed an evaluation methodology for AI-generated code review at enterprise scale, enabling narrow-scope review skills (security, compliance, performance, maintainability, operational best practices) to be calibrated against thousands of real production pull requests before reaching developers. Retrospective evaluation against historical incidents demonstrated material preventive value across multiple categories of customer-impacting events. Scaling from test team to org-wide deployment
- › Decreased team debugging of operational issues from minutes to seconds by building a collection of reusable Claude Code skills, encapsulating ~9,800 lines of codified operational knowledge (error registries, diagnostic strategies, troubleshooting workflows) previously locked in Slack threads and tribal knowledge
- › Engineered a three-layer token optimization system reducing per-investigation LLM cost by 44-57% (~62K to ~30K tokens). Built a deterministic confidence gate that skips redundant agent calls on ~70% of investigations without quality loss, pipeline-aware config injection that eliminates ~20K tokens of irrelevant domain knowledge per call, and MCP tool result transformers that parse and compress raw log and code search payloads by 10-15x. Implemented per-run token observability with cost attribution, solving the problem that the LLM gateway exposes no usage API
- › Own reliability of Internal Developer Platform supporting 300+ services across commercial and GovCloud (FedRAMP) environments with 99%+ deployment success rate on AWS (EKS, Lambda, Step Functions, DynamoDB, S3)
- › Architected Terraform-based deployment pipeline using AWS Step Functions as a distributed systems orchestrator with child execution cancellation, error propagation, and execution-aware dependency resolution across multi-environment sequential pipelines
- › Fixed critical GovCloud deployment failures in multi-environment pipelines (EKS, Lambda, Step Functions) by implementing execution-aware dependency resolution, preventing false rejections in concurrent runs
- › Led organization-wide GitHub Enterprise migration enabling 300+ services to exit change moratorium constraints
- › Established SLI/SLO dashboard standards with AI-calibrated error budget thresholds using 60-day baseline data, replacing intuition with statistical analysis (P90, P95, mean+3σ) for Prometheus/Grafana monitoring alerts
- › Built interactive observability dashboards with one-click SLI drilldowns to root cause analysis using Grafana, OpenTelemetry, and Splunk, reducing MTTR for platform incidents
- › Implemented pipeline failure classification metrics differentiating platform errors from service-team configuration issues, improving SLI accuracy from a noisy 42% error rate to actionable signal
- › Added error count metric instrumentation for external service dependencies, closing observability gaps across the deployment platform
- › Recurring presenter to executive leadership and the broader engineering organization on AI platform engineering, code review evaluation, and developer productivity. Drove adoption of multi-agent and AI-assisted reliability practices across teams