Bob Ulrich

Lead Site Reliability Engineer / AI Platform Engineer

Phoenix, Arizona

bob@bdigitalmedia.io linkedin.com/in/bobulrich42

AI-native SRE and platform engineer with 15+ years at Salesforce, Oracle, and Microsoft serving 100M+ users. Designed autonomous multi-agent systems for 24/7 engineering support automation. Deep expertise in distributed systems, observability, CI/CD, IaC across both private and public sector cloud environments. Proven technical leader driving developer experience improvements, toil reduction using leading AI-assisted engineering practices.

Government Security Clearance

Filter:

Export: PDF DOCX JSON YAML

Experience

Salesforce

4 yrs 3 mos

Lead Site Reliability Engineer / AI Platform Engineer

Feb 2022 — Present

Remote

› Designed and deployed a multi-agent system for 24/7 deployment support automation, built on Temporal for durable execution and Mastra for agent and workflow definition. Seven specialized agents orchestrate parallel investigation with verification-first RCA (live log search and code tracing across pipeline and customer configuration before any diagnostic narrative), confidence-gated live posting, and in-thread human feedback that triggers self-analysis runs. Consumes the company's AWS Bedrock-backed inference platform for the Anthropic Claude family (Opus, Sonnet, Haiku), with per-agent model selection and a deterministic guardrail layer that prevents agents from proposing workarounds to existing operational policies. Handles thousands of investigations daily
› Designed an evaluation methodology for AI-generated code review at enterprise scale, enabling narrow-scope review skills (security, compliance, performance, maintainability, operational best practices) to be calibrated against thousands of real production pull requests before reaching developers. Retrospective evaluation against historical incidents demonstrated material preventive value across multiple categories of customer-impacting events. Scaling from test team to org-wide deployment
› Decreased team debugging of operational issues from minutes to seconds by building a collection of reusable Claude Code skills, encapsulating ~9,800 lines of codified operational knowledge (error registries, diagnostic strategies, troubleshooting workflows) previously locked in Slack threads and tribal knowledge
› Engineered a three-layer token optimization system reducing per-investigation LLM cost by 44-57% (~62K to ~30K tokens). Built a deterministic confidence gate that skips redundant agent calls on ~70% of investigations without quality loss, pipeline-aware config injection that eliminates ~20K tokens of irrelevant domain knowledge per call, and MCP tool result transformers that parse and compress raw log and code search payloads by 10-15x. Implemented per-run token observability with cost attribution, solving the problem that the LLM gateway exposes no usage API
› Own reliability of Internal Developer Platform supporting 300+ services across commercial and GovCloud (FedRAMP) environments with 99%+ deployment success rate on AWS (EKS, Lambda, Step Functions, DynamoDB, S3)
› Architected Terraform-based deployment pipeline using AWS Step Functions as a distributed systems orchestrator with child execution cancellation, error propagation, and execution-aware dependency resolution across multi-environment sequential pipelines
› Fixed critical GovCloud deployment failures in multi-environment pipelines (EKS, Lambda, Step Functions) by implementing execution-aware dependency resolution, preventing false rejections in concurrent runs
› Led organization-wide GitHub Enterprise migration enabling 300+ services to exit change moratorium constraints
› Established SLI/SLO dashboard standards with AI-calibrated error budget thresholds using 60-day baseline data, replacing intuition with statistical analysis (P90, P95, mean+3σ) for Prometheus/Grafana monitoring alerts
› Built interactive observability dashboards with one-click SLI drilldowns to root cause analysis using Grafana, OpenTelemetry, and Splunk, reducing MTTR for platform incidents
› Implemented pipeline failure classification metrics differentiating platform errors from service-team configuration issues, improving SLI accuracy from a noisy 42% error rate to actionable signal
› Added error count metric instrumentation for external service dependencies, closing observability gaps across the deployment platform
› Recurring presenter to executive leadership and the broader engineering organization on AI platform engineering, code review evaluation, and developer productivity. Drove adoption of multi-agent and AI-assisted reliability practices across teams

Oracle

3 yrs 2 mos

Principal Site Reliability Engineer

Jul 2021 — Feb 2022

Phoenix, AZ

› Promoted to Principal on Substrate Compute Data Plane team managing 10,000+ hypervisors with capacity planning for local storage fleet
› Eliminated 90% of manual deployment toil by migrating tmux/script-based process to automated Infrastructure as Code (IaC) pipeline
› Reduced deployment errors to near-zero, cutting rollback incidents by 75%; authored runbooks and blameless post-mortem processes for incident resolution

Senior Site Reliability Engineer

Dec 2018 — Jul 2021

Seattle, WA

› Managed fleet of 15,000+ hypervisors on VM Infrastructure Data Plane team supporting Oracle Cloud compute with capacity planning and toil reduction initiatives
› Reduced on-call escalations by 30% through trend analysis, runbook automation, and targeted toil elimination of repeat incidents
› Operated large-scale QEMU/KVM infrastructure with iSCSI storage at 99.9%+ availability; led blameless post-mortems driving reliability improvements

Donnelley Financial Solutions

10 mos

Senior Site Reliability Engineer

Feb 2018 — Dec 2018

Redmond, WA

› Led Kubernetes migration enabling 50% faster deployments and reducing infrastructure costs by 30%
› Built CI/CD pipeline cutting release cycle from weeks to hours, increasing deployment frequency 10x
› Implemented EFK logging stack reducing mean-time-to-detection (MTTD) for production issues by 60%

Microsoft

5 yrs 7 mos

Service Engineer II, OneDrive

Nov 2016 — Feb 2018

Redmond, WA

› Operated one of the world's largest consumer storage services serving hundreds of millions of users
› Contributed to Azure microservice migration reducing deployment complexity and improving scalability

Service Engineer II, SharePoint Online

Jul 2012 — Nov 2016

Redmond, WA

› Maintained 99.99% SLA for 100M+ monthly active users as key contributor to 24x7 SRE team
› Reduced catastrophic storage failures by 67% within 1 year through rigorous process improvements
› Engineered SharePoint update process reducing farm-wide installation time from days to hours (3x faster)
› Resolved HP miniport driver flaw eliminating storage communication failures across 10,000+ machines
› Automated certificate lifecycle reducing manual effort by 80% and eliminating expiration incidents

Wimmer Solutions

1 yr 3 mos

Service Engineer (Microsoft Contractor)

Mar 2011 — Jun 2012

Redmond, WA

› Executed load balancer migration for 4,000+ servers enabling hardware SSL offloading and 50% faster handshakes
› Developed automation scripts reducing manual configuration time by 90% and eliminating human error

John L Scott Real Estate

3 yrs 8 mos

System Administrator

Oct 2008 — Mar 2011

Portland, OR

› Led Active Directory and Exchange migration with zero data loss during M&A, maintaining 100% uptime
› Reduced WAN costs by 25% and improved branch uptime to 99.5% through Cisco ASA upgrades
› Cut software deployment time by 80% through SCCM implementation across 50+ branch offices

Field Technician

Jul 2007 — Oct 2008

Bellevue, WA

› Supported 10+ branch offices and 500+ users across Washington with 95%+ ticket resolution rate
› Improved network performance by 40% through Cisco Catalyst upgrade with VLAN segmentation

Skills

Cloud Platforms

AWSAzureOracle CloudSalesforce

AWS Services

EKSLambdaDynamoDBStep FunctionsS3EC2Bedrock

Infrastructure

KubernetesLinuxQEMU/KVMDockerWindows Server

Languages & Scripting

PythonGoBashJavaJSON/YAMLPowerShell

AI & Agentic Workflows

Multi-Agent SystemsAgent OrchestrationDurable Execution (Temporal)Workflow Frameworks (Mastra.ai)Anthropic Claude APIAWS BedrockMCP (Model Context Protocol)Function Calling / Tool UsePrompt EngineeringLLM IntegrationAI Evaluation & CalibrationAI Safety / GuardrailsToken OptimizationFinOpsClaude CodeCursorGitHub Copilot

SRE & Observability

SLI/SLO/SLA DesignError BudgetsPrometheusGrafanaOpenTelemetrySplunkDistributed TracingIncident ManagementOn-CallBlameless Post-MortemsRunbooksToil ReductionCapacity PlanningMTTR/MTTD

Platform Engineering

Internal Developer Platform (IDP)Developer Experience (DevEx)Self-Service InfrastructureDeployment Orchestration

CI/CD & Infrastructure as Code

GitHub ActionsArgo WorkflowsTerraformGitOpsJenkinsInfrastructure as Code (IaC)

Security & Compliance

FedRAMPGovCloudDevSecOpsIAMZero Trust

Networking

DNSLoad BalancingAPI GatewayPKISSL/TLS

Notable Projects

Claude-pilling by bdigital

2026-04 to present

Sole builder (product + engineering + ops)

$500 flat-fee Claude Desktop onboarding for service-business owners. Intake form → Claude-CLI generation agent → R2/KV storage → Stripe Checkout → presigned preview email → Telegram notification on draft-ready. 6 customers through the pipeline by 2026-04-20.

› 15-minute intake with localStorage autosave, inline field-level validation, per-IP rate limiter
› launchd-scheduled agent invokes `claude -p` with `--output-format stream-json --verbose`, 900s timeout ceiling (4-6min typical runs), captures tool_use/input_tokens/output_tokens per run and writes them to the admin log
› HMAC-SHA256 preview URLs with rotatable salt (src/lib/training/signatures.ts); Stripe webhook is idempotent with 7-day replay dedupe
› Personalized 192-line/~28KB system prompt drives every guide — no manual post-edit on the happy path

AstroCloudflare WorkersCloudflare KVCloudflare R2Stripe Checkout + WebhooksResendClaude Code CLI (`claude -p`)launchdVitest

Training-guide parser + security harness

2026-04

Designer / engineer

The Vitest harness that sits in front of the Claude-pilling generation pipeline. Sanitizes attacker-reachable markdown-to-HTML output, repairs common generator drift, and encodes every resolved audit finding as a grep-style invariant that fails CI if it ever comes back.

› `sanitizeGuideHtml` strips <script>/<style>/<iframe>/<object>/<embed>/<form>/<meta>/<link>, all on* event handlers, and javascript:/vbscript:/data:text|application URLs before `set:html` renders the guide — intake text is untrusted because it flows into the Claude prompt
› `cleanGuideMarkdown` + `ensureStepCheckboxes` in guideTransform.ts repair specific generator regressions (outer code-fence wrapping, leaked `★ Insight` commentary, missing step checkboxes)
› 34 security-invariants Vitest cases gate CI; 151 training-specific tests total across 16 files (~1,440 lines of test for ~850 lines of production lib)
› Pre-launch audit: 4 critical + 7 high findings remediated and verified in CI + end-to-end tests

TypeScriptVitestmarkedCloudflare WorkersAstro

LLM Fundamentals blog series

2026-04-11 to present

Author

An 11-part provider-agnostic walkthrough of how LLMs work for developers — tokens, text generation, context windows, then Anthropic's Messages API, prompt vs harness engineering, extended thinking, structured output, tool use, the agentic loop. Every published post ships with OG image + Kokoro TTS narration.

› Parts 0–3 published 2026-04-11 through 2026-04-14 (What Are LLMs / Tokens / Text Generation / Context Windows); Parts 4–10 in scheduled draft
› Part 4 (Messages API) anchors the series shift from concepts to Anthropic-specific: roles, statelessness, system field vs user/assistant messages
› Every post has a build-time-rendered OG image (SVG hero system) and an auto-generated TTS audio track, both CI-enforced before publish

AstroMarkdown/MDXClaudeKokoro TTSCustom SVG heroes

Operational-skill corpus for Claude Code

2025–present

Author / maintainer

~9,800 lines of codified operational skills that compress deployment-pipeline expertise — error registries, diagnostic strategies, troubleshooting workflows — into Claude Code skills that turn minutes-of-investigation into seconds. Publicly documented patterns; private skills live inside the Salesforce tenant.

› SessionEnd hook + daily GitHub Action looks for failure patterns in my own skill output, proposes additive fixes, opens a PR — self-improving loop described in the `automated-self-improving-skills` blog post
› Skills catalog public-facing: geo, geo-audit, geo-content, geo-citability, geo-platform-optimizer, geo-schema, geo-technical, geo-brand-mentions, geo-crawlers, geo-llmstxt, geo-report, geo-report-pdf, last30days, career-ops, competitive-brief, game-day
› Each skill ships with a behavioral eval, install verification, and CI drift check — six-phase SDLC documented in `evaluating-a-claude-skill-before-shipping`

Claude CodeMarkdown (SKILL.md)GitHub ActionsBash / Python scripts

bdigital.media clips & pics storefront

2026-03 to present

Builder

A $5-per-clip, $1-per-photo youth-sports storefront. Parents open a link on the ride home from a game, watch watermarked 720p previews, tap Buy $5 for the 4K file. Zero recurring cost at rest — if nobody buys, I pay nothing.

› 12 games, 464 clips and 5 galleries, 1,739 photos in production as of 2026-04-19
› Upload pipeline: ffmpeg renders 720p watermarked previews + thumbnails, wrangler uploads to R2, script patches `src/data/clips.json`, git push triggers Cloudflare auto-deploy
› Full-res clips stay private in R2; after Stripe webhook fires, server mints a 1-hour presigned download URL — no permanent links, no recurring egress charges thanks to R2
› Shipped first version in a day from a phone, write-up at tech.bdigitalmedia.io/blog/building-clip-storefront-with-claude

AstroCloudflare WorkersCloudflare R2Stripe Checkoutffmpegsharp / dcraw_emulocalStorage cart

Education & Certifications

Cascadia College

Certificate — Cisco Network Academy

FAA Part 107 sUAS Pilot

Federal Aviation Administration

May 2025

CCENT

Cisco

Mar 2011