Bob Ulrich

Bob Ulrich

Lead Site Reliability Engineer

Phoenix, Arizona

AI-native SRE and platform engineer with 15+ years at Salesforce, Oracle, and Microsoft serving 100M+ users. Built 17+ agentic AI skills for Claude Code and Cursor, developed parallel AI-powered code review agents, and prototyped LLM-driven support automation achieving 80-98% ticket resolution. Deep expertise in Kubernetes, Terraform, observability, and cloud-native Infrastructure as Code (IaC) with hands-on prompt engineering, MCP integration, and AI agent development. Proven cross-functional technical leader mentoring engineering teams on AI-assisted developer experience and SRE best practices.

Government Security Clearance
PDF DOCX JSON YAML

Experience

4 yrs 1 mo

Lead Site Reliability Engineer

Feb 2022 — Present

Remote

  • Built agentic AI skill framework with 17+ specialized skills for Claude Code and Cursor IDE integration, enabling automated code review, deployment orchestration, and infrastructure management — adopted cross-functionally across engineering teams
  • Developed parallel AI-powered PR review system spawning 4 concurrent agentic workflows (code quality, security, test coverage, architecture) to address review bottleneck from AI-accelerated code generation volume
  • Created multi-dimensional code analysis tool for GitHub Enterprise repositories with automated severity classification (critical/warning/suggestion) and optional AI-generated PR comment posting
  • Prototyped AI agent integration to automate 80-98% of engineering support requests, authoring PRD and proof-of-concept for automated ticket resolution across support channels
  • Built AI-powered work item management skill enabling natural language queries, CRUD operations, sprint detection, and automated report formatting on internal tracking systems
  • Shipped AI cost tracking and FinOps dashboard providing team visibility into LLM API consumption patterns and optimization opportunities across engineering tools
  • Established SLI/SLO dashboard standards with AI-calibrated thresholds using 60-day baseline data, replacing intuition with statistical analysis (P90, P95, mean+3σ) for monitoring alerts
  • Built interactive observability dashboards with one-click SLI drilldowns to root cause analysis, reducing MTTR for platform incidents across Terraform deployment pipelines
  • Fixed critical AWS GovCloud deployment failure in multi-environment sequential pipelines (EKS, Lambda, Step Functions) by implementing execution-aware dependency resolution, preventing false rejections in concurrent runs
  • Added pipeline failure classification metrics enabling differentiation between platform errors and service-team configuration issues, improving SLI accuracy from a noisy 42% error rate
  • Supported 300+ services in production deployment pipeline with 99% commercial and 33% government cloud adoption; implemented FedRAMP security controls across technical environments
  • Published reusable AI skill framework adopted across engineering teams; mentored engineers on prompt engineering best practices, agentic workflow design patterns, and AI integration with enterprise tooling (GitHub, AWS, Grafana)
  • Building fully autonomous AI agent platform using Mastra.ai to automate operational issue analysis, remediation, and support requests end-to-end — reducing headcount requirements and enabling 24/7 cross-timezone coverage with non-stop productivity
  • Led organization-wide migration to new GitHub Enterprise organization, enabling services to exit change moratorium constraints
3 yrs 2 mos

Principal Site Reliability Engineer

Jul 2021 — Feb 2022

Phoenix, AZ

  • Promoted to Principal on Substrate Compute Data Plane team managing 10,000+ hypervisors with capacity planning for local storage fleet
  • Eliminated 90% of manual deployment toil by migrating tmux/script-based process to automated Infrastructure as Code (IaC) pipeline
  • Reduced deployment errors to near-zero, cutting rollback incidents by 75%; authored runbooks and blameless post-mortem processes for incident resolution

Senior Site Reliability Engineer

Dec 2018 — Jul 2021

Seattle, WA

  • Managed fleet of 15,000+ hypervisors on VM Infrastructure Data Plane team supporting Oracle Cloud compute with capacity planning and toil reduction initiatives
  • Reduced on-call escalations by 30% through trend analysis, runbook automation, and targeted toil elimination of repeat incidents
  • Operated large-scale QEMU/KVM infrastructure with iSCSI storage at 99.9%+ availability; led blameless post-mortems driving reliability improvements

Senior Site Reliability Engineer

Feb 2018 — Dec 2018

Redmond, WA

  • Led Kubernetes migration enabling 50% faster deployments and reducing infrastructure costs by 30%
  • Built CI/CD pipeline cutting release cycle from weeks to hours, increasing deployment frequency 10x
  • Implemented EFK logging stack reducing mean-time-to-detection (MTTD) for production issues by 60%
5 yrs 7 mos

Service Engineer II, OneDrive

Nov 2016 — Feb 2018

Redmond, WA

  • Operated one of the world's largest consumer storage services serving hundreds of millions of users
  • Contributed to Azure microservice migration reducing deployment complexity and improving scalability

Service Engineer II, SharePoint Online

Jul 2012 — Nov 2016

Redmond, WA

  • Maintained 99.99% SLA for 100M+ monthly active users as key contributor to 24x7 SRE team
  • Reduced catastrophic storage failures by 67% within 1 year through rigorous process improvements
  • Engineered SharePoint update process reducing farm-wide installation time from days to hours (3x faster)
  • Resolved HP miniport driver flaw eliminating storage communication failures across 10,000+ machines
  • Automated certificate lifecycle reducing manual effort by 80% and eliminating expiration incidents

Wimmer Solutions

1 yr 3 mos

Service Engineer (Microsoft Contractor)

Mar 2011 — Jun 2012

Redmond, WA

  • Executed load balancer migration for 4,000+ servers enabling hardware SSL offloading and 50% faster handshakes
  • Developed automation scripts reducing manual configuration time by 90% and eliminating human error

John L Scott Real Estate

3 yrs 8 mos

System Administrator

Oct 2008 — Mar 2011

Portland, OR

  • Led Active Directory and Exchange migration with zero data loss during M&A, maintaining 100% uptime
  • Reduced WAN costs by 25% and improved branch uptime to 99.5% through Cisco ASA upgrades
  • Cut software deployment time by 80% through SCCM implementation across 50+ branch offices

Field Technician

Jul 2007 — Oct 2008

Bellevue, WA

  • Supported 10+ branch offices and 500+ users across Washington with 95%+ ticket resolution rate
  • Improved network performance by 40% through Cisco Catalyst upgrade with VLAN segmentation

Skills

Cloud Platforms

AWSAzureOracle CloudSalesforce

AWS Services

EKSLambdaDynamoDBStep FunctionsS3EC2

Infrastructure

KubernetesLinuxQEMU/KVMDockerWindows Server

Languages & Scripting

PythonGoBashJavaJSON/YAMLPowerShell

AI & Agentic Workflows

Claude CodeClaude APIMCP (Model Context Protocol)Prompt EngineeringLLM IntegrationAI AgentsAgentic WorkflowsMastra.aiCursorGitHub Copilot

SRE & Observability

PrometheusGrafanaOpenTelemetryElasticsearchKibanaSLI/SLOIncident ResponseOn-CallBlameless Post-MortemsRunbooks

CI/CD & Infrastructure as Code

GitHub ActionsArgo WorkflowsTerraformGitOpsJenkinsInfrastructure as Code (IaC)

Networking

DNSLoad BalancingAPI GatewayPKISSL/TLS

Education & Certifications

Cascadia College

Certificate — Cisco Network Academy

FAA Part 107 sUAS Pilot

Federal Aviation Administration

May 2025

CCENT

Cisco

Mar 2011