Case study · 2025

Agentic DevOps Copilot for Incident Triage

↓ 20% MTTR Subscription platform AWS · EKS · ArgoCD Claude + tool use Databricks · Snowflake

An on-call rotation drowning in noisy alerts, runbooks scattered across wikis, and a backlog of small recurring incidents that ate engineering time. We built an LLM-powered DevOps copilot that triages incoming alerts, correlates them with logs and metrics, proposes a fix, and opens a PR with the patch attached. It cut MTTR by 20% in the first quarter.

The problem

The platform team supported a sprawling AWS estate — multiple EKS clusters, Aurora PostgreSQL, OpenSearch, Lambda, Kinesis — for a multi-tenant subscription platform with consumer-facing and partner-facing workloads. Alert volume was high, the on-call engineer's first 20 minutes of every page were spent on the same recurring detective work: which service, which deployment, what changed, where's the runbook.

Three patterns came up over and over:

The constraints

Three non-negotiables shaped the design:

Architecture

The copilot is a Claude-powered agent with a tightly scoped tool surface. When an alert fires from Prometheus or CloudWatch, an EventBridge rule invokes a Lambda that hands the alert to the agent loop. The agent then reasons over the alert and decides which tools to call.

Alertmanager / CloudWatch │ ▼ EventBridge ──▶ Lambda (agent runtime) │ ▼ ┌────────────┐ │ Claude │ (reasoning + tool selection) └────┬───────┘ │ ┌──────────────┼──────────────┬──────────────┬─────────────┐ ▼ ▼ ▼ ▼ ▼ read_logs read_metrics read_runbook query_history open_pr (OpenSearch) (Prometheus) (Confluence) (Databricks) (Git/PR)

The tools are deliberately narrow and read-mostly:

The on-call engineer gets a Slack message: "I think this is X because of Y, here's the runbook section, here's a PR if you want it." They can approve, edit, or reject. Every interaction is logged.

Implementation notes

The PR is the unit of trust

The breakthrough wasn't the LLM — it was treating the version-controlled pull request as the contract between the agent and the team. A PR has reviewers, CI checks, deploy gates, rollback. The engineering team already trusts that workflow. We slotted the agent into that workflow rather than asking the team to trust a new one.

Ground every claim in a tool call

The agent is prompted to never make a claim without citing a tool call output. If it can't find evidence, it says so. If it found contradictory evidence, it surfaces both. This is dull, slow agent behaviour — and it's the only kind that engineers trust.

Past incidents are the killer dataset

Most LLM-for-DevOps demos focus on real-time signals. The bigger unlock was historical. Querying Databricks for "last 90 days of alerts matching this fingerprint, plus the resolution that closed them" turned the agent from a clever generalist into a colleague who'd been on-call here before.

What didn't work

Results

↓ 20%MTTR reduction (Q1)
~30%Recurring incidents auto-triaged
0Production changes without human PR review

Beyond the MTTR number, three softer wins:

What I'd do differently

Tools used

Claude · AWS Lambda · EventBridge · OpenSearch · Prometheus · Grafana · CloudWatch · Confluence API · Databricks Lakehouse · Snowflake · ArgoCD · EKS · Terraform · Slack · Python · Bash · Go.

Building something similar?

I work with platform and SRE teams introducing AI-augmented DevOps without breaking trust. Outside IR35, Inside IR35, permanent or fractional.

Schedule a call →