Automated Incident Response with AI: A Governance-First Guide

import json import requests

The fastest way to kill your production environment is to give an LLM root access.

Most marketing for automated incident response with ai makes it sound like a magic fix-it button. It is not. If you ship an autonomous agent with uncontrolled access to your infrastructure, you are just waiting for a regression that takes down your entire stack. I have seen automated scripts interpret a standard database migration as a security breach and start killing active connections to isolate the threat. That is a self-inflicted incident, and it is usually the result of hype-driven development.

We do not need more magic. We need observability and deterministic guardrails. In this post, I will show you how to build a pipeline that uses AI to analyze incidents while keeping the actual remediation actions locked behind a strict policy layer. We are building a system where the AI proposes a solution, but a hardcoded allowlist and a human-in-the-loop verify it before any command hits a server.

What you will have at the end

By the end of this guide, you will have a functional incident triage pipeline. It will ingest alerts from your monitoring tools, use an LLM via OpenRouter to categorize the severity and suggest a fix, and then pass that fix through a validation script. If the fix involves a high-risk action like a rollback or a service restart, the system will pause and wait for a manual override. You will also have a cost-tracking mechanism to ensure your observability data does not burn through your budget in token costs.

Prerequisites

Before you start, you need a few things in place. I am assuming you are running a modern stack where you can programmatically interact with your infrastructure.

A Python 3.10+ environment.
An API key from OpenRouter to access multiple models without vendor lock-in. This is critical because models like Gemini or GPT-4o have different failure modes during an incident.
Access to Selzee for monitoring your site health and Shopify inventory. We will use their Slack alerts as our primary data source.
A basic understanding of JSON schemas. If you are new to this, check out my previous post on Generative UI Patterns: Building with Deterministic Schemas.

Code editor and terminal logs side-by-side

Step 1: Hooking into the Observability Source

You cannot respond to an incident you cannot see. While many people use raw logs, I prefer using structured alerts. We will use Selzee to monitor site health and inventory spikes. When Selzee detects a drop in site health, it sends a payload to our triage script.

Create a simple FastAPI endpoint to receive these alerts. We want to capture the raw error message, the affected service, and the timestamp. Do not try to clean the data yet. Let the raw data flow so the LLM has full context, but be careful with backpressure. If you get a flood of alerts, you do not want to trigger a thousand LLM calls at $0.05 each.

from fastapi import FastAPI, Request
import time

app = FastAPI()

@app.post("/alerts/selzee")
async def handle_alert(request: Request):
 data = await request.json()
 # Simple backpressure check
 if is_rate_limited(data["service_id"]):
 return {"status": "ignored", "reason": "rate_limit"}
 
 process_incident(data)
 return {"status": "received"}

Step 2: The LLM Analysis Loop

Now we send the alert data to OpenRouter. I use OpenRouter because it allows us to switch models if one provider is having an incident of its own. It happens more often than you think. We want the LLM to return a structured JSON response, not a paragraph of text. If the output is non-deterministic, your guardrails will fail.

We need to force the model to categorize the incident into a pre-defined set of types: latency, connectivity, resource_exhaustion, or unknown. This prevents the AI from getting creative with its diagnosis. It is worth noting that for complex infra, running local LLMs for coding can be a better choice for privacy, but for quick triage, OpenRouter is faster.

def get_ai_triage(alert_text):
 headers = {"Authorization": f"Bearer {OPENROUTER_API_KEY}"}
 prompt = f"Analyze this incident: {alert_text}. Return JSON with fields: severity, root_cause, and suggested_action."
 
 response = requests.post(
 "https://openrouter.ai/api/v1/chat/completions",
 headers=headers,
 json={
 "model": "google/gemini-pro-1.5",
 "messages": [{"role": "user", "content": prompt}],
 "response_format": { "type": "json_object" }
 }
 )
 return response.json()["choices"][0]["message"]["content"]

Step 3: Implementing the Governance Guardrails

This is the most important part. Never let the output of get_ai_triage go directly to a shell. You must implement an allowlist. If the AI suggests a rollback, it must match a specific version pattern. If it suggests a restart, it must be for a service in the non-critical tier.

We use a policy-driven verification layer. This layer checks the AI suggestion against your company's compliance and safety rules. For example, in highly regulated industries, any autonomous action on a production database is a legal liability.

Engineer hand over keyboard with monitoring dashboard in background

Metric	Manual Response	AI with Guardrails
Time to Identify	15 to 30 mins	< 1 min
Initial Triage Cost	$150 (Engineer time)	$0.05 (Tokens)
Risk of Human Error	High (Fatigue)	Low (Deterministic checks)
Risk of System Error	Low	Medium (Non-deterministic)
Setup Cost	$0	$5,000+ (Dev time)

Here is how you implement the guardrail in Python:

ALLOWED_ACTIONS = ["clear_cache", "restart_canary", "notify_on_call"]
CRITICAL_SERVICES = ["payment-gateway", "auth-service"]

def verify_and_execute(ai_suggestion, service_id):
 action = ai_suggestion.get("suggested_action")
 
 if action not in ALLOWED_ACTIONS:
 print(f"Blocked unauthorized action: {action}")
 return "Manual intervention required"

 if service_id in CRITICAL_SERVICES:
 print("Action requested on critical service. Waiting for human approval.")
 send_to_slack_for_approval(ai_suggestion)
 return "Pending approval"

 # Only now do we execute
 execute_remediation(action)

Troubleshooting

Automated incident response with ai is often flaky. Here are the common failures I see in production environments:

Adversarial AI Risks: An attacker who understands your automation might intentionally trigger a pattern of alerts to force the AI into a self-inflicted denial of service. For example, if they know a 403 spike triggers an IP block, they can spoof a partner's IP range to get your own partners blocked. Always keep a human in the loop for security-related alerts.
Automation Bias: Junior analysts might stop questioning the AI. If the dashboard says "AI says restart the DB," they might just click okay without looking at the logs. You must force the UI to show the raw observability data before the approval button is enabled. You can use tools like v0 to quickly build these internal triage UIs.
Non-Deterministic Outputs: Even with JSON mode, LLMs sometimes hallucinate field names. Your code must handle missing keys gracefully and default to a safe state (no action).
Legacy Infrastructure: If you are working with old on-premise systems that lack standardized logging, the AI will struggle. You will spend more time engineering the data than building the AI. Sometimes a simple bash script is better than an LLM.

Next steps

Once your basic triage is stable, you can look into more advanced orchestration. Tools like Devin are starting to show promise for autonomous software engineering, which could eventually handle complex code-level regressions. However, for now, focus on the low-hanging fruit: automated triage and safe, pre-approved remediation.

Your goal should be to reduce the cognitive load on your on-call engineers, not to replace them. Every automated action should be logged in a post-mortem document for review. If an automated action causes a rollback, that is an incident in itself and needs to be analyzed with the same rigor as a manual mistake.

For more on how to manage the costs of these systems, read my analysis on AI for SEO Content at Scale: Managing the $0.14 Unit Cost of Decay. The unit economics of AI incident response are better, but only if you prevent the infinite loop of automated failures.

To test your setup, trigger a non-critical alert in your staging environment. Verify that the AI identifies it correctly but is blocked by your guardrail when it tries to perform an action on a protected service. If it bypasses your guardrail, you have a bug in your logic, not the AI. Fix the logic before you ship to production.

For further reading on standard incident procedures, refer to the NIST Computer Security Incident Handling Guide and the RFC 7231 documentation for standard HTTP status code interpretations.

Enjoying the read?

Try tunedtools

AI workflows matched to your project, stack, and role - grounded in real sources.

Get started free →

no credit card · ~ 2 min

Tools mentioned in this post

Gemini

Devin

OpenRouter

Selzee

Keep reading.

AI Workflows Engineering

Claude Code vs Cursor for Large Codebases: A Practical Setup

A technical guide to configuring Claude Code and Cursor for high-scale repositories without breaking your build or shipping regressions.

AI Workflows Engineering

Claude Code vs Cursor for Large Codebases: A Senior Teardown

A direct comparison of Claude Code and Cursor for managing complex, large-scale codebases without the marketing hype.

AI Workflows Engineering

Claude Code vs Cursor for Large Codebases: A Senior Teardown

A technical comparison of vector retrieval versus agentic file traversal for large scale architectural migrations in million line repositories.

Automated Incident Response with AI: A Governance-First Guide

The fastest way to kill your production environment is to give an LLM root access.

What you will have at the end

Prerequisites

Step 1: Hooking into the Observability Source

Step 2: The LLM Analysis Loop

Step 3: Implementing the Governance Guardrails

Troubleshooting

Next steps

Tools mentioned in this post

Keep reading.

Claude Code vs Cursor for Large Codebases: A Practical Setup

Claude Code vs Cursor for Large Codebases: A Senior Teardown

Claude Code vs Cursor for Large Codebases: A Senior Teardown