Why Ad Hoc Chaos Programs Stall Out
Chaos engineering works when it's systematic. Ad hoc execution means experiments cluster around incidents (when everyone is already stretched) or quiet Fridays (when no one wants to trigger a page). Manual blast radius analysis — mapping Datadog service dependencies before each experiment — is the work that gets cut when time is short. Without it, experiments get cancelled or scoped down to the point of being unrevealing. The program survives on paper but doesn't build real resilience knowledge.
How an AI Agent Makes Chaos Engineering a Scheduled Practice
An AI Labor Company agent mines your Gremlin experiment history and Datadog service dependency maps to understand your environment's topology and historical failure patterns. The agent then proposes a quarterly chaos experiment schedule — mapped to your actual service graph, with automated blast radius analysis generated from Datadog dependency data before each experiment. Each proposed experiment routes to the principal SRE for approval before it runs. Kubernetes namespace scoping, PagerDuty on-call schedules, and Slack routing are all incorporated into the scheduling logic. Teams running this workflow typically see 55–75% reduction in the manual work associated with each experiment, with the agent live in about 6 weeks.
The Business Case: Resilience Coverage as Risk Reduction
Doubling your chaos experiment cadence without adding SRE effort means covering twice as many services, failure modes, and dependency paths in a given quarter. For a $100M–$600M ARR SaaS company, unplanned downtime carries real revenue and churn risk — and more importantly, the failure modes that cause major incidents are usually the ones no one tested. A systematic chaos program that actually runs on schedule is a risk-reduction investment with a direct relationship to incident frequency. The agent doesn't prevent incidents by itself, but it builds the organizational knowledge that does.
How does the agent determine blast radius for a proposed experiment?
The agent uses Datadog service dependency maps to trace upstream and downstream blast radius for each targeted service or Kubernetes namespace, then cross-references Gremlin experiment history to identify services that have previously shown unexpected blast radius under similar experiments. The output is a structured blast radius assessment routed to the principal SRE for sign-off before the experiment runs.
Can the agent automatically schedule around PagerDuty on-call rotations and deployment windows?
Yes. The scheduling logic incorporates PagerDuty on-call calendars and can be configured to avoid deployment freeze windows, major release periods, or other blackout times. If a scheduling conflict arises for a proposed experiment, the agent surfaces it in the approval request rather than scheduling around it silently.