Chaos Engineering

Introduction¶

This guide provides comprehensive standards for implementing chaos engineering experiments and synthetic monitoring to validate system resilience. It covers experiment design, safety measures, and proactive monitoring strategies.

Table of Contents¶

Chaos Engineering Philosophy
Chaos Mesh for Kubernetes
Litmus Chaos
Chaos Monkey Patterns
Experiment Design
Safety and Blast Radius
Synthetic Monitoring Overview
Datadog Synthetics
Checkly
Uptime and SLO Standards
CI/CD Integration
Best Practices

Chaos Engineering Philosophy¶

Core Principles¶

┌─────────────────────────────────────────────────────────────────┐
│                  Chaos Engineering Workflow                      │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│   ┌──────────────┐    ┌──────────────┐    ┌──────────────┐     │
│   │   HYPOTHESIZE │ → │   EXPERIMENT │ → │    ANALYZE   │     │
│   │              │    │              │    │              │      │
│   │  Define      │    │  Inject      │    │  Measure     │      │
│   │  steady      │    │  controlled  │    │  impact on   │      │
│   │  state       │    │  failures    │    │  steady      │      │
│   │              │    │              │    │  state       │      │
│   └──────────────┘    └──────────────┘    └──────────────┘      │
│          │                   │                   │               │
│          └───────────────────┼───────────────────┘               │
│                              │                                   │
│                       ┌──────▼──────┐                           │
│                       │   IMPROVE   │                           │
│                       │   SYSTEM    │                           │
│                       └─────────────┘                           │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Chaos Maturity Model¶

# chaos-maturity-assessment.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: chaos-maturity-levels
data:
  levels: |
    Level 0 - Ad Hoc:
      - No formal chaos practices
      - Reactive incident response only
      - No resilience testing

    Level 1 - Initial:
      - Manual failure injection in staging
      - Basic health checks exist
      - Some runbooks documented

    Level 2 - Managed:
      - Scheduled chaos experiments
      - Automated rollback mechanisms
      - Incident correlation with experiments

    Level 3 - Defined:
      - Chaos experiments in CI/CD
      - Hypothesis-driven experiments
      - Blast radius controls enforced

    Level 4 - Measured:
      - Continuous chaos in production
      - SLO-based experiment triggers
      - Automated experiment analysis

    Level 5 - Optimized:
      - AI-driven chaos selection
      - Self-healing systems validated
      - Chaos as culture embedded

Chaos Mesh for Kubernetes¶

Chaos Mesh Installation¶

# Install Chaos Mesh using Helm
helm repo add chaos-mesh https://charts.chaos-mesh.org
helm repo update

# Create namespace
kubectl create namespace chaos-mesh

# Install with recommended settings
helm install chaos-mesh chaos-mesh/chaos-mesh \
  --namespace chaos-mesh \
  --set chaosDaemon.runtime=containerd \
  --set chaosDaemon.socketPath=/run/containerd/containerd.sock \
  --set dashboard.securityMode=true \
  --version 2.6.3

# Verify installation
kubectl get pods -n chaos-mesh

Pod Chaos Experiments¶

# Good - Pod failure experiment with safety controls
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: pod-failure-api-service
  namespace: chaos-testing
  labels:
    experiment-type: pod-failure
    team: platform
    severity: medium
spec:
  action: pod-failure
  mode: one
  value: "1"
  duration: "30s"
  selector:
    namespaces:
      - production
    labelSelectors:
      app: api-service
      chaos-enabled: "true"
    expressionSelectors:
      - key: environment
        operator: In
        values:
          - production
          - staging
  scheduler:
    cron: "@every 4h"

# Good - Pod kill experiment with percentage mode
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: pod-kill-worker-nodes
  namespace: chaos-testing
spec:
  action: pod-kill
  mode: fixed-percent
  value: "25"
  duration: "1m"
  gracePeriod: 30
  selector:
    namespaces:
      - production
    labelSelectors:
      app: worker
      chaos-enabled: "true"
    podPhaseSelectors:
      - Running

# Good - Container kill targeting specific container
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: container-kill-sidecar
  namespace: chaos-testing
spec:
  action: container-kill
  mode: one
  containerNames:
    - envoy-sidecar
  duration: "45s"
  selector:
    namespaces:
      - production
    labelSelectors:
      app: api-gateway

Network Chaos Experiments¶

# Good - Network delay injection
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: network-delay-database
  namespace: chaos-testing
spec:
  action: delay
  mode: all
  selector:
    namespaces:
      - production
    labelSelectors:
      app: api-service
  delay:
    latency: "100ms"
    jitter: "20ms"
    correlation: "25"
  direction: to
  target:
    selector:
      namespaces:
        - production
      labelSelectors:
        app: postgres
    mode: all
  duration: "2m"

# Good - Network partition simulation
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: network-partition-zones
  namespace: chaos-testing
spec:
  action: partition
  mode: all
  selector:
    namespaces:
      - production
    labelSelectors:
      zone: us-east-1a
  direction: both
  target:
    selector:
      namespaces:
        - production
      labelSelectors:
        zone: us-east-1b
    mode: all
  duration: "30s"

# Good - Packet loss injection
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: network-loss-external
  namespace: chaos-testing
spec:
  action: loss
  mode: all
  selector:
    namespaces:
      - production
    labelSelectors:
      app: payment-service
  loss:
    loss: "10"
    correlation: "50"
  direction: to
  externalTargets:
    - stripe.com
    - api.stripe.com
  duration: "5m"

# Good - Bandwidth limitation
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: network-bandwidth-limit
  namespace: chaos-testing
spec:
  action: bandwidth
  mode: all
  selector:
    namespaces:
      - production
    labelSelectors:
      app: file-upload-service
  bandwidth:
    rate: "1mbps"
    limit: 100
    buffer: 10000
  direction: to
  duration: "10m"

Stress Chaos Experiments¶

# Good - CPU stress test
apiVersion: chaos-mesh.org/v1alpha1
kind: StressChaos
metadata:
  name: cpu-stress-api
  namespace: chaos-testing
spec:
  mode: one
  selector:
    namespaces:
      - production
    labelSelectors:
      app: api-service
  stressors:
    cpu:
      workers: 2
      load: 80
  duration: "5m"
  containerNames:
    - api

# Good - Memory stress test
apiVersion: chaos-mesh.org/v1alpha1
kind: StressChaos
metadata:
  name: memory-stress-cache
  namespace: chaos-testing
spec:
  mode: all
  selector:
    namespaces:
      - production
    labelSelectors:
      app: cache-service
  stressors:
    memory:
      workers: 4
      size: "256Mi"
      oomScoreAdj: -1000
  duration: "3m"

IO Chaos Experiments¶

# Good - IO delay injection
apiVersion: chaos-mesh.org/v1alpha1
kind: IOChaos
metadata:
  name: io-delay-database
  namespace: chaos-testing
spec:
  action: latency
  mode: one
  selector:
    namespaces:
      - production
    labelSelectors:
      app: postgres
  volumePath: /var/lib/postgresql/data
  path: "**/*"
  delay: "100ms"
  percent: 50
  duration: "2m"

# Good - IO fault injection
apiVersion: chaos-mesh.org/v1alpha1
kind: IOChaos
metadata:
  name: io-fault-storage
  namespace: chaos-testing
spec:
  action: fault
  mode: one
  selector:
    namespaces:
      - production
    labelSelectors:
      app: object-storage
  volumePath: /data
  path: "/data/uploads/*"
  errno: 5
  percent: 10
  duration: "1m"

HTTP Chaos Experiments¶

# Good - HTTP abort injection
apiVersion: chaos-mesh.org/v1alpha1
kind: HTTPChaos
metadata:
  name: http-abort-external-api
  namespace: chaos-testing
spec:
  mode: all
  selector:
    namespaces:
      - production
    labelSelectors:
      app: integration-service
  target: Request
  port: 80
  method: GET
  path: /api/external/*
  abort: true
  percent: 50
  duration: "2m"

# Good - HTTP delay injection
apiVersion: chaos-mesh.org/v1alpha1
kind: HTTPChaos
metadata:
  name: http-delay-upstream
  namespace: chaos-testing
spec:
  mode: all
  selector:
    namespaces:
      - production
    labelSelectors:
      app: api-gateway
  target: Response
  port: 8080
  delay: "2s"
  percent: 25
  code: 200
  duration: "5m"

# Good - HTTP response replacement
apiVersion: chaos-mesh.org/v1alpha1
kind: HTTPChaos
metadata:
  name: http-error-injection
  namespace: chaos-testing
spec:
  mode: one
  selector:
    namespaces:
      - production
    labelSelectors:
      app: user-service
  target: Response
  port: 8080
  path: /api/users/*
  replace:
    code: 503
    body: '{"error": "Service temporarily unavailable"}'
    headers:
      Retry-After: "30"
  percent: 10
  duration: "3m"

DNS Chaos Experiments¶

# Good - DNS error injection
apiVersion: chaos-mesh.org/v1alpha1
kind: DNSChaos
metadata:
  name: dns-error-external
  namespace: chaos-testing
spec:
  action: error
  mode: all
  selector:
    namespaces:
      - production
    labelSelectors:
      app: notification-service
  patterns:
    - "smtp.sendgrid.net"
    - "*.sendgrid.net"
  duration: "1m"

# Good - DNS random response
apiVersion: chaos-mesh.org/v1alpha1
kind: DNSChaos
metadata:
  name: dns-random-database
  namespace: chaos-testing
spec:
  action: random
  mode: all
  selector:
    namespaces:
      - production
    labelSelectors:
      app: api-service
  patterns:
    - "postgres.internal"
  duration: "30s"

Chaos Mesh Workflow¶

# Good - Sequential chaos workflow
apiVersion: chaos-mesh.org/v1alpha1
kind: Workflow
metadata:
  name: comprehensive-resilience-test
  namespace: chaos-testing
spec:
  entry: entry-point
  templates:
    - name: entry-point
      templateType: Serial
      deadline: 30m
      children:
        - network-delay-phase
        - pod-failure-phase
        - stress-test-phase
        - cleanup-phase

    - name: network-delay-phase
      templateType: NetworkChaos
      deadline: 5m
      networkChaos:
        action: delay
        mode: all
        selector:
          namespaces:
            - production
          labelSelectors:
            app: api-service
        delay:
          latency: "200ms"
        duration: "3m"

    - name: pod-failure-phase
      templateType: PodChaos
      deadline: 5m
      podChaos:
        action: pod-failure
        mode: one
        selector:
          namespaces:
            - production
          labelSelectors:
            app: api-service
        duration: "2m"

    - name: stress-test-phase
      templateType: StressChaos
      deadline: 10m
      stressChaos:
        mode: one
        selector:
          namespaces:
            - production
          labelSelectors:
            app: api-service
        stressors:
          cpu:
            workers: 2
            load: 70
        duration: "5m"

    - name: cleanup-phase
      templateType: Suspend
      deadline: 1m
      suspend:
        duration: "30s"

# Good - Parallel chaos workflow
apiVersion: chaos-mesh.org/v1alpha1
kind: Workflow
metadata:
  name: multi-service-chaos
  namespace: chaos-testing
spec:
  entry: parallel-experiments
  templates:
    - name: parallel-experiments
      templateType: Parallel
      deadline: 15m
      children:
        - api-chaos
        - database-chaos
        - cache-chaos

    - name: api-chaos
      templateType: PodChaos
      podChaos:
        action: pod-kill
        mode: one
        selector:
          namespaces:
            - production
          labelSelectors:
            app: api-service
        duration: "5m"

    - name: database-chaos
      templateType: NetworkChaos
      networkChaos:
        action: delay
        mode: all
        selector:
          namespaces:
            - production
          labelSelectors:
            app: postgres
        delay:
          latency: "50ms"
        duration: "5m"

    - name: cache-chaos
      templateType: StressChaos
      stressChaos:
        mode: one
        selector:
          namespaces:
            - production
          labelSelectors:
            app: redis
        stressors:
          memory:
            workers: 1
            size: "100Mi"
        duration: "5m"

Litmus Chaos¶

Litmus Installation¶

# Install Litmus using Helm
helm repo add litmuschaos https://litmuschaos.github.io/litmus-helm/
helm repo update

# Create namespace
kubectl create namespace litmus

# Install Litmus ChaosCenter
helm install chaos litmuschaos/litmus \
  --namespace litmus \
  --set portal.frontend.service.type=LoadBalancer \
  --set mongodb.persistence.enabled=true \
  --set mongodb.persistence.size=20Gi

# Verify installation
kubectl get pods -n litmus

# Get access URL
kubectl get svc -n litmus

ChaosEngine Configuration¶

# Good - Basic ChaosEngine
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: nginx-chaos-engine
  namespace: default
  labels:
    context: chaos-testing
spec:
  engineState: active
  annotationCheck: "true"
  appinfo:
    appns: default
    applabel: "app=nginx"
    appkind: deployment
  chaosServiceAccount: litmus-admin
  experiments:
    - name: pod-delete
      spec:
        components:
          env:
            - name: TOTAL_CHAOS_DURATION
              value: "60"
            - name: CHAOS_INTERVAL
              value: "10"
            - name: FORCE
              value: "false"
            - name: PODS_AFFECTED_PERC
              value: "50"

# Good - ChaosEngine with probes
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: api-resilience-test
  namespace: production
spec:
  engineState: active
  appinfo:
    appns: production
    applabel: "app=api-service"
    appkind: deployment
  chaosServiceAccount: litmus-admin
  experiments:
    - name: pod-delete
      spec:
        probe:
          - name: health-check
            type: httpProbe
            mode: Continuous
            runProperties:
              probeTimeout: 5s
              retry: 3
              interval: 2s
            httpProbe/inputs:
              url: http://api-service.production.svc:8080/health
              insecureSkipVerify: false
              method:
                get:
                  criteria: ==
                  responseCode: "200"
          - name: prometheus-check
            type: promProbe
            mode: Edge
            runProperties:
              probeTimeout: 5s
              retry: 2
            promProbe/inputs:
              endpoint: http://prometheus.monitoring.svc:9090
              query: sum(rate(http_requests_total{status=~"5.."}[1m]))
              comparator:
                type: float
                criteria: "<="
                value: "0.1"
        components:
          env:
            - name: TOTAL_CHAOS_DURATION
              value: "120"
            - name: CHAOS_INTERVAL
              value: "15"

Litmus Workflows¶

# Good - Comprehensive Litmus workflow
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  name: resilience-workflow
  namespace: litmus
spec:
  entrypoint: resilience-test
  serviceAccountName: argo-chaos
  securityContext:
    runAsUser: 1000
    runAsNonRoot: true
  arguments:
    parameters:
      - name: adminModeNamespace
        value: litmus
  templates:
    - name: resilience-test
      steps:
        - - name: install-experiment
            template: install-experiment

        - - name: run-chaos
            template: run-chaos

        - - name: verify-results
            template: verify-results

    - name: install-experiment
      inputs:
        artifacts:
          - name: pod-delete-experiment
            path: /tmp/pod-delete.yaml
            raw:
              data: |
                apiVersion: litmuschaos.io/v1alpha1
                kind: ChaosExperiment
                metadata:
                  name: pod-delete
                  namespace: litmus
                spec:
                  definition:
                    scope: Namespaced
                    permissions:
                      - apiGroups: [""]
                        resources: ["pods"]
                        verbs: ["delete", "get", "list"]
                    image: litmuschaos/go-runner:latest
                    imagePullPolicy: Always
                    args:
                      - -c
                      - ./experiments -name pod-delete
                    command:
                      - /bin/bash
                    env:
                      - name: TOTAL_CHAOS_DURATION
                        value: "30"
                      - name: CHAOS_INTERVAL
                        value: "10"
      container:
        name: install
        image: litmuschaos/k8s:latest
        command: [kubectl, apply, -f, /tmp/pod-delete.yaml]

    - name: run-chaos
      container:
        name: run-chaos
        image: litmuschaos/litmus-checker:latest
        args:
          - -file=/tmp/chaosengine.yaml
        env:
          - name: APP_NAMESPACE
            value: production
          - name: APP_LABEL
            value: app=api-service
          - name: EXPERIMENT_NAME
            value: pod-delete

    - name: verify-results
      container:
        name: verify
        image: curlimages/curl:latest
        command:
          - /bin/sh
          - -c
          - |
            # Check application health
            status=$(curl -s -o /dev/null -w "%{http_code}" http://api-service.production.svc:8080/health)
            if [ "$status" != "200" ]; then
              echo "Health check failed with status: $status"
              exit 1
            fi
            echo "Application recovered successfully"

Litmus Experiments Library¶

# Good - Container kill experiment
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: container-kill-engine
  namespace: production
spec:
  engineState: active
  appinfo:
    appns: production
    applabel: "app=api-service"
    appkind: deployment
  chaosServiceAccount: litmus-admin
  experiments:
    - name: container-kill
      spec:
        components:
          env:
            - name: TARGET_CONTAINER
              value: "api"
            - name: TOTAL_CHAOS_DURATION
              value: "60"
            - name: CHAOS_INTERVAL
              value: "10"
            - name: SIGNAL
              value: "SIGKILL"

# Good - Network loss experiment
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: network-loss-engine
  namespace: production
spec:
  engineState: active
  appinfo:
    appns: production
    applabel: "app=api-service"
    appkind: deployment
  chaosServiceAccount: litmus-admin
  experiments:
    - name: pod-network-loss
      spec:
        components:
          env:
            - name: TOTAL_CHAOS_DURATION
              value: "120"
            - name: NETWORK_INTERFACE
              value: "eth0"
            - name: NETWORK_PACKET_LOSS_PERCENTAGE
              value: "50"
            - name: CONTAINER_RUNTIME
              value: "containerd"
            - name: DESTINATION_IPS
              value: "10.0.0.0/8"

# Good - Disk fill experiment
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: disk-fill-engine
  namespace: production
spec:
  engineState: active
  appinfo:
    appns: production
    applabel: "app=database"
    appkind: statefulset
  chaosServiceAccount: litmus-admin
  experiments:
    - name: disk-fill
      spec:
        components:
          env:
            - name: TOTAL_CHAOS_DURATION
              value: "60"
            - name: FILL_PERCENTAGE
              value: "80"
            - name: EPHEMERAL_STORAGE_MEBIBYTES
              value: "500"
            - name: DATA_BLOCK_SIZE
              value: "1024"

Chaos Monkey Patterns¶

AWS Chaos Monkey Configuration¶

#!/bin/bash
# chaos-monkey-setup.sh

# Install Chaos Monkey (Simian Army)
git clone https://github.com/Netflix/chaosmonkey.git
cd chaosmonkey

# Configure Chaos Monkey
cat > chaosmonkey.toml << 'EOF'
[chaosmonkey]
enabled = true
schedule_enabled = true
leashed = false
accounts = ["production"]
start_hour = 9
end_hour = 17
time_zone = "America/New_York"

[chaosmonkey.decryptor]
type = "aws.kms"

[chaosmonkey.outage_checker]
type = "pagerduty"
api_key_decrypt = "encrypted:AQECAHg..."

[chaosmonkey.terminator]
type = "aws.asg"

[chaosmonkey.tracker]
type = "dynamodb"
table_name = "chaosmonkey-state"
EOF

# Start Chaos Monkey
./chaosmonkey migrate
./chaosmonkey schedule

# chaos_monkey_controller.py
"""
@module chaos_monkey_controller
@description Chaos Monkey management and safety controls
@version 1.0.0
@author Tyler Dukes
@last_updated 2025-01-31
@status stable
"""

import boto3
from datetime import datetime, timedelta
from typing import Optional
import logging

logger = logging.getLogger(__name__)


class ChaosMonkeyController:
    """Controller for Chaos Monkey experiments with safety controls."""

    def __init__(
        self,
        region: str = "us-east-1",
        dry_run: bool = False,
    ) -> None:
        self.ec2 = boto3.client("ec2", region_name=region)
        self.asg = boto3.client("autoscaling", region_name=region)
        self.cloudwatch = boto3.client("cloudwatch", region_name=region)
        self.dry_run = dry_run
        self.protected_tags = ["chaos-protected", "production-critical"]

    def is_safe_to_terminate(self, instance_id: str) -> bool:
        """Check if instance can be safely terminated."""
        response = self.ec2.describe_instances(InstanceIds=[instance_id])
        instance = response["Reservations"][0]["Instances"][0]
        tags = {t["Key"]: t["Value"] for t in instance.get("Tags", [])}

        if any(tag in tags for tag in self.protected_tags):
            logger.warning(f"Instance {instance_id} is protected")
            return False

        if tags.get("chaos-enabled") != "true":
            logger.warning(f"Instance {instance_id} not opted-in")
            return False

        return True

    def check_health_metrics(self, asg_name: str) -> bool:
        """Verify ASG health before chaos injection."""
        response = self.asg.describe_auto_scaling_groups(
            AutoScalingGroupNames=[asg_name]
        )
        asg = response["AutoScalingGroups"][0]

        healthy_count = sum(
            1 for i in asg["Instances"]
            if i["HealthStatus"] == "Healthy"
        )
        min_healthy = asg["MinSize"] + 1

        if healthy_count < min_healthy:
            logger.warning(
                f"ASG {asg_name} has insufficient healthy instances: "
                f"{healthy_count} < {min_healthy}"
            )
            return False

        return True

    def terminate_random_instance(
        self,
        asg_name: str,
        probability: float = 1.0,
    ) -> Optional[str]:
        """Terminate random instance with safety checks."""
        import random

        if random.random() > probability:
            logger.info("Chaos skipped due to probability")
            return None

        if not self.check_health_metrics(asg_name):
            logger.warning("Aborting: health check failed")
            return None

        response = self.asg.describe_auto_scaling_groups(
            AutoScalingGroupNames=[asg_name]
        )
        instances = response["AutoScalingGroups"][0]["Instances"]
        eligible = [
            i for i in instances
            if i["HealthStatus"] == "Healthy"
            and self.is_safe_to_terminate(i["InstanceId"])
        ]

        if not eligible:
            logger.warning("No eligible instances for termination")
            return None

        victim = random.choice(eligible)
        instance_id = victim["InstanceId"]

        if self.dry_run:
            logger.info(f"DRY RUN: Would terminate {instance_id}")
            return instance_id

        self.ec2.terminate_instances(InstanceIds=[instance_id])
        logger.info(f"Terminated instance: {instance_id}")
        return instance_id

    def run_scheduled_chaos(
        self,
        asg_name: str,
        schedule: str = "0 10 * * MON-FRI",
        probability: float = 0.5,
    ) -> None:
        """Run chaos on schedule with business hours check."""
        now = datetime.now()

        if not (9 <= now.hour < 17):
            logger.info("Outside business hours, skipping chaos")
            return

        if now.weekday() >= 5:
            logger.info("Weekend, skipping chaos")
            return

        self.terminate_random_instance(asg_name, probability)

Kubernetes Chaos Monkey¶

# Good - Kube-monkey configuration
apiVersion: v1
kind: ConfigMap
metadata:
  name: kube-monkey-config
  namespace: kube-system
data:
  config.toml: |
    [kubemonkey]
    run_hour = 10
    start_hour = 10
    end_hour = 16
    time_zone = "America/New_York"
    blacklisted_namespaces = ["kube-system", "monitoring", "istio-system"]
    whitelisted_namespaces = ["production", "staging"]

    [debug]
    enabled = true
    schedule_immediate_kill = false

    [notifications]
    enabled = true
    endpoint = "https://hooks.slack.com/services/xxx"

# Good - Deployment with chaos opt-in
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-service
  namespace: production
  labels:
    app: api-service
    kube-monkey/enabled: "true"
    kube-monkey/identifier: "api-service"
    kube-monkey/mtbf: "2"
    kube-monkey/kill-mode: "fixed"
    kube-monkey/kill-value: "1"
spec:
  replicas: 5
  selector:
    matchLabels:
      app: api-service
  template:
    metadata:
      labels:
        app: api-service
        kube-monkey/enabled: "true"
        kube-monkey/identifier: "api-service"
    spec:
      containers:
        - name: api
          image: api-service:latest
          resources:
            requests:
              cpu: 100m
              memory: 128Mi
            limits:
              cpu: 500m
              memory: 512Mi

Experiment Design¶

Hypothesis Template¶

# Good - Experiment hypothesis document
apiVersion: v1
kind: ConfigMap
metadata:
  name: chaos-experiment-hypothesis
  namespace: chaos-testing
data:
  experiment.yaml: |
    experiment:
      name: "API Service Pod Failure Resilience"
      id: "EXP-2025-001"
      date: "2025-01-31"
      owner: "platform-team"

    hypothesis:
      steady_state:
        description: "API responds to 99.9% of requests within 500ms"
        metrics:
          - name: "success_rate"
            query: "sum(rate(http_requests_total{status=~'2..'}[5m])) / sum(rate(http_requests_total[5m]))"
            expected: ">= 0.999"
          - name: "p99_latency"
            query: "histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))"
            expected: "<= 0.5"

      prediction: |
        When 25% of API pods are terminated, the system will:
        1. Continue serving requests without errors
        2. Maintain p99 latency under 750ms during recovery
        3. Recover to steady state within 60 seconds

    experiment:
      type: "pod-failure"
      target:
        namespace: "production"
        selector: "app=api-service"
      parameters:
        mode: "fixed-percent"
        value: "25"
        duration: "2m"

    safety:
      abort_conditions:
        - "error_rate > 0.05"
        - "p99_latency > 2s"
        - "available_pods < 2"
      rollback_procedure: |
        1. Delete ChaosExperiment CR
        2. Wait for pod recovery
        3. Verify health endpoints
        4. Check dependent services

    observation:
      dashboards:
        - "https://grafana.internal/d/api-service"
      alerts:
        - "APIHighErrorRate"
        - "APIHighLatency"
      logs:
        - "kubectl logs -l app=api-service -n production --since=10m"

Experiment Runbook¶

# chaos_experiment_runner.py
"""
@module chaos_experiment_runner
@description Automated chaos experiment execution with safety controls
@version 1.0.0
@author Tyler Dukes
@last_updated 2025-01-31
@status stable
"""

import asyncio
from dataclasses import dataclass
from datetime import datetime, timedelta
from enum import Enum
from typing import Callable
import httpx
from prometheus_api_client import PrometheusConnect


class ExperimentState(Enum):
    """Experiment lifecycle states."""

    PENDING = "pending"
    RUNNING = "running"
    COMPLETED = "completed"
    ABORTED = "aborted"
    FAILED = "failed"


@dataclass
class SteadyStateMetric:
    """Definition of a steady state metric."""

    name: str
    query: str
    threshold: float
    comparator: str = ">="


@dataclass
class ExperimentConfig:
    """Chaos experiment configuration."""

    name: str
    hypothesis: str
    duration_seconds: int
    steady_state_metrics: list[SteadyStateMetric]
    abort_threshold_seconds: int = 30
    cooldown_seconds: int = 60


class ChaosExperimentRunner:
    """Execute chaos experiments with safety controls."""

    def __init__(
        self,
        prometheus_url: str,
        slack_webhook: str | None = None,
    ) -> None:
        self.prom = PrometheusConnect(url=prometheus_url)
        self.slack_webhook = slack_webhook
        self.state = ExperimentState.PENDING

    async def check_steady_state(
        self,
        metrics: list[SteadyStateMetric],
    ) -> tuple[bool, dict]:
        """Verify all steady state metrics are within thresholds."""
        results = {}
        all_pass = True

        for metric in metrics:
            result = self.prom.custom_query(metric.query)
            if not result:
                results[metric.name] = {"value": None, "pass": False}
                all_pass = False
                continue

            value = float(result[0]["value"][1])
            passes = self._evaluate_threshold(
                value,
                metric.threshold,
                metric.comparator,
            )
            results[metric.name] = {"value": value, "pass": passes}
            if not passes:
                all_pass = False

        return all_pass, results

    def _evaluate_threshold(
        self,
        value: float,
        threshold: float,
        comparator: str,
    ) -> bool:
        """Evaluate value against threshold."""
        ops = {
            ">=": lambda v, t: v >= t,
            "<=": lambda v, t: v <= t,
            ">": lambda v, t: v > t,
            "<": lambda v, t: v < t,
            "==": lambda v, t: v == t,
        }
        return ops.get(comparator, lambda v, t: False)(value, threshold)

    async def run_experiment(
        self,
        config: ExperimentConfig,
        inject_chaos: Callable[[], None],
        stop_chaos: Callable[[], None],
    ) -> dict:
        """Execute chaos experiment with full lifecycle."""
        start_time = datetime.now()
        experiment_log = {
            "name": config.name,
            "start_time": start_time.isoformat(),
            "hypothesis": config.hypothesis,
            "events": [],
        }

        try:
            self.state = ExperimentState.PENDING
            await self._notify(f"Starting experiment: {config.name}")

            passes, results = await self.check_steady_state(
                config.steady_state_metrics
            )
            experiment_log["pre_steady_state"] = results

            if not passes:
                experiment_log["events"].append({
                    "time": datetime.now().isoformat(),
                    "event": "abort",
                    "reason": "Pre-experiment steady state check failed",
                })
                self.state = ExperimentState.ABORTED
                return experiment_log

            self.state = ExperimentState.RUNNING
            inject_chaos()
            experiment_log["events"].append({
                "time": datetime.now().isoformat(),
                "event": "chaos_injected",
            })

            abort_start = None
            for _ in range(config.duration_seconds):
                await asyncio.sleep(1)

                passes, results = await self.check_steady_state(
                    config.steady_state_metrics
                )
                experiment_log["events"].append({
                    "time": datetime.now().isoformat(),
                    "event": "metric_check",
                    "results": results,
                })

                if not passes:
                    if abort_start is None:
                        abort_start = datetime.now()
                    elif (datetime.now() - abort_start).seconds > config.abort_threshold_seconds:
                        stop_chaos()
                        experiment_log["events"].append({
                            "time": datetime.now().isoformat(),
                            "event": "abort",
                            "reason": "Steady state violation exceeded threshold",
                        })
                        self.state = ExperimentState.ABORTED
                        return experiment_log
                else:
                    abort_start = None

            stop_chaos()
            experiment_log["events"].append({
                "time": datetime.now().isoformat(),
                "event": "chaos_stopped",
            })

            await asyncio.sleep(config.cooldown_seconds)
            passes, results = await self.check_steady_state(
                config.steady_state_metrics
            )
            experiment_log["post_steady_state"] = results

            self.state = (
                ExperimentState.COMPLETED
                if passes
                else ExperimentState.FAILED
            )

        except Exception as e:
            stop_chaos()
            experiment_log["events"].append({
                "time": datetime.now().isoformat(),
                "event": "error",
                "error": str(e),
            })
            self.state = ExperimentState.FAILED

        experiment_log["end_time"] = datetime.now().isoformat()
        experiment_log["state"] = self.state.value
        await self._notify(
            f"Experiment {config.name} completed: {self.state.value}"
        )

        return experiment_log

    async def _notify(self, message: str) -> None:
        """Send notification to Slack."""
        if not self.slack_webhook:
            return

        async with httpx.AsyncClient() as client:
            await client.post(
                self.slack_webhook,
                json={"text": message},
            )

Safety and Blast Radius¶

Blast Radius Controls¶

# Good - Namespace isolation for chaos
apiVersion: v1
kind: Namespace
metadata:
  name: chaos-sandbox
  labels:
    chaos-mesh.org/inject: enabled
    environment: sandbox
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: chaos-isolation
  namespace: chaos-sandbox
spec:
  podSelector: {}
  policyTypes:
    - Ingress
    - Egress
  ingress:
    - from:
        - namespaceSelector:
            matchLabels:
              chaos-mesh.org/inject: enabled
  egress:
    - to:
        - namespaceSelector:
            matchLabels:
              chaos-mesh.org/inject: enabled

# Good - RBAC for chaos experiments
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: chaos-experimenter
  namespace: production
rules:
  - apiGroups: ["chaos-mesh.org"]
    resources: ["podchaos", "networkchaos"]
    verbs: ["create", "delete", "get", "list"]
    resourceNames: []
  - apiGroups: [""]
    resources: ["pods"]
    verbs: ["get", "list"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: chaos-experimenter-binding
  namespace: production
subjects:
  - kind: ServiceAccount
    name: chaos-runner
    namespace: chaos-testing
roleRef:
  kind: Role
  name: chaos-experimenter
  apiGroup: rbac.authorization.k8s.io

Emergency Stop Mechanism¶

# Good - Emergency stop ConfigMap
apiVersion: v1
kind: ConfigMap
metadata:
  name: chaos-circuit-breaker
  namespace: chaos-testing
data:
  enabled: "true"
  emergency_stop: "false"
  max_concurrent_experiments: "3"
  excluded_namespaces: |
    kube-system
    monitoring
    istio-system
    cert-manager

# circuit_breaker.py
"""
@module circuit_breaker
@description Chaos experiment circuit breaker with emergency stop
@version 1.0.0
@author Tyler Dukes
@last_updated 2025-01-31
@status stable
"""

from kubernetes import client, config
from typing import NamedTuple
import logging

logger = logging.getLogger(__name__)


class CircuitState(NamedTuple):
    """Circuit breaker state."""

    enabled: bool
    emergency_stop: bool
    max_concurrent: int
    excluded_namespaces: list[str]


class ChaosCircuitBreaker:
    """Circuit breaker for chaos experiments."""

    def __init__(self, namespace: str = "chaos-testing") -> None:
        config.load_incluster_config()
        self.v1 = client.CoreV1Api()
        self.namespace = namespace
        self.config_name = "chaos-circuit-breaker"

    def get_state(self) -> CircuitState:
        """Get current circuit breaker state."""
        cm = self.v1.read_namespaced_config_map(
            name=self.config_name,
            namespace=self.namespace,
        )
        return CircuitState(
            enabled=cm.data.get("enabled", "true").lower() == "true",
            emergency_stop=cm.data.get("emergency_stop", "false").lower() == "true",
            max_concurrent=int(cm.data.get("max_concurrent_experiments", "3")),
            excluded_namespaces=cm.data.get("excluded_namespaces", "").strip().split("\n"),
        )

    def is_experiment_allowed(
        self,
        target_namespace: str,
        current_experiments: int,
    ) -> tuple[bool, str]:
        """Check if experiment is allowed to run."""
        state = self.get_state()

        if not state.enabled:
            return False, "Chaos experiments are disabled"

        if state.emergency_stop:
            return False, "Emergency stop is active"

        if target_namespace in state.excluded_namespaces:
            return False, f"Namespace {target_namespace} is excluded"

        if current_experiments >= state.max_concurrent:
            return False, f"Max concurrent experiments ({state.max_concurrent}) reached"

        return True, "Experiment allowed"

    def trigger_emergency_stop(self, reason: str) -> None:
        """Trigger emergency stop for all experiments."""
        logger.critical(f"EMERGENCY STOP triggered: {reason}")

        cm = self.v1.read_namespaced_config_map(
            name=self.config_name,
            namespace=self.namespace,
        )
        cm.data["emergency_stop"] = "true"
        cm.data["emergency_stop_reason"] = reason

        self.v1.patch_namespaced_config_map(
            name=self.config_name,
            namespace=self.namespace,
            body=cm,
        )

        self._cleanup_all_experiments()

    def _cleanup_all_experiments(self) -> None:
        """Delete all running chaos experiments."""
        custom_api = client.CustomObjectsApi()

        chaos_types = [
            ("chaos-mesh.org", "v1alpha1", "podchaos"),
            ("chaos-mesh.org", "v1alpha1", "networkchaos"),
            ("chaos-mesh.org", "v1alpha1", "stresschaos"),
            ("chaos-mesh.org", "v1alpha1", "iochaos"),
        ]

        for group, version, plural in chaos_types:
            try:
                experiments = custom_api.list_cluster_custom_object(
                    group=group,
                    version=version,
                    plural=plural,
                )
                for exp in experiments.get("items", []):
                    custom_api.delete_namespaced_custom_object(
                        group=group,
                        version=version,
                        namespace=exp["metadata"]["namespace"],
                        plural=plural,
                        name=exp["metadata"]["name"],
                    )
                    logger.info(
                        f"Deleted {plural}/{exp['metadata']['name']} "
                        f"in {exp['metadata']['namespace']}"
                    )
            except client.ApiException as e:
                logger.error(f"Failed to cleanup {plural}: {e}")

    def reset_emergency_stop(self) -> None:
        """Reset emergency stop flag."""
        cm = self.v1.read_namespaced_config_map(
            name=self.config_name,
            namespace=self.namespace,
        )
        cm.data["emergency_stop"] = "false"
        cm.data.pop("emergency_stop_reason", None)

        self.v1.patch_namespaced_config_map(
            name=self.config_name,
            namespace=self.namespace,
            body=cm,
        )
        logger.info("Emergency stop has been reset")

Synthetic Monitoring Overview¶

Monitoring Architecture¶

┌─────────────────────────────────────────────────────────────────┐
│                  Synthetic Monitoring Stack                      │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│   ┌─────────────┐    ┌─────────────┐    ┌─────────────┐        │
│   │   API       │    │  BROWSER    │    │   SSL/DNS   │        │
│   │   CHECKS    │    │   TESTS     │    │   MONITORS  │        │
│   │             │    │             │    │             │         │
│   │  HTTP/HTTPS │    │  User       │    │  Certificate│        │
│   │  endpoints  │    │  journeys   │    │  expiry     │        │
│   │  health     │    │  flows      │    │  DNS health │        │
│   └──────┬──────┘    └──────┬──────┘    └──────┬──────┘        │
│          │                  │                  │                 │
│          └──────────────────┼──────────────────┘                │
│                             │                                    │
│                      ┌──────▼──────┐                            │
│                      │   ALERTING  │                            │
│                      │   & SLOs    │                            │
│                      └─────────────┘                            │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Datadog Synthetics¶

API Tests¶

# Good - Datadog API synthetic test
apiVersion: datadoghq.com/v1alpha1
kind: DatadogSynthetic
metadata:
  name: api-health-check
  namespace: monitoring
spec:
  name: "API Health Check"
  type: api
  subtype: http
  status: live
  message: "API health check failed - {{#is_alert}}ALERT{{/is_alert}}"
  tags:
    - "env:production"
    - "team:platform"
    - "service:api"
  request:
    method: GET
    url: "https://api.example.com/health"
    timeout: 30
    headers:
      Accept: "application/json"
      X-Request-ID: "synthetic-{{$randomUUID}}"
  assertions:
    - type: statusCode
      operator: is
      target: 200
    - type: responseTime
      operator: lessThan
      target: 500
    - type: header
      property: content-type
      operator: contains
      target: "application/json"
    - type: body
      operator: validatesJSONPath
      target: "$.status"
      targetValue: "healthy"
  locations:
    - aws:us-east-1
    - aws:us-west-2
    - aws:eu-west-1
    - aws:ap-northeast-1
  options:
    tick_every: 60
    min_failure_duration: 120
    min_location_failed: 2
    retry:
      count: 2
      interval: 500
    monitor_options:
      renotify_interval: 120
      escalation_message: "API still failing after 2 hours"
      include_tags: true

# Good - Multi-step API test
apiVersion: datadoghq.com/v1alpha1
kind: DatadogSynthetic
metadata:
  name: api-auth-flow
  namespace: monitoring
spec:
  name: "API Authentication Flow"
  type: api
  subtype: multi
  status: live
  steps:
    - name: "Login"
      subtype: http
      request:
        method: POST
        url: "https://api.example.com/auth/login"
        body: |
          {
            "email": "synthetic@example.com",
            "password": "{{SYNTHETIC_PASSWORD}}"
          }
        headers:
          Content-Type: "application/json"
      assertions:
        - type: statusCode
          operator: is
          target: 200
        - type: body
          operator: validatesJSONPath
          target: "$.token"
      extractedValues:
        - name: AUTH_TOKEN
          type: http_body
          field: "$.token"

    - name: "Get User Profile"
      subtype: http
      request:
        method: GET
        url: "https://api.example.com/users/me"
        headers:
          Authorization: "Bearer {{AUTH_TOKEN}}"
      assertions:
        - type: statusCode
          operator: is
          target: 200
        - type: body
          operator: validatesJSONPath
          target: "$.email"
          targetValue: "synthetic@example.com"

    - name: "Logout"
      subtype: http
      request:
        method: POST
        url: "https://api.example.com/auth/logout"
        headers:
          Authorization: "Bearer {{AUTH_TOKEN}}"
      assertions:
        - type: statusCode
          operator: is
          target: 204
  locations:
    - aws:us-east-1
    - aws:eu-west-1
  options:
    tick_every: 300

Browser Tests¶

// Good - Datadog browser test
const { synthetics } = require("@datadog/datadog-ci");

module.exports = {
  name: "User Login Journey",
  type: "browser",
  status: "live",
  message: "Login flow failed - investigate immediately",
  tags: ["env:production", "team:frontend", "journey:login"],

  locations: ["aws:us-east-1", "aws:eu-west-1", "aws:ap-southeast-1"],

  options: {
    tick_every: 300,
    min_failure_duration: 180,
    min_location_failed: 2,
    device_ids: ["chrome.laptop_large", "firefox.laptop_large"],
    retry: {
      count: 1,
      interval: 1000,
    },
    ci: {
      executionRule: "blocking",
    },
  },

  steps: [
    {
      name: "Navigate to login page",
      type: "goToUrl",
      params: {
        url: "https://app.example.com/login",
      },
      assertions: [
        {
          type: "currentUrl",
          operator: "contains",
          target: "/login",
        },
      ],
    },
    {
      name: "Enter email",
      type: "typeText",
      params: {
        element: '[data-testid="email-input"]',
        value: "synthetic@example.com",
      },
    },
    {
      name: "Enter password",
      type: "typeText",
      params: {
        element: '[data-testid="password-input"]',
        value: "{{ SYNTHETIC_PASSWORD }}",
      },
    },
    {
      name: "Click login button",
      type: "click",
      params: {
        element: '[data-testid="login-button"]',
      },
    },
    {
      name: "Wait for dashboard",
      type: "waitForElement",
      params: {
        element: '[data-testid="dashboard"]',
        timeout: 10000,
      },
    },
    {
      name: "Verify user is logged in",
      type: "assertElementContent",
      params: {
        element: '[data-testid="user-email"]',
        value: "synthetic@example.com",
      },
    },
    {
      name: "Check page performance",
      type: "customJavascript",
      params: {
        code: `
          const timing = performance.timing;
          const loadTime = timing.loadEventEnd - timing.navigationStart;
          if (loadTime > 3000) {
            throw new Error('Page load time exceeded 3s: ' + loadTime + 'ms');
          }
          return loadTime;
        `,
      },
    },
  ],
};

SSL and DNS Monitoring¶

# Good - SSL certificate monitoring
apiVersion: datadoghq.com/v1alpha1
kind: DatadogSynthetic
metadata:
  name: ssl-certificate-monitor
  namespace: monitoring
spec:
  name: "SSL Certificate Monitor - api.example.com"
  type: api
  subtype: ssl
  status: live
  message: |
    SSL certificate issue detected for api.example.com
    {{#is_alert}}Certificate expires in less than 30 days{{/is_alert}}
    {{#is_warning}}Certificate expires in less than 60 days{{/is_warning}}
  tags:
    - "env:production"
    - "monitor:ssl"
  request:
    host: api.example.com
    port: 443
  assertions:
    - type: certificate
      operator: isInMoreThan
      target: 30
  locations:
    - aws:us-east-1
    - aws:eu-west-1
  options:
    tick_every: 3600
    accept_self_signed: false
---
# Good - DNS monitoring
apiVersion: datadoghq.com/v1alpha1
kind: DatadogSynthetic
metadata:
  name: dns-resolution-monitor
  namespace: monitoring
spec:
  name: "DNS Resolution Monitor - api.example.com"
  type: api
  subtype: dns
  status: live
  message: "DNS resolution failed for api.example.com"
  tags:
    - "env:production"
    - "monitor:dns"
  request:
    host: api.example.com
    dnsServer: 8.8.8.8
    dnsServerPort: 53
  assertions:
    - type: recordSome
      operator: is
      property: A
      target: "10.0.1.100"
    - type: responseTime
      operator: lessThan
      target: 100
  locations:
    - aws:us-east-1
    - aws:us-west-2
    - aws:eu-west-1
  options:
    tick_every: 300

Checkly¶

API Checks¶

// Good - Checkly API check with assertions
const { ApiCheck, AssertionBuilder } = require("checkly/constructs");

new ApiCheck("api-health-check", {
  name: "API Health Check",
  activated: true,
  frequency: 5,
  frequencyOffset: 1,
  degradedResponseTime: 3000,
  maxResponseTime: 10000,

  locations: ["us-east-1", "eu-west-1", "ap-northeast-1"],

  tags: ["api", "production", "critical"],

  request: {
    method: "GET",
    url: "https://api.example.com/health",
    headers: [
      {
        key: "Accept",
        value: "application/json",
      },
      {
        key: "X-Request-Source",
        value: "checkly-synthetic",
      },
    ],
    followRedirects: true,
    skipSSL: false,
    assertions: [
      AssertionBuilder.statusCode().equals(200),
      AssertionBuilder.responseTime().lessThan(500),
      AssertionBuilder.jsonBody("$.status").equals("healthy"),
      AssertionBuilder.jsonBody("$.version").isNotEmpty(),
      AssertionBuilder.header("content-type").contains("application/json"),
    ],
  },

  alertChannels: [
    {
      type: "SLACK",
      config: {
        url: process.env.SLACK_WEBHOOK_URL,
        channel: "#alerts-production",
      },
    },
    {
      type: "PAGERDUTY",
      config: {
        serviceKey: process.env.PAGERDUTY_SERVICE_KEY,
        severity: "critical",
      },
    },
  ],

  doubleCheck: true,
  shouldFail: false,
  useGlobalAlertSettings: false,
  alertSettings: {
    escalationType: "RUN_BASED",
    runBasedEscalation: {
      failedRunThreshold: 2,
    },
    parallelRunFailureThreshold: {
      enabled: true,
      percentage: 50,
    },
    reminders: {
      amount: 3,
      interval: 10,
    },
    sslCertificates: {
      enabled: true,
      alertThreshold: 30,
    },
  },
});

// Good - Multi-step API check
const { ApiCheck, AssertionBuilder } = require("checkly/constructs");

new ApiCheck("api-order-flow", {
  name: "Order Creation Flow",
  activated: true,
  frequency: 15,
  locations: ["us-east-1", "eu-west-1"],

  request: {
    method: "POST",
    url: "https://api.example.com/orders",
    headers: [
      {
        key: "Content-Type",
        value: "application/json",
      },
      {
        key: "Authorization",
        value: "Bearer {{CHECKLY_API_TOKEN}}",
      },
    ],
    body: JSON.stringify({
      items: [
        {
          sku: "TEST-001",
          quantity: 1,
        },
      ],
      customer_id: "synthetic-customer",
    }),
    assertions: [
      AssertionBuilder.statusCode().equals(201),
      AssertionBuilder.jsonBody("$.order_id").isNotEmpty(),
      AssertionBuilder.jsonBody("$.status").equals("pending"),
    ],
  },

  setupScript: `
    const crypto = require('crypto');
    request.headers['X-Idempotency-Key'] = crypto.randomUUID();
    request.headers['X-Request-ID'] = 'checkly-' + Date.now();
  `,

  teardownScript: `
    // Clean up test order
    const orderId = response.body.order_id;
    if (orderId) {
      await fetch('https://api.example.com/orders/' + orderId, {
        method: 'DELETE',
        headers: {
          'Authorization': 'Bearer ' + process.env.CHECKLY_API_TOKEN
        }
      });
    }
  `,
});

Browser Checks¶

// Good - Checkly browser check with Playwright
const { BrowserCheck } = require("checkly/constructs");
const { test, expect } = require("@playwright/test");

new BrowserCheck("checkout-flow", {
  name: "E-commerce Checkout Flow",
  activated: true,
  frequency: 30,
  locations: ["us-east-1", "eu-west-1"],

  tags: ["e2e", "checkout", "critical"],

  code: {
    entrypoint: "./checks/checkout-flow.spec.ts",
  },

  alertChannels: [
    {
      type: "SLACK",
      config: {
        url: process.env.SLACK_WEBHOOK_URL,
      },
    },
  ],
});

// checks/checkout-flow.spec.ts
test("complete checkout flow", async ({ page }) => {
  await page.goto("https://shop.example.com");

  await page.click('[data-testid="product-card"]:first-child');
  await expect(page.locator('[data-testid="product-title"]')).toBeVisible();

  await page.click('[data-testid="add-to-cart"]');
  await expect(page.locator('[data-testid="cart-count"]')).toHaveText("1");

  await page.click('[data-testid="cart-icon"]');
  await page.click('[data-testid="checkout-button"]');

  await page.fill('[data-testid="email"]', "synthetic@example.com");
  await page.fill('[data-testid="card-number"]', "4242424242424242");
  await page.fill('[data-testid="card-expiry"]', "12/30");
  await page.fill('[data-testid="card-cvc"]', "123");

  await page.click('[data-testid="place-order"]');

  await expect(page.locator('[data-testid="order-confirmation"]')).toBeVisible({
    timeout: 30000,
  });

  const orderId = await page
    .locator('[data-testid="order-id"]')
    .textContent();
  console.log(`Order created: ${orderId}`);
});

Checkly CLI Configuration¶

// checkly.config.js
const { defineConfig } = require("checkly");

module.exports = defineConfig({
  projectName: "Production Monitoring",
  logicalId: "production-monitoring",
  repoUrl: "https://github.com/example/app",

  checks: {
    activated: true,
    muted: false,
    runtimeId: "2024.02",
    frequency: 5,
    locations: ["us-east-1", "eu-west-1"],

    tags: ["production"],

    checkMatch: "**/__checks__/**/*.check.{js,ts}",
    browserChecks: {
      testMatch: "**/__checks__/**/*.spec.{js,ts}",
    },
  },

  cli: {
    runLocation: "us-east-1",
    privateRunLocation: "private-dc-1",
  },
});

# Good - Checkly as Code with GitHub Actions
name: Checkly Monitoring

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

jobs:
  test-and-deploy:
    runs-on: ubuntu-latest

    steps:
      - uses: actions/checkout@v4

      - name: Setup Node.js
        uses: actions/setup-node@v4
        with:
          node-version: "20"
          cache: "npm"

      - name: Install dependencies
        run: npm ci

      - name: Run Checkly tests
        uses: checkly/checkly-action@v1
        with:
          apiKey: ${{ secrets.CHECKLY_API_KEY }}
          accountId: ${{ secrets.CHECKLY_ACCOUNT_ID }}

      - name: Deploy checks
        if: github.ref == 'refs/heads/main'
        run: npx checkly deploy --force
        env:
          CHECKLY_API_KEY: ${{ secrets.CHECKLY_API_KEY }}
          CHECKLY_ACCOUNT_ID: ${{ secrets.CHECKLY_ACCOUNT_ID }}

Uptime and SLO Standards¶

SLO Definitions¶

# Good - SLO configuration
apiVersion: v1
kind: ConfigMap
metadata:
  name: service-slos
  namespace: monitoring
data:
  slos.yaml: |
    services:
      api-service:
        availability:
          target: 99.9
          window: 30d
          burn_rate_alerts:
            - severity: critical
              long_window: 1h
              short_window: 5m
              burn_rate: 14.4
            - severity: warning
              long_window: 6h
              short_window: 30m
              burn_rate: 6

        latency:
          target: 99.0
          threshold_ms: 500
          window: 30d

        error_rate:
          target: 99.5
          window: 30d

      payment-service:
        availability:
          target: 99.99
          window: 30d
          burn_rate_alerts:
            - severity: critical
              long_window: 1h
              short_window: 5m
              burn_rate: 14.4

        latency:
          target: 99.5
          threshold_ms: 200
          window: 30d

# Good - Prometheus SLO recording rules
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: slo-recording-rules
  namespace: monitoring
spec:
  groups:
    - name: slo.rules
      interval: 30s
      rules:
        - record: slo:api_availability:ratio
          expr: |
            sum(rate(http_requests_total{status!~"5.."}[5m]))
            /
            sum(rate(http_requests_total[5m]))

        - record: slo:api_latency:ratio
          expr: |
            sum(rate(http_request_duration_seconds_bucket{le="0.5"}[5m]))
            /
            sum(rate(http_request_duration_seconds_count[5m]))

        - record: slo:error_budget:remaining
          expr: |
            1 - (
              (1 - slo:api_availability:ratio)
              /
              (1 - 0.999)
            )

    - name: slo.alerts
      rules:
        - alert: SLOBurnRateCritical
          expr: |
            (
              slo:api_availability:ratio < 0.999
              and
              (1 - slo:api_availability:ratio) / (1 - 0.999) > 14.4
            )
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: "Critical SLO burn rate for API service"
            description: "Error budget is being consumed 14.4x faster than sustainable"

        - alert: ErrorBudgetExhausted
          expr: slo:error_budget:remaining < 0
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: "Error budget exhausted for API service"
            description: "Monthly error budget has been fully consumed"

Multi-Region Check Strategy¶

# Good - Multi-region synthetic check configuration
apiVersion: v1
kind: ConfigMap
metadata:
  name: synthetic-check-strategy
  namespace: monitoring
data:
  strategy.yaml: |
    global_checks:
      frequency_seconds: 60
      locations:
        primary:
          - us-east-1
          - eu-west-1
          - ap-northeast-1
        secondary:
          - us-west-2
          - eu-central-1
          - ap-southeast-1

      failure_policy:
        min_locations_failed: 2
        consecutive_failures: 3
        alert_delay_seconds: 120

    regional_checks:
      us:
        endpoints:
          - https://api-us.example.com/health
        locations:
          - us-east-1
          - us-west-2
        slo_target: 99.95

      eu:
        endpoints:
          - https://api-eu.example.com/health
        locations:
          - eu-west-1
          - eu-central-1
        slo_target: 99.9

      apac:
        endpoints:
          - https://api-apac.example.com/health
        locations:
          - ap-northeast-1
          - ap-southeast-1
        slo_target: 99.9

CI/CD Integration¶

Chaos in CI/CD Pipeline¶

# Good - GitHub Actions chaos testing
name: Chaos Testing Pipeline

on:
  schedule:
    - cron: "0 10 * * 1-5"
  workflow_dispatch:
    inputs:
      experiment_type:
        description: "Type of chaos experiment"
        required: true
        default: "pod-failure"
        type: choice
        options:
          - pod-failure
          - network-delay
          - cpu-stress
          - memory-stress

jobs:
  chaos-test:
    runs-on: ubuntu-latest
    environment: staging

    steps:
      - uses: actions/checkout@v4

      - name: Configure kubectl
        uses: azure/k8s-set-context@v4
        with:
          kubeconfig: ${{ secrets.KUBE_CONFIG }}

      - name: Verify steady state
        run: |
          echo "Checking pre-experiment steady state..."

          # Check application health
          health=$(kubectl exec -n staging deploy/api-service -- \
            curl -s localhost:8080/health | jq -r '.status')
          if [ "$health" != "healthy" ]; then
            echo "Pre-experiment health check failed"
            exit 1
          fi

          # Check error rate
          error_rate=$(curl -s "http://prometheus:9090/api/v1/query" \
            --data-urlencode 'query=sum(rate(http_requests_total{status=~"5.."}[5m]))/sum(rate(http_requests_total[5m]))' \
            | jq -r '.data.result[0].value[1]')
          if (( $(echo "$error_rate > 0.01" | bc -l) )); then
            echo "Pre-experiment error rate too high: $error_rate"
            exit 1
          fi

      - name: Apply chaos experiment
        run: |
          cat << EOF | kubectl apply -f -
          apiVersion: chaos-mesh.org/v1alpha1
          kind: PodChaos
          metadata:
            name: ci-chaos-${{ github.run_id }}
            namespace: chaos-testing
          spec:
            action: ${{ inputs.experiment_type || 'pod-failure' }}
            mode: one
            duration: '60s'
            selector:
              namespaces:
                - staging
              labelSelectors:
                app: api-service
          EOF

      - name: Monitor experiment
        run: |
          echo "Monitoring chaos experiment for 90 seconds..."
          for i in $(seq 1 18); do
            sleep 5

            # Check if experiment is still running
            status=$(kubectl get podchaos ci-chaos-${{ github.run_id }} \
              -n chaos-testing -o jsonpath='{.status.experiment.phase}')
            echo "Experiment phase: $status"

            # Check application availability
            available=$(kubectl get deploy api-service -n staging \
              -o jsonpath='{.status.availableReplicas}')
            echo "Available replicas: $available"

            if [ "$available" -lt 1 ]; then
              echo "WARNING: No available replicas"
            fi
          done

      - name: Cleanup experiment
        if: always()
        run: |
          kubectl delete podchaos ci-chaos-${{ github.run_id }} \
            -n chaos-testing --ignore-not-found

      - name: Verify recovery
        run: |
          echo "Waiting 30 seconds for recovery..."
          sleep 30

          # Verify steady state restored
          health=$(kubectl exec -n staging deploy/api-service -- \
            curl -s localhost:8080/health | jq -r '.status')
          if [ "$health" != "healthy" ]; then
            echo "Post-experiment health check failed"
            exit 1
          fi

          echo "System recovered successfully"

      - name: Upload experiment results
        uses: actions/upload-artifact@v4
        with:
          name: chaos-experiment-results
          path: results/

Synthetic Tests in CI/CD¶

# Good - Checkly deployment verification
name: Deploy with Synthetic Verification

on:
  push:
    branches: [main]

jobs:
  deploy:
    runs-on: ubuntu-latest

    steps:
      - uses: actions/checkout@v4

      - name: Deploy to production
        run: |
          kubectl apply -f k8s/
          kubectl rollout status deployment/api-service -n production

      - name: Wait for deployment stabilization
        run: sleep 60

      - name: Run Checkly synthetic tests
        uses: checkly/checkly-action@v1
        id: checkly
        with:
          apiKey: ${{ secrets.CHECKLY_API_KEY }}
          accountId: ${{ secrets.CHECKLY_ACCOUNT_ID }}
          filterTags: "deployment-verification"

      - name: Rollback on failure
        if: failure() && steps.checkly.outcome == 'failure'
        run: |
          echo "Synthetic tests failed, rolling back..."
          kubectl rollout undo deployment/api-service -n production
          kubectl rollout status deployment/api-service -n production

      - name: Notify on success
        if: success()
        run: |
          curl -X POST "${{ secrets.SLACK_WEBHOOK }}" \
            -H 'Content-Type: application/json' \
            -d '{"text": "Deployment verified successfully with synthetic tests"}'

Best Practices¶

Chaos Engineering Checklist¶

# Good - Pre-experiment checklist
apiVersion: v1
kind: ConfigMap
metadata:
  name: chaos-experiment-checklist
  namespace: chaos-testing
data:
  checklist.md: |
    # Chaos Experiment Checklist

    ## Pre-Experiment
    - [ ] Hypothesis documented and reviewed
    - [ ] Steady state metrics defined
    - [ ] Blast radius limited and understood
    - [ ] Rollback procedure tested
    - [ ] Stakeholders notified
    - [ ] Observability dashboards ready
    - [ ] On-call engineer aware
    - [ ] Circuit breaker enabled

    ## During Experiment
    - [ ] Steady state being monitored
    - [ ] Abort conditions being checked
    - [ ] Impact being documented
    - [ ] Duration limit enforced

    ## Post-Experiment
    - [ ] Chaos stopped cleanly
    - [ ] System recovered to steady state
    - [ ] Results documented
    - [ ] Findings shared with team
    - [ ] Improvements identified
    - [ ] Follow-up actions created

Synthetic Monitoring Checklist¶

# Good - Synthetic monitoring standards
apiVersion: v1
kind: ConfigMap
metadata:
  name: synthetic-monitoring-standards
  namespace: monitoring
data:
  standards.md: |
    # Synthetic Monitoring Standards

    ## Check Frequency
    | Check Type        | Minimum Frequency | Recommended |
    |-------------------|-------------------|-------------|
    | Health/Heartbeat  | 1 minute          | 30 seconds  |
    | API Endpoints     | 5 minutes         | 1 minute    |
    | Browser Tests     | 15 minutes        | 5 minutes   |
    | SSL Certificates  | 1 hour            | 15 minutes  |
    | DNS Resolution    | 5 minutes         | 1 minute    |

    ## Location Requirements
    - Minimum 3 geographic locations
    - At least 2 locations per major region
    - Include locations closest to user base

    ## Alert Thresholds
    - Single location failure: Warning
    - 2+ location failures: Alert
    - 3+ consecutive failures: Page on-call

    ## Response Time Standards
    - API health checks: < 500ms
    - Page load time: < 3s
    - First contentful paint: < 1.5s
    - Time to interactive: < 5s

    ## SLO Requirements
    - Availability: 99.9% (43.8 min/month downtime)
    - Latency p99: < 500ms
    - Error rate: < 0.1%

Observability During Chaos¶

# Good - Chaos observability dashboard
apiVersion: v1
kind: ConfigMap
metadata:
  name: chaos-grafana-dashboard
  namespace: monitoring
  labels:
    grafana_dashboard: "1"
data:
  chaos-dashboard.json: |
    {
      "dashboard": {
        "title": "Chaos Engineering Dashboard",
        "panels": [
          {
            "title": "Active Chaos Experiments",
            "type": "stat",
            "targets": [
              {
                "expr": "count(chaos_mesh_experiments{phase='Running'})"
              }
            ]
          },
          {
            "title": "Error Rate During Chaos",
            "type": "graph",
            "targets": [
              {
                "expr": "sum(rate(http_requests_total{status=~'5..'}[1m])) / sum(rate(http_requests_total[1m]))",
                "legendFormat": "Error Rate"
              }
            ],
            "alert": {
              "conditions": [
                {
                  "evaluator": { "type": "gt", "params": [0.05] }
                }
              ]
            }
          },
          {
            "title": "P99 Latency During Chaos",
            "type": "graph",
            "targets": [
              {
                "expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[1m])) by (le))",
                "legendFormat": "P99 Latency"
              }
            ]
          },
          {
            "title": "Pod Availability",
            "type": "graph",
            "targets": [
              {
                "expr": "sum(kube_deployment_status_replicas_available) by (deployment)",
                "legendFormat": "{{ deployment }}"
              }
            ]
          },
          {
            "title": "Experiment Timeline",
            "type": "annotations",
            "targets": [
              {
                "expr": "changes(chaos_mesh_experiments{phase='Running'}[1m])",
                "legendFormat": "Chaos Events"
              }
            ]
          }
        ]
      }
    }

Quick Reference¶

Chaos Mesh Commands¶

# Install Chaos Mesh
helm install chaos-mesh chaos-mesh/chaos-mesh -n chaos-mesh

# Apply experiment
kubectl apply -f pod-chaos.yaml

# Check experiment status
kubectl get podchaos -A

# Delete experiment (emergency stop)
kubectl delete podchaos --all -A

# View Chaos Mesh dashboard
kubectl port-forward -n chaos-mesh svc/chaos-dashboard 2333:2333

Litmus Commands¶

# Install Litmus
helm install chaos litmuschaos/litmus -n litmus

# Apply chaos engine
kubectl apply -f chaos-engine.yaml

# Check experiment status
kubectl get chaosengine -A
kubectl get chaosresult -A

# Delete all experiments
kubectl delete chaosengine --all -A

Checkly CLI Commands¶

# Install Checkly CLI
npm install -g checkly

# Login
npx checkly login

# Test checks locally
npx checkly test

# Deploy checks
npx checkly deploy

# Trigger check run
npx checkly trigger --check api-health-check

Datadog CLI Commands¶

# Install Datadog CI
npm install -g @datadog/datadog-ci

# Run synthetic tests
datadog-ci synthetics run-tests --search 'tag:production'

# Upload test results
datadog-ci synthetics upload-application --app-key $DD_APP_KEY

# Trigger specific test
datadog-ci synthetics trigger-ci --public-id abc-123-xyz