Autonomous Infrastructure Runbooks

memory_pressure_oom

Prometheus K8sEvents Datadog CloudWatch Alertmanager GrafanaAlerting

Linux host Critical GA v0.1.23

Detects when

Detects OOM-kill events and sustained memory pressure (≥90% RSS), classifies root cause across 5 classes (memory_leak / traffic_spike / undersized_limit / noisy_neighbor / runtime_misconfig), and autonomously remediates within hard guardrails — max 3 restarts/hour, max 2× limit bump, never touch stateful workloads without allowlist.

Auto-fixes by

Diagnose in parallel: cgroup memory.stat, /proc/meminfo, top-N processes by RSS, container memory.limit vs. usage, recent OOM-kill dmesg entries, k8s events, JVM jcmd GC.heap_info, Node.js heap snapshot endpoint
Classify root cause (first match): memory_leak (RSS climb >2h, no traffic correlation), traffic_spike (RSS climb correlates with rps), undersized_limit (p99 RSS >80% of limit, steady-state), noisy_neighbor (node-level pressure, victim well-behaved), runtime_misconfig (JVM/Node heap > container limit)
memory_leak: capture heap snapshot via jcmd GC.heap_dump or SIGUSR1 for artifact; restart with exponential backoff (30s × 2^N, cap 300s); kubectl rollout restart for k8s, systemctl restart for bare-metal
traffic_spike: if HPA available, annotate deployment to trigger scale-out; else patch memory.limit upward (max 2× current, max policy-allowed); dry-run=server validation before applying
undersized_limit: propose limit increase via PR to manifest repo with suggested value (p99 usage × 1.5); never auto-apply
noisy_neighbor: cordon node, drain non-critical pods (--ignore-daemonsets --delete-emptydir-data --grace-period=60); page if drain fails
runtime_misconfig: propose corrected -Xmx / --max-old-space-size with 25% cgroup headroom; never auto-apply

Safety gates

Max 3 auto-restarts per service per hour (in-process rolling counter)
Memory limit bump capped at 2× current value — never exceed MAX_LIMIT_BUMP_MULTIPLIER
Never touch StatefulSet workloads without explicit BYTEPORT_STATEFUL_ALLOWLIST entry
undersized_limit and runtime_misconfig classifications always result in PR proposal, never auto-apply
Classify confidence must be ≥0.70 before acting autonomously — lower confidence escalates
noisy_neighbor: drain fails → page immediately, never force-delete
Verify: RSS must drop below 70% of limit within 10 min, no new OOM-kills, error rate normalized

Escalates with

Postmark alert with service, classification, confidence score, heap snapshot path (if captured), cgroup limit vs. usage, top-5 RSS consumers, k8s node state, and whether action was taken or blocked by guardrail. GitHub issue includes dmesg OOM log, full diagnostic bundle, and classification rationale. 1h dedup per service.

RBAC permissions

core/pods (read pod spec, memory limits, kubectl top pod) core/pods/exec (kubectl top requires exec for Metrics Server) core/namespaces (read byteport.io annotations) core/nodes (read MemoryPressure condition, cordon/drain) apps/deployments (rollout restart, annotate HPA trigger) autoscaling/horizontalpodautoscalers (check HPA existence)

Slack example

🔴 *memory_pressure_oom* | service: `payments-api` | class: `memory_leak` | confidence: 0.82 | action: heap snapshot captured → pod restarted (backoff 30s) | RSS: 93% → 48% | resolved in 4m12s | 🔗 /runbooks/memory_pressure_oom

Use this runbook →

shell

# 1. Check cgroup memory usage
cat /sys/fs/cgroup/memory.current
cat /sys/fs/cgroup/memory.max

# 2. Top RSS consumers
ps aux --sort=-%mem | head -10

# 3. OOM-kill log
dmesg --level=err,crit | grep -i 'oom\|out of memory' | tail -5

# 4. K8s pod memory
kubectl top pod POD -n NAMESPACE
kubectl get events -n NAMESPACE --field-selector reason=OOMKilling

# 5. JVM heap info
jcmd PID GC.heap_info

# 6. Restart with backoff (systemctl)
systemctl restart SERVICE

# 7. K8s rolling restart
kubectl rollout restart deployment/DEPLOY -n NAMESPACE

disk_space_auto_reclaim

Prometheus CloudWatch GCP K8sEvents AlertmanagerWebhook

Linux host Critical GA v0.1.21

Detects when

Autonomous full disk-space reclaim: parallel diagnose (df/inode/du/docker/journald), rotate+compress logs (logrotate+gzip>100MB+journal vacuum 7d), evict container layers (docker system prune or crictl rmi --prune on K8s nodes), drop expired tmp+apt/yum/pip/npm caches, and optionally grow cloud-managed volumes (EBS ModifyVolume / GCP disks.resize +25%, capped at policy.maxVolumeGb). Stops as soon as disk usage drops below 80%. Escalates with +N GB recommendation based on current excess and top-5 largest dirs if usage still >85%.

Auto-fixes by

Diagnose in parallel: df -h + df -i (inodes), top-10 space consumers via du -sh, docker system df, journald --disk-usage
Rotate+compress logs: force logrotate, gzip files >100MB older than 24h in /var/log, journalctl --vacuum-time=7d
Evict container layers: docker system prune -af --filter until=72h + unused volumes (bare-metal/VMs); crictl rmi --prune on K8s nodes
Drop expired caches: /tmp + /var/tmp files older than 7 days, apt/yum/dnf package cache, pip cache purge, npm cache clean --force
Grow cloud volume (opt-in, policy.allowVolumeGrow=true): EBS ModifyVolume / GCP disks.resize +25% capped at policy.maxVolumeGb, then growpart + resize2fs/xfs_growfs

Safety gates

Stop-on-floor: re-reads disk % after each step; exits early when usage drops below freeSpaceFloorPct (default 80%)
Volume grow cap: policy.maxVolumeGb (default 500 GB) — never grows beyond the configured ceiling
Volume grow opt-in: policy.allowVolumeGrow defaults to false; cloud grow only fires when explicitly enabled
BYTEPORT_RUNBOOK_DRY_RUN=true logs all intended actions without executing any shell commands
Escalates with sized recommendation (+N GB) when usage remains above escalateAbovePct (default 85%) after all steps

Escalates with

Postmark + Slack escalation when usage exceeds escalateAbovePct (default 85%) after all reclaim steps. Escalation includes: current disk % per mount, total bytes reclaimed per step, top-5 largest directories, estimated GB needed based on current excess (+N GB recommendation), and whether volume grow was skipped (and why).

RBAC permissions

filesystem read (df, df -i, du — read-only diagnostics) docker socket (docker system prune — container runtime, opt-in) crictl (crictl rmi --prune — K8s containerd, opt-in) package manager (apt-get clean / yum clean all / pip cache purge / npm cache clean — write, opt-in) AWS EC2 ModifyVolume API (EBS grow, requires ec2:ModifyVolume IAM permission, opt-in) GCP Compute Engine disks.resize API (GCP grow, requires compute.disks.resize permission, opt-in)

Slack example

🚨 *disk_space_auto_reclaim* | host: `prod-api-03` | disk: `92% → 71%` on `/` | reclaimed: 18.3 GB (logs: 4.2 GB, containers: 11.1 GB, caches: 3.0 GB) | 🔗 https://app.byteport.polsia.app/runbooks/disk_space_auto_reclaim

Use this runbook → ▶ Watch it work

typescript

import { DiskSpaceAutoReclaimRunbook, RunbookRegistry, createAgent } from "byteport-agent";
import { createPrometheusAdapter } from "byteport-agent/adapters/prometheus";

const adapter = createPrometheusAdapter({
  url: process.env.PROMETHEUS_URL!,
  // Fires disk_pressure, inode_exhaustion, volume_full, disk_almost_full
});

const registry = new RunbookRegistry();
registry.register(new DiskSpaceAutoReclaimRunbook({
  freeSpaceFloorPct: 80,
  escalateAbovePct: 85,
  isK8sNode: false,
  policy: {
    allowVolumeGrow: true,   // set false to disable cloud volume expansion
    maxVolumeGb: 500,        // safety cap
  },
}));

const agent = createAgent({
  adapters: [adapter],
  runners: [registry.asRunner("runbook")!],
  dryRun: false,
});

await agent.run();

container_restart_loop_breaker

K8sEvents Datadog Prometheus AlertmanagerWebhook

Containers Critical GA v0.1.19

Detects when

Breaks container restart loops by diagnosing root cause from logs, exit codes, and image SHA — then applying the correct class-specific fix: escalate on config/secret errors, chain on dependency failures, bump memory on OOM, rollback on port-bind or image-pull failures.

Auto-fixes by

Fetch last 200 log lines + last 5 exit codes + restart count + image SHA for diagnosis
Classify failure: config_error | dependency_down | oom_at_startup | port_conflict | image_pull_failure | unknown
Config error: check secret existence in orchestrator, surface exact diff — escalate, never auto-restore
Dependency down: TCP health-check named upstreams, defer and chain incident if dependency is itself down
OOM at startup: bump memory +25% (capped 2× original) via kubectl patch, redeploy, verify rollout
Port bind / image pull: kubectl rollout undo (K8s) or Render Deploys API rollback to last-green SHA
Unknown: escalate with structured incident report (logs, exit codes, image SHA, restart count)

Safety gates

Hard cap: 1 auto-remediation per container per hour — rejects all subsequent triggers within the window
OOM memory cap: never bumps beyond 2× original limit (configurable oomMemoryCapMultiplier)
Config error: never auto-restores secrets — always surfaces diff and escalates
Dependency chain: defers remediation when the upstream dependency is itself down (avoids masking)
BYTEPORT_RUNBOOK_DRY_RUN=true logs all intended actions without executing
Every action emits a structured audit event with failure class, action taken, and outcome

Escalates with

Postmark + Slack escalation with full diagnostic bundle: logs excerpt, exit codes, image SHA, failure class, secret diff (config_error), upstream health results (dependency_down), or structured incident report (unknown). PagerDuty Events v2 if remediation fails or hourly cap is hit. 24h dedup by container key (name@host).

RBAC permissions

core/pods/log (read previous container logs) core/pods (describe for exit code and restart count) apps/deployments (patch memory limits, rollout undo, rollout status) core/secrets (read to verify secret existence — never write) Render: API key with Deploy + Read access (RENDER_API_KEY + RENDER_SERVICE_ID)

Slack example

🚨 *container_restart_loop* | container: `api-server` | namespace: `production` | failure_class: `oom_at_startup` | exit_code: `137 (OOMKilled)` | restarts: 6 | action: memory bumped 256Mi → 320Mi, rollout verified | 🔗 https://byteport.polsia.app/runbooks/container_restart_loop_breaker

Use this runbook →

typescript

import { ContainerRestartLoopBreakerRunbook, RunbookRegistry, createAgent } from "byteport-agent";
import { createK8sEventsAdapter } from "byteport-agent/adapters/k8s-events";

const adapter = createK8sEventsAdapter({
  namespace: "production",
  // Fires container_restart_loop when restartCount > 5 in 10 min
});

const registry = new RunbookRegistry();
registry.register(new ContainerRestartLoopBreakerRunbook({
  // Post-remediation health check
  healthUrl: process.env.BYTEPORT_HEALTH_URL,
  healthWaitSec: 300, // hold for 5 min
  // OOM memory bump config
  oomMemoryBumpPct: 25,       // +25% on OOM
  oomMemoryCapMultiplier: 2.0, // never exceed 2× original
  // Render rollback credentials (used for port/image-pull class)
  renderServiceId: process.env.RENDER_SERVICE_ID,
  renderApiKey: process.env.RENDER_API_KEY,
}));

const agent = createAgent({
  adapters: [adapter],
  runners: [registry.asRunner("runbook")!],
  dryRun: false,
});

await agent.run();

cpu_saturation

Datadog CloudWatch GCP

Linux host Warning GA v0.1.3

Detects when

Detects sustained CPU saturation on Linux hosts (>85% for 90 continuous seconds) and safely restarts non-critical high-CPU processes — with oom_score_adj tuning as an intermediate step before full restart.

Auto-fixes by

Read /proc/stat to confirm sustained high CPU (distinguish spike from sustained load)
Identify top CPU consumers via ps aux --sort=-%cpu | head -5
Check for byteport.restart=safe label on identified processes — only restarts opted-in services
Apply oom_score_adj reduction of -200 to -600 on non-critical high-CPU processes to reduce scheduling priority
For memory-leak-driven CPU: identify via vmstat and suggest memory limit reduction
Restart via systemd/service only if CPU does not drop below 80% within 2 minutes of oom_score_adj change

Safety gates

Requires byteport.io/auto-remediate: "cpu" annotation on the host
Never restarts a process without the byteport.restart=safe label
Requires sustained 90s threshold — one-time CPU spike does not trigger
oom_score_adj reduction is tried first (non-disruptive) before restart
Skipped if priorityclass.kubernetes.io/no-autoremediation: "true" on the process namespace
5-minute global cooldown per host between restart attempts

Escalates with

Postmark alert with: top CPU consumers, vmstat summary, oom_score_adj changes made, restart outcome, and suggested long-term fix (resource limits, hpa configuration, or process replacement). GitHub issue for repeated CPU saturation events on the same host.

RBAC permissions

core/pods (read byteport.restart=safe label and priority class on container) core/namespaces (read byteport.io/auto-remediate annotation) sysadmin: /proc/stat (read CPU metrics) sysadmin: systemctl (restart services with safe label)

Slack example

⚠️ *cpu_saturation* | host: `prod-api-03` | avg CPU: 91% for 3 min | top process: `node /app/server.js` (PID 18841) | action: oom_score_adj -300 applied | 🔗 /runbooks/cpu_saturation

Use this runbook →

shell

# 1. Confirm sustained CPU saturation
vmstat 1 5
# Check %cpu column — must stay >85 across all 5 samples

# 2. Identify top CPU consumers
ps aux --sort=-%cpu | head -10

# 3. Check safe-restart label on processes
kubectl get pods -A -l 'byteport.restart=safe' -o wide

# 4. Reduce scheduling priority with oom_score_adj (non-disruptive)
echo -400 | sudo tee /proc/PID/oom_score_adj

# 5. Restart safe process if CPU stays critical
systemctl restart critical-app.service

# 6. Verify CPU recovery
sleep 120 && vmstat 1 3

disk_pressure

Datadog CloudWatch GCP

Linux host Warning GA v0.1.1

Detects when

Auto-remediates disk pressure on Linux hosts: log rotation, Docker prune, /tmp cleanup. Threshold: disk_usage > 85% warning, > 95% critical.

Auto-fixes by

Rotate journal logs: journalctl --vacuum-size=50MB
Delete rotated logs /var/log/*.gz older than 7 days
Run docker system prune -af to remove dangling images and stopped containers
Clear /tmp entries unused for over 48 hours
Verify disk drops below 80% threshold within 2 minutes — otherwise escalates

Safety gates

Requires byteport.io/auto-remediate: "disk" annotation on the host/node resource
Dry-run mode: --dry-run logs what would be pruned without executing
Global cooldown: 5-minute minimum between remediation cycles on the same host
Never deletes /var/log/lastlog, /var/log/wtmp, or active audit logs

Escalates with

GitHub issue (labeled byteport + severity:warning) + Postmark alert with host, mount point, current %, trend over last 5 min, full action audit log, and 3 recommended next steps.

RBAC permissions

sysadmin: journalctl --vacuum-size (root or journal group) sysadmin: rm (delete rotated logs in /var/log) sysadmin: docker (docker system prune)

Slack example

⚠️ *disk_pressure* | host: `prod-api-03` | `/var/log` at 91% | action: journal vacuum + docker prune | freed: 4.2 GB | now at 67% | 🔗 /runbooks/disk_pressure

Use this runbook →

shell

# 1. Inspect disk pressure
df -h | grep -v tmpfs | awk '$5+0 > 85 {print}'

# 2. Rotate journal logs
journalctl --vacuum-size=50MB

# 3. Prune Docker artifacts
docker system prune -af

# 4. Clear stale /tmp
find /tmp -type f -atime +2 -delete

# 5. Verify resolution
sleep 120 && df -h | grep -v tmpfs

disk-pressure-auto-cleanup

Datadog CloudWatch GrafanaAlerting PagerDuty

Linux host Critical GA v0.1.16

Detects when

End-to-end autonomous disk cleanup: diagnoses current disk usage and top space consumers, then runs ordered remediation steps — log rotation, journald vacuum, /tmp sweep, optional Docker prune, optional package cache prune — stopping as soon as usage drops below the configurable free-space floor (default 80%). Escalates with a per-step bytes-reclaimed summary and residual top-consumer paths if the floor is not restored after all safe steps.

Auto-fixes by

Diagnose: read mount usage % and top 10 space consumers via du on known-safe roots
Rotate + gzip logs older than N days (default: 7d) in /var/log — configurable allowlist; hard safe-list protects /var/log/lastlog, /var/log/wtmp, active audit logs
journalctl --vacuum-size=200M (configurable) to trim systemd journal
Clear /tmp and /var/tmp files older than N hours (default: 24h)
Docker system prune -af (opt-in: dockerPruneEnabled=true) — prunes dangling images and build cache if Docker socket present
apt-get clean / yum clean all / dnf clean all (opt-in: packageCachePruneEnabled=true) — clears OS package manager caches
Stop-on-floor: re-reads disk usage after each step; stops as soon as usage drops below freeSpaceFloorPct (default: 80%)
Escalate with per-step bytes-reclaimed, total freed, and residual top-consumer paths if floor not restored

Safety gates

Hard safe-list: /var/log/lastlog, /var/log/wtmp, /var/log/btmp, /var/log/audit/audit.log, active access.log/error.log/auth.log — never touched
Docker prune disabled by default (dockerPruneEnabled=false) — operator must explicitly opt in
Package cache prune disabled by default (packageCachePruneEnabled=false) — operator must explicitly opt in
Dry-run mode: BYTEPORT_RUNBOOK_DRY_RUN=true logs all intended actions without executing any writes
Stop-on-floor: never runs later steps if earlier steps already resolved the pressure
24h per-host cooldown enforced by RunbookCooldown layer — prevents runbook storm on repeated signals

Escalates with

Postmark + GitHub Issue (labeled byteport + severity:critical + disk_pressure) with: host, mount point, pre/post disk %, per-step bytes reclaimed, total freed, residual top consumers (du output), and suggested next steps. 24h dedup per host. Escalates only after all configured safe steps are exhausted.

RBAC permissions

sysadmin: journalctl --vacuum-size (root or journal group) sysadmin: gzip + rm (rotate logs in /var/log — root or log group) sysadmin: find/rm on /tmp and /var/tmp sysadmin: docker system prune (optional, requires Docker socket access) sysadmin: apt-get/yum/dnf clean (optional, requires package manager access)

Slack example

✅ *disk-pressure-auto-cleanup* | host: `prod-api-03` | `/` at 92% → 71% | freed: 6.8 GB (logs: 4.1 GB, journal: 2.1 GB, /tmp: 600 MB) | steps: rotate_logs + journal_vacuum | floor restored | 🔗 /runbooks/disk-pressure-auto-cleanup

Use this runbook →

typescript

import { DiskPressureAutoCleanupRunbook, RunbookRegistry, createAgent } from "byteport-agent";
import { createDatadogMetricsAdapter } from "byteport-agent/adapters/datadog-metrics";

const adapter = createDatadogMetricsAdapter();

const registry = new RunbookRegistry();
registry.register(new DiskPressureAutoCleanupRunbook({
  freeSpaceFloorPct: 80,          // stop when usage drops below 80%
  logRotationAgeDays: 7,          // rotate logs older than 7 days
  journalVacuumSize: "200M",      // trim journal to 200MB
  tmpMaxAgeHours: 24,             // sweep /tmp files older than 24h
  dockerPruneEnabled: false,      // set true to opt in
  packageCachePruneEnabled: false, // set true to opt in
}));

const agent = createAgent({
  adapters: [adapter],
  runners: [registry.asRunner("runbook")!],
  dryRun: true,
  thresholds: { disk_pressure: 85 },
});

await agent.run();

memory_pressure

Datadog CloudWatch GCP

Linux host Critical GA v0.1.1

Detects when

Detects host memory saturation (>90% for 90 seconds) and distinguishes page-cache pressure from actual memory exhaustion before safely restarting leaking processes.

Auto-fixes by

Inspect /proc/meminfo and cgroup stats to identify top resident-memory processes
Check for byteport.restart=safe label on processes — only restarts opted-in services
Attempt oom_score_adj reduction on non-critical heavy consumers to buy headroom
Restart identified leaking process via its configured init system (systemd/service)
Log post-restart memory baseline for comparison in auto-generated postmortem

Safety gates

Requires byteport.io/auto-remediate: "memory" annotation on the host
Refuses to restart any process without the byteport.restart=safe label
Pre-restart audit captures process tree, open FD count, and network connections
Post-restart verification waits 30s then checks memory returned below 85%
Escalates if restart fails or memory does not recover within 2 minutes

Escalates with

Postmark alert with host, top memory consumers, swap pressure, restart attempt result, and pre/post audit snapshot. GitHub issue includes cgroup path, RSS, and suggested memory.request / memory.limit adjustments.

RBAC permissions

sysadmin: /proc/meminfo (read memory stats) sysadmin: /proc/PID/oom_score_adj (adjust scheduling priority) sysadmin: systemctl (restart services with safe label)

Slack example

🔴 *memory_pressure* | host: `prod-api-01` | memory: 94% for 90s | leak: `python3 worker.py` (RSS 2.1 GB) | action: oom_score_adj -600 → memory recovered to 71% | 🔗 /runbooks/memory_pressure

Use this runbook →

shell

# 1. Check memory pressure
cat /proc/meminfo | grep -E 'MemAvailable|MemFree|SwapFree'

# 2. Find top RSS processes
ps aux --sort=-%mem | head -10

# 3. Read cgroup memory stats
cat /sys/fs/cgroup/memory/memory.usage_in_bytes

# 4. Reduce oom_score_adj (value between -1000 and 1000)
echo -600 | sudo tee /proc/PID/oom_score_adj

# 5. Restart via systemd (requires label)
systemctl restart target.service

memory_pressure_auto_recover

Prometheus Datadog CloudWatch GrafanaAlerting K8sEvents

Linux host Critical GA v0.1.18

Detects when

Detects runaway memory consumers and autonomously restarts the offending service with deploy-correlation hand-off and crash-loop guard. Works across Render, Kubernetes, and bare-metal/systemd.

Auto-fixes by

Identify top 5 memory consumers via ps aux --sort=-%mem + cgroup memory.current
Classify consumer: managed service (known to BytePort) vs unmanaged/system process
Deploy correlation: if release shipped <60 min ago on the top consumer → hand off to failed-deploy-auto-rollback instead
Crash-loop guard: abort and escalate if service restarted ≥2× in last 30 min
Graceful restart via best-available method: Render Deploys API / kubectl rollout restart / systemctl restart
Post-restart memory verify: re-sample after 3 s stabilization, confirm below recoveryThresholdPct (default 80%)

Safety gates

Crash-loop guard: if same service has been auto-restarted ≥2× in 30 min → STOP and escalate with restart history
Deploy hand-off: if a release fired within 60 min → escalate to FailedDeployAutoRollbackRunbook, not restart
BYTEPORT_RUNBOOK_DRY_RUN=true logs all intended actions without executing any restart
Render restart only executes when renderServiceId + renderApiKey are configured
Falls back to structured escalation when restart method cannot be determined (unknown env)
recoveryThresholdPct configurable (default 80%) — escalates if post-restart memory remains elevated

Escalates with

Postmark + Slack escalation with full top-5 memory attribution table, restart history and crash-loop count, recent deploy correlation evidence, suspected leak signature, and runtime env classification. PagerDuty Events v2 if crash-loop guard triggered or post-restart memory still elevated. 24h dedup by service key (name@host).

RBAC permissions

systemd: systemctl restart on the target service unit Kubernetes: apps/deployments/rollout (kubectl rollout restart) Render: API key with Deploy + Read access (RENDER_API_KEY + RENDER_SERVICE_ID)

Slack example

🔴 *memory_pressure_auto_recover* | host: `prod-api-01` | usage: 94% → 61% | top consumer: `api-server` (1.2 GB RSS) | action: systemctl restart api-server | TTR: 28 s | 🔗 https://byteport.polsia.app/runbooks/memory_pressure_auto_recover

Use this runbook →

typescript

import { MemoryPressureAutoRecoverRunbook, RunbookRegistry, createAgent } from "byteport-agent";
import { createPrometheusAdapter } from "byteport-agent/adapters/prometheus";

const adapter = createPrometheusAdapter({
  endpoint: "http://prometheus:9090",
  // memory_pressure signal fires when MemAvailable < 10%
});

const registry = new RunbookRegistry();
registry.register(new MemoryPressureAutoRecoverRunbook({
  // Render-native restart
  renderServiceId: process.env.RENDER_SERVICE_ID,
  renderApiKey: process.env.RENDER_API_KEY,
  // Safety: stop if restarted >=2x in 30 min
  crashLoopMaxRestarts: 2,
  crashLoopWindowMs: 30 * 60 * 1000,
  // Hand off to rollback if deploy in last 60 min
  deployLookbackMinutes: 60,
  // Confirm memory below 80% before declaring resolved
  recoveryThresholdPct: 80,
}));

const agent = createAgent({
  adapters: [adapter],
  runners: [registry.asRunner("runbook")!],
  dryRun: false,
});

await agent.run();

crashloopbackoff_v2

K8sEvents Prometheus Alertmanager DatadogMonitors GrafanaAlerting PagerDuty

Kubernetes Critical GA v0.1.30

Detects when

Detects CrashLoopBackOff and pod restart-loop incidents (10th pager class). 5-branch classifier: config_error_missing_env, image_pull_failure, liveness_probe_misconfigured, oom_at_startup, dependency_not_ready. Autonomous actions: rollback bad image tag, bump memory limit 1.5×, loosen probe timing, hold pod and wait for dependency healthcheck. Escalates on unknown classification or when guardrail ceilings are hit.

Auto-fixes by

Detect: pod status.containerStatuses[].state.waiting.reason == CrashLoopBackOff; restart count ≥3 (warn), ≥10 (high), ≥30 (critical); exit code 137 for pod_not_ready_oom; Unhealthy events for probe failures
Diagnose: kubectl logs --previous (last 2000 chars), kubectl describe pod events (ImagePullBackOff, configmap/secret not found, Unhealthy), liveness/readiness probe spec (initialDelaySeconds, failureThreshold), ConfigMap/Secret existence, node MemoryPressure condition
Classify (first-match, confidence ≥0.70): oom_at_startup (exit 137, conf=0.93 → bump memory 1.5×); image_pull_failure (ImagePullBackOff in events, conf=0.94 → rollback); config_error_missing_env (ConfigMap/Secret not found, conf=0.91 → audit + surface); liveness_probe_misconfigured (probe too tight + healthy startup logs, conf=0.86 → loosen probe); dependency_not_ready (exit 0/1 + connection-refused patterns, conf=0.88 → hold + healthcheck)
Remediate: oom_at_startup → kubectl patch Deployment memory limit 1.5× (ceiling 2×, max 1/pod/24h); image_pull_failure → kubectl rollout undo Deployment (max 1/service/hour); config_error_missing_env → kubectl get configmaps/secrets audit, surface missing names (never auto-creates Secrets); liveness_probe_misconfigured → kubectl patch Deployment livenessProbe.initialDelaySeconds=max(current×2,30) failureThreshold=5; dependency_not_ready → kubectl annotate pod with dependency-hold-since, probe dependency endpoints
Verify: pod reaches Ready within 5-min stabilisation window; restart count must not increment; no Unhealthy probe events
Escalate: unknown classification after 2 passes; memory ceiling hit; dependency healthcheck timeout; rollback rate limit reached

Safety gates

Max 1 rollback per service per 1-hour window (_rollback_history in-process dict)
Max 1 memory limit bump per pod per 24h (_memory_bump_history in-process dict)
Memory bump ceiling: never exceed 2× original limit (OOM_LIMIT_BUMP_CEILING_MULTIPLIER=2.0)
Never auto-create Secrets — surfaces missing key names to operator, requires manual action
byteport.io/skip-crashloop-remediation=true annotation skips all actions
Minimum confidence 0.70 before acting autonomously
Never delete PersistentVolumeClaims or PersistentVolumes
BYTEPORT_RUNBOOK_DRY_RUN=true or --dry-run: all kubectl patch/rollout/annotate calls are no-ops

Escalates with

Postmark + Slack escalation bundle: pod/namespace/node, signal, classification, confidence, restart count, exit code, previous logs snippet. GitHub issue with full kubectl describe pod + event timeline + classification rationale. 24h dedup per pod + classification.

RBAC permissions

core/pods (get, list — read container status, waiting reason, exit code, restart count) core/pods (annotate — dependency-hold annotation) apps/deployments (get, patch — memory limit bump, probe patch, rollout undo) core/events (list — image pull errors, configmap/secret not found, Unhealthy events) core/nodes (get — MemoryPressure condition check) core/configmaps (list — audit existing ConfigMaps for config_error class) core/secrets (list — audit secret names only, never values, for config_error class)

Slack example

🔁 *crashloopbackoff_v2* | pod: `production/api-7d9f-xk2p9` | signal: `crashloopbackoff` | restarts: 4 | class: `dependency_not_ready` | conf: 0.88 | action: dependency-hold annotated, waiting for postgres healthcheck | 🔗 /runbooks/crashloopbackoff_v2

Use this runbook →

kubectl

# 1. Check CrashLoopBackOff state
kubectl get pod POD -n NAMESPACE -o jsonpath='{.status.containerStatuses[0].state.waiting.reason}'
kubectl get pod POD -n NAMESPACE -o jsonpath='{.status.containerStatuses[0].restartCount}'

# 2. Get previous container logs
kubectl logs POD -n NAMESPACE --previous --tail=100

# 3. Check events for image pull errors and missing config
kubectl get events -n NAMESPACE --field-selector involvedObject.name=POD

# 4. Check liveness probe settings
kubectl get pod POD -n NAMESPACE -o jsonpath='{.spec.containers[0].livenessProbe}'

# 5. OOM at startup: bump memory limit
kubectl patch deployment DEPLOYMENT -n NAMESPACE \
  --type=strategic \
  --patch='{"spec":{"template":{"spec":{"containers":[{"name":"CONTAINER","resources":{"limits":{"memory":"384Mi"}}}]}}}}'

# 6. Image pull failure: rollback
kubectl rollout undo deployment/DEPLOYMENT -n NAMESPACE

# 7. Probe misconfigured: loosen initialDelay
kubectl patch deployment DEPLOYMENT -n NAMESPACE \
  --type=strategic \
  --patch='{"spec":{"template":{"spec":{"containers":[{"name":"CONTAINER","livenessProbe":{"initialDelaySeconds":30,"failureThreshold":5}}]}}}}'

# 8. Dependency hold: annotate pod
kubectl annotate pod POD -n NAMESPACE byteport.io/dependency-hold-since=$(date -u +%Y-%m-%dT%H:%M:%SZ)

oomkilled_memory_pressure

K8sEvents Prometheus Datadog Alertmanager GrafanaAlerting PagerDuty

Kubernetes Critical GA v0.1.29

Detects when

Detects OOMKilled pods and pod memory pressure across 6 signals, classifies root cause across 5 classes (genuine_memory_leak, undersized_limit, traffic_spike, node_pressure, runtime_misconfig), and autonomously bumps memory limits within safe bounds, triggers HPA scale-out, cordons/drains memory-pressured nodes, or surfaces JVM/Go/Node heap misconfig as a PR proposal. Max 1 limit bump per pod per 24h. Never auto-bumps on genuine memory leaks — escalates with heap profile instead.

Auto-fixes by

Detect: kubectl exit reason OOMKilled; NodeCondition MemoryPressure; JVM/Go OOM markers in previous logs; metrics-server working-set vs limit
Diagnose: kubectl describe pod (restart count, memory limit/request, exit reason), HPA discovery, recent deploy timestamp (<30min = deploy regression candidate), runtime hints (JVM -Xmx from JAVA_TOOL_OPTIONS, Go GOMEMLIMIT, Node --max-old-space-size from env), Prometheus series for trend and RPS correlation (BYTEPORT_PROMETHEUS_URL)
Classify (first match, confidence ≥0.70): runtime_misconfig (heap config > container limit, conf=0.93 → PR proposal); node_pressure (NodeCondition MemoryPressure + pod healthy <70%, conf=0.88 → cordon+drain); genuine_memory_leak (rising RSS across 3+ restarts, no traffic correlation, conf=0.85 → escalate, never auto-bump); traffic_spike (RSS correlates with RPS, conf=0.86 → HPA then limit bump); undersized_limit (RSS plateau ≥85% of limit, no trend, conf=0.84 → bump limit)
Remediate: undersized_limit → kubectl patch Deployment/StatefulSet memory limit (≤2× current, ≤80% of node allocatable, 1 bump/pod/24h); traffic_spike → annotate HPA to trigger scale-out, fallback to limit bump; node_pressure → kubectl cordon then kubectl drain (--ignore-daemonsets, PDB-aware); runtime_misconfig → generate -Xmx/GOMEMLIMIT/NODE_OPTIONS fix as PR description (never auto-applied); genuine_memory_leak → emit escalation bundle, open ticket
Verify: pod reaches Ready; working-set < 80% of new limit; 0 OOMKill events in last 5 min; no cascading evictions on same node; retry once at 15s; escalate on double-fail
Post-mortem: structured record — pod, namespace, node, signal, classification, old/new limit, actions_taken, limit_bumped, hpa_scaled, node_cordoned, escalated bool, time_to_resolve

Safety gates

Max 1 autonomous memory limit bump per pod per 24h (_limit_bump_history in-process dict) — refuses second bump with escalation
Never exceed 2× the original memory limit
Never set limit above 80% of node allocatable memory — prevents cascading OOMs on the node
byteport.io/skip-memory-remediation=true annotation on pod skips all autonomous actions
genuine_memory_leak classification always escalates — never auto-bumps to avoid masking real bugs
runtime_misconfig always surfaces as PR proposal — never auto-applies env var changes
Max 1 concurrent node drain for node_pressure class (_active_node_drains lock)
Minimum confidence 0.70 before acting autonomously
BYTEPORT_RUNBOOK_DRY_RUN=true or --dry-run: all kubectl patch/cordon/drain calls are no-ops

Escalates with

Postmark + Slack escalation bundle: pod/namespace/node, signal, classification, confidence, restart count, memory limit vs usage, actions taken or blocked, heap profile path (if genuine_memory_leak), Prometheus series summary. GitHub issue with full kubectl describe pod output + exit reason timeline + classification rationale. 24h dedup per pod + classification.

RBAC permissions

core/pods (get, list — read pod spec, restart count, exit reason, metrics) apps/deployments (get, patch — memory limit bump on Deployment workloads) apps/statefulsets (get, patch — memory limit bump on StatefulSet workloads) autoscaling/horizontalpodautoscalers (get, list, annotate — HPA scale-out trigger) core/nodes (get — read NodeCondition MemoryPressure, allocatable memory; cordon) core/nodes (drain — eviction API; respects PodDisruptionBudgets) core/events (list — OOMKill events, eviction events for verify step)

Slack example

💥 *oomkilled_memory_pressure* | pod: `production/api-7d9f-xk2p9` | signal: `pod_oom_killed` | class: `undersized_limit` | conf: 0.84 | action: memory limit 256Mi → 512Mi on Deployment/api | pod Ready in 1m43s | working-set 42% of new limit ✓ | 🔗 /runbooks/oomkilled_memory_pressure

Use this runbook →

kubectl

# 1. Check pod OOMKill history
kubectl describe pod POD -n NAMESPACE | grep -A5 'Last State'
kubectl get events -n NAMESPACE --field-selector involvedObject.name=POD

# 2. Check memory limits
kubectl get pod POD -n NAMESPACE -o jsonpath='{.spec.containers[0].resources}'

# 3. Check working-set vs limit (requires metrics-server)
kubectl top pod POD -n NAMESPACE --containers

# 4. Check HPA
kubectl get hpa -n NAMESPACE

# 5. Bump memory limit (Deployment)
kubectl patch deployment DEPLOYMENT -n NAMESPACE \
  --type=strategic \
  --patch='{"spec":{"template":{"spec":{"containers":[{"name":"CONTAINER","resources":{"limits":{"memory":"512Mi"}}}]}}}}'

# 6. Trigger HPA scale-out
kubectl annotate hpa HPA_NAME -n NAMESPACE \
  byteport.io/scale-triggered=$(date -u +%Y-%m-%dT%H:%M:%SZ) --overwrite

# 7. Node pressure: cordon + drain
kubectl cordon NODE
kubectl drain NODE --ignore-daemonsets --delete-emptydir-data --grace-period=60

# 8. Verify pod stabilises
kubectl get pod POD -n NAMESPACE --watch
kubectl top pod POD -n NAMESPACE --containers

node_not_ready

K8sEvents Prometheus Datadog Alertmanager GrafanaAlerting PagerDuty

Kubernetes Critical GA v0.1.28

Detects when

Detects Kubernetes Node NotReady and kubelet failure across 6 signals (node_not_ready / node_disk_pressure / node_memory_pressure / node_pid_pressure / kubelet_unhealthy / node_network_unavailable), classifies root cause across 6 classes (transient_kubelet_hang / runtime_crash / resource_pressure_disk / resource_pressure_mem_pid / network_plugin_failure / hardware_cloud_failure), and autonomously cordons, drains (respecting PDBs), restarts kubelet or container runtime, prunes images for disk pressure, or restarts CNI daemonset pods. Cloud provider node replacement (AWS ASG, GCP MIG, Azure VMSS) is gated behind operator confirmation or BYTEPORT_AUTO_REPLACE_NODES=true. Hard guardrail: never drain more than 1 node concurrently, never touch control-plane nodes.

Auto-fixes by

Detect: poll NodeCondition transitions via kubectl; pull kubelet systemd status (systemctl is-active/is-failed); read recent kernel messages (journalctl -k) for OOM/hardware errors
Diagnose: kubectl describe node (conditions, allocatable vs. capacity, taints, events), journalctl -u kubelet --since '30 min ago' (last 200 lines), container runtime status (systemctl status containerd/crio), inode exhaustion (df -i), fd exhaustion (/proc/sys/fs/file-nr), image disk usage (crictl/docker images)
Classify (first match, confidence ≥0.70): hardware_cloud_failure (kernel OOM/MCE/panic, conf=0.94, always escalate); network_plugin_failure (NetworkUnavailable + CNI errors, conf=0.88); runtime_crash (containerd/crio unit failed, conf=0.91); resource_pressure_disk (DiskPressure condition or disk ≥85% or inode_free ≤5%, conf=0.90); resource_pressure_mem_pid (MemoryPressure/PIDPressure condition, conf=0.88); transient_kubelet_hang (inactive but not failed, restart_count ≤2, conf=0.82); unknown (<0.70 confidence, always escalate)
Remediate — cordon node first, then per class: transient_kubelet_hang → systemctl restart kubelet; runtime_crash → drain → systemctl restart containerd/crio then kubelet; resource_pressure_disk → drain → crictl/docker image prune + delete evicted pods + restart kubelet; resource_pressure_mem_pid → drain → restart kubelet (node replacement if memory ≥95%); network_plugin_failure → delete CNI daemonset pod on node (flannel/calico/cilium/weave); hardware_cloud_failure → escalate with full bundle, cloud-provider node replacement if BYTEPORT_AUTO_REPLACE_NODES=true
Verify: node returns to Ready within 5 min window; no new eviction events for 5 min; no PDB violations; pods rescheduled and Running/Ready; retry once at 15s; escalate on double-fail
Post-mortem: structured record — node_name, signal, classification, confidence, root_cause, actions_taken, cordon/drain/restart timeline, pods_evicted, pods_rescheduled, time_to_recovery_seconds, escalated bool

Safety gates

Never drain more than 1 non-control-plane node concurrently (in-process _active_drains lock) — refuses with escalation if limit reached
Never touch nodes with label node-role.kubernetes.io/control-plane or node-role.kubernetes.io/master — always escalate
Respect PodDisruptionBudgets: kubectl drain with --disable-eviction=false; if PDB violation returned → surface error verbatim and escalate
Hardware/cloud failures always escalate — no autonomous node replacement unless BYTEPORT_AUTO_REPLACE_NODES=true explicitly set
Minimum confidence 0.70 before acting autonomously — lower confidence escalates with full diagnostic bundle
Image prune only (crictl/docker rmi --prune) — never deletes PersistentVolumes or PersistentVolumeClaims
BYTEPORT_RUNBOOK_DRY_RUN=true or --dry-run: all kubectl/systemctl/crictl/docker calls are no-ops

Escalates with

Postmark + Slack escalation with: node name, signal, classification, confidence, kernel error excerpt (if hardware), kubelet journal tail (last 10 lines), runtime status, disk usage, actions taken or blocked, PDB status, drain outcome. GitHub issue with full kubectl describe node output + NodeCondition timeline + classification rationale. 1h dedup per node.

RBAC permissions

core/nodes (get, describe — conditions, allocatable, taints; cordon via kubectl cordon) core/nodes (drain — eviction API; respects PodDisruptionBudgets) core/pods (list — find evicted pods for cleanup; get — rescheduling verification) core/pods (delete — remove evicted/Failed phase pods) policy/poddisruptionbudgets (get — PDB violation guard during drain) apps/daemonsets (get — locate CNI daemonset for pod restart) core/events (list — recent node events, eviction event detection)

Slack example

🔴 *node_not_ready* | node: `worker-3` | signal: `kubelet_unhealthy` | class: `runtime_crash` | conf: 0.91 | action: cordon → drain (4 pods evicted) → systemctl restart containerd → systemctl restart kubelet | node Ready in 2m18s | 🔗 /runbooks/node_not_ready

Use this runbook →

kubectl

# 1. Check node conditions
kubectl get node NODENAME -o wide
kubectl describe node NODENAME | grep -A 10 Conditions:

# 2. Check kubelet systemd status (on node)
systemctl status kubelet
journalctl -u kubelet --since '30 min ago' --no-pager | tail -50

# 3. Check container runtime (containerd)
systemctl status containerd
journalctl -u containerd --since '10 min ago' --no-pager | tail -30

# 4. Disk and inode usage (on node)
df -h /
df -i /

# 5. Cordon node
kubectl cordon NODENAME

# 6. Drain node (respecting PDBs)
kubectl drain NODENAME --ignore-daemonsets --delete-emptydir-data --grace-period=60

# 7. Restart kubelet
systemctl restart kubelet

# 8. Verify node returns Ready
kubectl get node NODENAME --watch

# 9. Uncordon when healthy
kubectl uncordon NODENAME

crash_loop_backoff

K8sEvents Prometheus Datadog Alertmanager GrafanaAlerting

Kubernetes Critical GA v0.1.24

Detects when

Detects Kubernetes pods in CrashLoopBackOff and container restart loops, classifies root cause across 7 failure classes (oom_at_startup / bad_image_or_tag / failed_init_container / probe_misconfigured / missing_configmap_or_secret / app_panic_on_boot / node_pressure_eviction), and autonomously remediates within hard guardrails — max 1 rollback/service/hour, max 1 resource bump/pod/day, never delete PersistentVolumes.

Auto-fixes by

Diagnose: kubectl describe pod, previous + current container logs (last 2000 chars), exit code analysis (137 OOM / 139 segfault / 127 missing binary / 1 app error), event timeline, image pull status, probe definitions, resource requests vs node capacity, recent ConfigMap/Secret changes
Classify (first match): node_pressure_eviction (node MemoryPressure/DiskPressure condition); bad_image_or_tag (ImagePullBackOff/ErrImagePull in events); missing_configmap_or_secret (missing K8s resource in events); failed_init_container (init container exit_code ≠ 0); oom_at_startup (exit code 137); probe_misconfigured (liveness/readiness probe fires before app ready, app logs show healthy startup); app_panic_on_boot (exit 139 or panic/fatal/exception in logs + restart_count ≥3)
oom_at_startup: bump memory request/limit 1.5× (max 1 bump/pod/day), patch deployment, rolling restart
bad_image_or_tag: kubectl rollout undo deployment (max 1 rollback/service/hour)
failed_init_container: delete pod to re-trigger init if non-external dep; escalate if external dep connectivity failure
probe_misconfigured: patch deployment with doubled initialDelaySeconds and failureThreshold=5
app_panic_on_boot: extract stack trace from previous logs; rollback if restart_count ≥3
node_pressure_eviction: kubectl cordon node, delete pod to reschedule onto healthy node
Verify: pod reaches Ready, restart delta=0 over 5 min, probe Unhealthy events gone; retry once after 60s
Post-mortem: emit structured event (classification, action, time-to-resolve, rollback-needed) to /demo timeline

Safety gates

Max 1 kubectl rollout undo per service per hour (in-process rolling counter)
Max 1 memory resource bump per pod per day (in-process rolling counter)
Never delete PersistentVolumeClaims or PersistentVolumes under any circumstance
missing_configmap_or_secret: audit only — never auto-create Secrets (operator action required)
failed_init_container: escalate on external dependency failure, do not loop-retry
Minimum confidence 0.70 before acting autonomously — lower confidence escalates
Verify must confirm pod Ready + zero new restarts before marking resolved; escalate on double-fail

Escalates with

Postmark alert with pod name, namespace, exit code, classification, confidence score, previous log excerpt (last 500 chars), actions taken or blocked, and guardrail hit (if any). GitHub issue with byteport + severity:critical + crash_loop labels, full diagnostic bundle. 24h dedup: same pod + same exit code → adds comment, not new issue.

RBAC permissions

core/pods (describe pod, read spec, resource limits, restart counts) core/pods/log (read previous and current container logs) core/pods/exec (kubectl top pod for live memory usage) core/namespaces (read byteport.io/allow-remediation annotation) core/events (read BackOff/CrashLoopBackOff/ImagePullBackOff events) core/nodes (read MemoryPressure/DiskPressure conditions, cordon) apps/deployments (rollout undo, patch memory limits, patch probe settings)

Slack example

🔴 *crash_loop_backoff* | namespace: `backend` | pod: `api-7d9f8-xk2p9` | class: `oom_at_startup` | exit: 137 | restarts: 8 | action: memory limit 256Mi → 384Mi, rolling restart | pod ready in 2m14s | 🔗 /runbooks/crash_loop_backoff

Use this runbook →

kubectl

# 1. Check pod status and restart count
kubectl get pod -n NAMESPACE POD -o wide

# 2. Fetch previous container logs
kubectl logs POD -n NAMESPACE --previous --tail=100

# 3. Describe for events and exit code
kubectl describe pod POD -n NAMESPACE

# 4. Exit code classification
#   137 → OOMKilled (bump memory limit)
#   139 → Segfault (rollback or investigate binary)
#   127 → Missing binary/library (image or entrypoint issue)
#   1   → App error (check logs for panic/fatal)

# 5. Rollback to last-good ReplicaSet
kubectl rollout undo deployment/DEPLOY -n NAMESPACE

# 6. Bump memory limit (OOM at startup)
kubectl patch deployment DEPLOY -n NAMESPACE -p '{"spec":{"template":{"spec":{"containers":[{"name":"app","resources":{"limits":{"memory":"384Mi"}}}]}}}}

# 7. Loosen liveness probe (probe misconfigured)
kubectl patch deployment DEPLOY -n NAMESPACE -p '{"spec":{"template":{"spec":{"containers":[{"name":"app","livenessProbe":{"initialDelaySeconds":30,"failureThreshold":5}}]}}}}'

pod_crashloop

K8sEvents Datadog Prometheus

Kubernetes Critical GA v0.1.4

Detects when

Detects Kubernetes pods in crashloop and classifies the exit code (OOMKilled, missing binary, image auth, config error) before safely restarting with back-off guard.

Auto-fixes by

Fetch kubectl logs --previous to capture crash payload
Classify exit code: 137=OOMKilled, 127=missing binary, 1=app error, 0=image auth failure
For OOMKilled: inspect memory.limit vs usage, suggest adjustment if limit is within 20% of actual
For transient errors: apply exponential back-off restart policy via kubectl rollout restart
Safe-restart only if byteport.io/allow-remediation: "true" annotation present on namespace

Safety gates

Requires byteport.io/allow-remediation: "true" on the namespace — pod-level annotation insufficient
Never force-deletes a pod in Terminating state
Requires crash classification before acting — raw restart count alone does not trigger
Max 1 restart attempt per 3-minute window
Skipped if priorityclass.kubernetes.io/no-autoremediation: "true" present

Escalates with

Postmark alert with pod name, namespace, exit code, previous log excerpt (last 500 chars), cluster, and 3 suggested steps. GitHub issue with byteport + severity:{level} + pod_crashloop labels. 24h dedup: same pod + same exit code → adds comment, not new issue.

RBAC permissions

core/pods/log (read previous container logs) core/pods (describe pods for exit code and events) core/namespaces (read byteport.io annotations) core/events (read namespace events for BackOff reason)

Slack example

🚨 *pod_crashloop* | namespace: `payment-api` | pod: `api-7d9f8-xk2p9` | exit code: `137 (OOMKilled)` | restarts: 5 in last 10 min | action: memory limit patched + pod restarted | 🔗 https://app.byteport.polsia.app/runbooks/pod_crashloop

Use this runbook →

kubectl

# 1. Check pod status and restart count
kubectl get pod -n NAMESPACE POD -o wide

# 2. Fetch previous container logs
kubectl logs POD -n NAMESPACE --previous

# 3. Describe for events and exit code
kubectl describe pod POD -n NAMESPACE

# 4. Classify via exit code
#   137 → OOMKilled (memory limit exceeded)
#   127 → Missing binary or library
#   1   → Application error (check logs)
#   0   → ImagePullBackOff or ConfigError

# 5. Safe-restart with back-off
kubectl rollout restart deployment/DEPLOY -n NAMESPACE

oom_killed_pod

K8sEvents Datadog Prometheus

Kubernetes Critical GA v0.1.5

Detects when

Detects pods killed by the Kubernetes OOM killer (exit code 137) and safely patches memory limits upward, restarts the pod, and enables debug logging to capture future leak patterns.

Auto-fixes by

Read current memory.request and memory.limit from pod spec
Compare against actual usage from kubectl top pod (Metrics Server required)
If usage >80% of limit: suggest increasing memory.limit to usage × 1.5
If leak signature detected: patch memory.limit upward and file postmortem
If limits are appropriately sized: flag for app-level investigation and escalate
Restart pod with DEBUG=true env var injection to capture future leak patterns

Safety gates

Requires byteport.io/auto-remediate: "oom" annotation on the namespace
Limit patch capped at 2× current value to avoid unbounded resource consumption
kubectl patch --dry-run=server first — rejected if it would exceed node allocatable memory
Only the OOM'd container is restarted; sidecar containers are preserved
Post-restart verification: pod must reach Running state within 60 seconds

Escalates with

Postmark alert with pod name, namespace, OOM count in last 24h, memory limit vs. actual usage comparison, suggested limit adjustment, and whether fix was applied or escalated. GitHub issue includes node allocatable headroom and top consumers on the node.

RBAC permissions

core/pods (read pod spec for memory limits) core/pods/exec (kubectl top pod requires exec for Metrics Server) core/namespaces (read byteport.io/auto-remediate annotation) core/nodes (read node allocatable for dry-run validation)

Slack example

🔴 *oom_killed_pod* | namespace: `backend` | pod: `worker-5f9b-x1p3q` | limit: 256Mi → patched to 512Mi | restart: initiated | 3 prior OOMs on this pod — consider increasing HPA memory floor | 🔗 /runbooks/oom_killed_pod

Use this runbook →

kubectl

# 1. Check OOM event in pod events
kubectl get events -n NAMESPACE --field-selector reason=OOMKilled

# 2. Get current memory limits
kubectl get pod POD -n NAMESPACE -o jsonpath='{.spec.containers[0].resources}'

# 3. Compare with actual usage
kubectl top pod POD -n NAMESPACE

# 4. Patch memory limit upward (dry-run)
kubectl patch pod POD -n NAMESPACE --dry-run=server -p '{"spec":{"containers":[{"name":"app","resources":{"limits":{"memory":"512Mi"}}}]}}'

# 5. Apply patch and restart
kubectl patch pod POD -n NAMESPACE -p '{"spec":{"containers":[{"name":"app","resources":{"limits":{"memory":"512Mi"}}}]}}'
kubectl delete pod POD -n NAMESPACE

image_pull_backoff

K8sEvents

Kubernetes Warning GA v0.1.6

Detects when

Detects Kubernetes pods failing to pull their container image (ErrImagePull, ImagePullBackOff) and surfaces the root cause — auth failure, bad tag, or network misconfiguration — with specific remediation steps.

Auto-fixes by

Classify failure type: auth (401/403), missing tag (404), network unreachable, or registry timeout
For auth failures: list imagePullSecrets on the pod and identify the missing or misconfigured secret
For missing tag: suggest the correct image tag from the previous deployment's image
For network unreachable: check node DNS resolution and registry reachability
For registry timeout: check if registry rate limit has been hit (Docker Hub unauthenticated)
Suggest kubectl set image or kubectl patch secret commands for the specific failure type

Safety gates

Auto-remediation limited to diagnostics and suggestions — no destructive actions taken
Requires byteport.io/allow-remediation: "true" annotation to include secret patch suggestion
Never modifies imagePullSecrets directly — only surfaces the correct patch command
Skipped if pod is already terminating or in a Completed state
Max 1 alert per 10-minute window per pod to avoid alert storms during image updates

Escalates with

Postmark alert with failure classification (auth/tag/network/ratelimit), the specific image URI, registry configuration state, and the exact kubectl commands to fix it. GitHub issue with image_pull_backoff label and registry diagnostic output.

RBAC permissions

core/pods (read image, imagePullSecrets, container spec) core/secrets (read imagePullSecrets to identify misconfiguration — redacted in alert) core/namespaces (read byteport.io annotations) core/events (read image pull failure events)

Slack example

⚠️ *image_pull_backoff* | namespace: `checkout` | pod: `api-0` | reason: `401 Unauthorized` — imagePullSecret `reg-creds` may be expired | fix: `kubectl create secret docker-registry ...` | 🔗 /runbooks/image_pull_backoff

Use this runbook →

kubectl

# 1. Check image pull failure events
kubectl get events -n NAMESPACE --field-selector reason=Failed

# 2. Inspect pod image and pull secrets
kubectl get pod POD -n NAMESPACE -o jsonpath='{.spec.imagePullSecrets}' | jq
kubectl get pod POD -n NAMESPACE -o jsonpath='{.spec.containers[0].image}'

# 3. Test registry authentication
kubectl create secret docker-registry reg-secret \
  --docker-server=https://index.docker.io/v1/ \
  --docker-username=USER \
  --docker-password=PASS \
  --docker-email=EMAIL -n NAMESPACE

# 4. Patch image to a known-good tag
kubectl set image deployment/DEPLOY CONTAINER=registry.io/image:sha-TAG -n NAMESPACE

# 5. Verify pull succeeds
kubectl describe pod POD -n NAMESPACE | grep -A5 Events

pod_pending_unschedulable

K8sEvents

Kubernetes Warning Beta v0.1.11

Detects when

Detects Kubernetes pods stuck in Pending state due to unschedulable conditions — resource exhaustion, storage constraints, affinity conflicts, or missing dependencies — and applies emergency scheduling workarounds where viable.

Auto-fixes by

Read Events for unschedulable reason: Insufficient memory/CPU, NoVolumeZoneConflict, AffinityConflict, NodeSelectorMismatch, or ResourceQuotaExceeded
Check node allocatable vs. requested — identify if cluster has headroom on any node
If cluster has headroom: attempt to relax PodAntiAffinity constraints to schedule on a less-contested node
If priorityClass is low: suggest bumping to a cluster-appropriate priority to preemption-wins the scheduling slot
Clear taints from target nodes where tolerations are misconfigured
Trigger cluster autoscaling via cloud provider API if node pool has room to scale up

Safety gates

Requires byteport.io/allow-remediation: "true" on the namespace
Never force-schedules a pod on a specific node — always resolves the scheduling constraint generically
Never modifies node resources, node selectors, or node labels
Never removes PodDisruptionBudgets to increase schedulability
Skips when reason is ResourceQuotaExceeded — quota changes require human approval
BETA: requires byteport.io/beta-ack: "true" annotation on the namespace

Escalates with

Postmark alert with scheduling failure reason, requested vs. allocatable resources across nodes, and the specific kubectl command to resolve (e.g., increase memory request, add node toleration). GitHub issue for ResourceQuotaExceeded cases requiring human quota adjustment.

RBAC permissions

core/nodes (read node allocatable and existing taints) core/pods (read pod spec for affinity and resource requests) core/namespaces (read byteport.io annotations) core/events (read FailedScheduling events for reason classification)

Slack example

⚠️ *pod_pending_unschedulable* [beta] | namespace: `queues` | pod: `worker-2` | reason: `Insufficient memory` | cluster headroom: 2 nodes available | action: relaxed anti-affinity, pod now scheduling | 🔗 /runbooks/pod_pending_unschedulable

Use this runbook →

kubectl

# 1. Check why pod is unschedulable
kubectl describe pod POD -n NAMESPACE | grep -A10 Events

# 2. Check node allocatable vs. requests
kubectl describe nodes | grep -E 'Allocatable|Allocated resources|memory|CPU'

# 3. List pods by priority (identify preemption candidates)
kubectl get pods -n NAMESPACE -o 'jsonpath={range .items[*]}{"Pod: "}{.metadata.name}{"\tPriority: "}{.spec.priority}{"\tNode: "}{.spec.nodeName}{"\n"}{end}' | sort -k3

# 4. Remove taint to schedule (when toleration misconfigured)
kubectl taint nodes NODENAME node.kubernetes.io/not-ready:NoSchedule-

# 5. Bump priority to enable preemption
kubectl patch pod POD -n NAMESPACE -p '{"spec":{"priorityClassName":"high-priority"}}'

node_not_ready

K8sEvents CloudWatch

Kubernetes Critical Beta v0.1.11

Detects when

Detects Kubernetes nodes entering NotReady, MemoryPressure, DiskPressure, or PIDPressure state and coordinates a safe node drain — cordoning the node, allowing graceful pod termination, and restoring schedulability once the node recovers.

Auto-fixes by

Check if node is already unschedulable (cordon happened) or still accepting new pods
Identify critical pods with disruption budgets on the node
If critical pods exist: wait up to 120 seconds for PodDisruptionBudget to be satisfied before draining
Execute kubectl cordon NODE, then kubectl drain NODE --ignore-daemonsets --delete-emptydir-data --grace-period=60
After drain completes: mark node as cordoned so new pods don't schedule
Post-recovery: uncordon node when NodeReady status is restored for 60 continuous seconds

Safety gates

Requires byteport.io/allow-remediation: "true" on the namespace
Critical pods (PriorityClass system or above 10000) are never forcibly evicted — drain blocked until PDB satisfied or 5-minute timeout
PDB timeout: 5 minutes max wait — after timeout, drain proceeds and issue is created for operator
Node is never force-deleted from the cluster
BETA: requires byteport.io/beta-ack: "true" annotation on the namespace

Escalates with

Postmark alert if node stays NotReady >10 minutes with: node name, current pressure type, allocatable vs. requested resources, top 5 pods on node and their priority. GitHub issue with node status timeline, recent events log, and recommended remediation steps. 4h dedup window.

RBAC permissions

core/nodes (read status, read/write unschedulable field) core/pods (read pod spec for critical annotation and priority class) core/pods/evict (evict pods during drain) core/namespaces (read byteport.io annotations) core/events (read node events for status transitions)

Slack example

🔴 *node_not_ready* [beta] | node: `ip-10-0-2-14` | condition: `MemoryPressure` | critical pods on node: 2 (hpa-managed) | action: cordon + drain initiated | 🔗 /runbooks/node_not_ready

Use this runbook →

kubectl

# 1. Check node status and events
kubectl get nodes NODE -o wide
kubectl describe nodes NODE | grep -A20 Events

# 2. Identify pressure type
kubectl get events -n default --field-selector involvedObject.name=NODE --sort-by='.lastTimestamp'

# 3. Check critical pods on node
kubectl get pods -A -o 'jsonpath={range .items[*]}{.metadata.namespace}{"\t"}{.metadata.name}{"\t"}{.spec.priority}{"\t"}{.spec.nodeName}{"\n"}{end}' | grep NODE | sort -k3

# 4. Cordon and drain
kubectl cordon NODE
kubectl drain NODE --ignore-daemonsets --delete-emptydir-data --grace-period=60 --timeout=300s

# 5. Monitor drain progress
kubectl get events -n default --field-selector reason=Draining

# 6. Restore node (post-recovery)
kubectl uncordon NODE

deployment_rollback

Webhook Datadog

Kubernetes Warning GA v0.1.2

Detects when

Ingest failed deployment webhooks from CI systems (GitHub Actions, ArgoCD, Flux, Spinnaker) and automatically rolls back to the last healthy revision — with safety gates and commit-author notification.

Auto-fixes by

Read byteport.io/allow-rollback: "true" annotation on the Deployment or Namespace
Inspect kubectl rollout history deployment/NAME -n NAMESPACE to confirm a stable previous revision exists
Execute kubectl rollout undo deployment/NAME -n NAMESPACE
Notify the commit author via Postmark with rollback confirmation, revision diff, and rollback duration
Log the rollback to the signal event audit trail for postmortem correlation

Safety gates

Hard-gate: byteport.io/allow-rollback annotation must be present — no implicit rollback
Requires at least one stable revision in rollout history before executing undo
Never rolls back if current revision is less than 2 minutes old — deploy may still be in progress
Blocked if Deployment is managed by ArgoCD in auto-sync mode (checked via annotation)
Pre-rollback snapshot captures replica count, image tag, and resource limits for audit

Escalates with

Postmark alert with deployment name, namespace, failure reason from webhook payload, commit SHA, author, and what prevented auto-rollback (missing annotation, no stable revision, ArgoCD sync conflict). 24h dedup per deployment.

RBAC permissions

apps/deployments (read rollout history, execute rollout undo) core/namespaces (read byteport.io/allow-rollback and ArgoCD sync-mode annotations) core/events (log rollback action to namespace events)

Slack example

↩️ *deployment_rollback* | deploy: `api-service` | namespace: `production` | failure: `test failure` | SHA: `a3f9c12` | author: `maria@` | rollback: triggered (revision 42 → 41) | 🔗 /runbooks/deployment_rollback

Use this runbook →

kubectl

# 1. Check rollback eligibility annotation
kubectl get deploy NAME -n NAMESPACE -o jsonpath='{.metadata.annotations.byteport.io\/allow-rollback}'

# 2. Inspect rollout history
kubectl rollout history deployment/NAME -n NAMESPACE

# 3. Roll back to last stable revision
kubectl rollout undo deployment/NAME -n NAMESPACE

# 4. Watch rollback progress
kubectl rollout status deployment/NAME -n NAMESPACE

# 5. Verify previous image/tag restored
kubectl get deploy NAME -n NAMESPACE -o jsonpath='{.spec.template.spec.containers[0].image}'

tls_cert_expiry_auto_renewal

Prometheus Alertmanager DatadogMonitors GrafanaAlerting PagerDuty

TLS / Certificates Critical GA v0.1.27

Detects when

Detects TLS certificate expiry across 6 signals (cert_expiring_soon at 30/14/7/1-day thresholds, cert_expired, cert_chain_invalid, ocsp_stapling_failure, acme_challenge_failure, renewal_rate_limit_hit), classifies the renewal approach across 7 classes (acme_http01, acme_dns01, cert_manager, acme_rate_limited, cloud_managed, commercial_ca, internal_ca), and autonomously renews via certbot, cert-manager CRD delete-and-reissue, or cloud API — then hot-reloads nginx/envoy/haproxy/traefik/k8s-ingress and verifies chain validity plus OCSP staple. Escalates immediately for pinned certs, wildcard certs without a configured DNS-01 provider, rate-limited domains, and commercial/internal CAs.

Auto-fixes by

Diagnose: openssl s_client -connect domain:443 -showcerts -status to fetch served cert; parse notAfter, issuer O=/CN=, SAN list, serial, OCSP Must-Staple extension, OCSP staple presence; detect wildcard (*.domain); locate cert files on disk (certbot live dir, /etc/ssl/certs, /etc/nginx/ssl); identify web server via pgrep (nginx/envoy/haproxy/traefik/caddy/k8s ingress); check DNS zone ownership (dig SOA); detect cert-manager Certificate CRD in cluster; estimate ACME rate-limit window remaining; check BYTEPORT_PINNED_DOMAINS for cert pinning.
Classify (first match, confidence-gated ≥0.70): acme_rate_limited (signal=renewal_rate_limit_hit or window exhausted, conf=0.92); cert_pinned (BYTEPORT_PINNED_DOMAINS match, conf=0.95, always escalate); cert_manager (cert-manager CRD present, conf=0.93); cloud_managed (issuer O in {Amazon, Google Trust Services}, conf=0.88); acme_dns01 (wildcard or challenge_failure + dns_provider configured, conf=0.91); acme_http01 (LE/ZeroSSL issuer, non-wildcard, zone owned, conf=0.90); internal_ca (self-signed, conf=0.87); commercial_ca (fallback, conf=0.75).
Remediate: acme_http01 → certbot renew --cert-name --webroot or --standalone --non-interactive [--server staging]; acme_dns01 → certbot renew --dns-{route53|cloudflare|google|azure}; cert_manager → kubectl delete secret -tls + kubectl annotate Certificate cert-manager.io/issue-temporary-certificate=true; cloud_managed → aws acm renew-certificate --certificate-arn $BYTEPORT_ACM_ARN; commercial_ca/internal_ca/cert_pinned/rate_limited → escalate with full cert bundle + retry ETA.
Hot-reload (server-aware, no downtime): nginx → nginx -s reload; caddy → caddy reload --config /etc/caddy/Caddyfile; haproxy → haproxy -sf $(cat /run/haproxy.pid); envoy/traefik → kill -HUP ; k8s ingress → kubectl rollout restart deployment/; unknown server → escalate.
Verify (wait_seconds=10): re-fetch via openssl s_client; confirm notAfter > now+30d; validate chain (openssl verify -CAfile /etc/ssl/certs/ca-certificates.crt); check OCSP staple if Must-Staple set; confirm serial changed from pre-renewal; handshake_ok. On failure: restore cert backup + reload + escalate.
Post-mortem: structured record — domain, signal, classification, issuer, challenge_used, tool_used, days_remaining_before/after, chain_valid_after, ocsp_ok_after, resolved bool, time_to_resolve_seconds, escalated bool.

Safety gates

DNS zone ownership confirmed via SOA lookup before any ACME action — refuses if zone not under our control
Cert backup always created before cert swap (BYTEPORT_CERT_BACKUP_DIR); rollback restores backup + reloads on verify failure
Pinned certs (BYTEPORT_PINNED_DOMAINS) always escalate — never renew autonomously (mobile client breakage risk)
ACME rate-limit awareness: tracks in-process renewal history; refuses within window; parses Retry-After for retry ETA
Wildcard certs require dns-01 provider to be configured — escalates if missing
Minimum confidence 0.70 before acting autonomously — lower confidence escalates with full diagnostic bundle
BYTEPORT_RUNBOOK_DRY_RUN=true or --dry-run: all certbot/kubectl/aws commands are no-ops; dry-run also passed to certbot

Escalates with

Postmark + Slack escalation with: domain, signal, classification, issuer, days_remaining, challenge attempted, error details, ACME rate-limit retry ETA (if applicable), cert backup path. GitHub issue with full openssl s_client output + cert chain + classification rationale. 6h dedup per domain.

RBAC permissions

core/secrets (get, delete — for cert-manager TLS secret delete-and-reissue) cert-manager.io/certificates (get, patch, annotate — force re-issuance) apps/deployments (rollout restart — for k8s ingress controller reload) exec: nginx -s reload / caddy reload / kill -HUP (for non-k8s server reload)

Slack example

🔒 *tls_cert_expiry_auto_renewal* | domain: `api.prod.example.com` | signal: `cert_expiring_soon` | days_remaining: 6 | class: `acme_http01` | action: certbot renew --cert-name api.prod.example.com → renewed | reload: nginx -s reload | chain valid ✓ | days_after: 89d | 🔗 /runbooks/tls_cert_expiry_auto_renewal

Use this runbook →

bash

# 1. Inspect served cert
openssl s_client -connect DOMAIN:443 -showcerts -status -servername DOMAIN < /dev/null 2>&1 | openssl x509 -noout -text -dates

# 2. Check cert chain validity
openssl verify -CAfile /etc/ssl/certs/ca-certificates.crt /etc/letsencrypt/live/DOMAIN/fullchain.pem

# 3. Dry-run renewal
certbot renew --cert-name DOMAIN --dry-run --non-interactive

# 4. Live renewal (http-01)
certbot renew --cert-name DOMAIN --webroot -w /var/www/html --non-interactive

# 5. Live renewal (dns-01 / wildcard)
certbot renew --cert-name DOMAIN --dns-route53 --non-interactive

# 6. Reload nginx
nginx -s reload

# 7. Verify new cert
openssl s_client -connect DOMAIN:443 -servername DOMAIN < /dev/null 2>&1 | openssl x509 -noout -enddate

cert_expiry

Datadog CustomMetric

TLS / Certificates Warning GA v0.1.7

Detects when

Detects SSL/TLS certificates expiring within 14 days (warning) or 7 days (critical) and triggers automated renewal via cert-manager, AWS ACM, or certbot — with pre/post snapshots and DNS validation checks.

Auto-fixes by

For cert-manager: patch Certificate resource with cert-manager.io/force-renew="true", wait for issuance and secret propagation
For AWS ACM: call acm.RequestCertificate for the same domain, wait for DNS validation if pending, log new ARN
For certbot/acme.sh: trigger certbot renew --quiet or acme.sh --renew -d domain.com, validate renewed cert serial
For manual certs: escalate with full cert metadata, days-to-expiry, and renewal commands — no automated path

Safety gates

Never runs renewal on certificates with fewer than 3 days remaining — too risky
Pre-renewal snapshot records current cert serial number, issuer, and SAN list
Post-renewal verification reads new cert serial and compares to pre-snapshot to confirm rotation succeeded
Requires byteport.io/auto-remediate: "cert" annotation on the namespace or Certificate resource
ACM DNS validation: runbook blocks if DNS challenge is not resolvable within 60 seconds

Escalates with

Postmark alert with domain, cert CN + SANs, days-to-expiry, issuer, current renewal path, and whether auto-renewal succeeded or was blocked. GitHub issue with cert fingerprint, renewal commands, and DNS admin contact. 7d dedup per domain.

RBAC permissions

core/certificates (read and patch cert-manager Certificate resources) core/secrets (read tls.crt from Secrets for OpenSSL scan) aws/acm (request new certificate, describe certificate for validation status)

Slack example

⚠️ *cert_expiry* | domain: `api.example.com` | cert: `*.example.com` | expires: 2026-06-06 (7 days) | issuer: Let's Encrypt | action: cert-manager force-renew triggered | 🔗 /runbooks/cert_expiry

Use this runbook →

shell

# 1. Check cert expiry (OpenSSL scan)
echo | openssl s_client -servername example.com -connect example.com:443 2>/dev/null | openssl x509 -noout -dates

# 2. cert-manager: force renewal
kubectl annotate certificate NAME -n NAMESPACE cert-manager.io/force-renew="true"

# 3. Wait for cert to be re-issued
kubectl get certificate NAME -n NAMESPACE -w

# 4. Verify new cert in Secret
kubectl get secret TLS-SECRET-NAME -n NAMESPACE -o jsonpath='{.data.tls.crt}' | base64 -d | openssl x509 -noout -serial -dates

# 5. Reload nginx / ingress controller
kubectl rollout restart deployment/ingress-controller -n ingress-nginx

ssl-cert-auto-rotate

CloudWatch GrafanaAlerting PagerDuty

TLS / Certificates Critical Beta v0.1.15

Detects when

End-to-end autonomous TLS certificate rotation: diagnoses cert management toolchain (cert-manager / Caddy / Traefik / certbot / IaC), validates ACME challenge via Let's Encrypt staging, backs up cert, rotates, and confirms via TLS handshake (3× retry). Escalates with full diagnostic if any step fails.

Auto-fixes by

Detect cert source: cert-manager (CRD), Caddy/Traefik (process detection), certbot/acme.sh (config dir), IaC (metadata flags)
Validate ACME challenge path via Let's Encrypt staging dry-run (DNS-01 or HTTP-01) — abort and escalate if staging fails
Backup current cert + key to /var/lib/byteport/cert-backups/-/ for 7-day rollback
cert-manager: kubectl annotate certificate with cert-manager.io/force-renew, wait Ready=True (120 s)
Caddy / Traefik: systemctl reload to trigger internal ACME renewal
certbot / acme.sh: certbot renew --force-renewal
IaC-managed: open PR via GitHub adapter bumping cert resource — never auto-apply
TLS-handshake verify 3× with 30 s backoff, parse notAfter to confirm ≥30 days remaining
Post-rotation Slack notification: ✅ Rotated cert for . New expiry: . Took .
On any failure: escalate to PagerDuty with full diagnostic, renewal command, and P-level (P1/P2/P3)

Safety gates

Hard rate limit: max 1 rotation per domain per 24 h — prevents duplicate ACME requests
BYTEPORT_RUNBOOK_DRY_RUN=true: all writes become no-ops; logs intended actions only
Pre-rotation backup always runs before any cert write
Staging validation gate: certbot --dry-run --staging must exit 0 before live rotation on raw-ACME paths
IaC path is PR-only — never auto-applies Terraform/Pulumi changes
BETA: requires byteport.io/beta-ack: "true" annotation on the namespace for cert-manager paths

Escalates with

PagerDuty Events API v2 (via PagerDutyEventsNotifier) on any step failure, with domain, P-level (P1/P2/P3 by days remaining), cert source, staging output, and exact renewal commands. Slack block-kit notification on success with new expiry and rotation duration.

RBAC permissions

core/certificates (read + annotate cert-manager Certificate resources) core/namespaces (read byteport.io annotations) sysadmin: certbot (certbot renew — requires certbot installed on host) sysadmin: systemctl reload caddy / traefik sysadmin: /var/lib/byteport/cert-backups/ (write backup dir)

Slack example

✅ *ssl-cert-auto-rotate* | domain: `api.example.com` | cert-manager | new expiry: Aug 30 2026 | took 4.2 s | 🔗 /runbooks/ssl-cert-auto-rotate

Use this runbook →

typescript

import { SslCertAutoRotateRunbook, RunbookRegistry, createAgent } from "byteport-agent";
import { createCloudWatchAdapter } from "byteport-agent/adapters/cloudwatch";

// Signal adapter — fires ssl_cert_expiring when DaysToExpiry < 14
const adapter = createCloudWatchAdapter({ region: "us-east-1" });

const registry = new RunbookRegistry();
registry.register(new SslCertAutoRotateRunbook());

const agent = createAgent({
  adapters: [adapter],
  runners: [registry.asRunner("runbook")!],
  dryRun: true, // flip false after confirming dry-run output
  thresholds: { ssl_cert_expiring: 14 }, // trigger when ≤14 days remain
});

await agent.run();

tls_cert_expiry_auto_renew

Datadog Prometheus CloudWatch GrafanaAlerting PagerDuty

TLS / Certificates Critical GA v0.1.22

Detects when

End-to-end autonomous TLS certificate expiry recovery: diagnose served cert via openssl s_client (issuer, notAfter, SANs, chain), classify renewal path (acme_renewable / cert_manager_managed / commercial_ca_manual / internal_ca / wildcard_dns01_required), renew via certbot/acme.sh (http-01 or dns-01 via Route53/Cloudflare/GCP) or cert-manager force-renew, hot-reload the serving process (nginx/caddy/envoy/haproxy/traefik/k8s ingress), then verify new notAfter > now + 30d + chain valid + handshake succeeds. Backs up cert+key before every swap; rolls back and escalates on verification failure. Refuses if DNS zone not owned.

Auto-fixes by

Diagnose: openssl s_client -connect DOMAIN:443 -servername DOMAIN -status -showcerts → parse notAfter, issuer O=/CN=, SAN list, OCSP staple, chain validity; locate cert files on disk + identify server process (nginx/caddy/envoy/haproxy/traefik/k8s/unknown)
Classify (first match): cert_manager_managed if k8s env or --cert-name given; wildcard_dns01_required if *.domain with supported DNS provider; acme_renewable if Let's Encrypt / ZeroSSL issuer or cert in certbot live dir; internal_ca or commercial_ca_manual → escalate with full context
Backup: copy cert + privkey to /var/lib/byteport/cert-backups/-/ before any write
Renew acme_renewable: certbot renew --cert-name DOMAIN (http-01 default) or with --dns-PROVIDER for dns-01; acme.sh --renew as fallback
Renew wildcard dns-01: certbot/acme.sh with --dns-route53 / --dns-cloudflare / --dns-gcp flag
Renew cert_manager_managed: kubectl annotate certificate cert-manager.io/force-renew=true, kubectl wait --for=condition=Ready --timeout=120s
Hot-reload: nginx -s reload / caddy reload / kill -HUP envoy / systemctl reload haproxy / systemctl reload traefik / kubectl rollout restart ingress-controller
Verify: re-fetch cert via openssl s_client, confirm notAfter > now + 30d AND chain_valid AND handshake_ok; rollback + hot-reload + escalate on failure

Safety gates

DNS zone ownership check: refuses renewal if zone SOA cannot be resolved (dig/nslookup) — prevents acting on externally-managed domains
Backup-before-swap: cert + key backed up to /var/lib/byteport/cert-backups/ before any write; abort if backup fails
Rollback on verification failure: restores prior cert + key from backup, re-runs hot-reload, escalates with full diagnostic
commercial_ca_manual and internal_ca always escalate — never attempt autonomous renewal for non-ACME issuers
wildcard_dns01_required without a configured DNS provider → escalate (never attempt http-01 for wildcards)
BYTEPORT_RUNBOOK_DRY_RUN=true / --dry-run: all subprocess calls are no-ops; logs intended commands only
Minimum post-renewal floor: notAfter must be >30d from now for verification to pass — rejects a technically-renewed cert with unusually short validity

Escalates with

Postmark + Slack escalation with: domain, issuer (O=/CN=), days remaining, classification reasoning, renewal tool attempted, stdout/stderr of renewal command, backup path, verification outcome (days/chain/handshake), rollback result, and exact manual renewal commands for the operator. GitHub issue with cert fingerprint, SAN list, OCSP staple status, and DNS admin contact. 7d dedup per domain. PagerDuty Events v2 on rollback trigger (verification failed after live renewal).

RBAC permissions

sysadmin: certbot renew / acme.sh --renew (ACME client on host) sysadmin: nginx -s reload / systemctl reload caddy haproxy traefik (web server reload) sysadmin: /var/lib/byteport/cert-backups/ (write backup dir) sysadmin: /etc/letsencrypt/live/ and /etc/ssl/ (read cert files on disk) k8s: certificates.cert-manager.io (get + annotate), deployments (rollout restart ingress-controller) dns: Route53 IAM / Cloudflare API token / GCP service account (dns-01 challenge, optional)

Slack example

✅ *tls_cert_expiry_auto_renew* | domain: `api.example.com` | issuer: `Let's Encrypt` | was: 4d remaining | renewed via: certbot http-01 | new expiry: +90d | reload: nginx -s reload OK | chain: valid | handshake: OK | 🔗 https://app.byteport.polsia.app/runbooks/tls_cert_expiry_auto_renew

Use this runbook →

python

# One-shot dry-run (check what would happen)
python runbooks/tls_cert_expiry_auto_renew.py --domain api.example.com

# Live renewal via certbot (http-01)
python runbooks/tls_cert_expiry_auto_renew.py --domain api.example.com --no-dry-run

# Wildcard via dns-01 + Route53
python runbooks/tls_cert_expiry_auto_renew.py \
  --domain '*.example.com' \
  --challenge dns01 --dns-provider route53 \
  --no-dry-run

# cert-manager managed cert
python runbooks/tls_cert_expiry_auto_renew.py \
  --domain api.example.com \
  --cert-name api-tls --namespace production \
  --no-dry-run

db_connection_pool_exhaustion

Prometheus Datadog Alertmanager GrafanaAlerting PagerDuty GenericWebhook

Database Critical GA v0.1.25

Detects when

Detects database connection pool exhaustion across Postgres, MySQL, and pgbouncer: classify root cause across 6 classes (leak_after_deploy / idle_in_tx_backlog / long_running_query / undersized_pool / missing_pgbouncer / runaway_cron_fanout), terminate idle-in-transaction sessions and long-running queries within hard safety caps, reload pgbouncer with raised pool size, trigger rolling restart of leaking deployments, and verify utilization drops below 70% with a structured postmortem including query fingerprints and permanent fix recommendation.

Auto-fixes by

Diagnose: query pg_stat_activity for full state breakdown (active / idle / idle-in-tx / waiting), capture top wait_events and long-running queries, pull pgbouncer SHOW POOLS + SHOW STATS, correlate connections_trend with last deploy SHA, inspect env vars for Hikari/SQLAlchemy/pgx pool config
Classify (first match, confidence ≥0.70): leak_after_deploy (deploy marker + rising idle-in-tx trend, conf=0.88); idle_in_tx_backlog (≥3 sessions older than threshold, conf=0.91); long_running_query (≥2 queries over budget or 1 over 10min, conf=0.85); missing_pgbouncer (pgbouncer_waiting>0, conf=0.80); runaway_cron_fanout (single app_name holds ≥15 connections, conf=0.82); undersized_pool (utilization ≥85% with no dominant cause, conf=0.75)
idle_in_tx_backlog: terminate eligible idle-in-transaction sessions via pg_terminate_backend, capped at kills_per_minute_cap (default 20/min), skipping pgbouncer/admin-proxy/superuser connections
long_running_query: cancel over-budget queries via pg_cancel_backend (graceful), fall back to pg_terminate_backend if cancel fails, capture query text for EXPLAIN postmortem
missing_pgbouncer / undersized_pool: send RELOAD to pgbouncer admin console to pick up raised default_pool_size from pgbouncer.ini
leak_after_deploy: terminate old idle-in-tx sessions first, then trigger kubectl rollout restart on the offending deployment to drain all leaked connection handles
runaway_cron_fanout: terminate idle-in-tx connections from the dominant application_name within cap; signal app tier to recycle
Verify: re-sample pg_stat_activity at 30s; pass if pool utilization <70%, lock_waiters=0, error_rate normalized; retry once at 60s; escalate on double-fail

Safety gates

Hard cap: kills_per_minute_cap (default 20) backend terminations per minute — rejects excess over cap and escalates
Critical app allowlist: never terminates pgbouncer, admin-proxy, pgadmin, or superuser connections under any circumstance
Superuser check: every connection is checked for is_superuser=true before termination — superusers are always skipped
pgbouncer RELOAD only (never KILL): pool config reload is safe; this runbook never issues SHUTDOWN or kills pgbouncer itself
BYTEPORT_RUNBOOK_DRY_RUN=true or --dry-run: all pg_terminate_backend / pg_cancel_backend / kubectl / psql calls are no-ops
Minimum confidence 0.70 before acting autonomously — lower confidence escalates to on-call with full diagnostic bundle
Verify must confirm utilization <70% before marking resolved; double-fail → escalate

Escalates with

Postmark + Slack escalation with: signal, classification, confidence score, before/after pool utilization %, pids_terminated, pids_cancelled, top 5 query fingerprints (literals stripped), pgbouncer pool state, last deploy SHA, kill cap status, and structured postmortem markdown with permanent fix recommendation. GitHub issue with full pg_stat_activity snapshot. 1h dedup per db_url.

RBAC permissions

pg_stat_activity (SELECT — connection diagnostics, state breakdown, wait events) pg_settings (SELECT — max_connections lookup) pg_terminate_backend (EXECUTE — idle-in-tx + runaway app termination, gated by cap and allowlist) pg_cancel_backend (EXECUTE — long-running query cancellation, graceful) pgbouncer admin console (RELOAD — pool config hot-reload, no SHUTDOWN) kubectl rollout restart (apps/deployments — leak-after-deploy rolling restart)

Slack example

🚨 *db_connection_pool_exhaustion* | host: `pg-prod-01/appdb` | connections: `94/100` (94%) | class: `idle_in_tx_backlog` | terminated: 12 sessions (idle >5min) from `worker-pool` | utilization after: 41% | action: pg_terminate_backend(1002,1003,...) | 🔗 /runbooks/db_connection_pool_exhaustion

Use this runbook →

sql

-- 1. Check connection saturation
SELECT count(*) AS total,
       (SELECT setting::int FROM pg_settings WHERE name='max_connections') AS max_conns,
       round(count(*)*100.0 / (SELECT setting::int FROM pg_settings WHERE name='max_connections'), 1) AS pct
FROM pg_stat_activity
WHERE pid <> pg_backend_pid();

-- 2. Find idle-in-transaction sessions
SELECT pid, usename, application_name, state,
       now() - state_change AS idt_duration, left(query, 120) AS query
FROM pg_stat_activity
WHERE state = 'idle in transaction'
  AND state_change < now() - interval '5 minutes'
ORDER BY state_change ASC;

-- 3. Find long-running queries
SELECT pid, now() - query_start AS duration, state, left(query, 120) AS query
FROM pg_stat_activity
WHERE state = 'active'
  AND query_start < now() - interval '5 minutes'
ORDER BY query_start ASC LIMIT 10;

-- 4. Terminate idle-in-tx session
SELECT pg_terminate_backend(1234);

-- 5. Cancel long-running query (graceful)
SELECT pg_cancel_backend(5678);

-- 6. pgbouncer: check waiting clients
-- psql -p 6432 pgbouncer -c 'SHOW POOLS;'
-- psql -p 6432 pgbouncer -c 'RELOAD;'

connection_pool_exhausted

Datadog CustomMetric

Database Critical GA v0.1.8

Detects when

Detects when PostgreSQL or MySQL active connections exceed 80% of max_connections and identifies connection leaks (idle-in-transaction > 30s) vs. load-driven saturation — terminating leak backends and suggesting PgBouncer configuration.

Auto-fixes by

Query pg_stat_activity to identify idle_in_transaction sessions held >30s — treated as leaks if >50% of idle connections
For connection leaks: identify backend PID with longest idle-in-transaction duration, capture query text, then terminate via pg_terminate_backend(PID) only if byteport.io/auto_remediate: "db_terminate" annotation present
For load-driven saturation: identify top 3 longest-running queries via pg_stat_activity.duration, log PIDs and query text, suggest PgBouncer/PgCat via escalation
Pre-termination audit: capture current connection count, top wait events from pg_stat_activity.wait_event, and application_name
Post-fix verification: waits 30s then confirms active connections dropped below 70%

Safety gates

Requires byteport.io/auto_remediate: "db_terminate" annotation for backend termination — never kills without this
Never terminates a connection with an uncommitted transaction that has row locks (checks pg_blocking_pids)
Max 1 connection termination per 2-minute window
Requires byteport.io/auto_remediate: "db_query" for pg_stat_activity reads
Never terminates a connection where usesuper=true in pg_stat_activity

Escalates with

Postmark alert with active vs. max connections, idle-in-transaction count, top wait events, longest-running query text, and whether termination was executed or blocked. GitHub issue with suggested PgBouncer pool settings and leak vs. load classification.

RBAC permissions

postgres: pg_monitor (read pg_stat_activity, pg_stat_replication) postgres: pg_execute (pg_terminate_backend) sysadmin: SHOW PROCESSLIST (MySQL equivalent)

Slack example

🔴 *connection_pool_exhausted* | DB: `prod-postgres` | active: 152 / 200 (76%) | leaks: 8 idle-in-transaction > 30s | PID 18432 terminated | connections now: 98/200 | 🔗 /runbooks/connection_pool_exhausted

Use this runbook →

sql

-- 1. Check connection saturation
SELECT count(*) AS active_conns,
       (SELECT setting FROM pg_settings
        WHERE name = 'max_connections')::int AS max_conns,
       round(count(*) * 100.0 /
         (SELECT setting FROM pg_settings
          WHERE name = 'max_connections')::int, 1) AS pct_used
FROM pg_stat_activity
WHERE state != 'idle';

-- 2. Find idle-in-transaction leaks
SELECT pid, usename, application_name,
       state, now() - state_change AS idle_duration, query
FROM pg_stat_activity
WHERE state = 'idle in transaction'
  AND state_change < now() - interval '30 seconds'
ORDER BY state_change ASC;

-- 3. Terminate leaking backend
SELECT pg_terminate_backend(PID)
FROM pg_stat_activity
WHERE pid = YOUR_PID;

-- 4. Top queries by duration
SELECT pid, now() - query_start AS duration, state, query
FROM pg_stat_activity
WHERE state != 'idle'
ORDER BY query_start ASC
LIMIT 5;

replication_lag

CustomMetric Datadog

Database Warning GA v0.1.9

Detects when

Detects PostgreSQL replica lag exceeding 30 seconds for more than 60 seconds — distinguishing between network latency, disk I/O saturation, and long-running queries on the primary — and terminates blocking queries to resume replication.

Auto-fixes by

Query pg_stat_replication to identify lagging replicas: application_name, state, sent_lsn, write_lsn, flush_lsn, replay_lsn
Compare send/replay LSN gap against pg_stat_database.conflict_count to classify root cause: replication_conflict, long_query, or wal_sender_hang
For long-running query on primary blocking WAL application: identify via pg_stat_activity where query_start is far in the past and state != 'idle'
Terminate long-running query on primary via pg_cancel_backend(PID) first (graceful), escalate to pg_terminate_backend after 60s
Log pre-termination snapshot: query text, duration, PID, application_name, and which replica is lagging

Safety gates

Never terminates a query that has been running for less than 60 seconds — slow queries are expected during bulk loads
Never terminates a query that holds an exclusive lock on a relation involved in an active replication slot
Requires byteport.io/auto_remediate: "db_query" annotation for pg_stat_activity reads
Never skips replication or modifies primary configuration (slot settings, wal_level, max_wal_senders)
Escalates to operator (never auto-remediates) when root cause is disk I/O or network — those require infra changes

Escalates with

Postmark alert with lag_seconds per replica, root cause classification, top blocking query text, and suggested read replica scaling actions. GitHub issue for persistent lag (>5min) with full WAL queue diagnostic, pg_stat_database conflict counts, and recommended infra fixes (network, disk, slot config). 6h dedup.

RBAC permissions

postgres: pg_monitor (read pg_stat_replication, pg_stat_activity, pg_stat_database) postgres: pg_signal (pg_cancel_backend, pg_terminate_backend for terminating blockers) postgres: pg_read_all_settings (read max_wal_senders, wal_level, replication slots)

Slack example

⚠️ *replication_lag* | replica: `prod-ro-2` | lag: 47s | cause: long-running query on primary (PID 22109, 3m12s) | action: pg_cancel_backend issued | lag recovering | 🔗 /runbooks/replication_lag

Use this runbook →

sql

-- 1. Check replication lag across all replicas
SELECT application_name, state,
       sent_lsn, write_lsn, flush_lsn, replay_lsn,
       round((pg_wal_lsn_diff(sent_lsn, replay_lsn) / 1024 / 1024), 2) AS lag_mb,
       write_lag, flush_lag, replay_lag
FROM pg_stat_replication
ORDER BY pg_wal_lsn_diff(sent_lsn, replay_lsn) DESC;

-- 2. Identify root cause (conflict vs. slow query vs. network)
SELECT count(*) AS replication_conflicts,
       datname,
       numscans
FROM pg_stat_database
WHERE datname IS NOT NULL
GROUP BY datname
ORDER BY replication_conflicts DESC;

-- 3. Find long-running queries on primary blocking replication
SELECT pid, now() - query_start AS duration, state,
       usename, application_name, query
FROM pg_stat_activity
WHERE state != 'idle'
  AND query_start < now() - interval '60 seconds'
ORDER BY query_start ASC
LIMIT 5;

-- 4. Gracefully cancel blocking query (try cancel first)
SELECT pg_cancel_backend(PID);

-- 5. Force terminate if still running after 60s
SELECT pg_terminate_backend(PID)
FROM pg_stat_activity
WHERE pid = YOUR_PID;

db_connection_pool_exhausted

Prometheus Datadog AlertmanagerWebhook GrafanaAlerting

Database Critical GA v0.1.20

Detects when

Autonomous DB connection pool exhaustion recovery for Postgres, MySQL, and MongoDB: classify idle-in-transaction leak / long-running query / app-side leak / traffic spike / undersized pool, terminate connections within per-minute safety caps, raise pool ceiling within 20% headroom, re-poll utilization at 30s/90s/180s, file structured postmortem.

Auto-fixes by

Parallel fetch: active/idle/idle-in-tx counts, pool ceiling, top 10 long-running queries, client IPs holding connections
Classify exhaustion into 5 classes (priority order): idle_in_transaction_leak → long_running_query → app_side_leak → traffic_spike → pool_undersized
idle_in_transaction_leak: pg_terminate_backend for all iit sessions >5min old (capped at killsPerMinuteCap/min, skips critical allowlist)
long_running_query: terminate queries exceeding runtimeBudgetSeconds (default 5min), capture query text for EXPLAIN postmortem
app_side_leak: flag offending app_name/IP, signal app tier to recycle — never kill connections blindly
traffic_spike: raise pool ceiling (max: dbMaxConnections × 0.8), flag non-critical routes for 503 shedding
pool_undersized: apply +50% ceiling increase if within safety bound, else escalate with config recommendation
Re-poll utilization at 30s / 90s / 180s — pass if <70% and no new pool_timeout signals

Safety gates

Hard cap: killsPerMinuteCap (default 20) connections terminated per minute — rejects all terminations over cap
Critical app allowlist: never terminates connections from listed application_names (e.g. pgbouncer, admin-proxy)
Pool ceiling bound: adjustments capped at dbMaxConnections × 0.8 (20% headroom) — never approaches server hard limit
Replica lag abort: escalates immediately if pg_stat_replication lag >10MB during remediation — avoids split-brain
BYTEPORT_RUNBOOK_DRY_RUN=true logs all intended actions without executing
Always escalates for root-cause awareness even after successful termination (stopgap, not a fix)

Escalates with

Postmark + Slack escalation with full diagnostic bundle: connection class, total/active/iit/long-running counts, client IP distribution, terminated PIDs, EXPLAIN capture of top 5 queries (long_running_query class), before/after utilization %, and postmortem markdown. Escalates immediately on: kill cap hit, critical app protected, replica lag spike, or post-verify utilization still elevated.

RBAC permissions

pg_stat_activity (read — connection diagnostics) pg_stat_replication (read — replica lag check) pg_terminate_backend (execute — idle-in-tx + long-running termination, gated) pg_settings (read — max_connections lookup) pg_reload_conf (execute — pool ceiling apply, gated within safety bound)

Slack example

🚨 *db_connection_pool_exhausted* | host: `pg-prod-01/appdb` | connections: `95/100` | class: `idle_in_transaction_leak` | terminated: 47 iit sessions (>5min) from `worker-pool` | utilization after: 38% | 🔗 https://app.byteport.polsia.app/runbooks/db_connection_pool_exhausted

Use this runbook →

typescript

import { DbConnectionPoolExhaustedRunbook, RunbookRegistry, createAgent } from "byteport-agent";
import { createPrometheusAdapter } from "byteport-agent/adapters/prometheus";

const adapter = createPrometheusAdapter({
  url: process.env.PROMETHEUS_URL!,
  // Fires db_pool_exhausted when pg connections > 90% of max_connections
});

const registry = new RunbookRegistry();
registry.register(new DbConnectionPoolExhaustedRunbook({
  runtimeBudgetSeconds: 300,      // terminate queries >5min
  killsPerMinuteCap: 20,           // hard safety cap
  criticalAppAllowlist: ["pgbouncer", "admin-proxy"],
  verifyUtilizationThreshold: 70,  // pass verify at <70%
  poolCeilingSafetyFraction: 0.8,  // never exceed 80% of db max
  databaseUrl: process.env.DATABASE_URL,
}));

const agent = createAgent({
  adapters: [adapter],
  runners: [registry.asRunner("runbook")!],
  dryRun: false,
});

await agent.run();

failed_deploy_auto_rollback

Prometheus Datadog K8sEvents Alertmanager GrafanaAlerting

Deploys & Rollbacks Critical GA v0.1.26

Detects when

Detects failed Kubernetes deployments across 5 signals (deploy_health_regression / readiness_probe_failure_post_deploy / deploy_5xx_spike / canary_slo_burn / rollout_stuck), classifies root cause with a false-attribution guard, and autonomously rolls back to the last-known-good revision with a hard guardrail of max 1 rollback/deployment/hour. If the rollback itself fails, escalates immediately with a full diagnostic bundle including the suspect commit SHA.

Auto-fixes by

Diagnose: kubectl rollout history to enumerate revisions, identify current and last-good revision, extract commit SHA from CHANGE-CAUSE annotation, count ready vs. total pods (kubectl get deployment -o json), discover downstream Services via label-selector overlap, detect false attribution (regression predating rollout by >5 min in Warning events)
Classify (first match, confidence ≥0.70): do_not_rollback (regression predates deploy — false attribution, conf=0.85); rollout_stuck (progressDeadlineExceeded + 0 ready pods, conf=0.91); readiness_never_ready (readiness probe failures on new ReplicaSet, conf=0.88); deploy_5xx_spike (HTTP 5xx on new revision, conf=0.90); canary_slo_burn (canary burns SLO budget faster than baseline, conf=0.87); deploy_health_regression (error rate or p95 latency spike post-rollout, conf=0.86). Falls back to escalate when no rollback target can be safely determined.
Plan: emit rollback plan (kubectl rollout undo --to-revision=, expected pod churn, estimated duration, rollback-fails-then-escalate tripwire); post plan to ops channel in dry_run mode
Execute: kubectl rollout undo deployment/ -n --to-revision=; watch ReplicaSet ready count; verify failure signal clears within 5-minute window; if rollback kubectl exits non-zero → escalate immediately (no retry)
Verify: deployment Available=True + Progressing=True, readyReplicas == spec.replicas; retry once at 60s if first check fails; escalate on double-fail with full diagnostic bundle
Post-mortem: structured record — signal, classification, suspect_revision, last_good_revision, suspect_image, last_good_image, commit SHA (from CHANGE-CAUSE annotation), rollback_duration_seconds, time_to_resolve_seconds, resolved bool

Safety gates

Max 1 kubectl rollout undo per deployment per hour (in-process rolling counter)
False attribution guard: if regression predates deploy by >5 min → do_not_rollback, escalate — never roll back a healthy deploy
Rollback-itself-fails tripwire: if kubectl rollout undo returns non-zero → escalate immediately, do not retry
do_not_rollback classification always escalates — no autonomous action taken
Minimum confidence 0.70 before acting autonomously — lower confidence escalates with full diagnostic bundle
Verify double-fail → escalate with full diagnostic bundle (revision history, pod status, condition timeline)
BYTEPORT_RUNBOOK_DRY_RUN=true or --dry-run: all kubectl rollout undo calls are no-ops; plan is posted to ops channel only

Escalates with

Postmark + Slack escalation with: deployment name, namespace, signal, classification, confidence, suspect revision + commit SHA, last-good revision, pod readiness before/after, actions taken or blocked, rollback duration, guardrail hit (if any). GitHub issue with full kubectl rollout history + pod status snapshot + classification rationale. 1h dedup per deployment.

RBAC permissions

apps/deployments (get, rollout history, rollout undo, status conditions) apps/replicasets (get — replica count, readiness state) core/pods (list — pod counts by label selector) core/services (list — label selector overlap for blast radius) core/events (list — false attribution guard, Warning events timeline)

Slack example

🚨 *failed_deploy_auto_rollback* | deployment: `checkout-service` | namespace: `production` | signal: `deploy_5xx_spike` | class: `deploy_5xx_spike` | suspect_rev: 14 (commit: a1b2c3d) | action: kubectl rollout undo → rev 13 | all pods ready in 87s | 🔗 /runbooks/failed_deploy_auto_rollback

Use this runbook →

kubectl

# 1. Check rollout status
kubectl rollout status deployment/DEPLOY -n NAMESPACE

# 2. View revision history
kubectl rollout history deployment/DEPLOY -n NAMESPACE

# 3. Inspect specific revision
kubectl rollout history deployment/DEPLOY -n NAMESPACE --revision=N

# 4. Check pod readiness
kubectl get deployment DEPLOY -n NAMESPACE -o wide
kubectl get pods -n NAMESPACE -l app=DEPLOY

# 5. Roll back to last-good revision
kubectl rollout undo deployment/DEPLOY -n NAMESPACE --to-revision=N

# 6. Watch rollback progress
kubectl rollout status deployment/DEPLOY -n NAMESPACE --timeout=5m

# 7. Verify pods are ready
kubectl get pods -n NAMESPACE -l app=DEPLOY -w

deploy_failure

GitHubActions GitHubActions Webhook

Deploys & Rollbacks Warning GA v0.1.10

Detects when

Detects failed deploy workflows via the GitHub Actions API — classifies the failure (test_failure, build_failure, deploy_step_failure, infra_timeout), fetches failed job logs, triggers rollback workflow_dispatch when a previous successful run exists and rollback is enabled, or escalates with a structured summary.

Auto-fixes by

Fetch failed job logs via GET /repos/{owner}/{repo}/actions/runs/{run_id}/jobs
Classify failure: test_failure (jest/vitest/pytest patterns) | build_failure (tsc/eslint/webpack) | deploy_step_failure (helm/kubectl/docker push) | infra_timeout (ETIMEDOUT/rate-limit)
Check if a previous successful run exists — if yes and rollbackEnabled: true, dispatch rollbackWorkflowFile via workflow_dispatch with inputs triggered_by=byteport-agent and failed_run_id
Dedup by run_id — each GitHub Actions run ID is emitted at most once per agent session
If rollback not enabled or no previous success: escalate with structured summary including failure class, branch, actor, run URL

Safety gates

Rollback is opt-in: rollbackEnabled must be explicitly set to true — default is false (detect + escalate only)
Requires Workflows: write permission on the GitHub token for rollback dispatch; Actions: read + Contents: read for detection only
Dry-run mode: all steps return dry_run status with descriptions of what would execute
De-duplicates by run_id — no signal is re-raised across fetch() cycles for the same failed run

Escalates with

Postmark + GitHub Issue escalation with: workflow name, branch, actor, run URL, failure class, classification reasoning, whether a previous successful run exists, and whether rollback was triggered. 24h dedup per run_id.

RBAC permissions

Actions: read — list workflow runs, fetch job logs Contents: read — repository metadata Workflows: write — dispatch workflow_dispatch for rollback (optional, only needed when rollbackEnabled: true)

Slack example

↩️ *deploy_failure* | workflow: `Deploy to Production` | run: `#1847` | branch: `feat/new-checkout` | failure: `build_failure` | author: `alex@` | 🔗 https://github.com/Polsia-Inc/byteport/actions/runs/1847

Use this runbook →

typescript

import { createGitHubActionsAdapter, DeployFailureRunbook, RunbookRegistry, createAgent } from "byteport-agent";

const adapter = createGitHubActionsAdapter({
  owner: "myorg",
  repo: "infra",
  // default: matches /deploy|release|production/i
  // deployWorkflowAllowlist: ["Deploy to Production"],
});

const registry = new RunbookRegistry();
registry.register(new DeployFailureRunbook({
  repo: "myorg/infra",
  rollbackEnabled: false,  // set true + rollbackWorkflowFile to auto-rollback
  // rollbackWorkflowFile: "rollback.yml",
}));

const agent = createAgent({
  adapters: [adapter],
  runners: [registry.asRunner("runbook")!],
  dryRun: true, // flip false after reviewing the promote checklist
});

await agent.run();

failed-deploy-auto-rollback

GitHubActions Prometheus Datadog CloudWatch

Deploys & Rollbacks Critical GA v0.1.17

Detects when

Detects a bad deploy via health-check failure, error-rate spike, or crash-loop signal; automatically rolls back to the last green SHA on Render; verifies health green within 90 s; writes a structured postmortem stub. No human required for the majority of 2am deploy incidents.

Auto-fixes by

Diagnose: fetch last 5 deploy SHAs from Render API, confirm failing one is newest
Decide: apply rollback if (a) error rate >5× baseline in 5-min post-deploy window, OR (b) health endpoint non-2xx for >2 consecutive minutes, OR (c) crash loop detected — otherwise escalate with diagnostics
Execute rollback: call Render API to re-deploy the previous (last-green) SHA
Verify: poll health endpoint every 15 s for up to 90 s post-rollback
Postmortem stub: write structured log entry with failing SHA, rolled-back-to SHA, gate reason, TTR, diagnostic snapshot

Safety gates

rollbackEnabled defaults to true — set false to always escalate instead
Rollback only fires when a gate is triggered AND a previous SHA is resolvable
BYTEPORT_RUNBOOK_DRY_RUN=true mode logs all intended actions without executing
Health verification timeout (default 90 s) prevents false positives from slow cold starts
If rollback API call fails, escalates immediately with full diagnostic bundle
If health still failing after 90 s post-rollback, escalates rather than looping

Escalates with

Postmark + Slack escalation with failing SHA, last-green SHA, gate reason, health endpoint status, and postmortem stub. PagerDuty Events v2 if rollback fails or health still degraded post-rollback. 24h dedup by service ID.

RBAC permissions

Render API key with Deploy + Read access (RENDER_API_KEY env var) RENDER_SERVICE_ID env var identifies the target service BYTEPORT_HEALTH_URL env var specifies the health endpoint to poll

Slack example

🔴 *failed_deploy_auto_rollback* | service: `byteport-prod` | failing SHA: `d3adb33f` → rolled back to `abc1234` | gate: `error_rate_12.4pct_vs_baseline_1.2pct` | health: ✅ green after 45s | TTR: 2m 14s | 🔗 /runbooks/failed-deploy-auto-rollback

Use this runbook →

typescript

import { FailedDeployAutoRollbackRunbook, RunbookRegistry, createAgent } from "byteport-agent";
import { createGitHubActionsAdapter } from "byteport-agent/adapters/github-actions";

// Signal adapter fires deploy_failed when workflow conclusion=failure
const adapter = createGitHubActionsAdapter({
  repo: "my-org/my-app",
  token: process.env.GITHUB_TOKEN,
});

const registry = new RunbookRegistry();
registry.register(new FailedDeployAutoRollbackRunbook({
  serviceId: process.env.RENDER_SERVICE_ID,
  apiKey: process.env.RENDER_API_KEY,
  healthUrl: "https://my-app.onrender.com/health",
  healthWaitSec: 90,
}));

const agent = createAgent({
  adapters: [adapter],
  runners: [registry.asRunner("runbook")!],
  dryRun: false,
});

await agent.run();

application_error

Sentry Sentry Sentry

Application Warning GA v0.1.1

Detects when

Classifies application errors from Sentry by stack-trace heuristics (db_connection, oom, deploy_correlation, upstream_5xx, unknown), auto-rolls back regressions correlated with a recent deploy SHA, scales up for OOM patterns, and escalates novel errors with a structured summary.

Auto-fixes by

Classify error by stack trace: db_connection → correlate with db_connections_saturated runbook
OOM pattern: trigger configurable scale-up command (kubectl scale or equivalent) — opt-in
Regression + deploy SHA within 30-min window: dispatch rollback command — opt-in, default off
Upstream 5xx: log circuit-breaker status and escalate with latency context
Novel/unknown: open incident channel and escalate with full Sentry permalink, event count, and affected users

Safety gates

Rollback requires explicit rollbackEnabled: true + rollbackCommand — default is escalate-only
Scale-up requires explicit scaleUpEnabled: true + scaleUpCommand — default is escalate-only
Deploy correlation window is configurable (default 30 min) — prevents stale SHA false positives
Dry-run mode: logs classification and proposed action without executing any mutation
All actions include structured escalation reason with Sentry shortId, permalink, event count, and affected users

Escalates with

Postmark + GitHub Issue escalation with: Sentry shortId, issue title, permalink, event count, affected users, classification reasoning, and specific remediation guidance per class. Dedup by issue ID + first-seen date.

RBAC permissions

Sentry: auth-token with project:read + event:read + org:read (no write scope required) Rollback: depends on rollbackCommand (e.g. kubectl or helm — caller-supplied) Scale-up: depends on scaleUpCommand (e.g. kubectl scale — caller-supplied)

Slack example

⚠️ *application_error* | project: `api-service` | class: `db_connection` | events: 142 in 5m | correlated: deploy SHA `a3f9c12` 18 min ago | action: correlated with db_connections_saturated runbook, checking connection pool | 🔗 https://sentry.io/organizations/example/issues/?project=1

Use this runbook →

typescript

import { createSentryAdapter, ApplicationErrorRunbook, RunbookRegistry, createAgent } from "byteport-agent";

const adapter = createSentryAdapter({
  org: "myorg",
  authToken: process.env.SENTRY_AUTH_TOKEN,
  project: "backend-api",         // optional: filter to one project
  windowSeconds: 300,              // look-back window (default: 5 min)
  spikeMultiplier: 2.0,            // 2× baseline events/min fires spike signal
});

const registry = new RunbookRegistry();
registry.register(new ApplicationErrorRunbook({
  rollbackEnabled: false,          // set true + rollbackCommand to auto-rollback regressions
  // rollbackCommand: "helm rollback myapp 0 -n production",
  scaleUpEnabled: false,           // set true + scaleUpCommand to auto-scale on OOM
  // scaleUpCommand: "kubectl scale deployment api -n production --replicas=6",
}));

const agent = createAgent({
  adapters: [adapter],
  runners: [registry.asRunner("runbook")!],
  dryRun: true,
});

await agent.run();

7 pager classes that no longer wake an engineer.

Linux host

Containers

Linux host

Kubernetes

TLS / Certificates

Database

Deploys & Rollbacks

Deploys & Rollbacks

Application

Coming in the next two weeks

Don't see your pager class?