Linux host
Detects OOM-kill events and sustained memory pressure (≥90% RSS), classifies root cause across 5 classes (memory_leak / traffic_spike / undersized_limit / noisy_neighbor / runtime_misconfig), and autonomously remediates within hard guardrails — max 3 restarts/hour, max 2× limit bump, never touch stateful workloads without allowlist.
- Diagnose in parallel: cgroup memory.stat, /proc/meminfo, top-N processes by RSS, container memory.limit vs. usage, recent OOM-kill dmesg entries, k8s events, JVM jcmd GC.heap_info, Node.js heap snapshot endpoint
- Classify root cause (first match): memory_leak (RSS climb >2h, no traffic correlation), traffic_spike (RSS climb correlates with rps), undersized_limit (p99 RSS >80% of limit, steady-state), noisy_neighbor (node-level pressure, victim well-behaved), runtime_misconfig (JVM/Node heap > container limit)
- memory_leak: capture heap snapshot via jcmd GC.heap_dump or SIGUSR1 for artifact; restart with exponential backoff (30s × 2^N, cap 300s); kubectl rollout restart for k8s, systemctl restart for bare-metal
- traffic_spike: if HPA available, annotate deployment to trigger scale-out; else patch memory.limit upward (max 2× current, max policy-allowed); dry-run=server validation before applying
- undersized_limit: propose limit increase via PR to manifest repo with suggested value (p99 usage × 1.5); never auto-apply
- noisy_neighbor: cordon node, drain non-critical pods (--ignore-daemonsets --delete-emptydir-data --grace-period=60); page if drain fails
- runtime_misconfig: propose corrected -Xmx / --max-old-space-size with 25% cgroup headroom; never auto-apply
- Max 3 auto-restarts per service per hour (in-process rolling counter)
- Memory limit bump capped at 2× current value — never exceed MAX_LIMIT_BUMP_MULTIPLIER
- Never touch StatefulSet workloads without explicit BYTEPORT_STATEFUL_ALLOWLIST entry
- undersized_limit and runtime_misconfig classifications always result in PR proposal, never auto-apply
- Classify confidence must be ≥0.70 before acting autonomously — lower confidence escalates
- noisy_neighbor: drain fails → page immediately, never force-delete
- Verify: RSS must drop below 70% of limit within 10 min, no new OOM-kills, error rate normalized
Postmark alert with service, classification, confidence score, heap snapshot path (if captured), cgroup limit vs. usage, top-5 RSS consumers, k8s node state, and whether action was taken or blocked by guardrail. GitHub issue includes dmesg OOM log, full diagnostic bundle, and classification rationale. 1h dedup per service.
core/pods (read pod spec, memory limits, kubectl top pod)
core/pods/exec (kubectl top requires exec for Metrics Server)
core/namespaces (read byteport.io annotations)
core/nodes (read MemoryPressure condition, cordon/drain)
apps/deployments (rollout restart, annotate HPA trigger)
autoscaling/horizontalpodautoscalers (check HPA existence)
🔴 *memory_pressure_oom* | service: `payments-api` | class: `memory_leak` | confidence: 0.82 | action: heap snapshot captured → pod restarted (backoff 30s) | RSS: 93% → 48% | resolved in 4m12s | 🔗 /runbooks/memory_pressure_oom
# 1. Check cgroup memory usage
cat /sys/fs/cgroup/memory.current
cat /sys/fs/cgroup/memory.max
# 2. Top RSS consumers
ps aux --sort=-%mem | head -10
# 3. OOM-kill log
dmesg --level=err,crit | grep -i 'oom\|out of memory' | tail -5
# 4. K8s pod memory
kubectl top pod POD -n NAMESPACE
kubectl get events -n NAMESPACE --field-selector reason=OOMKilling
# 5. JVM heap info
jcmd PID GC.heap_info
# 6. Restart with backoff (systemctl)
systemctl restart SERVICE
# 7. K8s rolling restart
kubectl rollout restart deployment/DEPLOY -n NAMESPACE
Autonomous full disk-space reclaim: parallel diagnose (df/inode/du/docker/journald), rotate+compress logs (logrotate+gzip>100MB+journal vacuum 7d), evict container layers (docker system prune or crictl rmi --prune on K8s nodes), drop expired tmp+apt/yum/pip/npm caches, and optionally grow cloud-managed volumes (EBS ModifyVolume / GCP disks.resize +25%, capped at policy.maxVolumeGb). Stops as soon as disk usage drops below 80%. Escalates with +N GB recommendation based on current excess and top-5 largest dirs if usage still >85%.
- Diagnose in parallel: df -h + df -i (inodes), top-10 space consumers via du -sh, docker system df, journald --disk-usage
- Rotate+compress logs: force logrotate, gzip files >100MB older than 24h in /var/log, journalctl --vacuum-time=7d
- Evict container layers: docker system prune -af --filter until=72h + unused volumes (bare-metal/VMs); crictl rmi --prune on K8s nodes
- Drop expired caches: /tmp + /var/tmp files older than 7 days, apt/yum/dnf package cache, pip cache purge, npm cache clean --force
- Grow cloud volume (opt-in, policy.allowVolumeGrow=true): EBS ModifyVolume / GCP disks.resize +25% capped at policy.maxVolumeGb, then growpart + resize2fs/xfs_growfs
- Stop-on-floor: re-reads disk % after each step; exits early when usage drops below freeSpaceFloorPct (default 80%)
- Volume grow cap: policy.maxVolumeGb (default 500 GB) — never grows beyond the configured ceiling
- Volume grow opt-in: policy.allowVolumeGrow defaults to false; cloud grow only fires when explicitly enabled
- BYTEPORT_RUNBOOK_DRY_RUN=true logs all intended actions without executing any shell commands
- Escalates with sized recommendation (+N GB) when usage remains above escalateAbovePct (default 85%) after all steps
Postmark + Slack escalation when usage exceeds escalateAbovePct (default 85%) after all reclaim steps. Escalation includes: current disk % per mount, total bytes reclaimed per step, top-5 largest directories, estimated GB needed based on current excess (+N GB recommendation), and whether volume grow was skipped (and why).
filesystem read (df, df -i, du — read-only diagnostics)
docker socket (docker system prune — container runtime, opt-in)
crictl (crictl rmi --prune — K8s containerd, opt-in)
package manager (apt-get clean / yum clean all / pip cache purge / npm cache clean — write, opt-in)
AWS EC2 ModifyVolume API (EBS grow, requires ec2:ModifyVolume IAM permission, opt-in)
GCP Compute Engine disks.resize API (GCP grow, requires compute.disks.resize permission, opt-in)
🚨 *disk_space_auto_reclaim* | host: `prod-api-03` | disk: `92% → 71%` on `/` | reclaimed: 18.3 GB (logs: 4.2 GB, containers: 11.1 GB, caches: 3.0 GB) | 🔗 https://app.byteport.polsia.app/runbooks/disk_space_auto_reclaim
import { DiskSpaceAutoReclaimRunbook, RunbookRegistry, createAgent } from "byteport-agent";
import { createPrometheusAdapter } from "byteport-agent/adapters/prometheus";
const adapter = createPrometheusAdapter({
url: process.env.PROMETHEUS_URL!,
// Fires disk_pressure, inode_exhaustion, volume_full, disk_almost_full
});
const registry = new RunbookRegistry();
registry.register(new DiskSpaceAutoReclaimRunbook({
freeSpaceFloorPct: 80,
escalateAbovePct: 85,
isK8sNode: false,
policy: {
allowVolumeGrow: true, // set false to disable cloud volume expansion
maxVolumeGb: 500, // safety cap
},
}));
const agent = createAgent({
adapters: [adapter],
runners: [registry.asRunner("runbook")!],
dryRun: false,
});
await agent.run();
Containers
Breaks container restart loops by diagnosing root cause from logs, exit codes, and image SHA — then applying the correct class-specific fix: escalate on config/secret errors, chain on dependency failures, bump memory on OOM, rollback on port-bind or image-pull failures.
- Fetch last 200 log lines + last 5 exit codes + restart count + image SHA for diagnosis
- Classify failure: config_error | dependency_down | oom_at_startup | port_conflict | image_pull_failure | unknown
- Config error: check secret existence in orchestrator, surface exact diff — escalate, never auto-restore
- Dependency down: TCP health-check named upstreams, defer and chain incident if dependency is itself down
- OOM at startup: bump memory +25% (capped 2× original) via kubectl patch, redeploy, verify rollout
- Port bind / image pull: kubectl rollout undo (K8s) or Render Deploys API rollback to last-green SHA
- Unknown: escalate with structured incident report (logs, exit codes, image SHA, restart count)
- Hard cap: 1 auto-remediation per container per hour — rejects all subsequent triggers within the window
- OOM memory cap: never bumps beyond 2× original limit (configurable oomMemoryCapMultiplier)
- Config error: never auto-restores secrets — always surfaces diff and escalates
- Dependency chain: defers remediation when the upstream dependency is itself down (avoids masking)
- BYTEPORT_RUNBOOK_DRY_RUN=true logs all intended actions without executing
- Every action emits a structured audit event with failure class, action taken, and outcome
Postmark + Slack escalation with full diagnostic bundle: logs excerpt, exit codes, image SHA, failure class, secret diff (config_error), upstream health results (dependency_down), or structured incident report (unknown). PagerDuty Events v2 if remediation fails or hourly cap is hit. 24h dedup by container key (name@host).
core/pods/log (read previous container logs)
core/pods (describe for exit code and restart count)
apps/deployments (patch memory limits, rollout undo, rollout status)
core/secrets (read to verify secret existence — never write)
Render: API key with Deploy + Read access (RENDER_API_KEY + RENDER_SERVICE_ID)
🚨 *container_restart_loop* | container: `api-server` | namespace: `production` | failure_class: `oom_at_startup` | exit_code: `137 (OOMKilled)` | restarts: 6 | action: memory bumped 256Mi → 320Mi, rollout verified | 🔗 https://byteport.polsia.app/runbooks/container_restart_loop_breaker
import { ContainerRestartLoopBreakerRunbook, RunbookRegistry, createAgent } from "byteport-agent";
import { createK8sEventsAdapter } from "byteport-agent/adapters/k8s-events";
const adapter = createK8sEventsAdapter({
namespace: "production",
// Fires container_restart_loop when restartCount > 5 in 10 min
});
const registry = new RunbookRegistry();
registry.register(new ContainerRestartLoopBreakerRunbook({
// Post-remediation health check
healthUrl: process.env.BYTEPORT_HEALTH_URL,
healthWaitSec: 300, // hold for 5 min
// OOM memory bump config
oomMemoryBumpPct: 25, // +25% on OOM
oomMemoryCapMultiplier: 2.0, // never exceed 2× original
// Render rollback credentials (used for port/image-pull class)
renderServiceId: process.env.RENDER_SERVICE_ID,
renderApiKey: process.env.RENDER_API_KEY,
}));
const agent = createAgent({
adapters: [adapter],
runners: [registry.asRunner("runbook")!],
dryRun: false,
});
await agent.run();
Linux host
Detects sustained CPU saturation on Linux hosts (>85% for 90 continuous seconds) and safely restarts non-critical high-CPU processes — with oom_score_adj tuning as an intermediate step before full restart.
- Read /proc/stat to confirm sustained high CPU (distinguish spike from sustained load)
- Identify top CPU consumers via ps aux --sort=-%cpu | head -5
- Check for byteport.restart=safe label on identified processes — only restarts opted-in services
- Apply oom_score_adj reduction of -200 to -600 on non-critical high-CPU processes to reduce scheduling priority
- For memory-leak-driven CPU: identify via vmstat and suggest memory limit reduction
- Restart via systemd/service only if CPU does not drop below 80% within 2 minutes of oom_score_adj change
- Requires byteport.io/auto-remediate: "cpu" annotation on the host
- Never restarts a process without the byteport.restart=safe label
- Requires sustained 90s threshold — one-time CPU spike does not trigger
- oom_score_adj reduction is tried first (non-disruptive) before restart
- Skipped if priorityclass.kubernetes.io/no-autoremediation: "true" on the process namespace
- 5-minute global cooldown per host between restart attempts
Postmark alert with: top CPU consumers, vmstat summary, oom_score_adj changes made, restart outcome, and suggested long-term fix (resource limits, hpa configuration, or process replacement). GitHub issue for repeated CPU saturation events on the same host.
core/pods (read byteport.restart=safe label and priority class on container)
core/namespaces (read byteport.io/auto-remediate annotation)
sysadmin: /proc/stat (read CPU metrics)
sysadmin: systemctl (restart services with safe label)
⚠️ *cpu_saturation* | host: `prod-api-03` | avg CPU: 91% for 3 min | top process: `node /app/server.js` (PID 18841) | action: oom_score_adj -300 applied | 🔗 /runbooks/cpu_saturation
# 1. Confirm sustained CPU saturation
vmstat 1 5
# Check %cpu column — must stay >85 across all 5 samples
# 2. Identify top CPU consumers
ps aux --sort=-%cpu | head -10
# 3. Check safe-restart label on processes
kubectl get pods -A -l 'byteport.restart=safe' -o wide
# 4. Reduce scheduling priority with oom_score_adj (non-disruptive)
echo -400 | sudo tee /proc/PID/oom_score_adj
# 5. Restart safe process if CPU stays critical
systemctl restart critical-app.service
# 6. Verify CPU recovery
sleep 120 && vmstat 1 3
Auto-remediates disk pressure on Linux hosts: log rotation, Docker prune, /tmp cleanup. Threshold: disk_usage > 85% warning, > 95% critical.
- Rotate journal logs: journalctl --vacuum-size=50MB
- Delete rotated logs /var/log/*.gz older than 7 days
- Run docker system prune -af to remove dangling images and stopped containers
- Clear /tmp entries unused for over 48 hours
- Verify disk drops below 80% threshold within 2 minutes — otherwise escalates
- Requires byteport.io/auto-remediate: "disk" annotation on the host/node resource
- Dry-run mode: --dry-run logs what would be pruned without executing
- Global cooldown: 5-minute minimum between remediation cycles on the same host
- Never deletes /var/log/lastlog, /var/log/wtmp, or active audit logs
GitHub issue (labeled byteport + severity:warning) + Postmark alert with host, mount point, current %, trend over last 5 min, full action audit log, and 3 recommended next steps.
sysadmin: journalctl --vacuum-size (root or journal group)
sysadmin: rm (delete rotated logs in /var/log)
sysadmin: docker (docker system prune)
⚠️ *disk_pressure* | host: `prod-api-03` | `/var/log` at 91% | action: journal vacuum + docker prune | freed: 4.2 GB | now at 67% | 🔗 /runbooks/disk_pressure
# 1. Inspect disk pressure
df -h | grep -v tmpfs | awk '$5+0 > 85 {print}'
# 2. Rotate journal logs
journalctl --vacuum-size=50MB
# 3. Prune Docker artifacts
docker system prune -af
# 4. Clear stale /tmp
find /tmp -type f -atime +2 -delete
# 5. Verify resolution
sleep 120 && df -h | grep -v tmpfs
End-to-end autonomous disk cleanup: diagnoses current disk usage and top space consumers, then runs ordered remediation steps — log rotation, journald vacuum, /tmp sweep, optional Docker prune, optional package cache prune — stopping as soon as usage drops below the configurable free-space floor (default 80%). Escalates with a per-step bytes-reclaimed summary and residual top-consumer paths if the floor is not restored after all safe steps.
- Diagnose: read mount usage % and top 10 space consumers via du on known-safe roots
- Rotate + gzip logs older than N days (default: 7d) in /var/log — configurable allowlist; hard safe-list protects /var/log/lastlog, /var/log/wtmp, active audit logs
- journalctl --vacuum-size=200M (configurable) to trim systemd journal
- Clear /tmp and /var/tmp files older than N hours (default: 24h)
- Docker system prune -af (opt-in: dockerPruneEnabled=true) — prunes dangling images and build cache if Docker socket present
- apt-get clean / yum clean all / dnf clean all (opt-in: packageCachePruneEnabled=true) — clears OS package manager caches
- Stop-on-floor: re-reads disk usage after each step; stops as soon as usage drops below freeSpaceFloorPct (default: 80%)
- Escalate with per-step bytes-reclaimed, total freed, and residual top-consumer paths if floor not restored
- Hard safe-list: /var/log/lastlog, /var/log/wtmp, /var/log/btmp, /var/log/audit/audit.log, active access.log/error.log/auth.log — never touched
- Docker prune disabled by default (dockerPruneEnabled=false) — operator must explicitly opt in
- Package cache prune disabled by default (packageCachePruneEnabled=false) — operator must explicitly opt in
- Dry-run mode: BYTEPORT_RUNBOOK_DRY_RUN=true logs all intended actions without executing any writes
- Stop-on-floor: never runs later steps if earlier steps already resolved the pressure
- 24h per-host cooldown enforced by RunbookCooldown layer — prevents runbook storm on repeated signals
Postmark + GitHub Issue (labeled byteport + severity:critical + disk_pressure) with: host, mount point, pre/post disk %, per-step bytes reclaimed, total freed, residual top consumers (du output), and suggested next steps. 24h dedup per host. Escalates only after all configured safe steps are exhausted.
sysadmin: journalctl --vacuum-size (root or journal group)
sysadmin: gzip + rm (rotate logs in /var/log — root or log group)
sysadmin: find/rm on /tmp and /var/tmp
sysadmin: docker system prune (optional, requires Docker socket access)
sysadmin: apt-get/yum/dnf clean (optional, requires package manager access)
✅ *disk-pressure-auto-cleanup* | host: `prod-api-03` | `/` at 92% → 71% | freed: 6.8 GB (logs: 4.1 GB, journal: 2.1 GB, /tmp: 600 MB) | steps: rotate_logs + journal_vacuum | floor restored | 🔗 /runbooks/disk-pressure-auto-cleanup
import { DiskPressureAutoCleanupRunbook, RunbookRegistry, createAgent } from "byteport-agent";
import { createDatadogMetricsAdapter } from "byteport-agent/adapters/datadog-metrics";
const adapter = createDatadogMetricsAdapter();
const registry = new RunbookRegistry();
registry.register(new DiskPressureAutoCleanupRunbook({
freeSpaceFloorPct: 80, // stop when usage drops below 80%
logRotationAgeDays: 7, // rotate logs older than 7 days
journalVacuumSize: "200M", // trim journal to 200MB
tmpMaxAgeHours: 24, // sweep /tmp files older than 24h
dockerPruneEnabled: false, // set true to opt in
packageCachePruneEnabled: false, // set true to opt in
}));
const agent = createAgent({
adapters: [adapter],
runners: [registry.asRunner("runbook")!],
dryRun: true,
thresholds: { disk_pressure: 85 },
});
await agent.run();
Detects host memory saturation (>90% for 90 seconds) and distinguishes page-cache pressure from actual memory exhaustion before safely restarting leaking processes.
- Inspect /proc/meminfo and cgroup stats to identify top resident-memory processes
- Check for byteport.restart=safe label on processes — only restarts opted-in services
- Attempt oom_score_adj reduction on non-critical heavy consumers to buy headroom
- Restart identified leaking process via its configured init system (systemd/service)
- Log post-restart memory baseline for comparison in auto-generated postmortem
- Requires byteport.io/auto-remediate: "memory" annotation on the host
- Refuses to restart any process without the byteport.restart=safe label
- Pre-restart audit captures process tree, open FD count, and network connections
- Post-restart verification waits 30s then checks memory returned below 85%
- Escalates if restart fails or memory does not recover within 2 minutes
Postmark alert with host, top memory consumers, swap pressure, restart attempt result, and pre/post audit snapshot. GitHub issue includes cgroup path, RSS, and suggested memory.request / memory.limit adjustments.
sysadmin: /proc/meminfo (read memory stats)
sysadmin: /proc/PID/oom_score_adj (adjust scheduling priority)
sysadmin: systemctl (restart services with safe label)
🔴 *memory_pressure* | host: `prod-api-01` | memory: 94% for 90s | leak: `python3 worker.py` (RSS 2.1 GB) | action: oom_score_adj -600 → memory recovered to 71% | 🔗 /runbooks/memory_pressure
# 1. Check memory pressure
cat /proc/meminfo | grep -E 'MemAvailable|MemFree|SwapFree'
# 2. Find top RSS processes
ps aux --sort=-%mem | head -10
# 3. Read cgroup memory stats
cat /sys/fs/cgroup/memory/memory.usage_in_bytes
# 4. Reduce oom_score_adj (value between -1000 and 1000)
echo -600 | sudo tee /proc/PID/oom_score_adj
# 5. Restart via systemd (requires label)
systemctl restart target.service
Detects runaway memory consumers and autonomously restarts the offending service with deploy-correlation hand-off and crash-loop guard. Works across Render, Kubernetes, and bare-metal/systemd.
- Identify top 5 memory consumers via ps aux --sort=-%mem + cgroup memory.current
- Classify consumer: managed service (known to BytePort) vs unmanaged/system process
- Deploy correlation: if release shipped <60 min ago on the top consumer → hand off to failed-deploy-auto-rollback instead
- Crash-loop guard: abort and escalate if service restarted ≥2× in last 30 min
- Graceful restart via best-available method: Render Deploys API / kubectl rollout restart / systemctl restart
- Post-restart memory verify: re-sample after 3 s stabilization, confirm below recoveryThresholdPct (default 80%)
- Crash-loop guard: if same service has been auto-restarted ≥2× in 30 min → STOP and escalate with restart history
- Deploy hand-off: if a release fired within 60 min → escalate to FailedDeployAutoRollbackRunbook, not restart
- BYTEPORT_RUNBOOK_DRY_RUN=true logs all intended actions without executing any restart
- Render restart only executes when renderServiceId + renderApiKey are configured
- Falls back to structured escalation when restart method cannot be determined (unknown env)
- recoveryThresholdPct configurable (default 80%) — escalates if post-restart memory remains elevated
Postmark + Slack escalation with full top-5 memory attribution table, restart history and crash-loop count, recent deploy correlation evidence, suspected leak signature, and runtime env classification. PagerDuty Events v2 if crash-loop guard triggered or post-restart memory still elevated. 24h dedup by service key (name@host).
systemd: systemctl restart on the target service unit
Kubernetes: apps/deployments/rollout (kubectl rollout restart)
Render: API key with Deploy + Read access (RENDER_API_KEY + RENDER_SERVICE_ID)
🔴 *memory_pressure_auto_recover* | host: `prod-api-01` | usage: 94% → 61% | top consumer: `api-server` (1.2 GB RSS) | action: systemctl restart api-server | TTR: 28 s | 🔗 https://byteport.polsia.app/runbooks/memory_pressure_auto_recover
import { MemoryPressureAutoRecoverRunbook, RunbookRegistry, createAgent } from "byteport-agent";
import { createPrometheusAdapter } from "byteport-agent/adapters/prometheus";
const adapter = createPrometheusAdapter({
endpoint: "http://prometheus:9090",
// memory_pressure signal fires when MemAvailable < 10%
});
const registry = new RunbookRegistry();
registry.register(new MemoryPressureAutoRecoverRunbook({
// Render-native restart
renderServiceId: process.env.RENDER_SERVICE_ID,
renderApiKey: process.env.RENDER_API_KEY,
// Safety: stop if restarted >=2x in 30 min
crashLoopMaxRestarts: 2,
crashLoopWindowMs: 30 * 60 * 1000,
// Hand off to rollback if deploy in last 60 min
deployLookbackMinutes: 60,
// Confirm memory below 80% before declaring resolved
recoveryThresholdPct: 80,
}));
const agent = createAgent({
adapters: [adapter],
runners: [registry.asRunner("runbook")!],
dryRun: false,
});
await agent.run();
Kubernetes
Detects CrashLoopBackOff and pod restart-loop incidents (10th pager class). 5-branch classifier: config_error_missing_env, image_pull_failure, liveness_probe_misconfigured, oom_at_startup, dependency_not_ready. Autonomous actions: rollback bad image tag, bump memory limit 1.5×, loosen probe timing, hold pod and wait for dependency healthcheck. Escalates on unknown classification or when guardrail ceilings are hit.
- Detect: pod status.containerStatuses[].state.waiting.reason == CrashLoopBackOff; restart count ≥3 (warn), ≥10 (high), ≥30 (critical); exit code 137 for pod_not_ready_oom; Unhealthy events for probe failures
- Diagnose: kubectl logs --previous (last 2000 chars), kubectl describe pod events (ImagePullBackOff, configmap/secret not found, Unhealthy), liveness/readiness probe spec (initialDelaySeconds, failureThreshold), ConfigMap/Secret existence, node MemoryPressure condition
- Classify (first-match, confidence ≥0.70): oom_at_startup (exit 137, conf=0.93 → bump memory 1.5×); image_pull_failure (ImagePullBackOff in events, conf=0.94 → rollback); config_error_missing_env (ConfigMap/Secret not found, conf=0.91 → audit + surface); liveness_probe_misconfigured (probe too tight + healthy startup logs, conf=0.86 → loosen probe); dependency_not_ready (exit 0/1 + connection-refused patterns, conf=0.88 → hold + healthcheck)
- Remediate: oom_at_startup → kubectl patch Deployment memory limit 1.5× (ceiling 2×, max 1/pod/24h); image_pull_failure → kubectl rollout undo Deployment (max 1/service/hour); config_error_missing_env → kubectl get configmaps/secrets audit, surface missing names (never auto-creates Secrets); liveness_probe_misconfigured → kubectl patch Deployment livenessProbe.initialDelaySeconds=max(current×2,30) failureThreshold=5; dependency_not_ready → kubectl annotate pod with dependency-hold-since, probe dependency endpoints
- Verify: pod reaches Ready within 5-min stabilisation window; restart count must not increment; no Unhealthy probe events
- Escalate: unknown classification after 2 passes; memory ceiling hit; dependency healthcheck timeout; rollback rate limit reached
- Max 1 rollback per service per 1-hour window (_rollback_history in-process dict)
- Max 1 memory limit bump per pod per 24h (_memory_bump_history in-process dict)
- Memory bump ceiling: never exceed 2× original limit (OOM_LIMIT_BUMP_CEILING_MULTIPLIER=2.0)
- Never auto-create Secrets — surfaces missing key names to operator, requires manual action
- byteport.io/skip-crashloop-remediation=true annotation skips all actions
- Minimum confidence 0.70 before acting autonomously
- Never delete PersistentVolumeClaims or PersistentVolumes
- BYTEPORT_RUNBOOK_DRY_RUN=true or --dry-run: all kubectl patch/rollout/annotate calls are no-ops
Postmark + Slack escalation bundle: pod/namespace/node, signal, classification, confidence, restart count, exit code, previous logs snippet. GitHub issue with full kubectl describe pod + event timeline + classification rationale. 24h dedup per pod + classification.
core/pods (get, list — read container status, waiting reason, exit code, restart count)
core/pods (annotate — dependency-hold annotation)
apps/deployments (get, patch — memory limit bump, probe patch, rollout undo)
core/events (list — image pull errors, configmap/secret not found, Unhealthy events)
core/nodes (get — MemoryPressure condition check)
core/configmaps (list — audit existing ConfigMaps for config_error class)
core/secrets (list — audit secret names only, never values, for config_error class)
🔁 *crashloopbackoff_v2* | pod: `production/api-7d9f-xk2p9` | signal: `crashloopbackoff` | restarts: 4 | class: `dependency_not_ready` | conf: 0.88 | action: dependency-hold annotated, waiting for postgres healthcheck | 🔗 /runbooks/crashloopbackoff_v2
# 1. Check CrashLoopBackOff state
kubectl get pod POD -n NAMESPACE -o jsonpath='{.status.containerStatuses[0].state.waiting.reason}'
kubectl get pod POD -n NAMESPACE -o jsonpath='{.status.containerStatuses[0].restartCount}'
# 2. Get previous container logs
kubectl logs POD -n NAMESPACE --previous --tail=100
# 3. Check events for image pull errors and missing config
kubectl get events -n NAMESPACE --field-selector involvedObject.name=POD
# 4. Check liveness probe settings
kubectl get pod POD -n NAMESPACE -o jsonpath='{.spec.containers[0].livenessProbe}'
# 5. OOM at startup: bump memory limit
kubectl patch deployment DEPLOYMENT -n NAMESPACE \
--type=strategic \
--patch='{"spec":{"template":{"spec":{"containers":[{"name":"CONTAINER","resources":{"limits":{"memory":"384Mi"}}}]}}}}'
# 6. Image pull failure: rollback
kubectl rollout undo deployment/DEPLOYMENT -n NAMESPACE
# 7. Probe misconfigured: loosen initialDelay
kubectl patch deployment DEPLOYMENT -n NAMESPACE \
--type=strategic \
--patch='{"spec":{"template":{"spec":{"containers":[{"name":"CONTAINER","livenessProbe":{"initialDelaySeconds":30,"failureThreshold":5}}]}}}}'
# 8. Dependency hold: annotate pod
kubectl annotate pod POD -n NAMESPACE byteport.io/dependency-hold-since=$(date -u +%Y-%m-%dT%H:%M:%SZ)
Detects OOMKilled pods and pod memory pressure across 6 signals, classifies root cause across 5 classes (genuine_memory_leak, undersized_limit, traffic_spike, node_pressure, runtime_misconfig), and autonomously bumps memory limits within safe bounds, triggers HPA scale-out, cordons/drains memory-pressured nodes, or surfaces JVM/Go/Node heap misconfig as a PR proposal. Max 1 limit bump per pod per 24h. Never auto-bumps on genuine memory leaks — escalates with heap profile instead.
- Detect: kubectl exit reason OOMKilled; NodeCondition MemoryPressure; JVM/Go OOM markers in previous logs; metrics-server working-set vs limit
- Diagnose: kubectl describe pod (restart count, memory limit/request, exit reason), HPA discovery, recent deploy timestamp (<30min = deploy regression candidate), runtime hints (JVM -Xmx from JAVA_TOOL_OPTIONS, Go GOMEMLIMIT, Node --max-old-space-size from env), Prometheus series for trend and RPS correlation (BYTEPORT_PROMETHEUS_URL)
- Classify (first match, confidence ≥0.70): runtime_misconfig (heap config > container limit, conf=0.93 → PR proposal); node_pressure (NodeCondition MemoryPressure + pod healthy <70%, conf=0.88 → cordon+drain); genuine_memory_leak (rising RSS across 3+ restarts, no traffic correlation, conf=0.85 → escalate, never auto-bump); traffic_spike (RSS correlates with RPS, conf=0.86 → HPA then limit bump); undersized_limit (RSS plateau ≥85% of limit, no trend, conf=0.84 → bump limit)
- Remediate: undersized_limit → kubectl patch Deployment/StatefulSet memory limit (≤2× current, ≤80% of node allocatable, 1 bump/pod/24h); traffic_spike → annotate HPA to trigger scale-out, fallback to limit bump; node_pressure → kubectl cordon then kubectl drain (--ignore-daemonsets, PDB-aware); runtime_misconfig → generate -Xmx/GOMEMLIMIT/NODE_OPTIONS fix as PR description (never auto-applied); genuine_memory_leak → emit escalation bundle, open ticket
- Verify: pod reaches Ready; working-set < 80% of new limit; 0 OOMKill events in last 5 min; no cascading evictions on same node; retry once at 15s; escalate on double-fail
- Post-mortem: structured record — pod, namespace, node, signal, classification, old/new limit, actions_taken, limit_bumped, hpa_scaled, node_cordoned, escalated bool, time_to_resolve
- Max 1 autonomous memory limit bump per pod per 24h (_limit_bump_history in-process dict) — refuses second bump with escalation
- Never exceed 2× the original memory limit
- Never set limit above 80% of node allocatable memory — prevents cascading OOMs on the node
- byteport.io/skip-memory-remediation=true annotation on pod skips all autonomous actions
- genuine_memory_leak classification always escalates — never auto-bumps to avoid masking real bugs
- runtime_misconfig always surfaces as PR proposal — never auto-applies env var changes
- Max 1 concurrent node drain for node_pressure class (_active_node_drains lock)
- Minimum confidence 0.70 before acting autonomously
- BYTEPORT_RUNBOOK_DRY_RUN=true or --dry-run: all kubectl patch/cordon/drain calls are no-ops
Postmark + Slack escalation bundle: pod/namespace/node, signal, classification, confidence, restart count, memory limit vs usage, actions taken or blocked, heap profile path (if genuine_memory_leak), Prometheus series summary. GitHub issue with full kubectl describe pod output + exit reason timeline + classification rationale. 24h dedup per pod + classification.
core/pods (get, list — read pod spec, restart count, exit reason, metrics)
apps/deployments (get, patch — memory limit bump on Deployment workloads)
apps/statefulsets (get, patch — memory limit bump on StatefulSet workloads)
autoscaling/horizontalpodautoscalers (get, list, annotate — HPA scale-out trigger)
core/nodes (get — read NodeCondition MemoryPressure, allocatable memory; cordon)
core/nodes (drain — eviction API; respects PodDisruptionBudgets)
core/events (list — OOMKill events, eviction events for verify step)
💥 *oomkilled_memory_pressure* | pod: `production/api-7d9f-xk2p9` | signal: `pod_oom_killed` | class: `undersized_limit` | conf: 0.84 | action: memory limit 256Mi → 512Mi on Deployment/api | pod Ready in 1m43s | working-set 42% of new limit ✓ | 🔗 /runbooks/oomkilled_memory_pressure
# 1. Check pod OOMKill history
kubectl describe pod POD -n NAMESPACE | grep -A5 'Last State'
kubectl get events -n NAMESPACE --field-selector involvedObject.name=POD
# 2. Check memory limits
kubectl get pod POD -n NAMESPACE -o jsonpath='{.spec.containers[0].resources}'
# 3. Check working-set vs limit (requires metrics-server)
kubectl top pod POD -n NAMESPACE --containers
# 4. Check HPA
kubectl get hpa -n NAMESPACE
# 5. Bump memory limit (Deployment)
kubectl patch deployment DEPLOYMENT -n NAMESPACE \
--type=strategic \
--patch='{"spec":{"template":{"spec":{"containers":[{"name":"CONTAINER","resources":{"limits":{"memory":"512Mi"}}}]}}}}'
# 6. Trigger HPA scale-out
kubectl annotate hpa HPA_NAME -n NAMESPACE \
byteport.io/scale-triggered=$(date -u +%Y-%m-%dT%H:%M:%SZ) --overwrite
# 7. Node pressure: cordon + drain
kubectl cordon NODE
kubectl drain NODE --ignore-daemonsets --delete-emptydir-data --grace-period=60
# 8. Verify pod stabilises
kubectl get pod POD -n NAMESPACE --watch
kubectl top pod POD -n NAMESPACE --containers
Detects Kubernetes Node NotReady and kubelet failure across 6 signals (node_not_ready / node_disk_pressure / node_memory_pressure / node_pid_pressure / kubelet_unhealthy / node_network_unavailable), classifies root cause across 6 classes (transient_kubelet_hang / runtime_crash / resource_pressure_disk / resource_pressure_mem_pid / network_plugin_failure / hardware_cloud_failure), and autonomously cordons, drains (respecting PDBs), restarts kubelet or container runtime, prunes images for disk pressure, or restarts CNI daemonset pods. Cloud provider node replacement (AWS ASG, GCP MIG, Azure VMSS) is gated behind operator confirmation or BYTEPORT_AUTO_REPLACE_NODES=true. Hard guardrail: never drain more than 1 node concurrently, never touch control-plane nodes.
- Detect: poll NodeCondition transitions via kubectl; pull kubelet systemd status (systemctl is-active/is-failed); read recent kernel messages (journalctl -k) for OOM/hardware errors
- Diagnose: kubectl describe node (conditions, allocatable vs. capacity, taints, events), journalctl -u kubelet --since '30 min ago' (last 200 lines), container runtime status (systemctl status containerd/crio), inode exhaustion (df -i), fd exhaustion (/proc/sys/fs/file-nr), image disk usage (crictl/docker images)
- Classify (first match, confidence ≥0.70): hardware_cloud_failure (kernel OOM/MCE/panic, conf=0.94, always escalate); network_plugin_failure (NetworkUnavailable + CNI errors, conf=0.88); runtime_crash (containerd/crio unit failed, conf=0.91); resource_pressure_disk (DiskPressure condition or disk ≥85% or inode_free ≤5%, conf=0.90); resource_pressure_mem_pid (MemoryPressure/PIDPressure condition, conf=0.88); transient_kubelet_hang (inactive but not failed, restart_count ≤2, conf=0.82); unknown (<0.70 confidence, always escalate)
- Remediate — cordon node first, then per class: transient_kubelet_hang → systemctl restart kubelet; runtime_crash → drain → systemctl restart containerd/crio then kubelet; resource_pressure_disk → drain → crictl/docker image prune + delete evicted pods + restart kubelet; resource_pressure_mem_pid → drain → restart kubelet (node replacement if memory ≥95%); network_plugin_failure → delete CNI daemonset pod on node (flannel/calico/cilium/weave); hardware_cloud_failure → escalate with full bundle, cloud-provider node replacement if BYTEPORT_AUTO_REPLACE_NODES=true
- Verify: node returns to Ready within 5 min window; no new eviction events for 5 min; no PDB violations; pods rescheduled and Running/Ready; retry once at 15s; escalate on double-fail
- Post-mortem: structured record — node_name, signal, classification, confidence, root_cause, actions_taken, cordon/drain/restart timeline, pods_evicted, pods_rescheduled, time_to_recovery_seconds, escalated bool
- Never drain more than 1 non-control-plane node concurrently (in-process _active_drains lock) — refuses with escalation if limit reached
- Never touch nodes with label node-role.kubernetes.io/control-plane or node-role.kubernetes.io/master — always escalate
- Respect PodDisruptionBudgets: kubectl drain with --disable-eviction=false; if PDB violation returned → surface error verbatim and escalate
- Hardware/cloud failures always escalate — no autonomous node replacement unless BYTEPORT_AUTO_REPLACE_NODES=true explicitly set
- Minimum confidence 0.70 before acting autonomously — lower confidence escalates with full diagnostic bundle
- Image prune only (crictl/docker rmi --prune) — never deletes PersistentVolumes or PersistentVolumeClaims
- BYTEPORT_RUNBOOK_DRY_RUN=true or --dry-run: all kubectl/systemctl/crictl/docker calls are no-ops
Postmark + Slack escalation with: node name, signal, classification, confidence, kernel error excerpt (if hardware), kubelet journal tail (last 10 lines), runtime status, disk usage, actions taken or blocked, PDB status, drain outcome. GitHub issue with full kubectl describe node output + NodeCondition timeline + classification rationale. 1h dedup per node.
core/nodes (get, describe — conditions, allocatable, taints; cordon via kubectl cordon)
core/nodes (drain — eviction API; respects PodDisruptionBudgets)
core/pods (list — find evicted pods for cleanup; get — rescheduling verification)
core/pods (delete — remove evicted/Failed phase pods)
policy/poddisruptionbudgets (get — PDB violation guard during drain)
apps/daemonsets (get — locate CNI daemonset for pod restart)
core/events (list — recent node events, eviction event detection)
🔴 *node_not_ready* | node: `worker-3` | signal: `kubelet_unhealthy` | class: `runtime_crash` | conf: 0.91 | action: cordon → drain (4 pods evicted) → systemctl restart containerd → systemctl restart kubelet | node Ready in 2m18s | 🔗 /runbooks/node_not_ready
# 1. Check node conditions
kubectl get node NODENAME -o wide
kubectl describe node NODENAME | grep -A 10 Conditions:
# 2. Check kubelet systemd status (on node)
systemctl status kubelet
journalctl -u kubelet --since '30 min ago' --no-pager | tail -50
# 3. Check container runtime (containerd)
systemctl status containerd
journalctl -u containerd --since '10 min ago' --no-pager | tail -30
# 4. Disk and inode usage (on node)
df -h /
df -i /
# 5. Cordon node
kubectl cordon NODENAME
# 6. Drain node (respecting PDBs)
kubectl drain NODENAME --ignore-daemonsets --delete-emptydir-data --grace-period=60
# 7. Restart kubelet
systemctl restart kubelet
# 8. Verify node returns Ready
kubectl get node NODENAME --watch
# 9. Uncordon when healthy
kubectl uncordon NODENAME
Detects Kubernetes pods in CrashLoopBackOff and container restart loops, classifies root cause across 7 failure classes (oom_at_startup / bad_image_or_tag / failed_init_container / probe_misconfigured / missing_configmap_or_secret / app_panic_on_boot / node_pressure_eviction), and autonomously remediates within hard guardrails — max 1 rollback/service/hour, max 1 resource bump/pod/day, never delete PersistentVolumes.
- Diagnose: kubectl describe pod, previous + current container logs (last 2000 chars), exit code analysis (137 OOM / 139 segfault / 127 missing binary / 1 app error), event timeline, image pull status, probe definitions, resource requests vs node capacity, recent ConfigMap/Secret changes
- Classify (first match): node_pressure_eviction (node MemoryPressure/DiskPressure condition); bad_image_or_tag (ImagePullBackOff/ErrImagePull in events); missing_configmap_or_secret (missing K8s resource in events); failed_init_container (init container exit_code ≠ 0); oom_at_startup (exit code 137); probe_misconfigured (liveness/readiness probe fires before app ready, app logs show healthy startup); app_panic_on_boot (exit 139 or panic/fatal/exception in logs + restart_count ≥3)
- oom_at_startup: bump memory request/limit 1.5× (max 1 bump/pod/day), patch deployment, rolling restart
- bad_image_or_tag: kubectl rollout undo deployment (max 1 rollback/service/hour)
- failed_init_container: delete pod to re-trigger init if non-external dep; escalate if external dep connectivity failure
- probe_misconfigured: patch deployment with doubled initialDelaySeconds and failureThreshold=5
- app_panic_on_boot: extract stack trace from previous logs; rollback if restart_count ≥3
- node_pressure_eviction: kubectl cordon node, delete pod to reschedule onto healthy node
- Verify: pod reaches Ready, restart delta=0 over 5 min, probe Unhealthy events gone; retry once after 60s
- Post-mortem: emit structured event (classification, action, time-to-resolve, rollback-needed) to /demo timeline
- Max 1 kubectl rollout undo per service per hour (in-process rolling counter)
- Max 1 memory resource bump per pod per day (in-process rolling counter)
- Never delete PersistentVolumeClaims or PersistentVolumes under any circumstance
- missing_configmap_or_secret: audit only — never auto-create Secrets (operator action required)
- failed_init_container: escalate on external dependency failure, do not loop-retry
- Minimum confidence 0.70 before acting autonomously — lower confidence escalates
- Verify must confirm pod Ready + zero new restarts before marking resolved; escalate on double-fail
Postmark alert with pod name, namespace, exit code, classification, confidence score, previous log excerpt (last 500 chars), actions taken or blocked, and guardrail hit (if any). GitHub issue with byteport + severity:critical + crash_loop labels, full diagnostic bundle. 24h dedup: same pod + same exit code → adds comment, not new issue.
core/pods (describe pod, read spec, resource limits, restart counts)
core/pods/log (read previous and current container logs)
core/pods/exec (kubectl top pod for live memory usage)
core/namespaces (read byteport.io/allow-remediation annotation)
core/events (read BackOff/CrashLoopBackOff/ImagePullBackOff events)
core/nodes (read MemoryPressure/DiskPressure conditions, cordon)
apps/deployments (rollout undo, patch memory limits, patch probe settings)
🔴 *crash_loop_backoff* | namespace: `backend` | pod: `api-7d9f8-xk2p9` | class: `oom_at_startup` | exit: 137 | restarts: 8 | action: memory limit 256Mi → 384Mi, rolling restart | pod ready in 2m14s | 🔗 /runbooks/crash_loop_backoff
# 1. Check pod status and restart count
kubectl get pod -n NAMESPACE POD -o wide
# 2. Fetch previous container logs
kubectl logs POD -n NAMESPACE --previous --tail=100
# 3. Describe for events and exit code
kubectl describe pod POD -n NAMESPACE
# 4. Exit code classification
# 137 → OOMKilled (bump memory limit)
# 139 → Segfault (rollback or investigate binary)
# 127 → Missing binary/library (image or entrypoint issue)
# 1 → App error (check logs for panic/fatal)
# 5. Rollback to last-good ReplicaSet
kubectl rollout undo deployment/DEPLOY -n NAMESPACE
# 6. Bump memory limit (OOM at startup)
kubectl patch deployment DEPLOY -n NAMESPACE -p '{"spec":{"template":{"spec":{"containers":[{"name":"app","resources":{"limits":{"memory":"384Mi"}}}]}}}}
# 7. Loosen liveness probe (probe misconfigured)
kubectl patch deployment DEPLOY -n NAMESPACE -p '{"spec":{"template":{"spec":{"containers":[{"name":"app","livenessProbe":{"initialDelaySeconds":30,"failureThreshold":5}}]}}}}'
Detects Kubernetes pods in crashloop and classifies the exit code (OOMKilled, missing binary, image auth, config error) before safely restarting with back-off guard.
- Fetch kubectl logs --previous to capture crash payload
- Classify exit code: 137=OOMKilled, 127=missing binary, 1=app error, 0=image auth failure
- For OOMKilled: inspect memory.limit vs usage, suggest adjustment if limit is within 20% of actual
- For transient errors: apply exponential back-off restart policy via kubectl rollout restart
- Safe-restart only if byteport.io/allow-remediation: "true" annotation present on namespace
- Requires byteport.io/allow-remediation: "true" on the namespace — pod-level annotation insufficient
- Never force-deletes a pod in Terminating state
- Requires crash classification before acting — raw restart count alone does not trigger
- Max 1 restart attempt per 3-minute window
- Skipped if priorityclass.kubernetes.io/no-autoremediation: "true" present
Postmark alert with pod name, namespace, exit code, previous log excerpt (last 500 chars), cluster, and 3 suggested steps. GitHub issue with byteport + severity:{level} + pod_crashloop labels. 24h dedup: same pod + same exit code → adds comment, not new issue.
core/pods/log (read previous container logs)
core/pods (describe pods for exit code and events)
core/namespaces (read byteport.io annotations)
core/events (read namespace events for BackOff reason)
🚨 *pod_crashloop* | namespace: `payment-api` | pod: `api-7d9f8-xk2p9` | exit code: `137 (OOMKilled)` | restarts: 5 in last 10 min | action: memory limit patched + pod restarted | 🔗 https://app.byteport.polsia.app/runbooks/pod_crashloop
# 1. Check pod status and restart count
kubectl get pod -n NAMESPACE POD -o wide
# 2. Fetch previous container logs
kubectl logs POD -n NAMESPACE --previous
# 3. Describe for events and exit code
kubectl describe pod POD -n NAMESPACE
# 4. Classify via exit code
# 137 → OOMKilled (memory limit exceeded)
# 127 → Missing binary or library
# 1 → Application error (check logs)
# 0 → ImagePullBackOff or ConfigError
# 5. Safe-restart with back-off
kubectl rollout restart deployment/DEPLOY -n NAMESPACE
Detects pods killed by the Kubernetes OOM killer (exit code 137) and safely patches memory limits upward, restarts the pod, and enables debug logging to capture future leak patterns.
- Read current memory.request and memory.limit from pod spec
- Compare against actual usage from kubectl top pod (Metrics Server required)
- If usage >80% of limit: suggest increasing memory.limit to usage × 1.5
- If leak signature detected: patch memory.limit upward and file postmortem
- If limits are appropriately sized: flag for app-level investigation and escalate
- Restart pod with DEBUG=true env var injection to capture future leak patterns
- Requires byteport.io/auto-remediate: "oom" annotation on the namespace
- Limit patch capped at 2× current value to avoid unbounded resource consumption
- kubectl patch --dry-run=server first — rejected if it would exceed node allocatable memory
- Only the OOM'd container is restarted; sidecar containers are preserved
- Post-restart verification: pod must reach Running state within 60 seconds
Postmark alert with pod name, namespace, OOM count in last 24h, memory limit vs. actual usage comparison, suggested limit adjustment, and whether fix was applied or escalated. GitHub issue includes node allocatable headroom and top consumers on the node.
core/pods (read pod spec for memory limits)
core/pods/exec (kubectl top pod requires exec for Metrics Server)
core/namespaces (read byteport.io/auto-remediate annotation)
core/nodes (read node allocatable for dry-run validation)
🔴 *oom_killed_pod* | namespace: `backend` | pod: `worker-5f9b-x1p3q` | limit: 256Mi → patched to 512Mi | restart: initiated | 3 prior OOMs on this pod — consider increasing HPA memory floor | 🔗 /runbooks/oom_killed_pod
# 1. Check OOM event in pod events
kubectl get events -n NAMESPACE --field-selector reason=OOMKilled
# 2. Get current memory limits
kubectl get pod POD -n NAMESPACE -o jsonpath='{.spec.containers[0].resources}'
# 3. Compare with actual usage
kubectl top pod POD -n NAMESPACE
# 4. Patch memory limit upward (dry-run)
kubectl patch pod POD -n NAMESPACE --dry-run=server -p '{"spec":{"containers":[{"name":"app","resources":{"limits":{"memory":"512Mi"}}}]}}'
# 5. Apply patch and restart
kubectl patch pod POD -n NAMESPACE -p '{"spec":{"containers":[{"name":"app","resources":{"limits":{"memory":"512Mi"}}}]}}'
kubectl delete pod POD -n NAMESPACE
Detects Kubernetes pods failing to pull their container image (ErrImagePull, ImagePullBackOff) and surfaces the root cause — auth failure, bad tag, or network misconfiguration — with specific remediation steps.
- Classify failure type: auth (401/403), missing tag (404), network unreachable, or registry timeout
- For auth failures: list imagePullSecrets on the pod and identify the missing or misconfigured secret
- For missing tag: suggest the correct image tag from the previous deployment's image
- For network unreachable: check node DNS resolution and registry reachability
- For registry timeout: check if registry rate limit has been hit (Docker Hub unauthenticated)
- Suggest kubectl set image or kubectl patch secret commands for the specific failure type
- Auto-remediation limited to diagnostics and suggestions — no destructive actions taken
- Requires byteport.io/allow-remediation: "true" annotation to include secret patch suggestion
- Never modifies imagePullSecrets directly — only surfaces the correct patch command
- Skipped if pod is already terminating or in a Completed state
- Max 1 alert per 10-minute window per pod to avoid alert storms during image updates
Postmark alert with failure classification (auth/tag/network/ratelimit), the specific image URI, registry configuration state, and the exact kubectl commands to fix it. GitHub issue with image_pull_backoff label and registry diagnostic output.
core/pods (read image, imagePullSecrets, container spec)
core/secrets (read imagePullSecrets to identify misconfiguration — redacted in alert)
core/namespaces (read byteport.io annotations)
core/events (read image pull failure events)
⚠️ *image_pull_backoff* | namespace: `checkout` | pod: `api-0` | reason: `401 Unauthorized` — imagePullSecret `reg-creds` may be expired | fix: `kubectl create secret docker-registry ...` | 🔗 /runbooks/image_pull_backoff
# 1. Check image pull failure events
kubectl get events -n NAMESPACE --field-selector reason=Failed
# 2. Inspect pod image and pull secrets
kubectl get pod POD -n NAMESPACE -o jsonpath='{.spec.imagePullSecrets}' | jq
kubectl get pod POD -n NAMESPACE -o jsonpath='{.spec.containers[0].image}'
# 3. Test registry authentication
kubectl create secret docker-registry reg-secret \
--docker-server=https://index.docker.io/v1/ \
--docker-username=USER \
--docker-password=PASS \
--docker-email=EMAIL -n NAMESPACE
# 4. Patch image to a known-good tag
kubectl set image deployment/DEPLOY CONTAINER=registry.io/image:sha-TAG -n NAMESPACE
# 5. Verify pull succeeds
kubectl describe pod POD -n NAMESPACE | grep -A5 Events
Detects Kubernetes pods stuck in Pending state due to unschedulable conditions — resource exhaustion, storage constraints, affinity conflicts, or missing dependencies — and applies emergency scheduling workarounds where viable.
- Read Events for unschedulable reason: Insufficient memory/CPU, NoVolumeZoneConflict, AffinityConflict, NodeSelectorMismatch, or ResourceQuotaExceeded
- Check node allocatable vs. requested — identify if cluster has headroom on any node
- If cluster has headroom: attempt to relax PodAntiAffinity constraints to schedule on a less-contested node
- If priorityClass is low: suggest bumping to a cluster-appropriate priority to preemption-wins the scheduling slot
- Clear taints from target nodes where tolerations are misconfigured
- Trigger cluster autoscaling via cloud provider API if node pool has room to scale up
- Requires byteport.io/allow-remediation: "true" on the namespace
- Never force-schedules a pod on a specific node — always resolves the scheduling constraint generically
- Never modifies node resources, node selectors, or node labels
- Never removes PodDisruptionBudgets to increase schedulability
- Skips when reason is ResourceQuotaExceeded — quota changes require human approval
- BETA: requires byteport.io/beta-ack: "true" annotation on the namespace
Postmark alert with scheduling failure reason, requested vs. allocatable resources across nodes, and the specific kubectl command to resolve (e.g., increase memory request, add node toleration). GitHub issue for ResourceQuotaExceeded cases requiring human quota adjustment.
core/nodes (read node allocatable and existing taints)
core/pods (read pod spec for affinity and resource requests)
core/namespaces (read byteport.io annotations)
core/events (read FailedScheduling events for reason classification)
⚠️ *pod_pending_unschedulable* [beta] | namespace: `queues` | pod: `worker-2` | reason: `Insufficient memory` | cluster headroom: 2 nodes available | action: relaxed anti-affinity, pod now scheduling | 🔗 /runbooks/pod_pending_unschedulable
# 1. Check why pod is unschedulable
kubectl describe pod POD -n NAMESPACE | grep -A10 Events
# 2. Check node allocatable vs. requests
kubectl describe nodes | grep -E 'Allocatable|Allocated resources|memory|CPU'
# 3. List pods by priority (identify preemption candidates)
kubectl get pods -n NAMESPACE -o 'jsonpath={range .items[*]}{"Pod: "}{.metadata.name}{"\tPriority: "}{.spec.priority}{"\tNode: "}{.spec.nodeName}{"\n"}{end}' | sort -k3
# 4. Remove taint to schedule (when toleration misconfigured)
kubectl taint nodes NODENAME node.kubernetes.io/not-ready:NoSchedule-
# 5. Bump priority to enable preemption
kubectl patch pod POD -n NAMESPACE -p '{"spec":{"priorityClassName":"high-priority"}}'
Detects Kubernetes nodes entering NotReady, MemoryPressure, DiskPressure, or PIDPressure state and coordinates a safe node drain — cordoning the node, allowing graceful pod termination, and restoring schedulability once the node recovers.
- Check if node is already unschedulable (cordon happened) or still accepting new pods
- Identify critical pods with disruption budgets on the node
- If critical pods exist: wait up to 120 seconds for PodDisruptionBudget to be satisfied before draining
- Execute kubectl cordon NODE, then kubectl drain NODE --ignore-daemonsets --delete-emptydir-data --grace-period=60
- After drain completes: mark node as cordoned so new pods don't schedule
- Post-recovery: uncordon node when NodeReady status is restored for 60 continuous seconds
- Requires byteport.io/allow-remediation: "true" on the namespace
- Critical pods (PriorityClass system or above 10000) are never forcibly evicted — drain blocked until PDB satisfied or 5-minute timeout
- PDB timeout: 5 minutes max wait — after timeout, drain proceeds and issue is created for operator
- Node is never force-deleted from the cluster
- BETA: requires byteport.io/beta-ack: "true" annotation on the namespace
Postmark alert if node stays NotReady >10 minutes with: node name, current pressure type, allocatable vs. requested resources, top 5 pods on node and their priority. GitHub issue with node status timeline, recent events log, and recommended remediation steps. 4h dedup window.
core/nodes (read status, read/write unschedulable field)
core/pods (read pod spec for critical annotation and priority class)
core/pods/evict (evict pods during drain)
core/namespaces (read byteport.io annotations)
core/events (read node events for status transitions)
🔴 *node_not_ready* [beta] | node: `ip-10-0-2-14` | condition: `MemoryPressure` | critical pods on node: 2 (hpa-managed) | action: cordon + drain initiated | 🔗 /runbooks/node_not_ready
# 1. Check node status and events
kubectl get nodes NODE -o wide
kubectl describe nodes NODE | grep -A20 Events
# 2. Identify pressure type
kubectl get events -n default --field-selector involvedObject.name=NODE --sort-by='.lastTimestamp'
# 3. Check critical pods on node
kubectl get pods -A -o 'jsonpath={range .items[*]}{.metadata.namespace}{"\t"}{.metadata.name}{"\t"}{.spec.priority}{"\t"}{.spec.nodeName}{"\n"}{end}' | grep NODE | sort -k3
# 4. Cordon and drain
kubectl cordon NODE
kubectl drain NODE --ignore-daemonsets --delete-emptydir-data --grace-period=60 --timeout=300s
# 5. Monitor drain progress
kubectl get events -n default --field-selector reason=Draining
# 6. Restore node (post-recovery)
kubectl uncordon NODE
Ingest failed deployment webhooks from CI systems (GitHub Actions, ArgoCD, Flux, Spinnaker) and automatically rolls back to the last healthy revision — with safety gates and commit-author notification.
- Read byteport.io/allow-rollback: "true" annotation on the Deployment or Namespace
- Inspect kubectl rollout history deployment/NAME -n NAMESPACE to confirm a stable previous revision exists
- Execute kubectl rollout undo deployment/NAME -n NAMESPACE
- Notify the commit author via Postmark with rollback confirmation, revision diff, and rollback duration
- Log the rollback to the signal event audit trail for postmortem correlation
- Hard-gate: byteport.io/allow-rollback annotation must be present — no implicit rollback
- Requires at least one stable revision in rollout history before executing undo
- Never rolls back if current revision is less than 2 minutes old — deploy may still be in progress
- Blocked if Deployment is managed by ArgoCD in auto-sync mode (checked via annotation)
- Pre-rollback snapshot captures replica count, image tag, and resource limits for audit
Postmark alert with deployment name, namespace, failure reason from webhook payload, commit SHA, author, and what prevented auto-rollback (missing annotation, no stable revision, ArgoCD sync conflict). 24h dedup per deployment.
apps/deployments (read rollout history, execute rollout undo)
core/namespaces (read byteport.io/allow-rollback and ArgoCD sync-mode annotations)
core/events (log rollback action to namespace events)
↩️ *deployment_rollback* | deploy: `api-service` | namespace: `production` | failure: `test failure` | SHA: `a3f9c12` | author: `maria@` | rollback: triggered (revision 42 → 41) | 🔗 /runbooks/deployment_rollback
# 1. Check rollback eligibility annotation
kubectl get deploy NAME -n NAMESPACE -o jsonpath='{.metadata.annotations.byteport.io\/allow-rollback}'
# 2. Inspect rollout history
kubectl rollout history deployment/NAME -n NAMESPACE
# 3. Roll back to last stable revision
kubectl rollout undo deployment/NAME -n NAMESPACE
# 4. Watch rollback progress
kubectl rollout status deployment/NAME -n NAMESPACE
# 5. Verify previous image/tag restored
kubectl get deploy NAME -n NAMESPACE -o jsonpath='{.spec.template.spec.containers[0].image}'
TLS / Certificates
Detects TLS certificate expiry across 6 signals (cert_expiring_soon at 30/14/7/1-day thresholds, cert_expired, cert_chain_invalid, ocsp_stapling_failure, acme_challenge_failure, renewal_rate_limit_hit), classifies the renewal approach across 7 classes (acme_http01, acme_dns01, cert_manager, acme_rate_limited, cloud_managed, commercial_ca, internal_ca), and autonomously renews via certbot, cert-manager CRD delete-and-reissue, or cloud API — then hot-reloads nginx/envoy/haproxy/traefik/k8s-ingress and verifies chain validity plus OCSP staple. Escalates immediately for pinned certs, wildcard certs without a configured DNS-01 provider, rate-limited domains, and commercial/internal CAs.
- Diagnose: openssl s_client -connect domain:443 -showcerts -status to fetch served cert; parse notAfter, issuer O=/CN=, SAN list, serial, OCSP Must-Staple extension, OCSP staple presence; detect wildcard (*.domain); locate cert files on disk (certbot live dir, /etc/ssl/certs, /etc/nginx/ssl); identify web server via pgrep (nginx/envoy/haproxy/traefik/caddy/k8s ingress); check DNS zone ownership (dig SOA); detect cert-manager Certificate CRD in cluster; estimate ACME rate-limit window remaining; check BYTEPORT_PINNED_DOMAINS for cert pinning.
- Classify (first match, confidence-gated ≥0.70): acme_rate_limited (signal=renewal_rate_limit_hit or window exhausted, conf=0.92); cert_pinned (BYTEPORT_PINNED_DOMAINS match, conf=0.95, always escalate); cert_manager (cert-manager CRD present, conf=0.93); cloud_managed (issuer O in {Amazon, Google Trust Services}, conf=0.88); acme_dns01 (wildcard or challenge_failure + dns_provider configured, conf=0.91); acme_http01 (LE/ZeroSSL issuer, non-wildcard, zone owned, conf=0.90); internal_ca (self-signed, conf=0.87); commercial_ca (fallback, conf=0.75).
- Remediate: acme_http01 → certbot renew --cert-name --webroot or --standalone --non-interactive [--server staging]; acme_dns01 → certbot renew --dns-{route53|cloudflare|google|azure}; cert_manager → kubectl delete secret
-tls + kubectl annotate Certificate cert-manager.io/issue-temporary-certificate=true; cloud_managed → aws acm renew-certificate --certificate-arn $BYTEPORT_ACM_ARN; commercial_ca/internal_ca/cert_pinned/rate_limited → escalate with full cert bundle + retry ETA. - Hot-reload (server-aware, no downtime): nginx → nginx -s reload; caddy → caddy reload --config /etc/caddy/Caddyfile; haproxy → haproxy -sf $(cat /run/haproxy.pid); envoy/traefik → kill -HUP
; k8s ingress → kubectl rollout restart deployment/ ; unknown server → escalate. - Verify (wait_seconds=10): re-fetch via openssl s_client; confirm notAfter > now+30d; validate chain (openssl verify -CAfile /etc/ssl/certs/ca-certificates.crt); check OCSP staple if Must-Staple set; confirm serial changed from pre-renewal; handshake_ok. On failure: restore cert backup + reload + escalate.
- Post-mortem: structured record — domain, signal, classification, issuer, challenge_used, tool_used, days_remaining_before/after, chain_valid_after, ocsp_ok_after, resolved bool, time_to_resolve_seconds, escalated bool.
- DNS zone ownership confirmed via SOA lookup before any ACME action — refuses if zone not under our control
- Cert backup always created before cert swap (BYTEPORT_CERT_BACKUP_DIR); rollback restores backup + reloads on verify failure
- Pinned certs (BYTEPORT_PINNED_DOMAINS) always escalate — never renew autonomously (mobile client breakage risk)
- ACME rate-limit awareness: tracks in-process renewal history; refuses within window; parses Retry-After for retry ETA
- Wildcard certs require dns-01 provider to be configured — escalates if missing
- Minimum confidence 0.70 before acting autonomously — lower confidence escalates with full diagnostic bundle
- BYTEPORT_RUNBOOK_DRY_RUN=true or --dry-run: all certbot/kubectl/aws commands are no-ops; dry-run also passed to certbot
Postmark + Slack escalation with: domain, signal, classification, issuer, days_remaining, challenge attempted, error details, ACME rate-limit retry ETA (if applicable), cert backup path. GitHub issue with full openssl s_client output + cert chain + classification rationale. 6h dedup per domain.
core/secrets (get, delete — for cert-manager TLS secret delete-and-reissue)
cert-manager.io/certificates (get, patch, annotate — force re-issuance)
apps/deployments (rollout restart — for k8s ingress controller reload)
exec: nginx -s reload / caddy reload / kill -HUP (for non-k8s server reload)
🔒 *tls_cert_expiry_auto_renewal* | domain: `api.prod.example.com` | signal: `cert_expiring_soon` | days_remaining: 6 | class: `acme_http01` | action: certbot renew --cert-name api.prod.example.com → renewed | reload: nginx -s reload | chain valid ✓ | days_after: 89d | 🔗 /runbooks/tls_cert_expiry_auto_renewal
# 1. Inspect served cert
openssl s_client -connect DOMAIN:443 -showcerts -status -servername DOMAIN < /dev/null 2>&1 | openssl x509 -noout -text -dates
# 2. Check cert chain validity
openssl verify -CAfile /etc/ssl/certs/ca-certificates.crt /etc/letsencrypt/live/DOMAIN/fullchain.pem
# 3. Dry-run renewal
certbot renew --cert-name DOMAIN --dry-run --non-interactive
# 4. Live renewal (http-01)
certbot renew --cert-name DOMAIN --webroot -w /var/www/html --non-interactive
# 5. Live renewal (dns-01 / wildcard)
certbot renew --cert-name DOMAIN --dns-route53 --non-interactive
# 6. Reload nginx
nginx -s reload
# 7. Verify new cert
openssl s_client -connect DOMAIN:443 -servername DOMAIN < /dev/null 2>&1 | openssl x509 -noout -enddate
Detects SSL/TLS certificates expiring within 14 days (warning) or 7 days (critical) and triggers automated renewal via cert-manager, AWS ACM, or certbot — with pre/post snapshots and DNS validation checks.
- For cert-manager: patch Certificate resource with cert-manager.io/force-renew="true", wait for issuance and secret propagation
- For AWS ACM: call acm.RequestCertificate for the same domain, wait for DNS validation if pending, log new ARN
- For certbot/acme.sh: trigger certbot renew --quiet or acme.sh --renew -d domain.com, validate renewed cert serial
- For manual certs: escalate with full cert metadata, days-to-expiry, and renewal commands — no automated path
- Never runs renewal on certificates with fewer than 3 days remaining — too risky
- Pre-renewal snapshot records current cert serial number, issuer, and SAN list
- Post-renewal verification reads new cert serial and compares to pre-snapshot to confirm rotation succeeded
- Requires byteport.io/auto-remediate: "cert" annotation on the namespace or Certificate resource
- ACM DNS validation: runbook blocks if DNS challenge is not resolvable within 60 seconds
Postmark alert with domain, cert CN + SANs, days-to-expiry, issuer, current renewal path, and whether auto-renewal succeeded or was blocked. GitHub issue with cert fingerprint, renewal commands, and DNS admin contact. 7d dedup per domain.
core/certificates (read and patch cert-manager Certificate resources)
core/secrets (read tls.crt from Secrets for OpenSSL scan)
aws/acm (request new certificate, describe certificate for validation status)
⚠️ *cert_expiry* | domain: `api.example.com` | cert: `*.example.com` | expires: 2026-06-06 (7 days) | issuer: Let's Encrypt | action: cert-manager force-renew triggered | 🔗 /runbooks/cert_expiry
# 1. Check cert expiry (OpenSSL scan)
echo | openssl s_client -servername example.com -connect example.com:443 2>/dev/null | openssl x509 -noout -dates
# 2. cert-manager: force renewal
kubectl annotate certificate NAME -n NAMESPACE cert-manager.io/force-renew="true"
# 3. Wait for cert to be re-issued
kubectl get certificate NAME -n NAMESPACE -w
# 4. Verify new cert in Secret
kubectl get secret TLS-SECRET-NAME -n NAMESPACE -o jsonpath='{.data.tls.crt}' | base64 -d | openssl x509 -noout -serial -dates
# 5. Reload nginx / ingress controller
kubectl rollout restart deployment/ingress-controller -n ingress-nginx
End-to-end autonomous TLS certificate rotation: diagnoses cert management toolchain (cert-manager / Caddy / Traefik / certbot / IaC), validates ACME challenge via Let's Encrypt staging, backs up cert, rotates, and confirms via TLS handshake (3× retry). Escalates with full diagnostic if any step fails.
- Detect cert source: cert-manager (CRD), Caddy/Traefik (process detection), certbot/acme.sh (config dir), IaC (metadata flags)
- Validate ACME challenge path via Let's Encrypt staging dry-run (DNS-01 or HTTP-01) — abort and escalate if staging fails
- Backup current cert + key to /var/lib/byteport/cert-backups/
- / for 7-day rollback - cert-manager: kubectl annotate certificate with cert-manager.io/force-renew, wait Ready=True (120 s)
- Caddy / Traefik: systemctl reload to trigger internal ACME renewal
- certbot / acme.sh: certbot renew --force-renewal
- IaC-managed: open PR via GitHub adapter bumping cert resource — never auto-apply
- TLS-handshake verify 3× with 30 s backoff, parse notAfter to confirm ≥30 days remaining
- Post-rotation Slack notification: ✅ Rotated cert for
. New expiry: . Took . - On any failure: escalate to PagerDuty with full diagnostic, renewal command, and P-level (P1/P2/P3)
- Hard rate limit: max 1 rotation per domain per 24 h — prevents duplicate ACME requests
- BYTEPORT_RUNBOOK_DRY_RUN=true: all writes become no-ops; logs intended actions only
- Pre-rotation backup always runs before any cert write
- Staging validation gate: certbot --dry-run --staging must exit 0 before live rotation on raw-ACME paths
- IaC path is PR-only — never auto-applies Terraform/Pulumi changes
- BETA: requires byteport.io/beta-ack: "true" annotation on the namespace for cert-manager paths
PagerDuty Events API v2 (via PagerDutyEventsNotifier) on any step failure, with domain, P-level (P1/P2/P3 by days remaining), cert source, staging output, and exact renewal commands. Slack block-kit notification on success with new expiry and rotation duration.
core/certificates (read + annotate cert-manager Certificate resources)
core/namespaces (read byteport.io annotations)
sysadmin: certbot (certbot renew — requires certbot installed on host)
sysadmin: systemctl reload caddy / traefik
sysadmin: /var/lib/byteport/cert-backups/ (write backup dir)
✅ *ssl-cert-auto-rotate* | domain: `api.example.com` | cert-manager | new expiry: Aug 30 2026 | took 4.2 s | 🔗 /runbooks/ssl-cert-auto-rotate
import { SslCertAutoRotateRunbook, RunbookRegistry, createAgent } from "byteport-agent";
import { createCloudWatchAdapter } from "byteport-agent/adapters/cloudwatch";
// Signal adapter — fires ssl_cert_expiring when DaysToExpiry < 14
const adapter = createCloudWatchAdapter({ region: "us-east-1" });
const registry = new RunbookRegistry();
registry.register(new SslCertAutoRotateRunbook());
const agent = createAgent({
adapters: [adapter],
runners: [registry.asRunner("runbook")!],
dryRun: true, // flip false after confirming dry-run output
thresholds: { ssl_cert_expiring: 14 }, // trigger when ≤14 days remain
});
await agent.run();
End-to-end autonomous TLS certificate expiry recovery: diagnose served cert via openssl s_client (issuer, notAfter, SANs, chain), classify renewal path (acme_renewable / cert_manager_managed / commercial_ca_manual / internal_ca / wildcard_dns01_required), renew via certbot/acme.sh (http-01 or dns-01 via Route53/Cloudflare/GCP) or cert-manager force-renew, hot-reload the serving process (nginx/caddy/envoy/haproxy/traefik/k8s ingress), then verify new notAfter > now + 30d + chain valid + handshake succeeds. Backs up cert+key before every swap; rolls back and escalates on verification failure. Refuses if DNS zone not owned.
- Diagnose: openssl s_client -connect DOMAIN:443 -servername DOMAIN -status -showcerts → parse notAfter, issuer O=/CN=, SAN list, OCSP staple, chain validity; locate cert files on disk + identify server process (nginx/caddy/envoy/haproxy/traefik/k8s/unknown)
- Classify (first match): cert_manager_managed if k8s env or --cert-name given; wildcard_dns01_required if *.domain with supported DNS provider; acme_renewable if Let's Encrypt / ZeroSSL issuer or cert in certbot live dir; internal_ca or commercial_ca_manual → escalate with full context
- Backup: copy cert + privkey to /var/lib/byteport/cert-backups/
- / before any write - Renew acme_renewable: certbot renew --cert-name DOMAIN (http-01 default) or with --dns-PROVIDER for dns-01; acme.sh --renew as fallback
- Renew wildcard dns-01: certbot/acme.sh with --dns-route53 / --dns-cloudflare / --dns-gcp flag
- Renew cert_manager_managed: kubectl annotate certificate cert-manager.io/force-renew=true, kubectl wait --for=condition=Ready --timeout=120s
- Hot-reload: nginx -s reload / caddy reload / kill -HUP envoy / systemctl reload haproxy / systemctl reload traefik / kubectl rollout restart ingress-controller
- Verify: re-fetch cert via openssl s_client, confirm notAfter > now + 30d AND chain_valid AND handshake_ok; rollback + hot-reload + escalate on failure
- DNS zone ownership check: refuses renewal if zone SOA cannot be resolved (dig/nslookup) — prevents acting on externally-managed domains
- Backup-before-swap: cert + key backed up to /var/lib/byteport/cert-backups/ before any write; abort if backup fails
- Rollback on verification failure: restores prior cert + key from backup, re-runs hot-reload, escalates with full diagnostic
- commercial_ca_manual and internal_ca always escalate — never attempt autonomous renewal for non-ACME issuers
- wildcard_dns01_required without a configured DNS provider → escalate (never attempt http-01 for wildcards)
- BYTEPORT_RUNBOOK_DRY_RUN=true / --dry-run: all subprocess calls are no-ops; logs intended commands only
- Minimum post-renewal floor: notAfter must be >30d from now for verification to pass — rejects a technically-renewed cert with unusually short validity
Postmark + Slack escalation with: domain, issuer (O=/CN=), days remaining, classification reasoning, renewal tool attempted, stdout/stderr of renewal command, backup path, verification outcome (days/chain/handshake), rollback result, and exact manual renewal commands for the operator. GitHub issue with cert fingerprint, SAN list, OCSP staple status, and DNS admin contact. 7d dedup per domain. PagerDuty Events v2 on rollback trigger (verification failed after live renewal).
sysadmin: certbot renew / acme.sh --renew (ACME client on host)
sysadmin: nginx -s reload / systemctl reload caddy haproxy traefik (web server reload)
sysadmin: /var/lib/byteport/cert-backups/ (write backup dir)
sysadmin: /etc/letsencrypt/live/ and /etc/ssl/ (read cert files on disk)
k8s: certificates.cert-manager.io (get + annotate), deployments (rollout restart ingress-controller)
dns: Route53 IAM / Cloudflare API token / GCP service account (dns-01 challenge, optional)
✅ *tls_cert_expiry_auto_renew* | domain: `api.example.com` | issuer: `Let's Encrypt` | was: 4d remaining | renewed via: certbot http-01 | new expiry: +90d | reload: nginx -s reload OK | chain: valid | handshake: OK | 🔗 https://app.byteport.polsia.app/runbooks/tls_cert_expiry_auto_renew
# One-shot dry-run (check what would happen)
python runbooks/tls_cert_expiry_auto_renew.py --domain api.example.com
# Live renewal via certbot (http-01)
python runbooks/tls_cert_expiry_auto_renew.py --domain api.example.com --no-dry-run
# Wildcard via dns-01 + Route53
python runbooks/tls_cert_expiry_auto_renew.py \
--domain '*.example.com' \
--challenge dns01 --dns-provider route53 \
--no-dry-run
# cert-manager managed cert
python runbooks/tls_cert_expiry_auto_renew.py \
--domain api.example.com \
--cert-name api-tls --namespace production \
--no-dry-run
Database
Detects database connection pool exhaustion across Postgres, MySQL, and pgbouncer: classify root cause across 6 classes (leak_after_deploy / idle_in_tx_backlog / long_running_query / undersized_pool / missing_pgbouncer / runaway_cron_fanout), terminate idle-in-transaction sessions and long-running queries within hard safety caps, reload pgbouncer with raised pool size, trigger rolling restart of leaking deployments, and verify utilization drops below 70% with a structured postmortem including query fingerprints and permanent fix recommendation.
- Diagnose: query pg_stat_activity for full state breakdown (active / idle / idle-in-tx / waiting), capture top wait_events and long-running queries, pull pgbouncer SHOW POOLS + SHOW STATS, correlate connections_trend with last deploy SHA, inspect env vars for Hikari/SQLAlchemy/pgx pool config
- Classify (first match, confidence ≥0.70): leak_after_deploy (deploy marker + rising idle-in-tx trend, conf=0.88); idle_in_tx_backlog (≥3 sessions older than threshold, conf=0.91); long_running_query (≥2 queries over budget or 1 over 10min, conf=0.85); missing_pgbouncer (pgbouncer_waiting>0, conf=0.80); runaway_cron_fanout (single app_name holds ≥15 connections, conf=0.82); undersized_pool (utilization ≥85% with no dominant cause, conf=0.75)
- idle_in_tx_backlog: terminate eligible idle-in-transaction sessions via pg_terminate_backend, capped at kills_per_minute_cap (default 20/min), skipping pgbouncer/admin-proxy/superuser connections
- long_running_query: cancel over-budget queries via pg_cancel_backend (graceful), fall back to pg_terminate_backend if cancel fails, capture query text for EXPLAIN postmortem
- missing_pgbouncer / undersized_pool: send RELOAD to pgbouncer admin console to pick up raised default_pool_size from pgbouncer.ini
- leak_after_deploy: terminate old idle-in-tx sessions first, then trigger kubectl rollout restart on the offending deployment to drain all leaked connection handles
- runaway_cron_fanout: terminate idle-in-tx connections from the dominant application_name within cap; signal app tier to recycle
- Verify: re-sample pg_stat_activity at 30s; pass if pool utilization <70%, lock_waiters=0, error_rate normalized; retry once at 60s; escalate on double-fail
- Hard cap: kills_per_minute_cap (default 20) backend terminations per minute — rejects excess over cap and escalates
- Critical app allowlist: never terminates pgbouncer, admin-proxy, pgadmin, or superuser connections under any circumstance
- Superuser check: every connection is checked for is_superuser=true before termination — superusers are always skipped
- pgbouncer RELOAD only (never KILL): pool config reload is safe; this runbook never issues SHUTDOWN or kills pgbouncer itself
- BYTEPORT_RUNBOOK_DRY_RUN=true or --dry-run: all pg_terminate_backend / pg_cancel_backend / kubectl / psql calls are no-ops
- Minimum confidence 0.70 before acting autonomously — lower confidence escalates to on-call with full diagnostic bundle
- Verify must confirm utilization <70% before marking resolved; double-fail → escalate
Postmark + Slack escalation with: signal, classification, confidence score, before/after pool utilization %, pids_terminated, pids_cancelled, top 5 query fingerprints (literals stripped), pgbouncer pool state, last deploy SHA, kill cap status, and structured postmortem markdown with permanent fix recommendation. GitHub issue with full pg_stat_activity snapshot. 1h dedup per db_url.
pg_stat_activity (SELECT — connection diagnostics, state breakdown, wait events)
pg_settings (SELECT — max_connections lookup)
pg_terminate_backend (EXECUTE — idle-in-tx + runaway app termination, gated by cap and allowlist)
pg_cancel_backend (EXECUTE — long-running query cancellation, graceful)
pgbouncer admin console (RELOAD — pool config hot-reload, no SHUTDOWN)
kubectl rollout restart (apps/deployments — leak-after-deploy rolling restart)
🚨 *db_connection_pool_exhaustion* | host: `pg-prod-01/appdb` | connections: `94/100` (94%) | class: `idle_in_tx_backlog` | terminated: 12 sessions (idle >5min) from `worker-pool` | utilization after: 41% | action: pg_terminate_backend(1002,1003,...) | 🔗 /runbooks/db_connection_pool_exhaustion
-- 1. Check connection saturation
SELECT count(*) AS total,
(SELECT setting::int FROM pg_settings WHERE name='max_connections') AS max_conns,
round(count(*)*100.0 / (SELECT setting::int FROM pg_settings WHERE name='max_connections'), 1) AS pct
FROM pg_stat_activity
WHERE pid <> pg_backend_pid();
-- 2. Find idle-in-transaction sessions
SELECT pid, usename, application_name, state,
now() - state_change AS idt_duration, left(query, 120) AS query
FROM pg_stat_activity
WHERE state = 'idle in transaction'
AND state_change < now() - interval '5 minutes'
ORDER BY state_change ASC;
-- 3. Find long-running queries
SELECT pid, now() - query_start AS duration, state, left(query, 120) AS query
FROM pg_stat_activity
WHERE state = 'active'
AND query_start < now() - interval '5 minutes'
ORDER BY query_start ASC LIMIT 10;
-- 4. Terminate idle-in-tx session
SELECT pg_terminate_backend(1234);
-- 5. Cancel long-running query (graceful)
SELECT pg_cancel_backend(5678);
-- 6. pgbouncer: check waiting clients
-- psql -p 6432 pgbouncer -c 'SHOW POOLS;'
-- psql -p 6432 pgbouncer -c 'RELOAD;'
Detects when PostgreSQL or MySQL active connections exceed 80% of max_connections and identifies connection leaks (idle-in-transaction > 30s) vs. load-driven saturation — terminating leak backends and suggesting PgBouncer configuration.
- Query pg_stat_activity to identify idle_in_transaction sessions held >30s — treated as leaks if >50% of idle connections
- For connection leaks: identify backend PID with longest idle-in-transaction duration, capture query text, then terminate via pg_terminate_backend(PID) only if byteport.io/auto_remediate: "db_terminate" annotation present
- For load-driven saturation: identify top 3 longest-running queries via pg_stat_activity.duration, log PIDs and query text, suggest PgBouncer/PgCat via escalation
- Pre-termination audit: capture current connection count, top wait events from pg_stat_activity.wait_event, and application_name
- Post-fix verification: waits 30s then confirms active connections dropped below 70%
- Requires byteport.io/auto_remediate: "db_terminate" annotation for backend termination — never kills without this
- Never terminates a connection with an uncommitted transaction that has row locks (checks pg_blocking_pids)
- Max 1 connection termination per 2-minute window
- Requires byteport.io/auto_remediate: "db_query" for pg_stat_activity reads
- Never terminates a connection where usesuper=true in pg_stat_activity
Postmark alert with active vs. max connections, idle-in-transaction count, top wait events, longest-running query text, and whether termination was executed or blocked. GitHub issue with suggested PgBouncer pool settings and leak vs. load classification.
postgres: pg_monitor (read pg_stat_activity, pg_stat_replication)
postgres: pg_execute (pg_terminate_backend)
sysadmin: SHOW PROCESSLIST (MySQL equivalent)
🔴 *connection_pool_exhausted* | DB: `prod-postgres` | active: 152 / 200 (76%) | leaks: 8 idle-in-transaction > 30s | PID 18432 terminated | connections now: 98/200 | 🔗 /runbooks/connection_pool_exhausted
-- 1. Check connection saturation
SELECT count(*) AS active_conns,
(SELECT setting FROM pg_settings
WHERE name = 'max_connections')::int AS max_conns,
round(count(*) * 100.0 /
(SELECT setting FROM pg_settings
WHERE name = 'max_connections')::int, 1) AS pct_used
FROM pg_stat_activity
WHERE state != 'idle';
-- 2. Find idle-in-transaction leaks
SELECT pid, usename, application_name,
state, now() - state_change AS idle_duration, query
FROM pg_stat_activity
WHERE state = 'idle in transaction'
AND state_change < now() - interval '30 seconds'
ORDER BY state_change ASC;
-- 3. Terminate leaking backend
SELECT pg_terminate_backend(PID)
FROM pg_stat_activity
WHERE pid = YOUR_PID;
-- 4. Top queries by duration
SELECT pid, now() - query_start AS duration, state, query
FROM pg_stat_activity
WHERE state != 'idle'
ORDER BY query_start ASC
LIMIT 5;
Detects PostgreSQL replica lag exceeding 30 seconds for more than 60 seconds — distinguishing between network latency, disk I/O saturation, and long-running queries on the primary — and terminates blocking queries to resume replication.
- Query pg_stat_replication to identify lagging replicas: application_name, state, sent_lsn, write_lsn, flush_lsn, replay_lsn
- Compare send/replay LSN gap against pg_stat_database.conflict_count to classify root cause: replication_conflict, long_query, or wal_sender_hang
- For long-running query on primary blocking WAL application: identify via pg_stat_activity where query_start is far in the past and state != 'idle'
- Terminate long-running query on primary via pg_cancel_backend(PID) first (graceful), escalate to pg_terminate_backend after 60s
- Log pre-termination snapshot: query text, duration, PID, application_name, and which replica is lagging
- Never terminates a query that has been running for less than 60 seconds — slow queries are expected during bulk loads
- Never terminates a query that holds an exclusive lock on a relation involved in an active replication slot
- Requires byteport.io/auto_remediate: "db_query" annotation for pg_stat_activity reads
- Never skips replication or modifies primary configuration (slot settings, wal_level, max_wal_senders)
- Escalates to operator (never auto-remediates) when root cause is disk I/O or network — those require infra changes
Postmark alert with lag_seconds per replica, root cause classification, top blocking query text, and suggested read replica scaling actions. GitHub issue for persistent lag (>5min) with full WAL queue diagnostic, pg_stat_database conflict counts, and recommended infra fixes (network, disk, slot config). 6h dedup.
postgres: pg_monitor (read pg_stat_replication, pg_stat_activity, pg_stat_database)
postgres: pg_signal (pg_cancel_backend, pg_terminate_backend for terminating blockers)
postgres: pg_read_all_settings (read max_wal_senders, wal_level, replication slots)
⚠️ *replication_lag* | replica: `prod-ro-2` | lag: 47s | cause: long-running query on primary (PID 22109, 3m12s) | action: pg_cancel_backend issued | lag recovering | 🔗 /runbooks/replication_lag
-- 1. Check replication lag across all replicas
SELECT application_name, state,
sent_lsn, write_lsn, flush_lsn, replay_lsn,
round((pg_wal_lsn_diff(sent_lsn, replay_lsn) / 1024 / 1024), 2) AS lag_mb,
write_lag, flush_lag, replay_lag
FROM pg_stat_replication
ORDER BY pg_wal_lsn_diff(sent_lsn, replay_lsn) DESC;
-- 2. Identify root cause (conflict vs. slow query vs. network)
SELECT count(*) AS replication_conflicts,
datname,
numscans
FROM pg_stat_database
WHERE datname IS NOT NULL
GROUP BY datname
ORDER BY replication_conflicts DESC;
-- 3. Find long-running queries on primary blocking replication
SELECT pid, now() - query_start AS duration, state,
usename, application_name, query
FROM pg_stat_activity
WHERE state != 'idle'
AND query_start < now() - interval '60 seconds'
ORDER BY query_start ASC
LIMIT 5;
-- 4. Gracefully cancel blocking query (try cancel first)
SELECT pg_cancel_backend(PID);
-- 5. Force terminate if still running after 60s
SELECT pg_terminate_backend(PID)
FROM pg_stat_activity
WHERE pid = YOUR_PID;
Autonomous DB connection pool exhaustion recovery for Postgres, MySQL, and MongoDB: classify idle-in-transaction leak / long-running query / app-side leak / traffic spike / undersized pool, terminate connections within per-minute safety caps, raise pool ceiling within 20% headroom, re-poll utilization at 30s/90s/180s, file structured postmortem.
- Parallel fetch: active/idle/idle-in-tx counts, pool ceiling, top 10 long-running queries, client IPs holding connections
- Classify exhaustion into 5 classes (priority order): idle_in_transaction_leak → long_running_query → app_side_leak → traffic_spike → pool_undersized
- idle_in_transaction_leak: pg_terminate_backend for all iit sessions >5min old (capped at killsPerMinuteCap/min, skips critical allowlist)
- long_running_query: terminate queries exceeding runtimeBudgetSeconds (default 5min), capture query text for EXPLAIN postmortem
- app_side_leak: flag offending app_name/IP, signal app tier to recycle — never kill connections blindly
- traffic_spike: raise pool ceiling (max: dbMaxConnections × 0.8), flag non-critical routes for 503 shedding
- pool_undersized: apply +50% ceiling increase if within safety bound, else escalate with config recommendation
- Re-poll utilization at 30s / 90s / 180s — pass if <70% and no new pool_timeout signals
- Hard cap: killsPerMinuteCap (default 20) connections terminated per minute — rejects all terminations over cap
- Critical app allowlist: never terminates connections from listed application_names (e.g. pgbouncer, admin-proxy)
- Pool ceiling bound: adjustments capped at dbMaxConnections × 0.8 (20% headroom) — never approaches server hard limit
- Replica lag abort: escalates immediately if pg_stat_replication lag >10MB during remediation — avoids split-brain
- BYTEPORT_RUNBOOK_DRY_RUN=true logs all intended actions without executing
- Always escalates for root-cause awareness even after successful termination (stopgap, not a fix)
Postmark + Slack escalation with full diagnostic bundle: connection class, total/active/iit/long-running counts, client IP distribution, terminated PIDs, EXPLAIN capture of top 5 queries (long_running_query class), before/after utilization %, and postmortem markdown. Escalates immediately on: kill cap hit, critical app protected, replica lag spike, or post-verify utilization still elevated.
pg_stat_activity (read — connection diagnostics)
pg_stat_replication (read — replica lag check)
pg_terminate_backend (execute — idle-in-tx + long-running termination, gated)
pg_settings (read — max_connections lookup)
pg_reload_conf (execute — pool ceiling apply, gated within safety bound)
🚨 *db_connection_pool_exhausted* | host: `pg-prod-01/appdb` | connections: `95/100` | class: `idle_in_transaction_leak` | terminated: 47 iit sessions (>5min) from `worker-pool` | utilization after: 38% | 🔗 https://app.byteport.polsia.app/runbooks/db_connection_pool_exhausted
import { DbConnectionPoolExhaustedRunbook, RunbookRegistry, createAgent } from "byteport-agent";
import { createPrometheusAdapter } from "byteport-agent/adapters/prometheus";
const adapter = createPrometheusAdapter({
url: process.env.PROMETHEUS_URL!,
// Fires db_pool_exhausted when pg connections > 90% of max_connections
});
const registry = new RunbookRegistry();
registry.register(new DbConnectionPoolExhaustedRunbook({
runtimeBudgetSeconds: 300, // terminate queries >5min
killsPerMinuteCap: 20, // hard safety cap
criticalAppAllowlist: ["pgbouncer", "admin-proxy"],
verifyUtilizationThreshold: 70, // pass verify at <70%
poolCeilingSafetyFraction: 0.8, // never exceed 80% of db max
databaseUrl: process.env.DATABASE_URL,
}));
const agent = createAgent({
adapters: [adapter],
runners: [registry.asRunner("runbook")!],
dryRun: false,
});
await agent.run();
Deploys & Rollbacks
Detects failed Kubernetes deployments across 5 signals (deploy_health_regression / readiness_probe_failure_post_deploy / deploy_5xx_spike / canary_slo_burn / rollout_stuck), classifies root cause with a false-attribution guard, and autonomously rolls back to the last-known-good revision with a hard guardrail of max 1 rollback/deployment/hour. If the rollback itself fails, escalates immediately with a full diagnostic bundle including the suspect commit SHA.
- Diagnose: kubectl rollout history to enumerate revisions, identify current and last-good revision, extract commit SHA from CHANGE-CAUSE annotation, count ready vs. total pods (kubectl get deployment -o json), discover downstream Services via label-selector overlap, detect false attribution (regression predating rollout by >5 min in Warning events)
- Classify (first match, confidence ≥0.70): do_not_rollback (regression predates deploy — false attribution, conf=0.85); rollout_stuck (progressDeadlineExceeded + 0 ready pods, conf=0.91); readiness_never_ready (readiness probe failures on new ReplicaSet, conf=0.88); deploy_5xx_spike (HTTP 5xx on new revision, conf=0.90); canary_slo_burn (canary burns SLO budget faster than baseline, conf=0.87); deploy_health_regression (error rate or p95 latency spike post-rollout, conf=0.86). Falls back to escalate when no rollback target can be safely determined.
- Plan: emit rollback plan (kubectl rollout undo --to-revision=
, expected pod churn, estimated duration, rollback-fails-then-escalate tripwire); post plan to ops channel in dry_run mode - Execute: kubectl rollout undo deployment/
-n --to-revision= ; watch ReplicaSet ready count; verify failure signal clears within 5-minute window; if rollback kubectl exits non-zero → escalate immediately (no retry) - Verify: deployment Available=True + Progressing=True, readyReplicas == spec.replicas; retry once at 60s if first check fails; escalate on double-fail with full diagnostic bundle
- Post-mortem: structured record — signal, classification, suspect_revision, last_good_revision, suspect_image, last_good_image, commit SHA (from CHANGE-CAUSE annotation), rollback_duration_seconds, time_to_resolve_seconds, resolved bool
- Max 1 kubectl rollout undo per deployment per hour (in-process rolling counter)
- False attribution guard: if regression predates deploy by >5 min → do_not_rollback, escalate — never roll back a healthy deploy
- Rollback-itself-fails tripwire: if kubectl rollout undo returns non-zero → escalate immediately, do not retry
- do_not_rollback classification always escalates — no autonomous action taken
- Minimum confidence 0.70 before acting autonomously — lower confidence escalates with full diagnostic bundle
- Verify double-fail → escalate with full diagnostic bundle (revision history, pod status, condition timeline)
- BYTEPORT_RUNBOOK_DRY_RUN=true or --dry-run: all kubectl rollout undo calls are no-ops; plan is posted to ops channel only
Postmark + Slack escalation with: deployment name, namespace, signal, classification, confidence, suspect revision + commit SHA, last-good revision, pod readiness before/after, actions taken or blocked, rollback duration, guardrail hit (if any). GitHub issue with full kubectl rollout history + pod status snapshot + classification rationale. 1h dedup per deployment.
apps/deployments (get, rollout history, rollout undo, status conditions)
apps/replicasets (get — replica count, readiness state)
core/pods (list — pod counts by label selector)
core/services (list — label selector overlap for blast radius)
core/events (list — false attribution guard, Warning events timeline)
🚨 *failed_deploy_auto_rollback* | deployment: `checkout-service` | namespace: `production` | signal: `deploy_5xx_spike` | class: `deploy_5xx_spike` | suspect_rev: 14 (commit: a1b2c3d) | action: kubectl rollout undo → rev 13 | all pods ready in 87s | 🔗 /runbooks/failed_deploy_auto_rollback
# 1. Check rollout status
kubectl rollout status deployment/DEPLOY -n NAMESPACE
# 2. View revision history
kubectl rollout history deployment/DEPLOY -n NAMESPACE
# 3. Inspect specific revision
kubectl rollout history deployment/DEPLOY -n NAMESPACE --revision=N
# 4. Check pod readiness
kubectl get deployment DEPLOY -n NAMESPACE -o wide
kubectl get pods -n NAMESPACE -l app=DEPLOY
# 5. Roll back to last-good revision
kubectl rollout undo deployment/DEPLOY -n NAMESPACE --to-revision=N
# 6. Watch rollback progress
kubectl rollout status deployment/DEPLOY -n NAMESPACE --timeout=5m
# 7. Verify pods are ready
kubectl get pods -n NAMESPACE -l app=DEPLOY -w
Detects failed deploy workflows via the GitHub Actions API — classifies the failure (test_failure, build_failure, deploy_step_failure, infra_timeout), fetches failed job logs, triggers rollback workflow_dispatch when a previous successful run exists and rollback is enabled, or escalates with a structured summary.
- Fetch failed job logs via GET /repos/{owner}/{repo}/actions/runs/{run_id}/jobs
- Classify failure: test_failure (jest/vitest/pytest patterns) | build_failure (tsc/eslint/webpack) | deploy_step_failure (helm/kubectl/docker push) | infra_timeout (ETIMEDOUT/rate-limit)
- Check if a previous successful run exists — if yes and rollbackEnabled: true, dispatch rollbackWorkflowFile via workflow_dispatch with inputs triggered_by=byteport-agent and failed_run_id
- Dedup by run_id — each GitHub Actions run ID is emitted at most once per agent session
- If rollback not enabled or no previous success: escalate with structured summary including failure class, branch, actor, run URL
- Rollback is opt-in: rollbackEnabled must be explicitly set to true — default is false (detect + escalate only)
- Requires Workflows: write permission on the GitHub token for rollback dispatch; Actions: read + Contents: read for detection only
- Dry-run mode: all steps return dry_run status with descriptions of what would execute
- De-duplicates by run_id — no signal is re-raised across fetch() cycles for the same failed run
Postmark + GitHub Issue escalation with: workflow name, branch, actor, run URL, failure class, classification reasoning, whether a previous successful run exists, and whether rollback was triggered. 24h dedup per run_id.
Actions: read — list workflow runs, fetch job logs
Contents: read — repository metadata
Workflows: write — dispatch workflow_dispatch for rollback (optional, only needed when rollbackEnabled: true)
↩️ *deploy_failure* | workflow: `Deploy to Production` | run: `#1847` | branch: `feat/new-checkout` | failure: `build_failure` | author: `alex@` | 🔗 https://github.com/Polsia-Inc/byteport/actions/runs/1847
import { createGitHubActionsAdapter, DeployFailureRunbook, RunbookRegistry, createAgent } from "byteport-agent";
const adapter = createGitHubActionsAdapter({
owner: "myorg",
repo: "infra",
// default: matches /deploy|release|production/i
// deployWorkflowAllowlist: ["Deploy to Production"],
});
const registry = new RunbookRegistry();
registry.register(new DeployFailureRunbook({
repo: "myorg/infra",
rollbackEnabled: false, // set true + rollbackWorkflowFile to auto-rollback
// rollbackWorkflowFile: "rollback.yml",
}));
const agent = createAgent({
adapters: [adapter],
runners: [registry.asRunner("runbook")!],
dryRun: true, // flip false after reviewing the promote checklist
});
await agent.run();
Deploys & Rollbacks
Detects a bad deploy via health-check failure, error-rate spike, or crash-loop signal; automatically rolls back to the last green SHA on Render; verifies health green within 90 s; writes a structured postmortem stub. No human required for the majority of 2am deploy incidents.
- Diagnose: fetch last 5 deploy SHAs from Render API, confirm failing one is newest
- Decide: apply rollback if (a) error rate >5× baseline in 5-min post-deploy window, OR (b) health endpoint non-2xx for >2 consecutive minutes, OR (c) crash loop detected — otherwise escalate with diagnostics
- Execute rollback: call Render API to re-deploy the previous (last-green) SHA
- Verify: poll health endpoint every 15 s for up to 90 s post-rollback
- Postmortem stub: write structured log entry with failing SHA, rolled-back-to SHA, gate reason, TTR, diagnostic snapshot
- rollbackEnabled defaults to true — set false to always escalate instead
- Rollback only fires when a gate is triggered AND a previous SHA is resolvable
- BYTEPORT_RUNBOOK_DRY_RUN=true mode logs all intended actions without executing
- Health verification timeout (default 90 s) prevents false positives from slow cold starts
- If rollback API call fails, escalates immediately with full diagnostic bundle
- If health still failing after 90 s post-rollback, escalates rather than looping
Postmark + Slack escalation with failing SHA, last-green SHA, gate reason, health endpoint status, and postmortem stub. PagerDuty Events v2 if rollback fails or health still degraded post-rollback. 24h dedup by service ID.
Render API key with Deploy + Read access (RENDER_API_KEY env var)
RENDER_SERVICE_ID env var identifies the target service
BYTEPORT_HEALTH_URL env var specifies the health endpoint to poll
🔴 *failed_deploy_auto_rollback* | service: `byteport-prod` | failing SHA: `d3adb33f` → rolled back to `abc1234` | gate: `error_rate_12.4pct_vs_baseline_1.2pct` | health: ✅ green after 45s | TTR: 2m 14s | 🔗 /runbooks/failed-deploy-auto-rollback
import { FailedDeployAutoRollbackRunbook, RunbookRegistry, createAgent } from "byteport-agent";
import { createGitHubActionsAdapter } from "byteport-agent/adapters/github-actions";
// Signal adapter fires deploy_failed when workflow conclusion=failure
const adapter = createGitHubActionsAdapter({
repo: "my-org/my-app",
token: process.env.GITHUB_TOKEN,
});
const registry = new RunbookRegistry();
registry.register(new FailedDeployAutoRollbackRunbook({
serviceId: process.env.RENDER_SERVICE_ID,
apiKey: process.env.RENDER_API_KEY,
healthUrl: "https://my-app.onrender.com/health",
healthWaitSec: 90,
}));
const agent = createAgent({
adapters: [adapter],
runners: [registry.asRunner("runbook")!],
dryRun: false,
});
await agent.run();
Application
Classifies application errors from Sentry by stack-trace heuristics (db_connection, oom, deploy_correlation, upstream_5xx, unknown), auto-rolls back regressions correlated with a recent deploy SHA, scales up for OOM patterns, and escalates novel errors with a structured summary.
- Classify error by stack trace: db_connection → correlate with db_connections_saturated runbook
- OOM pattern: trigger configurable scale-up command (kubectl scale or equivalent) — opt-in
- Regression + deploy SHA within 30-min window: dispatch rollback command — opt-in, default off
- Upstream 5xx: log circuit-breaker status and escalate with latency context
- Novel/unknown: open incident channel and escalate with full Sentry permalink, event count, and affected users
- Rollback requires explicit rollbackEnabled: true + rollbackCommand — default is escalate-only
- Scale-up requires explicit scaleUpEnabled: true + scaleUpCommand — default is escalate-only
- Deploy correlation window is configurable (default 30 min) — prevents stale SHA false positives
- Dry-run mode: logs classification and proposed action without executing any mutation
- All actions include structured escalation reason with Sentry shortId, permalink, event count, and affected users
Postmark + GitHub Issue escalation with: Sentry shortId, issue title, permalink, event count, affected users, classification reasoning, and specific remediation guidance per class. Dedup by issue ID + first-seen date.
Sentry: auth-token with project:read + event:read + org:read (no write scope required)
Rollback: depends on rollbackCommand (e.g. kubectl or helm — caller-supplied)
Scale-up: depends on scaleUpCommand (e.g. kubectl scale — caller-supplied)
⚠️ *application_error* | project: `api-service` | class: `db_connection` | events: 142 in 5m | correlated: deploy SHA `a3f9c12` 18 min ago | action: correlated with db_connections_saturated runbook, checking connection pool | 🔗 https://sentry.io/organizations/example/issues/?project=1
import { createSentryAdapter, ApplicationErrorRunbook, RunbookRegistry, createAgent } from "byteport-agent";
const adapter = createSentryAdapter({
org: "myorg",
authToken: process.env.SENTRY_AUTH_TOKEN,
project: "backend-api", // optional: filter to one project
windowSeconds: 300, // look-back window (default: 5 min)
spikeMultiplier: 2.0, // 2× baseline events/min fires spike signal
});
const registry = new RunbookRegistry();
registry.register(new ApplicationErrorRunbook({
rollbackEnabled: false, // set true + rollbackCommand to auto-rollback regressions
// rollbackCommand: "helm rollback myapp 0 -n production",
scaleUpEnabled: false, // set true + scaleUpCommand to auto-scale on OOM
// scaleUpCommand: "kubectl scale deployment api -n production --replicas=6",
}));
const agent = createAgent({
adapters: [adapter],
runners: [registry.asRunner("runbook")!],
dryRun: true,
});
await agent.run();
Coming in the next two weeks
We ship weekly. These are confirmed and in-progress:
kubectl rollout undo or GitHub Actions revert when canary thresholds breach.Don't see the runbook your team needs? Request one on GitHub — we scope custom runbooks for Growth and Scale customers.
Don't see your pager class?
We scope custom runbooks for Growth and Scale customers. Book a 15-min call and we'll tell you exactly what it would take to automate your specific alert.