CrashLoopBackOffRunbook ships as the 10th flagship pager class — the single most-paged Kubernetes failure mode in production. Handles 5 signals: crashloopbackoff (pod status.containerStatuses[].state.waiting.reason == CrashLoopBackOff), container_restart_threshold_exceeded (>5 restarts in 10-min window, sub-severity thresholds at 3/10/30 restarts), pod_not_ready_oom (exit code 137 on first container start), liveness_probe_failure (Unhealthy events from kube-apiserver), readiness_probe_failure (pod excluded from Service endpoints). Adapters: K8s Events, Prometheus, Alertmanager, Datadog, Grafana Alerting, PagerDuty.
6-step decision loop: (1) Detect — watch pod status.containerStatuses[].state.waiting.reason == CrashLoopBackOff via kubectl; restart count delta with 3/10/30 sub-severity thresholds; exit code 137 for pod_not_ready_oom; Unhealthy events for probe failures. (2) Diagnose — kubectl logs --previous (last 2000 chars), kubectl describe pod events, image pull status check, ConfigMap/Secret existence from events, liveness/readiness probe spec extraction, recent probe Unhealthy events, node MemoryPressure condition. (3) Classify (first-match, confidence ≥0.70): oom_at_startup (exit code 137 or pod_not_ready_oom signal, conf=0.93 — bump memory 1.5×); image_pull_failure (ImagePullBackOff/ErrImagePull in events, conf=0.94 — rollback to last-good ReplicaSet); config_error_missing_env (ConfigMap/Secret not found in events, conf=0.91 — audit + surface to operator, never auto-create Secrets); liveness_probe_misconfigured (probe firing within startup window + healthy startup indicators in logs, conf=0.86 — propose initialDelaySeconds bump + failureThreshold=5); dependency_not_ready (exit 0/1 + connection-refused/dial-tcp/ECONNREFUSED pattern in previous logs, conf=0.88 — annotate pod with dependency-hold, wait for dep healthcheck). (4) Remediate: oom_at_startup → kubectl patch deployment memory limit 1.5× (ceiling 2×, max 1/pod/24h); image_pull_failure → kubectl rollout undo deployment (max 1/service/hour); config_error_missing_env → kubectl get configmaps/secrets audit, surface missing names, skip_annotation respected; liveness_probe_misconfigured → kubectl patch deployment livenessProbe.initialDelaySeconds = max(current×2, 30) and failureThreshold=5; dependency_not_ready → kubectl annotate pod byteport.io/dependency-hold-since, probe endpoints, escalate if dep unreachable after timeout. (5) Verify — poll pod Ready condition every 15s up to 5 min stabilisation window; restart count must not increment; Unhealthy probe events must be absent. (6) Escalate — unknown class after 2 passes, memory ceiling hit, dep healthcheck timeout, rollback rate limit; emit structured escalation bundle with previous logs snippet.
Hard guardrails: max 1 rollback per service per 1-hour window (_rollback_history in-process dict); max 1 memory bump per pod per 24h (_memory_bump_history in-process dict); limit bump ceiling 2× original (OOM_LIMIT_BUMP_CEILING_MULTIPLIER=2.0); never auto-create Secrets (config_error class surfaces missing keys and stops); byteport.io/skip-crashloop-remediation=true annotation skips all classification and remediation; confidence <0.70 always escalates; BYTEPORT_RUNBOOK_DRY_RUN=true or --dry-run flag disables all mutations; never delete PVCs or PVs.
OOMKilledMemoryPressureRunbook closes the highest-volume Kubernetes page class — pods killed by the cgroup OOM killer — as the 9th flagship pager class. Handles 6 signals: pod_oom_killed (container exit reason OOMKilled, exit code 137), memory_pressure_high (container_memory_working_set_bytes ≥90% of limit via Prometheus/metrics-server), node_memory_pressure (NodeCondition MemoryPressure=True triggering pod evictions), cgroup_oom_event (cgroup memory.events oom_kill counter incremented), jvm_heap_exhausted (java.lang.OutOfMemoryError in pod logs), gomemlimit_exceeded (Go runtime out of memory). Adapters: Prometheus, metrics-server, K8s Events, Datadog, Grafana Alerting, PagerDuty.
6-step decision loop: (1) Detect — pod exit reason OOMKilled via kubectl; NodeCondition MemoryPressure; JVM/Go OOM markers in previous logs; metrics-server working-set vs limit. (2) Diagnose — kubectl describe pod, restart count, memory limit/request, HPA discovery, recent deploy timestamp, runtime hints (JVM -Xmx, Go GOMEMLIMIT, Node --max-old-space-size from env vars/image name), optional Prometheus series for trend/correlation analysis. (3) Classify (first-match, confidence ≥0.70): runtime_misconfig (heap config > container limit, conf=0.93 — PR proposal, not auto-bump); node_pressure (NodeCondition MemoryPressure=True + pod healthy <70%, conf=0.88 — cordon+drain); genuine_memory_leak (RSS rising across 3+ restarts, no traffic correlation, conf=0.85 — escalate, never auto-bump); traffic_spike (RSS correlated with RPS via Pearson ≥0.70, conf=0.86 — trigger HPA or bump limit); undersized_limit (RSS plateau ≥85% of limit, no trend, conf=0.84 — bump limit). (4) Remediate: undersized_limit/traffic_spike → apply memory limit patch on Deployment/StatefulSet (≤2× current, ≤80% of node allocatable, 1 bump/pod/24h cooldown); traffic_spike → annotate HPA to trigger scale-out before limit bump; node_pressure → cordon + drain with PDB respect; runtime_misconfig → propose corrected -Xmx/GOMEMLIMIT/--max-old-space-size env var as PR description (never auto-applied); genuine_memory_leak → open escalation bundle with heap profile path, stop acting. (5) Verify — pod Ready; working-set <80% of new limit; 0 OOMKill events in verify window; no cascading evictions on same node. (6) Postmortem — structured record: pod, node, signal, classification, old/new limit, actions_taken, escalated bool, time_to_resolve.
Hard guardrails: max 1 autonomous limit bump per pod per 24h (_limit_bump_history in-process dict); never exceed 2× original limit or 80% of node allocatable; byteport.io/skip-memory-remediation=true annotation respected (pod skipped entirely); confidence <0.70 always escalates; never auto-bump on genuine_memory_leak (would mask real bugs); max 1 concurrent node drain (_active_node_drains lock); BYTEPORT_RUNBOOK_DRY_RUN=true or --dry-run disables all mutations; StatefulSet owners flagged (is_stateful=True) for operator awareness. Prometheus integration optional: BYTEPORT_PROMETHEUS_URL fetches working_set series + RPS series for trend/correlation; falls back to heuristics if unavailable.
NodeNotReadyRunbook ships as the 8th flagship pager class — the single most common 2am page in any production K8s shop. Handles 6 signals: node_not_ready (kubelet stopped reporting to the API server, NodeCondition Ready=False/Unknown), node_disk_pressure (kubelet eviction threshold breached for disk; inode exhaustion also detected via df -i), node_memory_pressure (MemoryPressure eviction threshold), node_pid_pressure (PIDPressure eviction threshold), kubelet_unhealthy (systemd unit failed or restart loop in journal), node_network_unavailable (CNI/network plugin failure). Adapters: K8s Events, Prometheus, Datadog, Alertmanager, Grafana Alerting, PagerDuty.
6-step decision loop: (1) Detect — poll NodeCondition transitions via kubectl; pull kubelet systemd status (systemctl is-active/is-failed); read kernel messages (journalctl -k --since '30 min ago') for OOM/MCE/hardware errors. (2) Diagnose — kubectl describe node (NodeConditions, allocatable vs. capacity, taints, events), journalctl -u kubelet (last 200 lines), container runtime status (containerd/crio), inode/fd exhaustion, image disk usage via crictl/docker. (3) Classify (first match, confidence-gated ≥0.70): hardware_cloud_failure (kernel OOM/MCE/panic, conf=0.94, always escalate); network_plugin_failure (NetworkUnavailable + CNI errors, conf=0.88); runtime_crash (containerd/crio unit failed/inactive, conf=0.91); resource_pressure_disk (DiskPressure condition or disk ≥85% or inode_free ≤5%, conf=0.90); resource_pressure_mem_pid (MemoryPressure/PIDPressure condition, conf=0.88); transient_kubelet_hang (inactive but not failed, restart_count ≤2, conf=0.82); unknown (<0.70, always escalate). (4) Remediate — cordon first; then per class: transient_kubelet_hang → systemctl restart kubelet; runtime_crash → drain → restart containerd/crio → restart kubelet; resource_pressure_disk → drain → crictl image prune + delete evicted pods + restart kubelet; resource_pressure_mem_pid → drain → restart kubelet; network_plugin_failure → delete CNI daemonset pod on node (flannel/calico/cilium/weave). (5) Verify — node returns Ready in 5 min window; no new eviction events; no PDB violations; retry once at 15s; escalate on double-fail. (6) Postmortem — structured record: node, signal, classification, root_cause, actions_taken, pods_evicted, pods_rescheduled, time_to_recovery_seconds, escalated bool.
Hard guardrails: max 1 concurrent drain (in-process _active_drains lock — refuses second drain with escalation); control-plane nodes (node-role.kubernetes.io/control-plane or master label) always escalate, never touch; PDB-aware drain (kubectl drain --disable-eviction=false, PDB violation surfaces verbatim and stops remediation); hardware/cloud failures always escalate unless BYTEPORT_AUTO_REPLACE_NODES=true; image prune only (never deletes PVs/PVCs); confidence <0.70 always escalates; BYTEPORT_RUNBOOK_DRY_RUN=true or --dry-run disables all mutations.
25 unit tests appended to runbooks/test_runbooks.py: TestNodeNotReadyDetect (2 cases), TestNodeNotReadyClassify (8 cases: one per classification branch including control-plane guard), TestNodeNotReadyRemediate (5 cases: dry-run, escalation skip, low-confidence skip, control-plane guard, runtime crash branch), TestNodeNotReadyVerify (3 cases: dry-run resolved, node-ready resolved, not-ready unresolved), TestNodeNotReadyPostMortem (2 cases: shape + escalated flag), TestNodeNotReadySignalHandling (2 cases: invalid signal raises + all 6 accepted), TestNodeNotReadyGuardrails (3 cases: concurrent drain limit, max drain slots, inode exhaustion classifier). Delivery: runbooks/node_not_ready_runbook.py; tests appended; data/runbooks.json entry added (shipped, kubernetes, critical, v0.1.28); /runbooks gallery now shows 8 flagship pager classes.
TlsCertExpiryAutoRenewalRunbook closes the gap between the /demo page TLS scenario and shipped code. Handles 6 signals: cert_expiring_soon (with 30/14/7/1-day sub-severity thresholds — medium/high/critical/expired), cert_expired (immediate escalation path with full renewal attempt), cert_chain_invalid (intermediate chain breaks detected via openssl verify), ocsp_stapling_failure (Must-Staple extension set but staple not presented), acme_challenge_failure (HTTP-01 or DNS-01 rejected by CA), renewal_rate_limit_hit (ACME rate-limit window exhausted). Adapters: Prometheus, Alertmanager, Datadog Monitors, Grafana Alerting, PagerDuty.
6-step decision flow: (1) Diagnose — openssl s_client -showcerts -status to fetch served cert; parse notAfter, issuer O=/CN=, SAN list, serial, OCSP Must-Staple extension and staple presence; detect wildcard; locate cert on disk (certbot live dir, /etc/ssl/certs, /etc/nginx/ssl); identify web server via pgrep (nginx/envoy/haproxy/traefik/caddy/k8s ingress); check DNS zone ownership (dig SOA); detect cert-manager Certificate CRD in cluster; estimate ACME rate-limit window remaining; check BYTEPORT_PINNED_DOMAINS for mobile cert pinning. (2) Classify (first match, confidence-gated at 0.70): acme_rate_limited (rate-limit hit, conf=0.92, always escalate); cert_pinned (BYTEPORT_PINNED_DOMAINS match, conf=0.95, always escalate — mobile client risk); cert_manager (CRD present, conf=0.93); cloud_managed (issuer O in {Amazon, Google Trust Services}, conf=0.88); acme_dns01 (wildcard or challenge_failure + dns_provider, conf=0.91); acme_http01 (LE/ZeroSSL, non-wildcard, zone owned, conf=0.90); internal_ca (self-signed, conf=0.87); commercial_ca (fallback, conf=0.75). (3) Remediate: acme_http01 → certbot renew --webroot or --standalone; acme_dns01 → certbot --dns-{route53|cloudflare|google|azure}; cert_manager → kubectl delete secret -tls + annotate for force re-issue; cloud_managed → aws acm renew-certificate --certificate-arn $BYTEPORT_ACM_ARN; all escalation classes → emit escalation with retry ETA. (4) Hot-reload (server-aware, zero downtime): nginx → nginx -s reload; caddy → caddy reload; haproxy → haproxy -sf; envoy/traefik → SIGHUP; k8s ingress → kubectl rollout restart deployment/. (5) Verify: re-fetch cert; confirm notAfter > now+30d; chain valid; OCSP staple present (if Must-Staple); serial changed from pre-renewal; handshake_ok — rollback on failure. (6) Postmortem: structured record with domain, signal, classification, issuer, challenge_used, tool_used, days_remaining before/after, chain_valid_after, ocsp_ok_after, time_to_resolve_seconds, escalated bool.
Hard guardrails: DNS zone ownership confirmed before any ACME action; cert backup always created before swap, restored on verify failure; pinned certs always escalate; ACME rate-limit awareness with in-process window tracking; wildcard certs require dns-01 provider or escalate; confidence <0.70 always escalates; BYTEPORT_RUNBOOK_DRY_RUN=true or --dry-run disables all mutations.
FailedDeployAutoRollbackRunbook closes the mission-critical gap: failed deploys. Handles 5 signals: deploy_health_regression (error rate or p95 latency spike within the observation window of a new rollout), readiness_probe_failure_post_deploy (new ReplicaSet pods never reach Ready), deploy_5xx_spike (HTTP 5xx rate crosses threshold on new revision), canary_slo_burn (canary cohort burns SLO budget faster than stable baseline), rollout_stuck (Deployment progressDeadlineExceeded). Adapters: Prometheus, Datadog Metrics, K8s Events, Alertmanager, Grafana Alerting.
5-step decision flow: (1) Diagnose — kubectl rollout history to enumerate revisions and extract CHANGE-CAUSE annotations (commit SHA, CI run URL), identify current and last-good revision, get pod ready counts from Deployment status JSON, discover downstream Services via label-selector overlap for blast radius, run false-attribution guard (Warning events >5 min before rollout → regression predates deploy). (2) Classify (first match, confidence-gated at 0.70): do_not_rollback (false attribution, conf=0.85); rollout_stuck (0 ready pods + progressDeadlineExceeded, conf=0.91); readiness_never_ready (readiness probes failing on new ReplicaSet, conf=0.88); deploy_5xx_spike (conf=0.90); canary_slo_burn (conf=0.87); deploy_health_regression (conf=0.86). Falls back to escalate when no rollback target can be safely determined (empty history). (3) Plan — emit rollback plan: kubectl rollout undo deployment/ --to-revision=, expected pod churn, estimated duration, explicit 'rollback-fails-escalate' tripwire; plan posted to ops channel in dry_run mode. (4) Execute — kubectl rollout undo with --to-revision; if kubectl exits non-zero → escalate immediately, no retry (rollback-fails tripwire). Record rollback in in-process rate-limit counter (max 1/deployment/hour). (5) Verify: check deployment Available=True + Progressing=True + readyReplicas == spec.replicas; retry once at 60s; escalate on double-fail.
Hard guardrails: max 1 rollback per deployment per hour; false attribution guard prevents rolling back a healthy deploy when root cause predates the rollout; rollback-itself-fails tripwire escalates immediately with no retry; confidence <0.70 always escalates; verify double-fail escalates with full diagnostic bundle; BYTEPORT_RUNBOOK_DRY_RUN=true or --dry-run disables all kubectl mutations.
Post-mortem: structured record — signal, classification, suspect_revision, last_good_revision, suspect image, last_good image, commit SHA (from CHANGE-CAUSE annotation), rollback_duration_seconds, time_to_resolve_seconds, resolved bool. Links to the offending commit SHA when available via Deployment annotations.
DbConnectionPoolExhaustionRunbook closes the #1 database pager class — connection pool exhaustion. Handles 6 signals: db_pool_saturated (active ≥ max for >60s), db_connection_timeout (app-side acquire timeout from Hikari/SQLAlchemy/pgx), db_too_many_connections (Postgres FATAL 53300 / MySQL ER_CON_COUNT_ERROR), idle_in_transaction_backlog (pg_stat_activity idle-in-transaction count threshold), pgbouncer_pool_full (SHOW POOLS waiting > 0), connection_leak_post_deploy (connections rising monotonically after release marker). Adapters: Prometheus, Datadog, Alertmanager, Grafana Alerting, PagerDuty, Generic Webhook.
5-step decision flow: (1) Diagnose — query pg_stat_activity for full state breakdown (active/idle/idle-in-tx/waiting), capture top wait_events and long-running queries (duration + query text), pull pgbouncer SHOW POOLS + SHOW STATS, correlate connection trend with last deploy SHA, inspect Hikari/SQLAlchemy/pgx env vars for pool config. (2) Classify (first match, confidence-gated at 0.70): leak_after_deploy (deploy marker + rising idle-in-tx trend, conf=0.88); idle_in_tx_backlog (≥3 sessions older than threshold, conf=0.91); long_running_query (≥2 queries over budget or 1 >10min, conf=0.85); missing_pgbouncer (pgbouncer_waiting>0, conf=0.80); runaway_cron_fanout (single app_name holds ≥15 connections, conf=0.82); undersized_pool (utilization ≥85% with no dominant cause, conf=0.75). Falls back to escalate when confidence <0.70. (3) Remediate (least-invasive first): idle_in_tx_backlog — terminate via pg_terminate_backend (kills_per_minute_cap=20, skips pgbouncer/superuser); long_running_query — cancel via pg_cancel_backend then pg_terminate_backend if needed; missing_pgbouncer/undersized_pool — send RELOAD to pgbouncer admin console; leak_after_deploy — terminate old idle-in-tx sessions then kubectl rollout restart offending deployment; runaway_cron_fanout — terminate idle-in-tx connections from dominant application_name within cap. (4) Verify: re-sample at 30s; pass if utilization <70%, lock_waiters=0, error_rate normalized; retry once at 60s; escalate on double-fail. (5) Postmortem: emit structured record with signal, class, before/after utilization, pids_terminated, top 5 query fingerprints (literals stripped), and permanent fix recommendation per class (statement_timeout for idt; EXPLAIN for long queries; PgBouncer adoption for undersized; lifecycle audit for leak).
Hard guardrails: kills_per_minute_cap (default 20) — never exceeds regardless of queue depth; pgbouncer/admin-proxy/superuser connections always protected; confidence <0.70 always escalates; verify double-fail escalates with full diagnostic bundle; BYTEPORT_RUNBOOK_DRY_RUN=true disables all mutations.
28 unit tests in runbooks/test_runbooks.py: TestDbPoolClassify (7 cases), TestDbPoolRemediate (5 cases), TestDbPoolGuardrails (5 cases), TestDbPoolVerify (3 cases), TestDbPoolPostmortem (2 cases), TestDbPoolSignalHandling (5 cases) — plus the 5 previously-shipped flagship runbook tests.
CrashLoopBackOffRunbook closes the #1 Kubernetes pager class — pods stuck in CrashLoopBackOff. Triggered by crashloopbackoff (K8s CrashLoopBackOff state), container_restart_threshold_exceeded (>5 restarts in 10 min), pod_not_ready_oom (exit 137 at startup), liveness_probe_failure, readiness_probe_failure, and init_container_failure signals from K8s Events, Prometheus, Datadog, Alertmanager, and Grafana Alerting adapters.
5-step decision flow: (1) Diagnose — kubectl describe pod, previous + current container logs (last 2000 chars each), exit code analysis (137 OOM / 139 segfault / 127 missing binary / 1 app error), event timeline scraped for ImagePullBackOff / missing ConfigMap / missing Secret, probe definitions (initialDelaySeconds / failureThreshold / scheme), resource requests and limits from pod spec JSON, node MemoryPressure/DiskPressure conditions, ownerReference-based deployment name inference. (2) Classify (first match, confidence-gated at 0.70): node_pressure_eviction (node has MemoryPressure/DiskPressure, confidence 0.88); bad_image_or_tag (ImagePullBackOff/ErrImagePull in events, confidence 0.92); missing_configmap_or_secret (missing K8s resource event, confidence 0.90); failed_init_container (init container exit_code ≠ 0, confidence 0.87); oom_at_startup (exit code 137, confidence 0.91); probe_misconfigured (probe fires before app startup complete per log indicators, confidence 0.83); app_panic_on_boot (exit 139 or panic/fatal/exception in logs, confidence 0.80–0.86). Falls back to escalate when confidence <0.70. (3) Remediate within guardrails: oom_at_startup — kubectl patch deployment memory limit 1.5× current (max 1 bump/pod/day); bad_image_or_tag — kubectl rollout undo (max 1/service/hour); failed_init_container — delete pod to re-init if internal, escalate if external dep; probe_misconfigured — patch deployment doubling initialDelaySeconds + failureThreshold=5; missing_configmap_or_secret — audit and surface (never auto-create Secrets); app_panic_on_boot — extract stack trace, rollback if restart_count ≥3; node_pressure_eviction — cordon node, delete pod to reschedule onto healthy node. (4) Verify: pod reaches Ready, restart delta=0, probe Unhealthy events absent; retry once after 60s. (5) Post-mortem: emit structured event (signal, classification, action, time-to-resolve, rollback-needed bool) for the /demo timeline.
Hard guardrails: max 1 rollback per service per hour (in-process rolling counter); max 1 resource bump per pod per 24h (in-process rolling counter); never delete PersistentVolumeClaims or PersistentVolumes; missing_configmap_or_secret always requires operator action for creation; confidence <0.70 escalates.
MemoryPressureOomRunbook closes the #2 non-DB 2am pager class after disk — OOM kills and sustained memory pressure. Triggered by oom_killed (cgroup/kernel OOM killer), memory_pressure_high (≥90% RSS sustained 5 min), swap_thrash (swap-in rate spike), pod_evicted_memory (k8s MemoryPressure node condition), jvm_heap_exhausted, and node_heap_oom signals from any adapter (Prometheus, CloudWatch, Datadog, K8s Events, Alertmanager, Grafana Alerting).
5-step decision flow: (1) Diagnose in parallel — cgroup memory.stat (/sys/fs/cgroup/memory.current + memory.max, cgroup v1 fallback), /proc/meminfo (MemTotal/MemAvailable/Swap), top-N processes by VmRSS from /proc//status, recent OOM-kill dmesg entries, K8s pod spec + memory limits + kubectl top pod + node MemoryPressure condition + HPA availability, JVM heap via jcmd GC.heap_info + GCOverhead via PerfCounter.print, Node.js heap via --max-old-space-size from /proc//cmdline. (2) Classify (first match, confidence-gated): runtime_misconfig (JVM/Node heap > container limit, confidence 0.90–0.92); noisy_neighbor (node has MemoryPressure, victim pod usage <80% of its own limit, confidence 0.85); undersized_limit (steady-state usage >80% of limit, no rising trend, confidence 0.80); memory_leak (RSS trend rising, no traffic correlation, OOM-kill log or swap-thrash, confidence 0.82); traffic_spike (RSS rising with traffic correlation, confidence 0.78). Falls back to escalate when confidence <0.70. (3) Remediate within guardrails: memory_leak — capture jcmd GC.heap_dump or SIGUSR1 heap snapshot, exponential-backoff restart (30s×2^N, cap 300s); kubectl rollout restart for k8s, systemctl restart for bare-metal; traffic_spike — annotate deployment to trigger HPA scale-out if available, else kubectl patch memory.limit (dry-run=server validation + max 2× guard); undersized_limit — propose manifest PR with suggested limit (p99 usage×1.5), never auto-apply; noisy_neighbor — kubectl cordon node + drain (--ignore-daemonsets --delete-emptydir-data --grace-period=60), page if drain fails; runtime_misconfig — propose corrected -Xmx or --max-old-space-size with 25% cgroup headroom, never auto-apply. (4) Verify: re-read cgroup usage, confirm RSS <70% of limit AND no new OOM-kill AND swap_in<10 pages/s; retry once after 30s. (5) Escalate if confidence <0.70, guardrail blocks action, or both verify attempts fail.
Hard guardrails: max 3 auto-restarts per service per hour (in-process rolling counter); max 2× memory limit bump; never touch StatefulSet workloads without BYTEPORT_STATEFUL_ALLOWLIST; undersized_limit and runtime_misconfig never auto-apply; noisy_neighbor drain-fails → page immediately.
26 unit tests in runbooks/test_runbooks.py: TestMemOomClassify (8 cases), TestMemOomRemediate (7 cases), TestMemOomVerify (4 cases), TestMemOomRestartGuardrails (4 cases), TestMemOomSignalHandling (3 cases).
TlsCertExpiryAutoRenewRunbook closes the highest-shame 2am page in the industry — the cert that expired because the human forgot. Triggered by cert_expiring_soon (≤14d), cert_expired, ssl_handshake_failure, and x509_chain_invalid signals from any adapter (Datadog, Prometheus, CloudWatch, Grafana, PagerDuty).
5-step decision flow: (1) Diagnose — openssl s_client -connect DOMAIN:443 -servername DOMAIN -status -showcerts → parse notAfter, issuer O=/CN=, SAN list, OCSP staple presence, chain validity; locate cert files on disk (certbot live dir, /etc/ssl, /etc/pki/tls); identify serving process by pgrep (nginx/caddy/envoy/haproxy/traefik) or kubectl certificate CRD presence. (2) Classify (first match): cert_manager_managed if k8s env detected or --cert-name arg given; wildcard_dns01_required if *.domain with supported DNS provider (Route53/Cloudflare/GCP); acme_renewable if Let's Encrypt / ZeroSSL issuer or cert found in certbot live dir; internal_ca or commercial_ca_manual → escalate with full context. (3) Renew: backup cert+key to /var/lib/byteport/cert-backups/-/ before any write; certbot renew (http-01 default, or --dns-PROVIDER for dns-01); acme.sh as fallback; cert-manager: kubectl annotate cert-manager.io/force-renew=true + wait --for=condition=Ready --timeout=120s. (4) Hot-reload: nginx -s reload / caddy reload / kill -HUP envoy / systemctl reload haproxy|traefik / kubectl rollout restart ingress-controller. (5) Verify: re-fetch cert via openssl s_client, confirm notAfter > now + 30d AND chain_valid AND handshake_ok; rollback (restore backup + re-reload) and escalate on failure.
Safety rails: DNS zone ownership check via dig/nslookup — refuses to act on externally-managed domains; backup-before-swap always runs first (abort if backup fails); rollback on verification failure restores prior cert + hot-reloads again; commercial_ca_manual and internal_ca issuers always escalate — never attempt autonomous renewal; wildcard without DNS provider escalates rather than attempting http-01; MIN_DAYS_AFTER_RENEW=30 post-renewal floor — rejects a cert with unusually short validity; BYTEPORT_RUNBOOK_DRY_RUN / --dry-run makes all subprocesses no-ops.
DiskSpaceAutoReclaimRunbook closes the literal mission example — 'disk hit 95%' — the highest-volume non-DB 2am pager class. Triggered by disk_pressure, inode_exhaustion, volume_full, and disk_almost_full signals from any adapter (Prometheus, CloudWatch, GCP Monitoring, K8s Events, Alertmanager).
5-step decision flow (stops when usage drops below freeSpaceFloorPct, default 80%): (1) Diagnose in parallel — df -h, df -i (inodes), top-10 space consumers via du -sh, docker system df, journald --disk-usage; captures mount point and filesystem type. (2) Rotate+compress logs — force logrotate, gzip files >100MB older than 24h, journalctl --vacuum-time=7d. (3) Evict container layers — docker system prune -af --filter until=72h on bare-metal/VMs; crictl rmi --prune on K8s nodes (detected via signal namespace/node metadata). (4) Drop expired tmp/cache — /tmp + /var/tmp >7 days, apt/yum/dnf cache, pip cache purge, npm cache clean --force. (5) Grow volume — if cloud-managed and policy.allowVolumeGrow=true: EBS ModifyVolume / GCP disks.resize +25% capped at policy.maxVolumeGb (default 500 GB), then growpart + resize2fs/xfs_growfs.
Escalation: if usage still >85% (configurable escalateAbovePct) after all 5 steps, escalates with a sized +N GB recommendation based on current excess and top-5 largest directories attached.
Cloud provider detection: reads cloudProvider/volumeId/ebsVolumeId from signal metadata for EBS; gcpDisk/gcpZone/gcpProject for GCP. Falls back to AWS_DEFAULT_REGION / GOOGLE_CLOUD_PROJECT env vars on the agent host.
Safety rails: stop-on-floor re-reads disk % after each step; volume grow is opt-in (policy.allowVolumeGrow defaults false); maxVolumeGb cap (default 500) prevents runaway grows; BYTEPORT_RUNBOOK_DRY_RUN=true logs all actions without executing; escalates with recommendation when floor not restored.
Config interface: { freeSpaceFloorPct, escalateAbovePct, mount, isK8sNode, fsType, policy: { allowVolumeGrow, maxVolumeGb } }. All fields optional with safe defaults. Exported as DiskSpaceAutoReclaimConfig, DiskSpaceAutoReclaimPolicy, CloudProvider.
DbConnectionPoolExhaustedRunbook closes the #3 most common 2am incident class — right behind disk pressure and memory pressure, both already handled. Triggered by db_pool_exhausted, connection_refused_db, too_many_connections, or pool_timeout signals from any adapter (Prometheus, Datadog, Alertmanager, Grafana Alerting).
5-step decision flow: (1) Diagnose — parallel fetch of active/idle/idle-in-transaction connection counts, pool ceiling, top 10 long-running queries with duration, client IPs holding connections, and traffic delta (req/sec 5min vs prior hour). (2) Classify — 5 classes, first match wins: idle_in_transaction_leak (>30% connections are stale iit) → long_running_query (>20% connections are long-running) → app_side_leak (single client holds >40% of pool) → traffic_spike (traffic ratio >2.5×) → pool_undersized. (3) Resolve (gated). (4) Verify — re-poll at 30s/90s/180s, pass if utilization <70%. (5) Postmortem — structured doc with signals, classification, actions, before/after metrics, EXPLAIN capture.
idle_in_transaction_leak: pg_terminate_backend for all sessions idle-in-transaction >5min. Filtered by critical app allowlist. Capped at killsPerMinuteCap (default 20/min). After termination, still escalates for root-cause investigation — termination is a stopgap.
long_running_query: terminate all queries exceeding runtimeBudgetSeconds (default 5min). Captures query text for EXPLAIN postmortem before terminating. Per-minute kill cap applies.
app_side_leak: flags the offending application_name/IP that holds >40% of pool connections. Signals app tier to recycle. Does NOT kill connections blindly — critical apps are protected by allowlist.
traffic_spike: raises pool ceiling by up to +50% (capped at dbMaxConnections × 0.8). Flags non-critical routes for 503 shedding. Always escalates for sustained traffic investigation.
pool_undersized: proposes and applies +50% ceiling increase if within safety bound (dbMaxConnections × 0.8). Escalates with config recommendation when at bound or when ceiling raise fails.
Safety rails: killsPerMinuteCap hard cap (default 20, configurable); criticalAppAllowlist never terminates named apps (e.g. pgbouncer, admin-proxy); poolCeilingSafetyFraction bound (default 0.8 = 20% headroom); replica lag abort — escalates immediately if pg_stat_replication lag >10MB during remediation to avoid split-brain; BYTEPORT_RUNBOOK_DRY_RUN=true logs without executing.
Config interface: { runtimeBudgetSeconds, killsPerMinuteCap, criticalAppAllowlist, verifyUtilizationThreshold, poolCeilingSafetyFraction, verifyPollIntervalsMs, databaseUrl }. All fields optional with safe defaults. Exported as DbConnectionPoolExhaustedConfig.
ContainerRestartLoopBreakerRunbook closes the highest-frequency incident class still uncovered: CrashLoopBackOff and container restart loops. Triggered by container_restart_loop or crash_loop_backoff signals from any adapter (K8s Events, Render, ECS, Docker, Nomad). Hard cap: one auto-remediation per container per hour to prevent runbook-interference with the incident.
Diagnosis (step 1): fetches the last 200 lines of container logs, last 5 exit codes, restart count, and image SHA. Classifies failure into one of 6 classes — first match wins: config_error (env var / secret missing), dependency_down (DB / cache / upstream unreachable), oom_at_startup (OOMKilled before ready), port_conflict (port bind failure), image_pull_failure (ErrImagePull / ImagePullBackOff), unknown.
Config error (class 2): checks whether referenced secrets exist in the orchestrator (kubectl get secret); surfaces the exact diff for operator action. Does NOT auto-restore secrets — always escalates with the full diff. Prevents accidental secret injection into the wrong environment.
Dependency down (class 3): runs TCP health checks (bash /dev/tcp) against named upstreams extracted from log lines (ECONNREFUSED, dial tcp, postgres/redis/amqp patterns). If a known upstream is itself down, defers remediation and chains to that incident so the dependency is fixed first.
OOM at startup (class 4): reads current memory limit via kubectl get deployment jsonpath, bumps by 25% (configurable oomMemoryBumpPct), capped at 2× original (configurable oomMemoryCapMultiplier). Applies patch via kubectl patch deployment JSON merge, verifies rollout via kubectl rollout status with configurable healthWaitSec (default 300s). Rolls back classification to escalation if cap already reached.
Port bind / image pull failure (class 5): rolls back to the last green revision. On K8s: kubectl rollout undo deployment. On Render: reuses buildFetchDeploysCmd + buildRenderRollbackCmd from FailedDeployAutoRollbackRunbook (v0.1.17) — no code duplication. Falls back to structured escalation if no previous SHA is resolvable.
Unknown (class 6): escalates immediately with a structured incident report containing logs, exit codes, image SHA, restart count, and last 3 deploy references — never silent-fails.
Safety rails: hard hourly remediation cap per container (checked before any action); OOM memory cap guard (never bumps beyond 2× original); config error never auto-restores (always escalates with diff); dependency chain avoids masking a downstream incident; BYTEPORT_RUNBOOK_DRY_RUN=true logs all intended actions without executing; every action emits a structured audit event with failure class, action taken, and outcome.
26 unit tests in tests/container-restart-loop-breaker.test.ts: remediation history helpers (recordRemediation/recentRemediationCount — 4 cases: zero count, in-window, out-of-window, per-key isolation), classifyFailure (12 cases: OOM exit 137 / log text / metadata, image pull ErrImagePull/ImagePullBackOff, port EADDRINUSE, dependency ECONNREFUSED/no-route, config missing-env/secret-not-found, unknown, OOM priority, image-pull priority), classifyOrchestrator (5 cases: k8s platform/namespace, render, ecs, docker), parseMemoryMi (8 cases: Mi/Gi/M/G/plain/unavailable/empty/decimal), command builder assertions (buildFetchLogsCmd/buildFetchExitCodesCmd/buildRestartCountCmd/buildImageShaCmd/buildSecretCheckCmd/buildDependencyHealthCheckCmd/buildMemoryBumpCmd/buildGetMemoryLimitCmd/buildRolloutVerifyCmd — 9 cases), runbook identity (6 cases: id/handle container_restart_loop/handle crash_loop_backoff/reject disk_pressure/signalTypes/description), config defaults (4 cases: healthWaitSec/oomMemoryCapMultiplier/oomMemoryBumpPct/custom), plan step presence (8 cases: count=5/diagnose/classify/remediate/verify/audit_event/dry-run no cmd/live has cmd), dry-run execute (8 cases: mode/success/escalated/step count/all dry_run status/DRY RUN prefix/runbookId/totalBytesFreed), integration test against missing-env-var fixture (3 cases: classifies config_error/dry-run shape valid/config_error never auto-restores), registry integration (4 cases: find container_restart_loop/crash_loop_backoff/null for disk_pressure/asRunner).
Registered in src/runbooks/container/index.ts (new) and src/runbooks/index.ts and src/index.ts; exported as ContainerRestartLoopBreakerRunbook + ContainerRestartLoopBreakerConfig + RestartLoopFailureClass + OrchestratorEnv; data/runbooks.json entry added (shipped, category: containers, severity: critical); package.json version bumped to 0.1.19; CLAUDE.md updated.
MemoryPressureAutoRecoverRunbook closes the most common 2am page class still uncovered: runaway process consuming memory until OOM-killer fires or the host swaps to death. Covers Render, Kubernetes, and bare-metal/systemd with a single unified runbook.
Trigger: memory_pressure signal (memory >90% sustained) OR oom_kill signal. Three-phase decision before any restart action: (1) Diagnose — identify top 5 memory consumers via `ps aux --sort=-%mem` + current usage % via `free -m`; (2) Deploy correlation check — inspect journald/deploy log for releases within the last 60 min; if found, hand off to FailedDeployAutoRollbackRunbook instead of restarting; (3) Crash-loop guard — abort and escalate if the same service has been auto-restarted ≥2 times in the last 30 min, preventing the runbook from becoming the source of the outage.
Remediation (first applicable wins): Render — POST to Render Deploys API to force service restart; Kubernetes — `kubectl rollout restart deployment/ -n `; bare metal/VM — `systemctl restart ` with SIGTERM → grace period → SIGKILL fallback. Runtime environment is auto-detected from signal metadata (platform/namespace/renderServiceId fields) and env vars (RENDER_SERVICE_ID, KUBECONFIG, KUBERNETES_SERVICE_HOST).
Post-restart verification: re-samples memory % after a 3 s stabilization window and confirms it dropped below the configurable recoveryThresholdPct (default: 80%). If still elevated, escalates with full attribution table.
Escalation payload includes: full top-5 memory table, restart history and crash-loop count, recent deploy correlation evidence, suspected leak signature, and runtime environment classification — ready for copy-paste into incident response.
Safety rails: crash-loop guard (≥2 restarts in 30 min → stop); deploy hand-off guard (recent release → rollback runbook, not restart); BYTEPORT_RUNBOOK_DRY_RUN=true → logs all intended actions without executing; gracePeriodMs configurable (default 15 s); only restarts services with resolved runtime context — falls back to structured escalation when restart method cannot be determined.
Config interface: { gracePeriodMs, crashLoopMaxRestarts, crashLoopWindowMs, deployLookbackMinutes, renderServiceId, renderApiKey, renderApiBaseUrl, recoveryThresholdPct }. All fields optional with safe defaults. Exported as MemoryPressureAutoRecoverConfig.
Registered in src/runbooks/memory/index.ts (new) and src/runbooks/index.ts and src/index.ts; exported as MemoryPressureAutoRecoverRunbook + MemoryPressureAutoRecoverConfig + RuntimeEnv; data/runbooks.json entry added (shipped, category: infrastructure, severity: critical); package.json version bumped to 0.1.18; CLAUDE.md updated.
FailedDeployAutoRollbackRunbook closes the single most common 2am page in early-stage startups: a deploy that fails health checks now rolls back without a human. Render-native (dogfooded on byteport itself). Fly.io and Railway support planned for v0.1.18.
Trigger: deploy_failed OR health_check_failing signal within the deploy window. Three rollback gates — first match wins: (a) error rate >5× baseline in the 5-min post-deploy window, OR (b) health endpoint returns non-2xx for >2 consecutive minutes, OR (c) crash loop detected (signal metadata crashLoop=true). If no gate fires, runbook escalates with the full diagnostic bundle.
Execution flow: (1) diagnose — fetch last 5 deploy SHAs from Render API, confirm failing one is newest, extract last-green SHA from response if not already in signal metadata; (2) decide — evaluate the three rollback gates, emit structured decision log; (3) execute rollback — POST to Render Deploys API with commitId set to the last-green SHA and clearCache=false; (4) verify — poll health endpoint every 15 s for up to 90 s (configurable healthWaitSec); (5) postmortem stub — write structured JSON log with failing SHA, rolled-back-to SHA, gate reason, time-to-recovery, and diagnostic snapshot.
Safety rails: rollbackEnabled defaults to true; set false to always escalate instead. Rollback only executes when a gate fires AND a previous SHA is resolvable — no SHA means escalate, not guess. BYTEPORT_RUNBOOK_DRY_RUN=true mode logs all intended actions without any API calls. If the Render rollback API call fails, runbook escalates immediately with the full failure output. If health is still degraded after healthWaitSec, runbook escalates rather than looping. 24h dedup by service ID via RunbookCooldown.
Config interface: { serviceId, apiKey, healthUrl, healthWaitSec, renderApiBaseUrl, rollbackEnabled }. All fields optional with env var fallbacks (RENDER_SERVICE_ID, RENDER_API_KEY, BYTEPORT_HEALTH_URL). Exported as FailedDeployAutoRollbackConfig.
Registered in src/runbooks/deploy/index.ts (new) and src/runbooks/index.ts and src/index.ts; exported as FailedDeployAutoRollbackRunbook + FailedDeployAutoRollbackConfig + shouldRollback; data/runbooks.json entry added (shipped, category: deployment, severity: critical); package.json version bumped to 0.1.17; CLAUDE.md updated.
DiskPressureAutoCleanupRunbook closes the disk_pressure remediation loop: when a disk_pressure signal fires (from Datadog, CloudWatch, Grafana Alerting, or PagerDuty), the runbook diagnoses current usage and top space consumers, then executes ordered cleanup steps — stopping as soon as usage drops below the configurable free-space floor (default: 80%). No manual SSH required for the majority of disk incidents.
Remediation order (each step followed by a disk-usage check-point): (1) diagnose — df for current %, du on log roots for top consumers; (2) rotate + gzip logs older than N days (default: 7d) in /var/log with configurable allowlist; (3) journalctl --vacuum-size=200M (configurable); (4) remove /tmp and /var/tmp files older than N hours (default: 24h); (5) docker system prune -af (opt-in: dockerPruneEnabled=true); (6) apt-get clean / yum clean all / dnf clean all (opt-in: packageCachePruneEnabled=true). Stop-on-floor: after each step the runbook re-reads disk % and exits early if the floor is restored.
Safety rails: hard safe-list protects /var/log/lastlog, /var/log/wtmp, /var/log/btmp, /var/log/audit/audit.log, and active access/error/auth log files — never touched regardless of age. Docker and package cache prune are both opt-in (default false) so the runbook is safe to deploy without elevated permissions. BYTEPORT_RUNBOOK_DRY_RUN=true logs all intended actions without executing any writes. 24h per-host cooldown enforced by RunbookCooldown.
Escalation path: if all configured safe steps are exhausted and disk is still at or above the floor, the runbook emits a structured escalation with per-step bytes reclaimed, total freed, and residual top-consumer paths from du output — ready for copy-paste into incident response.
Config interface: { freeSpaceFloorPct, journalVacuumSize, logRotationAgeDays, logRotationAllowlist, tmpMaxAgeHours, dockerPruneEnabled, packageCachePruneEnabled }. All fields optional with safe defaults. Exported as DiskPressureAutoCleanupConfig.
Registered in src/runbooks/disk/index.ts (new) and src/runbooks/index.ts and src/index.ts; exported as DiskPressureAutoCleanupRunbook + DiskPressureAutoCleanupConfig; data/runbooks.json entry added (shipped, category: infrastructure, severity: critical); package.json version bumped to 0.1.16; CLAUDE.md updated.
SslCertAutoRotateRunbook closes the detection loop: when an ssl_cert_expiring signal fires, the runbook diagnoses the cert management toolchain (cert-manager / Caddy / Traefik / raw ACME / IaC), executes rotation, then verifies via TLS handshake — no human paging required for 80% of cert incidents.
Execution flow: (1) diagnose cert source from signal metadata + K8s CRD heuristics; (2) for raw-ACME paths, dry-run against Let's Encrypt staging endpoint to confirm DNS-01 or HTTP-01 challenge will succeed — abort and escalate with diagnostic if staging fails; (3) backup current cert + key to /var/lib/byteport/cert-backups/-/ before any write; (4) rotate — cert-manager: kubectl annotate certificate with cert-manager.io/force-renew + wait Ready; Caddy/Traefik: systemctl reload to trigger internal ACME; certbot: certbot renew --force-renewal; IaC (Terraform/Pulumi): open PR via GitHub adapter, do NOT auto-apply; (5) TLS-handshake verify 3× with 30 s backoff, parse new notAfter to confirm ≥30 days validity; (6) Slack success message via NotifierRegistry: '✅ Rotated cert for . New expiry: . Took .'; (7) on any failure, escalate to PagerDuty via existing bidirectional adapter with full diagnostic context.
Safety rails: hard rate limit — max 1 rotation per domain per 24 h (in-process map, reset on restart); BYTEPORT_RUNBOOK_DRY_RUN=true mode logs all intended actions and performs no writes; backups retained 7 days with automatic prune on next run; IaC-managed certs never auto-applied — PR-only path always escalates to a human; staging validation gate prevents rotation when ACME challenge misconfigured.
Severity escalation by days remaining: ≤48 h → P1, ≤7 d → P2, ≤14 d → P3. P-level embedded in escalation reason and Slack notification.
20 unit tests in tests/ssl-cert-auto-rotate.test.ts: detectCertSource (12 cases — explicit field, inference from certManagerCertificate/acmArn/caddyAdminUrl/traefikEndpoint/terraformWorkspace/pulumiStack/certbotConfigDir/acmeshHome, unknown fallback), detectAcmeChallenge (7 cases — dns-01/http-01 explicit variants, dnsChallengeProvider/route53ZoneId/cloudflareZoneId inference, default http-01), computePriority (P1/P2/P3 boundary cases), rate-limit helpers (cooldown/record/clear/cross-domain isolation), plan step coverage per source (cert-manager/raw-acme/unknown/iac/caddy/traefik, dry-run/live command presence), dry-run execute (cert-manager/raw-acme/iac/unknown), live IaC escalation (PR instruction, P-level, Pulumi stack), live rate-limit enforcement, registry integration (find/asRunner loop).
Registered in src/runbooks/ssl/index.ts and src/runbooks/index.ts; exported from src/index.ts as SslCertAutoRotateRunbook; data/runbooks.json entry added; data/changelog.json v0.1.15 entry.
GrafanaAlertingAdapter ships two ingestion modes in a single adapter: (1) webhook receiver — operator configures a Grafana contact point of type webhook pointing at /webhooks/grafana; adapter parses Grafana's standard alert payload (alerts[].labels, alerts[].annotations, status, startsAt, endsAt) with optional HMAC-SHA256 signature verification via GRAFANA_WEBHOOK_SECRET; (2) polling fallback — GET /api/alertmanager/grafana/api/v2/alerts against Grafana Cloud or self-hosted every cycle, with Bearer token auth via GRAFANA_API_TOKEN and configurable grafanaUrl.
Signal routing: built-in label rules match alertname (DiskSpaceLow→disk_pressure, HighMemoryUsage→memory_pressure, PodCrashLooping→pod_crashloop, CertificateExpiring→ssl_cert_expiring, HighErrorRate→application_error.error_rate_spike) and resource label values (disk/memory/pod/cert/error/cpu/db/deploy). Operator labelMap overrides take highest priority — any key=value label pair can be mapped to a custom signal type. severity=critical with no specific match falls back to generic.critical_alert.
Deduplication keyed on Grafana's fingerprint field (stable per-alertname + label-set hash). Won't re-fire while status is firing; re-fires automatically when an alert transitions resolved → firing so no manual reset is needed. emitResolved option (default: false) emits value=0 info signals for resolved alerts.
HMAC-SHA256 verification: Grafana sends the hex HMAC digest in X-Grafana-Signature. When webhookSecret is configured the adapter rejects payloads whose signature doesn't match — returns { ok: false, reason: 'HMAC signature mismatch' } without logging the body. No secret = verification skipped (useful for dev/internal clusters).
18 unit tests in tests/grafana-alerting.test.ts: webhook parse, HMAC reject, HMAC accept, polling fetch(), 7 signal type mapping cases (DiskSpaceLow/HighMemoryUsage/PodCrashLooping/CertificateExpiring/HighErrorRate/resource=disk/severity=critical fallback), dedup, dedup-reset on resolved→firing, labelMap override (2 cases), emitResolved true/false, polling 500 tolerance, env-var factory, verifyGrafanaSignature helper (3 cases), and webhook-only mode. Helm grafanaAlerting block added; adapter added to package.json exports and index.ts; /integrations page updated from 8 to 9 integrations live.
DatadogMonitorsAdapter polls GET /api/v1/monitor?group_states=alert,warn,no data every cycle. Site-aware auth: DD_API_KEY + DD_APP_KEY work against datadoghq.com, datadoghq.eu, us3.datadoghq.com, us5.datadoghq.com, and ap1.datadoghq.com via a configurable site option.
Signal type routing: (1) custom tagMap entries take highest priority; (2) built-in tag rules map resource:disk→disk_pressure, resource:memory→memory_pressure, resource:cpu→host.cpu_high, resource:ssl/cert→ssl_cert_expiring, resource:db→database.disk_pressure, resource:pod→pod_crashloop, resource:deploy→deploy_failure; (3) monitor type fallback: service check→service_unavailable, log alert→application_error.error_rate_spike, composite→dominant child type.
Composite monitor handling: inspects child monitor IDs, classifies each by its own tag rules, and returns the signal type with the highest count across children — prevents composite alerts from collapsing to a generic fallback.
Idempotency keyed on monitor_id + group (e.g. host:web-01) so a 10-host CPU alert emits 10 independent, deduplicated signals rather than collapsing to one. Per-group last_triggered_ts used as Signal timestamp.
postRemediationEvent(): after a successful runbook execution, posts POST /api/v1/events with signal_type/runbook tags and a link to the BytePort run log — gives Datadog users a native audit trail in the event stream without modifying dashboards.
19 unit tests in tests/datadog-monitors.test.ts covering: disk/memory/service-check/log-alert/composite signal mapping, EU site URL construction, custom tagMap precedence, dedup across fetch cycles, per-group signal emission, postRemediationEvent success and failure, missing creds validation, HTTP 500 error propagation, OK-state exclusion, No-Data severity, factory env-var reads, and dedupKey helper. Helm datadogMonitors block added; adapter list updated.
PagerDutyAdapter polls GET /incidents?statuses[]=triggered&statuses[]=acknowledged every cycle. Maps PD incident urgency + service name to BytePort canonical signal types (database.disk_pressure, pod_crashloop, ssl_cert_expiring, deploy_failure, application_error.error_rate_spike, host.cpu_high/disk_pressure/memory_pressure, service_unavailable). Idempotency keyed on PD incident_id across poll cycles.
Auto-resolve loop: after a successful runbook execution the adapter calls PUT /incidents/{id} with status=resolved and a resolution note — closes the PD incident autonomously without human intervention. Required API token scopes: incidents:read + incidents:write.
PagerDutyEventsNotifier implements the Notifier interface and fires Events API v2 (POST events.pagerduty.com/v2/enqueue) on escalation.required (trigger) and remediation.succeeded (resolve). Integration-key auth — no OAuth, no token rotation.
Dedup key = byteport/// — re-escalation updates the existing PD incident rather than spawning a duplicate. custom_details payload includes signal type, source adapter, runbook attempted, failure reason, and link to BytePort run log.
minSeverity filtering (default: warning) and exponential backoff retry on 429/5xx (max 3 attempts, 500ms base). Never blocks the remediation pipeline — always returns NotifyResult.
19 unit tests across tests/pagerduty-adapter.test.ts and tests/pagerduty-notifier.test.ts: incident polling + dedup (3), urgency→severity mapping (2), signal type heuristics (4), fallback type (1), auto-resolve (2), API error tolerance (1), pagination (1), notifier event actions (3), dedup-key collision (1), minSeverity filter (1), retry exhaustion (1), 5xx retry success (1), network error (1), custom_details payload (1).
Helm values.yaml: new pagerduty adapter block (apiTokenSecretRef, serviceIds, subdomain, fromEmail) and notifiers.pagerduty block (routingKeySecretRef, runLogBaseUrl, minSeverity). Chart.yaml appVersion bumped to 0.1.12.
SlackWebhookNotifier sends Block Kit-formatted messages for three event types: remediation.succeeded (green check, before/after metric, duration), remediation.failed (red X, error, links to /runbooks and /changelog), and escalation.required (yellow warning, suggested next steps, optional mention on critical)
Config: { webhookUrl, channelOverride?, mentionOnCritical?, minSeverity? }. Webhook URL validated at construction; minSeverity filters events below threshold without throwing
Retry with exponential backoff on 429/5xx (max 3 attempts, 500ms base). Failure never blocks the remediation pipeline — log and continue via NotifierRegistry
NotifierRegistry fans out to all configured notifiers in parallel; per-channel results aggregated and surfaced via optional onError callback
RunbookRegistry.asRunner() accepts an optional NotifierRegistry; dispatches NotificationEvents fire-and-forget after every runbook completion
10 unit tests in tests/slack-webhook.test.ts covering all three event types, minSeverity filtering, 429 retry exhaustion, 5xx retry success, URL validation, channelOverride, and multi-notifier fan-out isolation
CloudWatchAlarmsAdapter ships in two modes: DescribeAlarms polling (IAM: cloudwatch:DescribeAlarms) and SNS → HTTPS webhook receiver for push-based ingestion without polling
13 canonical signal mappings out of the box: EC2 CPUUtilization → host.cpu_high, RDS FreeStorageSpace → database.disk_pressure, Lambda Errors → application_error.error_rate_spike, ALB HTTPCode_Target_5XX_Count → service.elevated_5xx, and more
Idempotency keyed on AlarmArn + StateUpdatedTimestamp prevents duplicate signals across poll cycles and re-delivered SNS notifications
OK → ALARM → OK transition tracking with emitResolved support; configurable signalMap override; heuristic fallback for unknown namespace+metric; 20+ unit tests in tests/cloudwatch-alarms.test.ts