Every entry below is a shipped, production-grade runbook — not a recommendation.
BytePort detects the signal, classifies root cause, applies the correct fix,
verifies recovery, and only escalates when a human decision is required.
Detects CrashLoopBackOff and pod restart-loop incidents (10th pager class). 5-branch classifier: config_error_missing_env, image_pull_failure, liveness_probe_misconfigured, oom_at_startup, dependency_not_ready. Autonomous actions: rollback bad image tag, bump memory limit 1.5×, loosen probe timing, hold pod and wait for dependency healthcheck. Escalates on unknown classification or when guardrail ceilings are hit.
Detects OOMKilled pods and pod memory pressure across 6 signals, classifies root cause across 5 classes (genuine_memory_leak, undersized_limit, traffic_spike, node_pressure, runtime_misconfig), and autonomously bumps memory limits within safe bounds, triggers HPA scale-out, cordons/drains memory-pressured nodes, or surfaces JVM/Go/Node heap misconfig as a PR proposal. Max 1 limit bump per pod per 24h. Never auto-bumps on genuine memory leaks — escalates with heap profile instead.
Detect: kubectl exit reason OOMKilled; NodeCondition MemoryPressure; JVM/Go OOM markers in previous logs; metrics-server working-set vs limit
Diagnose: kubectl describe pod (restart count, memory limit/request, exit reason), HPA discovery, recent deploy timestamp (<30min = deploy regression candidate), runtime hints (JVM -Xmx from JAVA_TOOL_OPTIONS, Go GOMEMLIMIT, Node --max-old-space-size from env), Prometheus series for trend and RPS correlation (BYTEPORT_PROMETHEUS_URL)
Classify (first match, confidence ≥0.70): runtime_misconfig (heap config > container limit, conf=0.93 → PR proposal); node_pressure (NodeCondition MemoryPressure + pod healthy <70%, conf=0.88 → cordon+drain); genuine_memory_leak (rising RSS across 3+ restarts, no traffic correlation, conf=0.85 → escalate, never auto-bump); traffic_spike (RSS correlates with RPS, conf=0.86 → HPA then limit bump); undersized_limit (RSS plateau ≥85% of limit, no trend, conf=0.84 → bump limit)
Remediate: undersized_limit → kubectl patch Deployment/StatefulSet memory limit (≤2× current, ≤80% of node allocatable, 1 bump/pod/24h); traffic_spike → annotate HPA to trigger scale-out, fallback to limit bump; node_pressure → kubectl cordon then kubectl drain (--ignore-daemonsets, PDB-aware); runtime_misconfig → generate -Xmx/GOMEMLIMIT/NODE_OPTIONS fix as PR description (never auto-applied); genuine_memory_leak → emit escalation bundle, open ticket
Verify: pod reaches Ready; working-set < 80% of new limit; 0 OOMKill events in last 5 min; no cascading evictions on same node; retry once at 15s; escalate on double-fail
Detect: poll NodeCondition transitions via kubectl; pull kubelet systemd status (systemctl is-active/is-failed); read recent kernel messages (journalctl -k) for OOM/hardware errors
Diagnose: kubectl describe node (conditions, allocatable vs. capacity, taints, events), journalctl -u kubelet --since '30 min ago' (last 200 lines), container runtime status (systemctl status containerd/crio), inode exhaustion (df -i), fd exhaustion (/proc/sys/fs/file-nr), image disk usage (crictl/docker images)
Classify (first match, confidence ≥0.70): hardware_cloud_failure (kernel OOM/MCE/panic, conf=0.94, always escalate); network_plugin_failure (NetworkUnavailable + CNI errors, conf=0.88); runtime_crash (containerd/crio unit failed, conf=0.91); resource_pressure_disk (DiskPressure condition or disk ≥85% or inode_free ≤5%, conf=0.90); resource_pressure_mem_pid (MemoryPressure/PIDPressure condition, conf=0.88); transient_kubelet_hang (inactive but not failed, restart_count ≤2, conf=0.82); unknown (<0.70 confidence, always escalate)
Remediate — cordon node first, then per class: transient_kubelet_hang → systemctl restart kubelet; runtime_crash → drain → systemctl restart containerd/crio then kubelet; resource_pressure_disk → drain → crictl/docker image prune + delete evicted pods + restart kubelet; resource_pressure_mem_pid → drain → restart kubelet (node replacement if memory ≥95%); network_plugin_failure → delete CNI daemonset pod on node (flannel/calico/cilium/weave); hardware_cloud_failure → escalate with full bundle, cloud-provider node replacement if BYTEPORT_AUTO_REPLACE_NODES=true
Verify: node returns to Ready within 5 min window; no new eviction events for 5 min; no PDB violations; pods rescheduled and Running/Ready; retry once at 15s; escalate on double-fail
Detects TLS certificate expiry across 6 signals (cert_expiring_soon at 30/14/7/1-day thresholds, cert_expired, cert_chain_invalid, ocsp_stapling_failure, acme_challenge_failure, renewal_rate_limit_hit), classifies the renewal approach across 7 classes (acme_http01, acme_dns01, cert_manager, acme_rate_limited, cloud_managed, commercial_ca, internal_ca), and autonomously renews via certbot, cert-manager CRD delete-and-reissue, or cloud API — then hot-reloads nginx/envoy/haproxy/traefik/k8s-ingress and verifies chain validity plus OCSP staple. Escalates immediately for pinned certs, wildcard certs without a configured DNS-01 provider, rate-limited domains, and commercial/internal CAs.
Detects failed Kubernetes deployments across 5 signals (deploy_health_regression / readiness_probe_failure_post_deploy / deploy_5xx_spike / canary_slo_burn / rollout_stuck), classifies root cause with a false-attribution guard, and autonomously rolls back to the last-known-good revision with a hard guardrail of max 1 rollback/deployment/hour. If the rollback itself fails, escalates immediately with a full diagnostic bundle including the suspect commit SHA.
Diagnose: kubectl rollout history to enumerate revisions, identify current and last-good revision, extract commit SHA from CHANGE-CAUSE annotation, count ready vs. total pods (kubectl get deployment -o json), discover downstream Services via label-selector overlap, detect false attribution (regression predating rollout by >5 min in Warning events)
Classify (first match, confidence ≥0.70): do_not_rollback (regression predates deploy — false attribution, conf=0.85); rollout_stuck (progressDeadlineExceeded + 0 ready pods, conf=0.91); readiness_never_ready (readiness probe failures on new ReplicaSet, conf=0.88); deploy_5xx_spike (HTTP 5xx on new revision, conf=0.90); canary_slo_burn (canary burns SLO budget faster than baseline, conf=0.87); deploy_health_regression (error rate or p95 latency spike post-rollout, conf=0.86). Falls back to escalate when no rollback target can be safely determined.
Plan: emit rollback plan (kubectl rollout undo --to-revision=, expected pod churn, estimated duration, rollback-fails-then-escalate tripwire); post plan to ops channel in dry_run mode
Execute: kubectl rollout undo deployment/ -n --to-revision=; watch ReplicaSet ready count; verify failure signal clears within 5-minute window; if rollback kubectl exits non-zero → escalate immediately (no retry)
Verify: deployment Available=True + Progressing=True, readyReplicas == spec.replicas; retry once at 60s if first check fails; escalate on double-fail with full diagnostic bundle
Detects database connection pool exhaustion across Postgres, MySQL, and pgbouncer: classify root cause across 6 classes (leak_after_deploy / idle_in_tx_backlog / long_running_query / undersized_pool / missing_pgbouncer / runaway_cron_fanout), terminate idle-in-transaction sessions and long-running queries within hard safety caps, reload pgbouncer with raised pool size, trigger rolling restart of leaking deployments, and verify utilization drops below 70% with a structured postmortem including query fingerprints and permanent fix recommendation.
Diagnose: query pg_stat_activity for full state breakdown (active / idle / idle-in-tx / waiting), capture top wait_events and long-running queries, pull pgbouncer SHOW POOLS + SHOW STATS, correlate connections_trend with last deploy SHA, inspect env vars for Hikari/SQLAlchemy/pgx pool config
Classify (first match, confidence ≥0.70): leak_after_deploy (deploy marker + rising idle-in-tx trend, conf=0.88); idle_in_tx_backlog (≥3 sessions older than threshold, conf=0.91); long_running_query (≥2 queries over budget or 1 over 10min, conf=0.85); missing_pgbouncer (pgbouncer_waiting>0, conf=0.80); runaway_cron_fanout (single app_name holds ≥15 connections, conf=0.82); undersized_pool (utilization ≥85% with no dominant cause, conf=0.75)
idle_in_tx_backlog: terminate eligible idle-in-transaction sessions via pg_terminate_backend, capped at kills_per_minute_cap (default 20/min), skipping pgbouncer/admin-proxy/superuser connections
long_running_query: cancel over-budget queries via pg_cancel_backend (graceful), fall back to pg_terminate_backend if cancel fails, capture query text for EXPLAIN postmortem
missing_pgbouncer / undersized_pool: send RELOAD to pgbouncer admin console to pick up raised default_pool_size from pgbouncer.ini
leak_after_deploy: terminate old idle-in-tx sessions first, then trigger kubectl rollout restart on the offending deployment to drain all leaked connection handles
Safety guardrails
Hard cap: kills_per_minute_cap (default 20) backend terminations per minute — rejects excess over cap and escalates
Critical app allowlist: never terminates pgbouncer, admin-proxy, pgadmin, or superuser connections under any circumstance
Superuser check: every connection is checked for is_superuser=true before termination — superusers are always skipped
Detects Kubernetes pods in CrashLoopBackOff and container restart loops, classifies root cause across 7 failure classes (oom_at_startup / bad_image_or_tag / failed_init_container / probe_misconfigured / missing_configmap_or_secret / app_panic_on_boot / node_pressure_eviction), and autonomously remediates within hard guardrails — max 1 rollback/service/hour, max 1 resource bump/pod/day, never delete PersistentVolumes.
Detects OOM-kill events and sustained memory pressure (≥90% RSS), classifies root cause across 5 classes (memory_leak / traffic_spike / undersized_limit / noisy_neighbor / runtime_misconfig), and autonomously remediates within hard guardrails — max 3 restarts/hour, max 2× limit bump, never touch stateful workloads without allowlist.
Diagnose in parallel: cgroup memory.stat, /proc/meminfo, top-N processes by RSS, container memory.limit vs. usage, recent OOM-kill dmesg entries, k8s events, JVM jcmd GC.heap_info, Node.js heap snapshot endpoint
Classify root cause (first match): memory_leak (RSS climb >2h, no traffic correlation), traffic_spike (RSS climb correlates with rps), undersized_limit (p99 RSS >80% of limit, steady-state), noisy_neighbor (node-level pressure, victim well-behaved), runtime_misconfig (JVM/Node heap > container limit)
memory_leak: capture heap snapshot via jcmd GC.heap_dump or SIGUSR1 for artifact; restart with exponential backoff (30s × 2^N, cap 300s); kubectl rollout restart for k8s, systemctl restart for bare-metal
traffic_spike: if HPA available, annotate deployment to trigger scale-out; else patch memory.limit upward (max 2× current, max policy-allowed); dry-run=server validation before applying
undersized_limit: propose limit increase via PR to manifest repo with suggested value (p99 usage × 1.5); never auto-apply
Detects Kubernetes pods in crashloop and classifies the exit code (OOMKilled, missing binary, image auth, config error) before safely restarting with back-off guard.
Detects pods killed by the Kubernetes OOM killer (exit code 137) and safely patches memory limits upward, restarts the pod, and enables debug logging to capture future leak patterns.
Detects Kubernetes pods failing to pull their container image (ErrImagePull, ImagePullBackOff) and surfaces the root cause — auth failure, bad tag, or network misconfiguration — with specific remediation steps.
Detects Kubernetes pods stuck in Pending state due to unschedulable conditions — resource exhaustion, storage constraints, affinity conflicts, or missing dependencies — and applies emergency scheduling workarounds where viable.
Detects Kubernetes nodes entering NotReady, MemoryPressure, DiskPressure, or PIDPressure state and coordinates a safe node drain — cordoning the node, allowing graceful pod termination, and restoring schedulability once the node recovers.
Ingest failed deployment webhooks from CI systems (GitHub Actions, ArgoCD, Flux, Spinnaker) and automatically rolls back to the last healthy revision — with safety gates and commit-author notification.
Detects sustained CPU saturation on Linux hosts (>85% for 90 continuous seconds) and safely restarts non-critical high-CPU processes — with oom_score_adj tuning as an intermediate step before full restart.
End-to-end autonomous disk cleanup: diagnoses current disk usage and top space consumers, then runs ordered remediation steps — log rotation, journald vacuum, /tmp sweep, optional Docker prune, optional package cache prune — stopping as soon as usage drops below the configurable free-space floor (default 80%). Escalates with a per-step bytes-reclaimed summary and residual top-consumer paths if the floor is not restored after all safe steps.
Diagnose: read mount usage % and top 10 space consumers via du on known-safe roots
Rotate + gzip logs older than N days (default: 7d) in /var/log — configurable allowlist; hard safe-list protects /var/log/lastlog, /var/log/wtmp, active audit logs
journalctl --vacuum-size=200M (configurable) to trim systemd journal
Clear /tmp and /var/tmp files older than N hours (default: 24h)
Docker system prune -af (opt-in: dockerPruneEnabled=true) — prunes dangling images and build cache if Docker socket present
apt-get clean / yum clean all / dnf clean all (opt-in: packageCachePruneEnabled=true) — clears OS package manager caches
Safety guardrails
Hard safe-list: /var/log/lastlog, /var/log/wtmp, /var/log/btmp, /var/log/audit/audit.log, active access.log/error.log/auth.log — never touched
Docker prune disabled by default (dockerPruneEnabled=false) — operator must explicitly opt in
Package cache prune disabled by default (packageCachePruneEnabled=false) — operator must explicitly opt in
Detects host memory saturation (>90% for 90 seconds) and distinguishes page-cache pressure from actual memory exhaustion before safely restarting leaking processes.
Detects SSL/TLS certificates expiring within 14 days (warning) or 7 days (critical) and triggers automated renewal via cert-manager, AWS ACM, or certbot — with pre/post snapshots and DNS validation checks.
Detects when PostgreSQL or MySQL active connections exceed 80% of max_connections and identifies connection leaks (idle-in-transaction > 30s) vs. load-driven saturation — terminating leak backends and suggesting PgBouncer configuration.
Query pg_stat_activity to identify idle_in_transaction sessions held >30s — treated as leaks if >50% of idle connections
For connection leaks: identify backend PID with longest idle-in-transaction duration, capture query text, then terminate via pg_terminate_backend(PID) only if byteport.io/auto_remediate: "db_terminate" annotation present
For load-driven saturation: identify top 3 longest-running queries via pg_stat_activity.duration, log PIDs and query text, suggest PgBouncer/PgCat via escalation
Pre-termination audit: capture current connection count, top wait events from pg_stat_activity.wait_event, and application_name
Post-fix verification: waits 30s then confirms active connections dropped below 70%
Safety guardrails
Requires byteport.io/auto_remediate: "db_terminate" annotation for backend termination — never kills without this
Never terminates a connection with an uncommitted transaction that has row locks (checks pg_blocking_pids)
Detects PostgreSQL replica lag exceeding 30 seconds for more than 60 seconds — distinguishing between network latency, disk I/O saturation, and long-running queries on the primary — and terminates blocking queries to resume replication.
Detects failed deploy workflows via the GitHub Actions API — classifies the failure (test_failure, build_failure, deploy_step_failure, infra_timeout), fetches failed job logs, triggers rollback workflow_dispatch when a previous successful run exists and rollback is enabled, or escalates with a structured summary.
Check if a previous successful run exists — if yes and rollbackEnabled: true, dispatch rollbackWorkflowFile via workflow_dispatch with inputs triggered_by=byteport-agent and failed_run_id
Dedup by run_id — each GitHub Actions run ID is emitted at most once per agent session
If rollback not enabled or no previous success: escalate with structured summary including failure class, branch, actor, run URL
Safety guardrails
Rollback is opt-in: rollbackEnabled must be explicitly set to true — default is false (detect + escalate only)
Requires Workflows: write permission on the GitHub token for rollback dispatch; Actions: read + Contents: read for detection only
Dry-run mode: all steps return dry_run status with descriptions of what would execute
Classifies application errors from Sentry by stack-trace heuristics (db_connection, oom, deploy_correlation, upstream_5xx, unknown), auto-rolls back regressions correlated with a recent deploy SHA, scales up for OOM patterns, and escalates novel errors with a structured summary.
Detects a bad deploy via health-check failure, error-rate spike, or crash-loop signal; automatically rolls back to the last green SHA on Render; verifies health green within 90 s; writes a structured postmortem stub. No human required for the majority of 2am deploy incidents.
Diagnose: fetch last 5 deploy SHAs from Render API, confirm failing one is newest
Decide: apply rollback if (a) error rate >5× baseline in 5-min post-deploy window, OR (b) health endpoint non-2xx for >2 consecutive minutes, OR (c) crash loop detected — otherwise escalate with diagnostics
Execute rollback: call Render API to re-deploy the previous (last-green) SHA
Verify: poll health endpoint every 15 s for up to 90 s post-rollback
Detects runaway memory consumers and autonomously restarts the offending service with deploy-correlation hand-off and crash-loop guard. Works across Render, Kubernetes, and bare-metal/systemd.
Breaks container restart loops by diagnosing root cause from logs, exit codes, and image SHA — then applying the correct class-specific fix: escalate on config/secret errors, chain on dependency failures, bump memory on OOM, rollback on port-bind or image-pull failures.
Autonomous DB connection pool exhaustion recovery for Postgres, MySQL, and MongoDB: classify idle-in-transaction leak / long-running query / app-side leak / traffic spike / undersized pool, terminate connections within per-minute safety caps, raise pool ceiling within 20% headroom, re-poll utilization at 30s/90s/180s, file structured postmortem.
Autonomous full disk-space reclaim: parallel diagnose (df/inode/du/docker/journald), rotate+compress logs (logrotate+gzip>100MB+journal vacuum 7d), evict container layers (docker system prune or crictl rmi --prune on K8s nodes), drop expired tmp+apt/yum/pip/npm caches, and optionally grow cloud-managed volumes (EBS ModifyVolume / GCP disks.resize +25%, capped at policy.maxVolumeGb). Stops as soon as disk usage drops below 80%. Escalates with +N GB recommendation based on current excess and top-5 largest dirs if usage still >85%.
End-to-end autonomous TLS certificate expiry recovery: diagnose served cert via openssl s_client (issuer, notAfter, SANs, chain), classify renewal path (acme_renewable / cert_manager_managed / commercial_ca_manual / internal_ca / wildcard_dns01_required), renew via certbot/acme.sh (http-01 or dns-01 via Route53/Cloudflare/GCP) or cert-manager force-renew, hot-reload the serving process (nginx/caddy/envoy/haproxy/traefik/k8s ingress), then verify new notAfter > now + 30d + chain valid + handshake succeeds. Backs up cert+key before every swap; rolls back and escalates on verification failure. Refuses if DNS zone not owned.
Diagnose: openssl s_client -connect DOMAIN:443 -servername DOMAIN -status -showcerts → parse notAfter, issuer O=/CN=, SAN list, OCSP staple, chain validity; locate cert files on disk + identify server process (nginx/caddy/envoy/haproxy/traefik/k8s/unknown)
Classify (first match): cert_manager_managed if k8s env or --cert-name given; wildcard_dns01_required if *.domain with supported DNS provider; acme_renewable if Let's Encrypt / ZeroSSL issuer or cert in certbot live dir; internal_ca or commercial_ca_manual → escalate with full context
Backup: copy cert + privkey to /var/lib/byteport/cert-backups/-/ before any write
Renew acme_renewable: certbot renew --cert-name DOMAIN (http-01 default) or with --dns-PROVIDER for dns-01; acme.sh --renew as fallback
Renew wildcard dns-01: certbot/acme.sh with --dns-route53 / --dns-cloudflare / --dns-gcp flag