Runbooks / Catalog

BytePort autonomously remediates 27+ incident classes

Every entry below is a shipped, production-grade runbook — not a recommendation. BytePort detects the signal, classifies root cause, applies the correct fix, verifies recovery, and only escalates when a human decision is required.

27
GA runbooks
3
In beta
9
Incident classes
15
Signal adapters
Incident class
Adapter
kubernetes

crashloopbackoff_v2

GA v0.1.30

Detects CrashLoopBackOff and pod restart-loop incidents (10th pager class). 5-branch classifier: config_error_missing_env, image_pull_failure, liveness_probe_misconfigured, oom_at_startup, dependency_not_ready. Autonomous actions: rollback bad image tag, bump memory limit 1.5×, loosen probe timing, hold pod and wait for dependency healthcheck. Escalates on unknown classification or when guardrail ceilings are hit.

  1. Detect: pod status.containerStatuses[].state.waiting.reason == CrashLoopBackOff; restart count ≥3 (warn), ≥10 (high), ≥30 (critical); exit code 137 for pod_not_ready_oom; Unhealthy events for probe failures
  2. Diagnose: kubectl logs --previous (last 2000 chars), kubectl describe pod events (ImagePullBackOff, configmap/secret not found, Unhealthy), liveness/readiness probe spec (initialDelaySeconds, failureThreshold), ConfigMap/Secret existence, node MemoryPressure condition
  3. Classify (first-match, confidence ≥0.70): oom_at_startup (exit 137, conf=0.93 → bump memory 1.5×); image_pull_failure (ImagePullBackOff in events, conf=0.94 → rollback); config_error_missing_env (ConfigMap/Secret not found, conf=0.91 → audit + surface); liveness_probe_misconfigured (probe too tight + healthy startup logs, conf=0.86 → loosen probe); dependency_not_ready (exit 0/1 + connection-refused patterns, conf=0.88 → hold + healthcheck)
  4. Remediate: oom_at_startup → kubectl patch Deployment memory limit 1.5× (ceiling 2×, max 1/pod/24h); image_pull_failure → kubectl rollout undo Deployment (max 1/service/hour); config_error_missing_env → kubectl get configmaps/secrets audit, surface missing names (never auto-creates Secrets); liveness_probe_misconfigured → kubectl patch Deployment livenessProbe.initialDelaySeconds=max(current×2,30) failureThreshold=5; dependency_not_ready → kubectl annotate pod with dependency-hold-since, probe dependency endpoints
  5. Verify: pod reaches Ready within 5-min stabilisation window; restart count must not increment; no Unhealthy probe events
  6. Escalate: unknown classification after 2 passes; memory ceiling hit; dependency healthcheck timeout; rollback rate limit reached
kubernetes

oomkilled_memory_pressure

GA v0.1.29

Detects OOMKilled pods and pod memory pressure across 6 signals, classifies root cause across 5 classes (genuine_memory_leak, undersized_limit, traffic_spike, node_pressure, runtime_misconfig), and autonomously bumps memory limits within safe bounds, triggers HPA scale-out, cordons/drains memory-pressured nodes, or surfaces JVM/Go/Node heap misconfig as a PR proposal. Max 1 limit bump per pod per 24h. Never auto-bumps on genuine memory leaks — escalates with heap profile instead.

  1. Detect: kubectl exit reason OOMKilled; NodeCondition MemoryPressure; JVM/Go OOM markers in previous logs; metrics-server working-set vs limit
  2. Diagnose: kubectl describe pod (restart count, memory limit/request, exit reason), HPA discovery, recent deploy timestamp (<30min = deploy regression candidate), runtime hints (JVM -Xmx from JAVA_TOOL_OPTIONS, Go GOMEMLIMIT, Node --max-old-space-size from env), Prometheus series for trend and RPS correlation (BYTEPORT_PROMETHEUS_URL)
  3. Classify (first match, confidence ≥0.70): runtime_misconfig (heap config > container limit, conf=0.93 → PR proposal); node_pressure (NodeCondition MemoryPressure + pod healthy <70%, conf=0.88 → cordon+drain); genuine_memory_leak (rising RSS across 3+ restarts, no traffic correlation, conf=0.85 → escalate, never auto-bump); traffic_spike (RSS correlates with RPS, conf=0.86 → HPA then limit bump); undersized_limit (RSS plateau ≥85% of limit, no trend, conf=0.84 → bump limit)
  4. Remediate: undersized_limit → kubectl patch Deployment/StatefulSet memory limit (≤2× current, ≤80% of node allocatable, 1 bump/pod/24h); traffic_spike → annotate HPA to trigger scale-out, fallback to limit bump; node_pressure → kubectl cordon then kubectl drain (--ignore-daemonsets, PDB-aware); runtime_misconfig → generate -Xmx/GOMEMLIMIT/NODE_OPTIONS fix as PR description (never auto-applied); genuine_memory_leak → emit escalation bundle, open ticket
  5. Verify: pod reaches Ready; working-set < 80% of new limit; 0 OOMKill events in last 5 min; no cascading evictions on same node; retry once at 15s; escalate on double-fail
  6. Post-mortem: structured record — pod, namespace, node, signal, classification, old/new limit, actions_taken, limit_bumped, hpa_scaled, node_cordoned, escalated bool, time_to_resolve
kubernetes

node_not_ready

GA v0.1.28

Detects Kubernetes Node NotReady and kubelet failure across 6 signals (node_not_ready / node_disk_pressure / node_memory_pressure / node_pid_pressure / kubelet_unhealthy / node_network_unavailable), classifies root cause across 6 classes (transient_kubelet_hang / runtime_crash / resource_pressure_disk / resource_pressure_mem_pid / network_plugin_failure / hardware_cloud_failure), and autonomously cordons, drains (respecting PDBs), restarts kubelet or container runtime, prunes images for disk pressure, or restarts CNI daemonset pods. Cloud provider node replacement (AWS ASG, GCP MIG, Azure VMSS) is gated behind operator confirmation or BYTEPORT_AUTO_REPLACE_NODES=true. Hard guardrail: never drain more than 1 node concurrently, never touch control-plane nodes.

  1. Detect: poll NodeCondition transitions via kubectl; pull kubelet systemd status (systemctl is-active/is-failed); read recent kernel messages (journalctl -k) for OOM/hardware errors
  2. Diagnose: kubectl describe node (conditions, allocatable vs. capacity, taints, events), journalctl -u kubelet --since '30 min ago' (last 200 lines), container runtime status (systemctl status containerd/crio), inode exhaustion (df -i), fd exhaustion (/proc/sys/fs/file-nr), image disk usage (crictl/docker images)
  3. Classify (first match, confidence ≥0.70): hardware_cloud_failure (kernel OOM/MCE/panic, conf=0.94, always escalate); network_plugin_failure (NetworkUnavailable + CNI errors, conf=0.88); runtime_crash (containerd/crio unit failed, conf=0.91); resource_pressure_disk (DiskPressure condition or disk ≥85% or inode_free ≤5%, conf=0.90); resource_pressure_mem_pid (MemoryPressure/PIDPressure condition, conf=0.88); transient_kubelet_hang (inactive but not failed, restart_count ≤2, conf=0.82); unknown (<0.70 confidence, always escalate)
  4. Remediate — cordon node first, then per class: transient_kubelet_hang → systemctl restart kubelet; runtime_crash → drain → systemctl restart containerd/crio then kubelet; resource_pressure_disk → drain → crictl/docker image prune + delete evicted pods + restart kubelet; resource_pressure_mem_pid → drain → restart kubelet (node replacement if memory ≥95%); network_plugin_failure → delete CNI daemonset pod on node (flannel/calico/cilium/weave); hardware_cloud_failure → escalate with full bundle, cloud-provider node replacement if BYTEPORT_AUTO_REPLACE_NODES=true
  5. Verify: node returns to Ready within 5 min window; no new eviction events for 5 min; no PDB violations; pods rescheduled and Running/Ready; retry once at 15s; escalate on double-fail
  6. Post-mortem: structured record — node_name, signal, classification, confidence, root_cause, actions_taken, cordon/drain/restart timeline, pods_evicted, pods_rescheduled, time_to_recovery_seconds, escalated bool
tls

tls_cert_expiry_auto_renewal

GA v0.1.27

Detects TLS certificate expiry across 6 signals (cert_expiring_soon at 30/14/7/1-day thresholds, cert_expired, cert_chain_invalid, ocsp_stapling_failure, acme_challenge_failure, renewal_rate_limit_hit), classifies the renewal approach across 7 classes (acme_http01, acme_dns01, cert_manager, acme_rate_limited, cloud_managed, commercial_ca, internal_ca), and autonomously renews via certbot, cert-manager CRD delete-and-reissue, or cloud API — then hot-reloads nginx/envoy/haproxy/traefik/k8s-ingress and verifies chain validity plus OCSP staple. Escalates immediately for pinned certs, wildcard certs without a configured DNS-01 provider, rate-limited domains, and commercial/internal CAs.

  1. Diagnose: openssl s_client -connect domain:443 -showcerts -status to fetch served cert; parse notAfter, issuer O=/CN=, SAN list, serial, OCSP Must-Staple extension, OCSP staple presence; detect wildcard (*.domain); locate cert files on disk (certbot live dir, /etc/ssl/certs, /etc/nginx/ssl); identify web server via pgrep (nginx/envoy/haproxy/traefik/caddy/k8s ingress); check DNS zone ownership (dig SOA); detect cert-manager Certificate CRD in cluster; estimate ACME rate-limit window remaining; check BYTEPORT_PINNED_DOMAINS for cert pinning.
  2. Classify (first match, confidence-gated ≥0.70): acme_rate_limited (signal=renewal_rate_limit_hit or window exhausted, conf=0.92); cert_pinned (BYTEPORT_PINNED_DOMAINS match, conf=0.95, always escalate); cert_manager (cert-manager CRD present, conf=0.93); cloud_managed (issuer O in {Amazon, Google Trust Services}, conf=0.88); acme_dns01 (wildcard or challenge_failure + dns_provider configured, conf=0.91); acme_http01 (LE/ZeroSSL issuer, non-wildcard, zone owned, conf=0.90); internal_ca (self-signed, conf=0.87); commercial_ca (fallback, conf=0.75).
  3. Remediate: acme_http01 → certbot renew --cert-name --webroot or --standalone --non-interactive [--server staging]; acme_dns01 → certbot renew --dns-{route53|cloudflare|google|azure}; cert_manager → kubectl delete secret -tls + kubectl annotate Certificate cert-manager.io/issue-temporary-certificate=true; cloud_managed → aws acm renew-certificate --certificate-arn $BYTEPORT_ACM_ARN; commercial_ca/internal_ca/cert_pinned/rate_limited → escalate with full cert bundle + retry ETA.
  4. Hot-reload (server-aware, no downtime): nginx → nginx -s reload; caddy → caddy reload --config /etc/caddy/Caddyfile; haproxy → haproxy -sf $(cat /run/haproxy.pid); envoy/traefik → kill -HUP ; k8s ingress → kubectl rollout restart deployment/; unknown server → escalate.
  5. Verify (wait_seconds=10): re-fetch via openssl s_client; confirm notAfter > now+30d; validate chain (openssl verify -CAfile /etc/ssl/certs/ca-certificates.crt); check OCSP staple if Must-Staple set; confirm serial changed from pre-renewal; handshake_ok. On failure: restore cert backup + reload + escalate.
  6. Post-mortem: structured record — domain, signal, classification, issuer, challenge_used, tool_used, days_remaining_before/after, chain_valid_after, ocsp_ok_after, resolved bool, time_to_resolve_seconds, escalated bool.
deploys

failed_deploy_auto_rollback

GA v0.1.26

Detects failed Kubernetes deployments across 5 signals (deploy_health_regression / readiness_probe_failure_post_deploy / deploy_5xx_spike / canary_slo_burn / rollout_stuck), classifies root cause with a false-attribution guard, and autonomously rolls back to the last-known-good revision with a hard guardrail of max 1 rollback/deployment/hour. If the rollback itself fails, escalates immediately with a full diagnostic bundle including the suspect commit SHA.

  1. Diagnose: kubectl rollout history to enumerate revisions, identify current and last-good revision, extract commit SHA from CHANGE-CAUSE annotation, count ready vs. total pods (kubectl get deployment -o json), discover downstream Services via label-selector overlap, detect false attribution (regression predating rollout by >5 min in Warning events)
  2. Classify (first match, confidence ≥0.70): do_not_rollback (regression predates deploy — false attribution, conf=0.85); rollout_stuck (progressDeadlineExceeded + 0 ready pods, conf=0.91); readiness_never_ready (readiness probe failures on new ReplicaSet, conf=0.88); deploy_5xx_spike (HTTP 5xx on new revision, conf=0.90); canary_slo_burn (canary burns SLO budget faster than baseline, conf=0.87); deploy_health_regression (error rate or p95 latency spike post-rollout, conf=0.86). Falls back to escalate when no rollback target can be safely determined.
  3. Plan: emit rollback plan (kubectl rollout undo --to-revision=, expected pod churn, estimated duration, rollback-fails-then-escalate tripwire); post plan to ops channel in dry_run mode
  4. Execute: kubectl rollout undo deployment/ -n --to-revision=; watch ReplicaSet ready count; verify failure signal clears within 5-minute window; if rollback kubectl exits non-zero → escalate immediately (no retry)
  5. Verify: deployment Available=True + Progressing=True, readyReplicas == spec.replicas; retry once at 60s if first check fails; escalate on double-fail with full diagnostic bundle
  6. Post-mortem: structured record — signal, classification, suspect_revision, last_good_revision, suspect_image, last_good_image, commit SHA (from CHANGE-CAUSE annotation), rollback_duration_seconds, time_to_resolve_seconds, resolved bool
database

db_connection_pool_exhaustion

GA v0.1.25

Detects database connection pool exhaustion across Postgres, MySQL, and pgbouncer: classify root cause across 6 classes (leak_after_deploy / idle_in_tx_backlog / long_running_query / undersized_pool / missing_pgbouncer / runaway_cron_fanout), terminate idle-in-transaction sessions and long-running queries within hard safety caps, reload pgbouncer with raised pool size, trigger rolling restart of leaking deployments, and verify utilization drops below 70% with a structured postmortem including query fingerprints and permanent fix recommendation.

Prometheus Datadog Alertmanager GrafanaAlerting PagerDuty GenericWebhook
  1. Diagnose: query pg_stat_activity for full state breakdown (active / idle / idle-in-tx / waiting), capture top wait_events and long-running queries, pull pgbouncer SHOW POOLS + SHOW STATS, correlate connections_trend with last deploy SHA, inspect env vars for Hikari/SQLAlchemy/pgx pool config
  2. Classify (first match, confidence ≥0.70): leak_after_deploy (deploy marker + rising idle-in-tx trend, conf=0.88); idle_in_tx_backlog (≥3 sessions older than threshold, conf=0.91); long_running_query (≥2 queries over budget or 1 over 10min, conf=0.85); missing_pgbouncer (pgbouncer_waiting>0, conf=0.80); runaway_cron_fanout (single app_name holds ≥15 connections, conf=0.82); undersized_pool (utilization ≥85% with no dominant cause, conf=0.75)
  3. idle_in_tx_backlog: terminate eligible idle-in-transaction sessions via pg_terminate_backend, capped at kills_per_minute_cap (default 20/min), skipping pgbouncer/admin-proxy/superuser connections
  4. long_running_query: cancel over-budget queries via pg_cancel_backend (graceful), fall back to pg_terminate_backend if cancel fails, capture query text for EXPLAIN postmortem
  5. missing_pgbouncer / undersized_pool: send RELOAD to pgbouncer admin console to pick up raised default_pool_size from pgbouncer.ini
  6. leak_after_deploy: terminate old idle-in-tx sessions first, then trigger kubectl rollout restart on the offending deployment to drain all leaked connection handles
kubernetes

crash_loop_backoff

GA v0.1.24

Detects Kubernetes pods in CrashLoopBackOff and container restart loops, classifies root cause across 7 failure classes (oom_at_startup / bad_image_or_tag / failed_init_container / probe_misconfigured / missing_configmap_or_secret / app_panic_on_boot / node_pressure_eviction), and autonomously remediates within hard guardrails — max 1 rollback/service/hour, max 1 resource bump/pod/day, never delete PersistentVolumes.

  1. Diagnose: kubectl describe pod, previous + current container logs (last 2000 chars), exit code analysis (137 OOM / 139 segfault / 127 missing binary / 1 app error), event timeline, image pull status, probe definitions, resource requests vs node capacity, recent ConfigMap/Secret changes
  2. Classify (first match): node_pressure_eviction (node MemoryPressure/DiskPressure condition); bad_image_or_tag (ImagePullBackOff/ErrImagePull in events); missing_configmap_or_secret (missing K8s resource in events); failed_init_container (init container exit_code ≠ 0); oom_at_startup (exit code 137); probe_misconfigured (liveness/readiness probe fires before app ready, app logs show healthy startup); app_panic_on_boot (exit 139 or panic/fatal/exception in logs + restart_count ≥3)
  3. oom_at_startup: bump memory request/limit 1.5× (max 1 bump/pod/day), patch deployment, rolling restart
  4. bad_image_or_tag: kubectl rollout undo deployment (max 1 rollback/service/hour)
  5. failed_init_container: delete pod to re-trigger init if non-external dep; escalate if external dep connectivity failure
  6. probe_misconfigured: patch deployment with doubled initialDelaySeconds and failureThreshold=5
host

memory_pressure_oom

GA v0.1.23

Detects OOM-kill events and sustained memory pressure (≥90% RSS), classifies root cause across 5 classes (memory_leak / traffic_spike / undersized_limit / noisy_neighbor / runtime_misconfig), and autonomously remediates within hard guardrails — max 3 restarts/hour, max 2× limit bump, never touch stateful workloads without allowlist.

  1. Diagnose in parallel: cgroup memory.stat, /proc/meminfo, top-N processes by RSS, container memory.limit vs. usage, recent OOM-kill dmesg entries, k8s events, JVM jcmd GC.heap_info, Node.js heap snapshot endpoint
  2. Classify root cause (first match): memory_leak (RSS climb >2h, no traffic correlation), traffic_spike (RSS climb correlates with rps), undersized_limit (p99 RSS >80% of limit, steady-state), noisy_neighbor (node-level pressure, victim well-behaved), runtime_misconfig (JVM/Node heap > container limit)
  3. memory_leak: capture heap snapshot via jcmd GC.heap_dump or SIGUSR1 for artifact; restart with exponential backoff (30s × 2^N, cap 300s); kubectl rollout restart for k8s, systemctl restart for bare-metal
  4. traffic_spike: if HPA available, annotate deployment to trigger scale-out; else patch memory.limit upward (max 2× current, max policy-allowed); dry-run=server validation before applying
  5. undersized_limit: propose limit increase via PR to manifest repo with suggested value (p99 usage × 1.5); never auto-apply
  6. noisy_neighbor: cordon node, drain non-critical pods (--ignore-daemonsets --delete-emptydir-data --grace-period=60); page if drain fails
kubernetes

pod_crashloop

GA v0.1.4

Detects Kubernetes pods in crashloop and classifies the exit code (OOMKilled, missing binary, image auth, config error) before safely restarting with back-off guard.

  1. Fetch kubectl logs --previous to capture crash payload
  2. Classify exit code: 137=OOMKilled, 127=missing binary, 1=app error, 0=image auth failure
  3. For OOMKilled: inspect memory.limit vs usage, suggest adjustment if limit is within 20% of actual
  4. For transient errors: apply exponential back-off restart policy via kubectl rollout restart
  5. Safe-restart only if byteport.io/allow-remediation: "true" annotation present on namespace
kubernetes

oom_killed_pod

GA v0.1.5

Detects pods killed by the Kubernetes OOM killer (exit code 137) and safely patches memory limits upward, restarts the pod, and enables debug logging to capture future leak patterns.

  1. Read current memory.request and memory.limit from pod spec
  2. Compare against actual usage from kubectl top pod (Metrics Server required)
  3. If usage >80% of limit: suggest increasing memory.limit to usage × 1.5
  4. If leak signature detected: patch memory.limit upward and file postmortem
  5. If limits are appropriately sized: flag for app-level investigation and escalate
  6. Restart pod with DEBUG=true env var injection to capture future leak patterns
kubernetes

image_pull_backoff

GA v0.1.6

Detects Kubernetes pods failing to pull their container image (ErrImagePull, ImagePullBackOff) and surfaces the root cause — auth failure, bad tag, or network misconfiguration — with specific remediation steps.

  1. Classify failure type: auth (401/403), missing tag (404), network unreachable, or registry timeout
  2. For auth failures: list imagePullSecrets on the pod and identify the missing or misconfigured secret
  3. For missing tag: suggest the correct image tag from the previous deployment's image
  4. For network unreachable: check node DNS resolution and registry reachability
  5. For registry timeout: check if registry rate limit has been hit (Docker Hub unauthenticated)
  6. Suggest kubectl set image or kubectl patch secret commands for the specific failure type
kubernetes

pod_pending_unschedulable

Beta v0.1.11

Detects Kubernetes pods stuck in Pending state due to unschedulable conditions — resource exhaustion, storage constraints, affinity conflicts, or missing dependencies — and applies emergency scheduling workarounds where viable.

  1. Read Events for unschedulable reason: Insufficient memory/CPU, NoVolumeZoneConflict, AffinityConflict, NodeSelectorMismatch, or ResourceQuotaExceeded
  2. Check node allocatable vs. requested — identify if cluster has headroom on any node
  3. If cluster has headroom: attempt to relax PodAntiAffinity constraints to schedule on a less-contested node
  4. If priorityClass is low: suggest bumping to a cluster-appropriate priority to preemption-wins the scheduling slot
  5. Clear taints from target nodes where tolerations are misconfigured
  6. Trigger cluster autoscaling via cloud provider API if node pool has room to scale up
kubernetes

node_not_ready

Beta v0.1.11

Detects Kubernetes nodes entering NotReady, MemoryPressure, DiskPressure, or PIDPressure state and coordinates a safe node drain — cordoning the node, allowing graceful pod termination, and restoring schedulability once the node recovers.

  1. Check if node is already unschedulable (cordon happened) or still accepting new pods
  2. Identify critical pods with disruption budgets on the node
  3. If critical pods exist: wait up to 120 seconds for PodDisruptionBudget to be satisfied before draining
  4. Execute kubectl cordon NODE, then kubectl drain NODE --ignore-daemonsets --delete-emptydir-data --grace-period=60
  5. After drain completes: mark node as cordoned so new pods don't schedule
  6. Post-recovery: uncordon node when NodeReady status is restored for 60 continuous seconds
kubernetes

deployment_rollback

GA v0.1.2

Ingest failed deployment webhooks from CI systems (GitHub Actions, ArgoCD, Flux, Spinnaker) and automatically rolls back to the last healthy revision — with safety gates and commit-author notification.

  1. Read byteport.io/allow-rollback: "true" annotation on the Deployment or Namespace
  2. Inspect kubectl rollout history deployment/NAME -n NAMESPACE to confirm a stable previous revision exists
  3. Execute kubectl rollout undo deployment/NAME -n NAMESPACE
  4. Notify the commit author via Postmark with rollback confirmation, revision diff, and rollback duration
  5. Log the rollback to the signal event audit trail for postmortem correlation
infrastructure

cpu_saturation

GA v0.1.3

Detects sustained CPU saturation on Linux hosts (>85% for 90 continuous seconds) and safely restarts non-critical high-CPU processes — with oom_score_adj tuning as an intermediate step before full restart.

  1. Read /proc/stat to confirm sustained high CPU (distinguish spike from sustained load)
  2. Identify top CPU consumers via ps aux --sort=-%cpu | head -5
  3. Check for byteport.restart=safe label on identified processes — only restarts opted-in services
  4. Apply oom_score_adj reduction of -200 to -600 on non-critical high-CPU processes to reduce scheduling priority
  5. For memory-leak-driven CPU: identify via vmstat and suggest memory limit reduction
  6. Restart via systemd/service only if CPU does not drop below 80% within 2 minutes of oom_score_adj change
infrastructure

disk_pressure

GA v0.1.1

Auto-remediates disk pressure on Linux hosts: log rotation, Docker prune, /tmp cleanup. Threshold: disk_usage > 85% warning, > 95% critical.

  1. Rotate journal logs: journalctl --vacuum-size=50MB
  2. Delete rotated logs /var/log/*.gz older than 7 days
  3. Run docker system prune -af to remove dangling images and stopped containers
  4. Clear /tmp entries unused for over 48 hours
  5. Verify disk drops below 80% threshold within 2 minutes — otherwise escalates
infrastructure

disk-pressure-auto-cleanup

GA v0.1.16

End-to-end autonomous disk cleanup: diagnoses current disk usage and top space consumers, then runs ordered remediation steps — log rotation, journald vacuum, /tmp sweep, optional Docker prune, optional package cache prune — stopping as soon as usage drops below the configurable free-space floor (default 80%). Escalates with a per-step bytes-reclaimed summary and residual top-consumer paths if the floor is not restored after all safe steps.

  1. Diagnose: read mount usage % and top 10 space consumers via du on known-safe roots
  2. Rotate + gzip logs older than N days (default: 7d) in /var/log — configurable allowlist; hard safe-list protects /var/log/lastlog, /var/log/wtmp, active audit logs
  3. journalctl --vacuum-size=200M (configurable) to trim systemd journal
  4. Clear /tmp and /var/tmp files older than N hours (default: 24h)
  5. Docker system prune -af (opt-in: dockerPruneEnabled=true) — prunes dangling images and build cache if Docker socket present
  6. apt-get clean / yum clean all / dnf clean all (opt-in: packageCachePruneEnabled=true) — clears OS package manager caches
infrastructure

memory_pressure

GA v0.1.1

Detects host memory saturation (>90% for 90 seconds) and distinguishes page-cache pressure from actual memory exhaustion before safely restarting leaking processes.

  1. Inspect /proc/meminfo and cgroup stats to identify top resident-memory processes
  2. Check for byteport.restart=safe label on processes — only restarts opted-in services
  3. Attempt oom_score_adj reduction on non-critical heavy consumers to buy headroom
  4. Restart identified leaking process via its configured init system (systemd/service)
  5. Log post-restart memory baseline for comparison in auto-generated postmortem
tls

cert_expiry

GA v0.1.7

Detects SSL/TLS certificates expiring within 14 days (warning) or 7 days (critical) and triggers automated renewal via cert-manager, AWS ACM, or certbot — with pre/post snapshots and DNS validation checks.

  1. For cert-manager: patch Certificate resource with cert-manager.io/force-renew="true", wait for issuance and secret propagation
  2. For AWS ACM: call acm.RequestCertificate for the same domain, wait for DNS validation if pending, log new ARN
  3. For certbot/acme.sh: trigger certbot renew --quiet or acme.sh --renew -d domain.com, validate renewed cert serial
  4. For manual certs: escalate with full cert metadata, days-to-expiry, and renewal commands — no automated path
database

connection_pool_exhausted

GA v0.1.8

Detects when PostgreSQL or MySQL active connections exceed 80% of max_connections and identifies connection leaks (idle-in-transaction > 30s) vs. load-driven saturation — terminating leak backends and suggesting PgBouncer configuration.

  1. Query pg_stat_activity to identify idle_in_transaction sessions held >30s — treated as leaks if >50% of idle connections
  2. For connection leaks: identify backend PID with longest idle-in-transaction duration, capture query text, then terminate via pg_terminate_backend(PID) only if byteport.io/auto_remediate: "db_terminate" annotation present
  3. For load-driven saturation: identify top 3 longest-running queries via pg_stat_activity.duration, log PIDs and query text, suggest PgBouncer/PgCat via escalation
  4. Pre-termination audit: capture current connection count, top wait events from pg_stat_activity.wait_event, and application_name
  5. Post-fix verification: waits 30s then confirms active connections dropped below 70%
database

replication_lag

GA v0.1.9

Detects PostgreSQL replica lag exceeding 30 seconds for more than 60 seconds — distinguishing between network latency, disk I/O saturation, and long-running queries on the primary — and terminates blocking queries to resume replication.

  1. Query pg_stat_replication to identify lagging replicas: application_name, state, sent_lsn, write_lsn, flush_lsn, replay_lsn
  2. Compare send/replay LSN gap against pg_stat_database.conflict_count to classify root cause: replication_conflict, long_query, or wal_sender_hang
  3. For long-running query on primary blocking WAL application: identify via pg_stat_activity where query_start is far in the past and state != 'idle'
  4. Terminate long-running query on primary via pg_cancel_backend(PID) first (graceful), escalate to pg_terminate_backend after 60s
  5. Log pre-termination snapshot: query text, duration, PID, application_name, and which replica is lagging
deploys

deploy_failure

GA v0.1.10

Detects failed deploy workflows via the GitHub Actions API — classifies the failure (test_failure, build_failure, deploy_step_failure, infra_timeout), fetches failed job logs, triggers rollback workflow_dispatch when a previous successful run exists and rollback is enabled, or escalates with a structured summary.

  1. Fetch failed job logs via GET /repos/{owner}/{repo}/actions/runs/{run_id}/jobs
  2. Classify failure: test_failure (jest/vitest/pytest patterns) | build_failure (tsc/eslint/webpack) | deploy_step_failure (helm/kubectl/docker push) | infra_timeout (ETIMEDOUT/rate-limit)
  3. Check if a previous successful run exists — if yes and rollbackEnabled: true, dispatch rollbackWorkflowFile via workflow_dispatch with inputs triggered_by=byteport-agent and failed_run_id
  4. Dedup by run_id — each GitHub Actions run ID is emitted at most once per agent session
  5. If rollback not enabled or no previous success: escalate with structured summary including failure class, branch, actor, run URL
application

application_error

GA v0.1.1

Classifies application errors from Sentry by stack-trace heuristics (db_connection, oom, deploy_correlation, upstream_5xx, unknown), auto-rolls back regressions correlated with a recent deploy SHA, scales up for OOM patterns, and escalates novel errors with a structured summary.

  1. Classify error by stack trace: db_connection → correlate with db_connections_saturated runbook
  2. OOM pattern: trigger configurable scale-up command (kubectl scale or equivalent) — opt-in
  3. Regression + deploy SHA within 30-min window: dispatch rollback command — opt-in, default off
  4. Upstream 5xx: log circuit-breaker status and escalate with latency context
  5. Novel/unknown: open incident channel and escalate with full Sentry permalink, event count, and affected users
tls

ssl-cert-auto-rotate

Beta v0.1.15

End-to-end autonomous TLS certificate rotation: diagnoses cert management toolchain (cert-manager / Caddy / Traefik / certbot / IaC), validates ACME challenge via Let's Encrypt staging, backs up cert, rotates, and confirms via TLS handshake (3× retry). Escalates with full diagnostic if any step fails.

  1. Detect cert source: cert-manager (CRD), Caddy/Traefik (process detection), certbot/acme.sh (config dir), IaC (metadata flags)
  2. Validate ACME challenge path via Let's Encrypt staging dry-run (DNS-01 or HTTP-01) — abort and escalate if staging fails
  3. Backup current cert + key to /var/lib/byteport/cert-backups/-/ for 7-day rollback
  4. cert-manager: kubectl annotate certificate with cert-manager.io/force-renew, wait Ready=True (120 s)
  5. Caddy / Traefik: systemctl reload to trigger internal ACME renewal
  6. certbot / acme.sh: certbot renew --force-renewal
deployment

failed-deploy-auto-rollback

GA v0.1.17

Detects a bad deploy via health-check failure, error-rate spike, or crash-loop signal; automatically rolls back to the last green SHA on Render; verifies health green within 90 s; writes a structured postmortem stub. No human required for the majority of 2am deploy incidents.

  1. Diagnose: fetch last 5 deploy SHAs from Render API, confirm failing one is newest
  2. Decide: apply rollback if (a) error rate >5× baseline in 5-min post-deploy window, OR (b) health endpoint non-2xx for >2 consecutive minutes, OR (c) crash loop detected — otherwise escalate with diagnostics
  3. Execute rollback: call Render API to re-deploy the previous (last-green) SHA
  4. Verify: poll health endpoint every 15 s for up to 90 s post-rollback
  5. Postmortem stub: write structured log entry with failing SHA, rolled-back-to SHA, gate reason, TTR, diagnostic snapshot
infrastructure

memory_pressure_auto_recover

GA v0.1.18

Detects runaway memory consumers and autonomously restarts the offending service with deploy-correlation hand-off and crash-loop guard. Works across Render, Kubernetes, and bare-metal/systemd.

  1. Identify top 5 memory consumers via ps aux --sort=-%mem + cgroup memory.current
  2. Classify consumer: managed service (known to BytePort) vs unmanaged/system process
  3. Deploy correlation: if release shipped <60 min ago on the top consumer → hand off to failed-deploy-auto-rollback instead
  4. Crash-loop guard: abort and escalate if service restarted ≥2× in last 30 min
  5. Graceful restart via best-available method: Render Deploys API / kubectl rollout restart / systemctl restart
  6. Post-restart memory verify: re-sample after 3 s stabilization, confirm below recoveryThresholdPct (default 80%)
containers

container_restart_loop_breaker

GA v0.1.19

Breaks container restart loops by diagnosing root cause from logs, exit codes, and image SHA — then applying the correct class-specific fix: escalate on config/secret errors, chain on dependency failures, bump memory on OOM, rollback on port-bind or image-pull failures.

  1. Fetch last 200 log lines + last 5 exit codes + restart count + image SHA for diagnosis
  2. Classify failure: config_error | dependency_down | oom_at_startup | port_conflict | image_pull_failure | unknown
  3. Config error: check secret existence in orchestrator, surface exact diff — escalate, never auto-restore
  4. Dependency down: TCP health-check named upstreams, defer and chain incident if dependency is itself down
  5. OOM at startup: bump memory +25% (capped 2× original) via kubectl patch, redeploy, verify rollout
  6. Port bind / image pull: kubectl rollout undo (K8s) or Render Deploys API rollback to last-green SHA
database

db_connection_pool_exhausted

GA v0.1.20

Autonomous DB connection pool exhaustion recovery for Postgres, MySQL, and MongoDB: classify idle-in-transaction leak / long-running query / app-side leak / traffic spike / undersized pool, terminate connections within per-minute safety caps, raise pool ceiling within 20% headroom, re-poll utilization at 30s/90s/180s, file structured postmortem.

  1. Parallel fetch: active/idle/idle-in-tx counts, pool ceiling, top 10 long-running queries, client IPs holding connections
  2. Classify exhaustion into 5 classes (priority order): idle_in_transaction_leak → long_running_query → app_side_leak → traffic_spike → pool_undersized
  3. idle_in_transaction_leak: pg_terminate_backend for all iit sessions >5min old (capped at killsPerMinuteCap/min, skips critical allowlist)
  4. long_running_query: terminate queries exceeding runtimeBudgetSeconds (default 5min), capture query text for EXPLAIN postmortem
  5. app_side_leak: flag offending app_name/IP, signal app tier to recycle — never kill connections blindly
  6. traffic_spike: raise pool ceiling (max: dbMaxConnections × 0.8), flag non-critical routes for 503 shedding
host

disk_space_auto_reclaim

GA v0.1.21

Autonomous full disk-space reclaim: parallel diagnose (df/inode/du/docker/journald), rotate+compress logs (logrotate+gzip>100MB+journal vacuum 7d), evict container layers (docker system prune or crictl rmi --prune on K8s nodes), drop expired tmp+apt/yum/pip/npm caches, and optionally grow cloud-managed volumes (EBS ModifyVolume / GCP disks.resize +25%, capped at policy.maxVolumeGb). Stops as soon as disk usage drops below 80%. Escalates with +N GB recommendation based on current excess and top-5 largest dirs if usage still >85%.

  1. Diagnose in parallel: df -h + df -i (inodes), top-10 space consumers via du -sh, docker system df, journald --disk-usage
  2. Rotate+compress logs: force logrotate, gzip files >100MB older than 24h in /var/log, journalctl --vacuum-time=7d
  3. Evict container layers: docker system prune -af --filter until=72h + unused volumes (bare-metal/VMs); crictl rmi --prune on K8s nodes
  4. Drop expired caches: /tmp + /var/tmp files older than 7 days, apt/yum/dnf package cache, pip cache purge, npm cache clean --force
  5. Grow cloud volume (opt-in, policy.allowVolumeGrow=true): EBS ModifyVolume / GCP disks.resize +25% capped at policy.maxVolumeGb, then growpart + resize2fs/xfs_growfs
tls

tls_cert_expiry_auto_renew

GA v0.1.22

End-to-end autonomous TLS certificate expiry recovery: diagnose served cert via openssl s_client (issuer, notAfter, SANs, chain), classify renewal path (acme_renewable / cert_manager_managed / commercial_ca_manual / internal_ca / wildcard_dns01_required), renew via certbot/acme.sh (http-01 or dns-01 via Route53/Cloudflare/GCP) or cert-manager force-renew, hot-reload the serving process (nginx/caddy/envoy/haproxy/traefik/k8s ingress), then verify new notAfter > now + 30d + chain valid + handshake succeeds. Backs up cert+key before every swap; rolls back and escalates on verification failure. Refuses if DNS zone not owned.

  1. Diagnose: openssl s_client -connect DOMAIN:443 -servername DOMAIN -status -showcerts → parse notAfter, issuer O=/CN=, SAN list, OCSP staple, chain validity; locate cert files on disk + identify server process (nginx/caddy/envoy/haproxy/traefik/k8s/unknown)
  2. Classify (first match): cert_manager_managed if k8s env or --cert-name given; wildcard_dns01_required if *.domain with supported DNS provider; acme_renewable if Let's Encrypt / ZeroSSL issuer or cert in certbot live dir; internal_ca or commercial_ca_manual → escalate with full context
  3. Backup: copy cert + privkey to /var/lib/byteport/cert-backups/-/ before any write
  4. Renew acme_renewable: certbot renew --cert-name DOMAIN (http-01 default) or with --dns-PROVIDER for dns-01; acme.sh --renew as fallback
  5. Renew wildcard dns-01: certbot/acme.sh with --dns-route53 / --dns-cloudflare / --dns-gcp flag
  6. Renew cert_manager_managed: kubectl annotate certificate cert-manager.io/force-renew=true, kubectl wait --for=condition=Ready --timeout=120s