kubernetes

crashloopbackoff_v2

GA v0.1.30

Detects CrashLoopBackOff and pod restart-loop incidents (10th pager class). 5-branch classifier: config_error_missing_env, image_pull_failure, liveness_probe_misconfigured, oom_at_startup, dependency_not_ready. Autonomous actions: rollback bad image tag, bump memory limit 1.5×, loosen probe timing, hold pod and wait for dependency healthcheck. Escalates on unknown classification or when guardrail ceilings are hit.

Trigger signals

K8sEvents Prometheus Alertmanager DatadogMonitors GrafanaAlerting PagerDuty

Decision flow

Detect: pod status.containerStatuses[].state.waiting.reason == CrashLoopBackOff; restart count ≥3 (warn), ≥10 (high), ≥30 (critical); exit code 137 for pod_not_ready_oom; Unhealthy events for probe failures
Diagnose: kubectl logs --previous (last 2000 chars), kubectl describe pod events (ImagePullBackOff, configmap/secret not found, Unhealthy), liveness/readiness probe spec (initialDelaySeconds, failureThreshold), ConfigMap/Secret existence, node MemoryPressure condition
Classify (first-match, confidence ≥0.70): oom_at_startup (exit 137, conf=0.93 → bump memory 1.5×); image_pull_failure (ImagePullBackOff in events, conf=0.94 → rollback); config_error_missing_env (ConfigMap/Secret not found, conf=0.91 → audit + surface); liveness_probe_misconfigured (probe too tight + healthy startup logs, conf=0.86 → loosen probe); dependency_not_ready (exit 0/1 + connection-refused patterns, conf=0.88 → hold + healthcheck)
Remediate: oom_at_startup → kubectl patch Deployment memory limit 1.5× (ceiling 2×, max 1/pod/24h); image_pull_failure → kubectl rollout undo Deployment (max 1/service/hour); config_error_missing_env → kubectl get configmaps/secrets audit, surface missing names (never auto-creates Secrets); liveness_probe_misconfigured → kubectl patch Deployment livenessProbe.initialDelaySeconds=max(current×2,30) failureThreshold=5; dependency_not_ready → kubectl annotate pod with dependency-hold-since, probe dependency endpoints
Verify: pod reaches Ready within 5-min stabilisation window; restart count must not increment; no Unhealthy probe events
Escalate: unknown classification after 2 passes; memory ceiling hit; dependency healthcheck timeout; rollback rate limit reached

Safety guardrails

Max 1 rollback per service per 1-hour window (_rollback_history in-process dict)
Max 1 memory limit bump per pod per 24h (_memory_bump_history in-process dict)
Memory bump ceiling: never exceed 2× original limit (OOM_LIMIT_BUMP_CEILING_MULTIPLIER=2.0)

BytePort autonomously remediates 27+ incident classes

crashloopbackoff_v2

oomkilled_memory_pressure

node_not_ready

tls_cert_expiry_auto_renewal

failed_deploy_auto_rollback

db_connection_pool_exhaustion

crash_loop_backoff

memory_pressure_oom

pod_crashloop

oom_killed_pod

image_pull_backoff

pod_pending_unschedulable

node_not_ready

deployment_rollback

cpu_saturation

disk_pressure

disk-pressure-auto-cleanup

memory_pressure

cert_expiry

connection_pool_exhausted

replication_lag

deploy_failure

application_error

ssl-cert-auto-rotate

failed-deploy-auto-rollback

memory_pressure_auto_recover

container_restart_loop_breaker

db_connection_pool_exhausted

disk_space_auto_reclaim

tls_cert_expiry_auto_renew

New runbooks ship every week