What BytePort handles without paging you
BytePort autonomously detects, diagnoses, and remediates these pager classes. We ship a new one every week.
Autonomy matrix
What BytePort does and when it stops to escalate. Updated June 2, 2026.
| Pager class | Detection signals | Autonomous actions | Escalation criteria | Runbook |
|---|---|---|---|---|
|
CrashLoopBackOff (v2)
kubernetes
|
crashloopbackoff
container_restart_threshold_exceeded (3/10/30 sub-severity)
pod_not_ready_oom (exit 137 at startup)
liveness_probe_failure
readiness_probe_failure
|
Rollback bad image tag (1/service/hour) · bump memory limit 1.5× for OOM-at-startup (1/pod/24h) · loosen liveness probe initialDelaySeconds · hold pod restart + wait for dependency healthcheck · audit missing ConfigMap/Secret (never auto-creates Secrets)
|
Unknown classification after 2 diagnostic passes · memory bump ceiling hit (1/pod/24h limit) · dependency healthcheck times out · rollback rate limit reached (1/service/hour) · confidence <0.70
|
v0.1.30
Source |
|
OOMKilled / Pod Memory Pressure
kubernetes
|
pod_oom_killed
memory_pressure_high
node_memory_pressure
cgroup_oom_event
jvm_heap_exhausted
gomemlimit_exceeded
|
Bump memory limit (≤2×, 1/pod/24h) · trigger HPA scale-out · cordon+drain memory-pressured node · surface runtime misconfig as PR proposal
|
Genuine memory leak (rising RSS, no traffic correlation) · confidence <0.70 · second limit-bump within 24h · verify fails twice
|
v0.1.29
Source |
|
Node NotReady / Kubelet Failure
kubernetes
|
node_not_ready
node_disk_pressure
node_memory_pressure
node_pid_pressure
kubelet_unhealthy
node_network_unavailable
|
Restart kubelet or container runtime · prune disk/images · restart CNI daemonset pod · cordon+drain (max 1 node concurrent)
|
Hardware / cloud failure (kernel OOM/MCE) · confidence <0.70 · control-plane node · PDB violation during drain
|
v0.1.11
Source |
|
TLS Certificate Expiry
tls
|
cert_expiring_soon (30/14/7/1d)
cert_expired
cert_chain_invalid
ocsp_stapling_failure
acme_challenge_failure
|
Renew via certbot / cert-manager / ACM · hot-reload nginx/envoy/haproxy/traefik/k8s ingress · verify chain and OCSP staple
|
Pinned cert · wildcard without dns-01 provider · rate-limited domain · commercial or internal CA · chain validation failure
|
v0.1.27
Source |
|
Failed Deploy Auto-Rollback
deploys
|
deploy_health_regression
readiness_probe_failure_post_deploy
deploy_5xx_spike
canary_slo_burn
rollout_stuck
|
kubectl rollout undo to last-known-good revision (max 1 rollback/deployment/hour) · watch pod recovery · emit postmortem
|
Regression predates deploy (false attribution) · rollback itself fails · confidence <0.70 · verify fails twice
|
v0.1.26
Source |
|
Database Connection Pool Exhaustion
database
|
db_pool_saturated
db_connection_timeout
db_too_many_connections
idle_in_transaction_backlog
pgbouncer_pool_full
|
Terminate idle-in-transaction sessions (cap 20/min) · cancel long-running queries · reload pgbouncer · rolling-restart leaking deployment
|
Kill cap (20/min) reached · superuser or pgbouncer session targeted · verify fails (util still >70%) · confidence <0.70
|
v0.1.25
Source |
|
CrashLoopBackOff
kubernetes
|
CrashLoopBackOff
container_restart_threshold_exceeded (>5/10min)
liveness_probe_failure
init_container_failure
|
Bump memory request 1.5× for OOM-at-startup · rollback bad image tag · loosen probe timing · cordon node on pressure eviction
|
External dependency failure in init container · missing ConfigMap/Secret · confidence <0.70 · daily resource-bump limit hit
|
v0.1.24
Source |
|
Host Memory Pressure / OOM
host
|
oom_killed
memory_pressure_high (≥90% RSS)
swap_thrash
pod_evicted_memory
jvm_heap_exhausted
|
Heap snapshot + restart with backoff · trigger HPA · patch limit (≤2×) · cordon+drain noisy-neighbor node
|
Genuine memory leak confirmed · stateful workload not in allowlist · verify fails (RSS stays >70%) · confidence <0.70
|
v0.1.23
Source |
|
CPU Saturation
infrastructure
|
cpu_throttle_high
load_avg_spike
process_cpu_saturation
scheduler_queue_backlog
|
Annotate HPA for scale-out · kill runaway process if confidence ≥0.85 · throttle cron fanout · surface over-provisioned limit as alert
|
No HPA present and no runaway process identified · kill confidence <0.85 · cluster-wide saturation (>80% nodes affected)
|
v0.1.3
Source |
|
Disk Pressure
infrastructure
|
disk_usage_high (≥85%)
inode_exhaustion
disk_io_saturation
disk_pressure (NodeCondition)
|
Delete old log files and WAL archives · prune dangling container images · clean package cache · alert on inode exhaustion
|
Cleanup insufficient (disk still >90% after prune) · inode exhaustion persists · confidence <0.70
|
v0.1.1
Source |
Coming next
Pager classes actively in development. Ship cadence: one per week.
CoreDNS failures, NXDOMAIN storms, UDP truncation, search-domain misconfig. Auto-restart CoreDNS, flush ndots cache, escalate on cluster-wide outage.
nginx-ingress / envoy pod crash, admission webhook failure, wildcard route misconfiguration. Rolling restart with config validation gate.
Kafka consumer lag, RabbitMQ queue depth, SQS age-of-oldest-message breach. Scale consumer pods, dead-letter storm detection, poison-pill quarantine.
Postgres streaming replication lag, MySQL replica delay, read-replica promotion. Auto-failover gate with split-brain detection and quorum check.
Intermediate CA rotation breaking leaf cert verification. Chain rebuild, trust-store reload, zero-downtime rotation via cert-manager CRD.
Don't see your pager class?
Tell us which alert woke you up last week. We'll prioritize it for next week's release.