How BytePort works

BytePort is an autonomous ops agent that watches your infra, fixes what it can, and escalates only what it can't. Signal ingestion → runbook matching → safe action execution → verified outcome. No runbooks.yml to write, no on-call page to configure.

Signal to outcome in one loop

Every remediation cycle runs the same four-stage loop: Detect → Diagnose → Act → Verify. The agent never skips stages and never applies a fix it can't verify.

SIGNAL SOURCES CloudWatch GCP Monitoring Datadog K8s Events Webhooks SIGNAL ROUTER threshold eval severity tagging dedup window annotation check RUNBOOK MATCHER disk_pressure pod_crashloop memory_pressure + 4 more → SAFE ACTION EXECUTOR annotation guard dry-run available cooldown enforced verify outcome Auto-resolved postmortem filed Escalated GitHub Issue + Postmark alert verify loop — re-checks signal post-action
Three incidents, one loop

Each example below is a real runbook shipped in v0.1. The signal name, runbook name, and actions are live code — not mockups. Click any runbook name to see the full documentation.

disk_pressure cleanup_logs runbook
Signal detected
Disk usage on /dev/sda1 crosses 85% for 60 seconds. Reported by CloudWatch (disk.percent), Datadog (system.disk.in_use), or GCP (agent.googleapis.com/disk/percent_used). Excludes tmpfs mounts.
Actions taken
1. journalctl --vacuum-size=50MB
2. Delete /var/log/*.gz older than 7 days
3. docker system prune -af
4. Clear /tmp files unused 48h+
Requires byteport.io/auto-remediate: "disk" on host
Outcome
Auto-resolved
Disk verified below 80% within 2 min. Postmortem auto-filed with action log and trend chart.

Escalated if
Disk stays above 80% or hits 95%+ — GitHub Issue + Postmark alert with next steps.
pod_crashloop restart_with_backoff runbook
Signal detected
Pod restarts ≥ 3 times in 15 minutes. K8s Events adapter detects BackOff / CrashLoopBackOff reason directly — no metrics backend required. Exit code classified: 137 = OOMKilled, 127 = missing binary, 1 = app error.
Actions taken
1. kubectl logs --previous captures crash payload
2. Classifies exit code and OOM signature
3. For transient errors: kubectl rollout restart with exponential back-off
4. For OOMKilled: patches memory.limit upward (≤ 2× current)
Max 1 restart per 3-min window. Requires namespace annotation.
Outcome
Auto-resolved
Pod reaches Running state. Dedup: same pod + same exit code within 24h adds a comment, not a new issue.

Escalated if
Classification fails, annotation missing, or pod doesn't recover — GitHub Issue with exit code, log excerpt, and suggested fix.
memory_pressure rolling_restart runbook
Signal detected
Host memory utilization exceeds 90% for 90 continuous seconds. Uses system.mem.used_pct (Datadog), mem/used (CloudWatch), or GCP agent.googleapis.com/memory/percent_used. Swap checked first to distinguish page-cache pressure from true exhaustion.
Actions taken
1. Reads /proc/meminfo + cgroup stats for top RSS processes
2. Reduces oom_score_adj on non-critical consumers to buy headroom
3. Restarts leaking process via systemd (requires byteport.restart=safe label)
4. Logs pre/post memory baseline for postmortem
Never restarts a process without the opt-in label.
Outcome
Auto-resolved
Memory verified below 85% within 2 min of restart. Pre/post audit in postmortem.

Escalated if
Restart fails or memory doesn't recover — Postmark alert with host, top consumers, swap pressure, and suggested memory.limit adjustments.
Explicit opt-in. Hard limits. Always audited.

BytePort never touches infrastructure without an explicit annotation on the resource. Every action is logged, every outcome verified. Dry-run mode available globally.

Will do — without asking
  • Restart annotated pods and services via kubectl rollout restart
  • Clean rotated logs and prune Docker artifacts
  • Scale deployments up within configured replica bounds
  • Patch memory.limit upward (capped at 2× current value)
  • Terminate idle-in-transaction DB connections with annotation
  • Trigger cert-manager certificate renewal
  • Roll back to the last stable Kubernetes deployment revision
  • File auto-generated postmortems on resolution
Will NOT do — ever
  • Drop or truncate database tables
  • Delete persistent volumes or cloud storage buckets
  • Push code or modify application deployable artifacts
  • Act on any resource without the correct opt-in annotation
  • Force-delete a pod in Terminating state
  • Roll back a deployment managed by ArgoCD auto-sync
  • Terminate a DB connection with an open uncommitted transaction
  • Patch memory.limit beyond 2× current to avoid runaway costs

RBAC permissions are scoped to exactly the verbs each runbook needs. View the full permission matrix in the Helm chart: charts/byteport-agent RBAC

GitHub + Postmark. 24-hour dedup.

When BytePort can't auto-fix an incident, it hands off with full context — not just "something is wrong." Every escalation includes the full action audit log, the postmortem draft, and concrete next steps.

1
Remediation attempted
BytePort runs the matched runbook with all safety gates enforced. Each step is logged to agent_actions with status, result, and timestamp. The verify phase re-checks the signal 30–120 seconds after the action.
2
Dedup check — 24-hour window
Before opening a GitHub Issue, BytePort checks for an existing open issue with the same signal type, resource, and severity. If found within 24 hours: adds a comment with the new action log instead of opening a duplicate. This keeps your issue tracker signal-to-noise ratio high.
3
GitHub Issue opened
Labeled byteport + severity:{level} + signal-specific label (e.g. pod_crashloop). Body includes: resource name, namespace/host, signal value, what was tried, why it failed, and 3 recommended manual steps with exact commands.
4
Postmark alert to on-call
Structured email via Postmark to the configured operator address. Includes the same context as the GitHub Issue plus a direct link to the issue, the signal event ID, and the auto-generated postmortem draft. Sent only once per escalation — not per check cycle.
5
Postmortem auto-published on resolution
When the signal clears (whether by BytePort or by a human), the postmortem is finalized in signal_events.postmortem_md and accessible at GET /incidents/:id/postmortem. Contains timeline, root cause, actions taken, and follow-up recommendations.

Ready to stop being paged at 3am?

BytePort installs in 5 minutes via Helm. Runs in dry-run mode by default — detecting and logging signals without touching your cluster until you opt in.