Architecture

How BytePort works

BytePort is an autonomous ops agent that watches your infra, fixes what it can, and escalates only what it can't. Signal ingestion → runbook matching → safe action execution → verified outcome. No runbooks.yml to write, no on-call page to configure.

Architecture

Signal to outcome in one loop

Every remediation cycle runs the same four-stage loop: Detect → Diagnose → Act → Verify. The agent never skips stages and never applies a fix it can't verify.

Worked examples

Three incidents, one loop

Each example below is a real runbook shipped in v0.1. The signal name, runbook name, and actions are live code — not mockups. Click any runbook name to see the full documentation.

disk_pressure → cleanup_logs runbook

Signal detected

Disk usage on /dev/sda1 crosses 85% for 60 seconds. Reported by CloudWatch (disk.percent), Datadog (system.disk.in_use), or GCP (agent.googleapis.com/disk/percent_used). Excludes tmpfs mounts.

Actions taken

1. journalctl --vacuum-size=50MB
2. Delete /var/log/*.gz older than 7 days
3. docker system prune -af
4. Clear /tmp files unused 48h+
Requires byteport.io/auto-remediate: "disk" on host

Outcome

Auto-resolved
Disk verified below 80% within 2 min. Postmortem auto-filed with action log and trend chart.

Escalated if
Disk stays above 80% or hits 95%+ — GitHub Issue + Postmark alert with next steps.

pod_crashloop → restart_with_backoff runbook

Signal detected

Pod restarts ≥ 3 times in 15 minutes. K8s Events adapter detects BackOff / CrashLoopBackOff reason directly — no metrics backend required. Exit code classified: 137 = OOMKilled, 127 = missing binary, 1 = app error.

Actions taken

1. kubectl logs --previous captures crash payload
2. Classifies exit code and OOM signature
3. For transient errors: kubectl rollout restart with exponential back-off
4. For OOMKilled: patches memory.limit upward (≤ 2× current)
Max 1 restart per 3-min window. Requires namespace annotation.

Outcome

Auto-resolved
Pod reaches Running state. Dedup: same pod + same exit code within 24h adds a comment, not a new issue.

Escalated if
Classification fails, annotation missing, or pod doesn't recover — GitHub Issue with exit code, log excerpt, and suggested fix.

memory_pressure → rolling_restart runbook

Signal detected

Host memory utilization exceeds 90% for 90 continuous seconds. Uses system.mem.used_pct (Datadog), mem/used (CloudWatch), or GCP agent.googleapis.com/memory/percent_used. Swap checked first to distinguish page-cache pressure from true exhaustion.

Actions taken

1. Reads /proc/meminfo + cgroup stats for top RSS processes
2. Reduces oom_score_adj on non-critical consumers to buy headroom
3. Restarts leaking process via systemd (requires byteport.restart=safe label)
4. Logs pre/post memory baseline for postmortem
Never restarts a process without the opt-in label.

Outcome

Auto-resolved
Memory verified below 85% within 2 min of restart. Pre/post audit in postmortem.

Escalated if
Restart fails or memory doesn't recover — Postmark alert with host, top consumers, swap pressure, and suggested memory.limit adjustments.

Safety model

Explicit opt-in. Hard limits. Always audited.

BytePort never touches infrastructure without an explicit annotation on the resource. Every action is logged, every outcome verified. Dry-run mode available globally.

Will do — without asking

Restart annotated pods and services via kubectl rollout restart
Clean rotated logs and prune Docker artifacts
Scale deployments up within configured replica bounds
Patch memory.limit upward (capped at 2× current value)
Terminate idle-in-transaction DB connections with annotation
Trigger cert-manager certificate renewal
Roll back to the last stable Kubernetes deployment revision
File auto-generated postmortems on resolution

Will NOT do — ever

Drop or truncate database tables
Delete persistent volumes or cloud storage buckets
Push code or modify application deployable artifacts
Act on any resource without the correct opt-in annotation
Force-delete a pod in Terminating state
Roll back a deployment managed by ArgoCD auto-sync
Terminate a DB connection with an open uncommitted transaction
Patch memory.limit beyond 2× current to avoid runaway costs

RBAC permissions are scoped to exactly the verbs each runbook needs. View the full permission matrix in the Helm chart: charts/byteport-agent RBAC

Escalation

GitHub + Postmark. 24-hour dedup.

When BytePort can't auto-fix an incident, it hands off with full context — not just "something is wrong." Every escalation includes the full action audit log, the postmortem draft, and concrete next steps.

Remediation attempted

BytePort runs the matched runbook with all safety gates enforced. Each step is logged to agent_actions with status, result, and timestamp. The verify phase re-checks the signal 30–120 seconds after the action.

Dedup check — 24-hour window

Before opening a GitHub Issue, BytePort checks for an existing open issue with the same signal type, resource, and severity. If found within 24 hours: adds a comment with the new action log instead of opening a duplicate. This keeps your issue tracker signal-to-noise ratio high.

GitHub Issue opened

Labeled byteport + severity:{level} + signal-specific label (e.g. pod_crashloop). Body includes: resource name, namespace/host, signal value, what was tried, why it failed, and 3 recommended manual steps with exact commands.

Postmark alert to on-call

Structured email via Postmark to the configured operator address. Includes the same context as the GitHub Issue plus a direct link to the issue, the signal event ID, and the auto-generated postmortem draft. Sent only once per escalation — not per check cycle.

Postmortem auto-published on resolution

When the signal clears (whether by BytePort or by a human), the postmortem is finalized in signal_events.postmortem_md and accessible at GET /incidents/:id/postmortem. Contains timeline, root cause, actions taken, and follow-up recommendations.

Ready to stop being paged at 3am?

BytePort installs in 5 minutes via Helm. Runs in dry-run mode by default — detecting and logging signals without touching your cluster until you opt in.

Install in 5 minutes → See the runbooks →