BytePort is an autonomous ops agent that watches your infra, fixes what it can, and escalates only what it can't.
Signal ingestion → runbook matching → safe action execution → verified outcome. No runbooks.yml to write, no on-call page to configure.
Architecture
Signal to outcome in one loop
Every remediation cycle runs the same four-stage loop: Detect → Diagnose → Act → Verify.
The agent never skips stages and never applies a fix it can't verify.
Worked examples
Three incidents, one loop
Each example below is a real runbook shipped in v0.1. The signal name, runbook name, and actions
are live code — not mockups. Click any runbook name to see the full documentation.
Disk usage on /dev/sda1 crosses 85% for 60 seconds.
Reported by CloudWatch (disk.percent), Datadog (system.disk.in_use),
or GCP (agent.googleapis.com/disk/percent_used). Excludes tmpfs mounts.
Actions taken
1. journalctl --vacuum-size=50MB
2. Delete /var/log/*.gz older than 7 days
3. docker system prune -af
4. Clear /tmp files unused 48h+ Requires byteport.io/auto-remediate: "disk" on host
Outcome
Auto-resolved
Disk verified below 80% within 2 min. Postmortem auto-filed with action log and trend chart.
Escalated if
Disk stays above 80% or hits 95%+ — GitHub Issue + Postmark alert with next steps.
Host memory utilization exceeds 90% for 90 continuous seconds.
Uses system.mem.used_pct (Datadog), mem/used (CloudWatch), or GCP
agent.googleapis.com/memory/percent_used.
Swap checked first to distinguish page-cache pressure from true exhaustion.
Actions taken
1. Reads /proc/meminfo + cgroup stats for top RSS processes
2. Reduces oom_score_adj on non-critical consumers to buy headroom
3. Restarts leaking process via systemd (requires byteport.restart=safe label)
4. Logs pre/post memory baseline for postmortem Never restarts a process without the opt-in label.
Outcome
Auto-resolved
Memory verified below 85% within 2 min of restart. Pre/post audit in postmortem.
Escalated if
Restart fails or memory doesn't recover — Postmark alert with host, top consumers, swap pressure, and suggested memory.limit adjustments.
Safety model
Explicit opt-in. Hard limits. Always audited.
BytePort never touches infrastructure without an explicit annotation on the resource.
Every action is logged, every outcome verified. Dry-run mode available globally.
Will do — without asking
Restart annotated pods and services via kubectl rollout restart
Clean rotated logs and prune Docker artifacts
Scale deployments up within configured replica bounds
Patch memory.limit upward (capped at 2× current value)
Terminate idle-in-transaction DB connections with annotation
Trigger cert-manager certificate renewal
Roll back to the last stable Kubernetes deployment revision
File auto-generated postmortems on resolution
Will NOT do — ever
Drop or truncate database tables
Delete persistent volumes or cloud storage buckets
Push code or modify application deployable artifacts
Act on any resource without the correct opt-in annotation
Force-delete a pod in Terminating state
Roll back a deployment managed by ArgoCD auto-sync
Terminate a DB connection with an open uncommitted transaction
Patch memory.limit beyond 2× current to avoid runaway costs
RBAC permissions are scoped to exactly the verbs each runbook needs.
View the full permission matrix in the Helm chart:
charts/byteport-agent RBAC
Escalation
GitHub + Postmark. 24-hour dedup.
When BytePort can't auto-fix an incident, it hands off with full context —
not just "something is wrong." Every escalation includes the full action audit log, the postmortem draft, and concrete next steps.
1
Remediation attempted
BytePort runs the matched runbook with all safety gates enforced.
Each step is logged to agent_actions with status, result, and timestamp.
The verify phase re-checks the signal 30–120 seconds after the action.
2
Dedup check — 24-hour window
Before opening a GitHub Issue, BytePort checks for an existing open issue with the same signal type, resource, and severity.
If found within 24 hours: adds a comment with the new action log instead of opening a duplicate.
This keeps your issue tracker signal-to-noise ratio high.
3
GitHub Issue opened
Labeled byteport + severity:{level} + signal-specific label (e.g. pod_crashloop).
Body includes: resource name, namespace/host, signal value, what was tried, why it failed, and 3 recommended manual steps with exact commands.
4
Postmark alert to on-call
Structured email via Postmark to the configured operator address.
Includes the same context as the GitHub Issue plus a direct link to the issue, the signal event ID, and the auto-generated postmortem draft.
Sent only once per escalation — not per check cycle.
5
Postmortem auto-published on resolution
When the signal clears (whether by BytePort or by a human), the postmortem is finalized in signal_events.postmortem_md
and accessible at GET /incidents/:id/postmortem. Contains timeline, root cause, actions taken, and follow-up recommendations.
Ready to stop being paged at 3am?
BytePort installs in 5 minutes via Helm. Runs in dry-run mode by default —
detecting and logging signals without touching your cluster until you opt in.