S4 Firewall
LLM token budget & runaway-loop control
An in-VPC forwarding proxy that puts a pre-emptive spend firewall in front of your LLM traffic. Your app only changes its base_url (OpenAI-compatible, Anthropic Messages-compatible, or Bedrock-compatible); every request runs a synchronous attribute → reserve → budget/anomaly → forward → reconcile pipeline. Its headline job is the runaway-loop circuit breaker: it deterministically blocks a request before it is relayed once a budget would be exceeded.
S4 Firewall is a forwarding proxy that puts a budget and a circuit breaker in front of your LLM token spend, with two honestly distinct layers. Layer 1, the hard cap, is deterministic and pre-emptive: cumulative spend is known, so any request whose reservation would push the running total past a configured hard cap is blocked before it is relayed — a 100% pre-emptive block of over-cap requests, fixed by chaos tests. Layer 2, the loop block, is best-effort and behavioral: it detects agent loops, near-duplicate call chains, and in-session amplification to bound the blast radius. Spend is attributed per feature / tenant / customer, and streaming responses pass through chunk by chunk so time-to-first-token is preserved.
The problem
LLM token spend is uniquely prone to runaway events: an agent loop, a chain of near-duplicate calls, or in-session amplification can burn through a month's budget before anyone notices. Because the number of output tokens is unknowable until the response returns, without a mechanism that stops spend before the bill is incurred you are left reconstructing after the fact which feature, tenant, or customer ran up the cost. Your spend tracks an uncontrolled loop rather than the work you actually intended.
How it works
- 1
Just point your base_url at it
S4 Firewall is a forwarding proxy you run inside your own Amazon VPC. Your application changes only its base_url: the firewall accepts an OpenAI-compatible, Anthropic Messages-compatible, or Bedrock-compatible intake and relays each request to the upstream provider you already use. Streaming responses are passed through chunk by chunk without buffering, so time-to-first-token is preserved.
- 2
Attribute, reserve, and decide in one synchronous pipeline
Before relaying, every request runs a synchronous attribute -> reserve -> budget/anomaly -> forward -> reconcile pipeline. It attributes the request to a feature, tenant, and customer, reserves the worst-case cost (input tokens counted now, output priced at max_tokens times the output rate), checks the reservation against the hierarchy of budgets, then either forwards or blocks.
- 3
Two layers stop it, then reconcile against real usage
Layer 1, the hard cap, is deterministic and pre-emptive: any request whose reservation would exceed a configured cap is blocked before relay (a 100% pre-emptive block of over-cap requests, with the same state in producing the same decision out, fixed by chaos tests). Layer 2, the loop block, is best-effort and behavioral: it detects agent loops, near-duplicate call chains, and in-session amplification to bound the blast radius. When the response returns, the provider's reported usage is taken as the source of truth and reconciled against the reservation.
Highlights
Deterministic hard cap: blocks an over-budget request before it is relayed (tenant / feature / customer budget hierarchy).
Runaway-loop circuit breaker: detects agent loops and near-duplicate call chains to contain the blast radius.
Drop-in: OpenAI / Anthropic / Bedrock-compatible base_url, streaming passthrough (TTFT preserved), runs in your own VPC.
What's included
- Amazon Linux 2023 arm64 AMI (runs on c6g / c7g Graviton instances, metered per instance-hour)
- Forwarding proxy with OpenAI-compatible, Anthropic Messages-compatible, and Bedrock-compatible intake (apps change only their base_url; streaming passes through chunk by chunk, preserving TTFT)
- Layer 1 hard cap — a deterministic, pre-emptive circuit breaker that blocks any over-budget request before relay (100% pre-emptive block of over-cap requests, fixed by chaos tests)
- Layer 2 loop block — a best-effort, behavioral layer that detects agent loops, near-duplicate call chains, and in-session amplification to bound the blast radius (not a 100% guarantee; ships with a dry-run shadow mode)
- Reserve-then-reconcile token accounting — the reservation uses the worst case (input counted now, output priced at max_tokens times the output rate), then reconciles against the provider's reported usage as the source of truth
- Per-feature/tenant/customer token-spend attribution, emitted to Amazon CloudWatch (namespace S4/Firewall) and an optional counts-only audit ledger (never prompt or response bodies)
- One-click CloudFormation templates (cfn-single.yaml for a single instance, cfn-ha.yaml for a redundant fleet behind an internal load balancer), with a least-privilege IAM role and no separate control plane or database
Use cases
Stopping a runaway agent with a deterministic hard cap before it burns the month's budget
Teams that need to attribute token spend per feature, tenant, or customer — finer than IAM-principal granularity — and assign budgets accordingly
Governing LLM traffic without handing prompts or response bodies to a third party — a ledger that records token counts, not content
Putting a single in-VPC budget in front of OpenAI-, Anthropic-, and Bedrock-compatible traffic by pointing the app's base_url at the proxy
FAQ
Can it stop a runaway agent 100% of the time?
We keep two layers honestly distinct. Layer 1, the hard cap, is deterministic and pre-emptive: any request whose reservation would exceed a configured cap is blocked before relay, a 100% pre-emptive block of over-cap requests (same state in, same decision out, fixed by chaos tests). Layer 2, loop detection, is best-effort: a runaway is only knowable after a few calls, and those few are already billed, so it bounds the blast radius to a small number of requests or a small dollar amount rather than guaranteeing 100% prevention. It ships with a dry-run shadow mode so you can measure the false-block rate before you enforce.
If output tokens are unknown in advance, how does it stop spend before the bill?
It is reserve-then-reconcile, not a flat estimate. Before relay, the reservation uses the worst case — input tokens counted now, output priced at max_tokens times the output rate — to make the hard-cap decision. When the response returns, the provider's reported usage is taken as the source of truth and reconciled against the reservation. Token counts are normalized across providers and split into input, output, cached-read, and cache-write so accounting reflects each provider's real rate card.
Where are my prompts stored or sent?
S4 Firewall itself does not persist or transmit your prompts or responses. Its only outbound call is the provider request your application would have made anyway — the firewall adds no egress. The ledger and metrics carry token counts, not content (counts-not-content, fixed by property tests). Where prompts egress depends on the upstream you choose: routing to Amazon Bedrock through a VPC interface endpoint (PrivateLink, which this AMI can provision) keeps those calls inside your AWS boundary, while routing to a third-party provider on the public internet egresses to the internet and does not stay in your VPC.
Does it need a separate control plane or database?
No. There is no separate control plane and no external database. Budget state is held in-memory per instance and re-derived from zero on restart. The data plane is a single static binary running under a hardened systemd unit with zero elevated capabilities and a least-privilege IAM role (upstream model invocation, CloudWatch PutMetricData scoped to the S4/Firewall namespace, and write-only PutObject to the ledger bucket). There is no telemetry home-call and no license-key check.
How is it billed, and how do I deploy it?
Billing is AMI hourly (metered per instance-hour) with an annual contract option, running on c6g / c7g (Arm) instances. You deploy with the included CloudFormation templates — cfn-single.yaml for a single instance, cfn-ha.yaml for a redundant fleet behind an internal load balancer — which optionally create the Bedrock VPC interface endpoint. Then you simply point your application's base_url at the firewall.
Pricing model
Hourly software fee + EC2 (c6g class, Arm). Metered per instance type.
Other S4 products
S4 — Squished S3
Transparent GPU S3-compression gateway
S4 Logs
Archive CloudWatch Logs to zstd S3
S4 Metrics
Govern CloudWatch metric cardinality