S4 KV
int4 KV-cache compression for vLLM
A vLLM v1 KV-cache backend that serves more concurrent users and longer context per GPU. It quantizes the KV cache to int4 (per-channel KEY, per-token VALUE, KIVI-style), bit-identical to fp16 greedy decoding — about 3× more KV density. Launches as a turnkey, OpenAI-compatible vLLM endpoint via the included CloudFormation template.
S4 KV stores the vLLM v1 paged-attention KV cache in int4 instead of fp16 using KIVI-style quantization, bit-identical to fp16 greedy decoding (verified on Qwen2.5 1.5B/3B/7B and Llama-3.2-3B / 3.1-8B). You get roughly 3× more KV density — more concurrent requests and longer context on the same GPU — with no measured quality loss. Supported models are head_dim 128 with standard attention (Qwen2.5 and Llama-3 families); the backend fails fast at startup rather than serve incorrect output for an unsupported model. Priced on density and quality, not latency. Runs on g5, g6, g6e, p4d, and p5 GPU instances.
The problem
When serving LLMs on a GPU, the ceiling on concurrent requests and usable context length is usually the GPU memory consumed by the attention KV cache. Because that cache is held in fp16, it grows quickly as requests and context get longer, capping throughput unless you add larger, more expensive GPUs. Your per-GPU capacity ends up bound by memory rather than by compute.
How it works
- 1
Store the KV cache in int4
A KV-cache backend for vLLM v1 paged attention stores the attention KV cache in int4 instead of fp16. It uses KIVI-style quantization — per-channel asymmetric int4 for KEY and per-token int4 for VALUE.
- 2
Stay bit-identical to fp16
It is bit-identical to fp16 greedy decoding, verified on Qwen2.5 1.5B/3B/7B and Llama-3.2-3B / 3.1-8B. That yields roughly 3x more KV-cache density — more concurrent requests and longer context on the same GPU — with no measured quality loss.
- 3
Launch a turnkey endpoint
The image ships as an OpenAI-compatible vLLM endpoint with int4_kivi enabled, launched via the included CloudFormation template. Set ModelId to a head_dim-128 model (e.g. Qwen/Qwen2.5-7B-Instruct); the API is on port 8000 at /v1, with health at /health.
Highlights
int4 KIVI quantization, bit-identical to fp16 greedy decoding (verified on Qwen2.5 / Llama-3) — no measured quality loss.
~3× more KV density: more concurrent requests and longer context on the same GPU.
Turnkey OpenAI-compatible vLLM endpoint (CloudFormation included); fails fast at startup on an unsupported model.
What's included
- Ubuntu 22.04 GPU AMI (x86_64) — boots an OpenAI-compatible vLLM server with int4_kivi enabled out of the box
- An int4 KV-cache backend for vLLM v1 paged attention (KIVI-style: per-channel asymmetric int4 for KEY, per-token int4 for VALUE)
- About 3x KV-cache density (more concurrent requests and longer context per GPU, with no measured quality loss)
- Bit-identical match to fp16 greedy decoding (verified on Qwen2.5 1.5B/3B/7B and Llama-3.2-3B / 3.1-8B, greedy match 1.0000)
- OpenAI-compatible API (port 8000 at /v1, health at /health) — launched with a head_dim-128 ModelId
- CloudFormation template (deploy/cfn-gpu-serving.yaml) — creates the security group and IAM role and starts the vLLM OpenAI server
- Startup model validation — supports head_dim 128 with standard attention (Qwen2.5 and Llama-3 families); unsupported models fail fast at startup rather than serve incorrect output
Use cases
KV-cache-bound serving where you want to handle more concurrent users on a single GPU
Workloads that need longer per-request context without adding GPUs
Serving the Qwen2.5 or Llama-3 families (head_dim 128) behind an OpenAI-compatible API
Datacenter-GPU inference where you want higher per-GPU capacity (density) without sacrificing quality
FAQ
Does int4 hurt output quality?
There is no measured quality loss. The backend is verified bit-identical to fp16 greedy decoding, with a greedy match of 1.0000 on Qwen2.5 1.5B/3B/7B and Llama-3.2-3B / 3.1-8B. You get roughly 3x more KV-cache density without sacrificing quality.
Which models are supported?
Models with head_dim 128 and standard attention — the Qwen2.5 and Llama-3 families. For an unsupported model the backend fails fast at startup rather than serve incorrect output, so an incompatible model never silently produces wrong results.
Does it affect inference latency?
It is priced on density and quality, not latency. At long context on datacenter GPUs the backend trades some decode latency for density. It is best suited to workloads whose goal is more concurrent requests and longer context per GPU, and we are upfront about that trade.
Which GPU instances does it run on?
It runs on g5, g6, g6e, p4d, and p5 GPU instances. Choose the instance to match the throughput and model size you need.
How do I deploy it?
Launch it with the included CloudFormation template (deploy/cfn-gpu-serving.yaml): it creates the security group and IAM role and starts the vLLM OpenAI server with int4_kivi enabled. Set ModelId to a head_dim-128 model (e.g. Qwen/Qwen2.5-7B-Instruct); the OpenAI-compatible API is on port 8000 at /v1, with health at /health.
Pricing model
Hourly software fee + GPU EC2 (g5 / g6 / g6e / p4d / p5). Metered per instance type.
Other S4 products
S4 — Squished S3
Transparent GPU S3-compression gateway
S4 Logs
Archive CloudWatch Logs to zstd S3
S4 Metrics
Govern CloudWatch metric cardinality