All AWS Marketplace products
S4 Weights
AI & GPU

S4 Weights

Lossless GPU checkpoint compression

Bit-exact lossless compression AWS service it replaces: PyTorch checkpoint storage in S3
Get it on AWS Marketplace

A GPU lossless compression codec for PyTorch training checkpoints (weights + optimizer state). It byte-plane splits each tensor and routes the exponent plane through ANS and the mantissa through GDeflate (nvCOMP) on the GPU, staying bit-exact (lossless) for bf16, fp16, and fp32. Compressed checkpoints persist to your own S3 bucket. Delivered as a g6 / g6e GPU AMI.

S4 Weights is a transparent, lossless codec for the checkpoints a training run writes — the model weights and optimizer state. Restore is byte-for-byte identical, verified against adversarial bit patterns (NaN, ±Inf, denormal, -0.0) on every supported dtype. For runs that checkpoint frequently it also compresses the byte-XOR delta between consecutive checkpoints. The AMI build itself fails if a compress→decompress round-trip is not bit-exact on the build GPU, so a broken plane reassembly never reaches a customer image. Your checkpoints never leave your own VPC and S3 bucket.

The problem

Large PyTorch training runs write checkpoints — the model weights and the optimizer state — frequently, so the storage they occupy in Amazon S3 and the bytes you transfer keep growing. Bigger checkpoints also mean longer save/load stalls that pause training each time, piling up both storage cost and GPU idle time. These checkpoints are the raw bf16/fp16/fp32 numbers, where any loss or quantization is unacceptable, so compression has to be lossless to be trusted.

How it works

  1. 1

    Split into byte planes on the GPU

    Each tensor is split on the GPU into sign, exponent, and mantissa byte planes, and each plane is routed to the codec that suits it — the exponent plane through ANS entropy coding, the mantissa through GDeflate, and the sign plane bit-packed (built on NVIDIA nvCOMP).

  2. 2

    Compress the delta between checkpoints

    For runs that checkpoint frequently, S4 Weights also stores and compresses the byte-XOR delta between consecutive checkpoints, which is much smaller when most weights barely move between saves.

  3. 3

    Restore bit-exact, persist to your own S3

    Restore is always byte-for-byte identical, verified against adversarial bit patterns (NaN, +/-Inf, denormal, -0.0) on every supported dtype. Compressed checkpoints persist to the S3 bucket you configure and never leave your account.

Highlights

Lossless (byte-for-byte) for bf16 / fp16 / fp32, verified against adversarial NaN / ±Inf / denormal / -0.0.

Byte-plane split: exponent → ANS, mantissa → GDeflate on the GPU (nvCOMP); also compresses the delta chain between checkpoints.

Compressed checkpoints land in your own S3 bucket inside your own VPC; g6 / g6e GPU AMI.

What's included

  • GPU lossless checkpoint codec — bit-exact compression of PyTorch training checkpoints (model weights and optimizer state) for bf16/fp16/fp32
  • Byte-plane data path (exponent plane -> ANS, mantissa plane -> GDeflate, sign plane bit-packed, running on the GPU and built on NVIDIA nvCOMP)
  • Byte-XOR delta chain between consecutive checkpoints — extra savings on frequent saves, and never expands a blob beyond a small fixed header
  • Drop-in PyTorch API — transparent s4weights.save / s4weights.load plus a base->delta checkpoint store (save_checkpoint / load_checkpoint)
  • Adversarial bit-pattern verification — NaN / +/-Inf / denormal / -0.0 checked on every supported dtype, and the AMI build itself fails unless a GPU compress->decompress round-trip is bit-exact
  • Ubuntu 22.04-based g6/g6e GPU AMI with an end-to-end CloudFormation template (deploy/cfn-train-runner.yaml); the runner is a non-root systemd service serving a health endpoint on TCP 8080
  • Persistence to your own S3 registry bucket (your data never leaves your account) plus a fail-closed RegisterUsage entitlement check at boot

Use cases

Large PyTorch training runs that checkpoint frequently and want to cut S3 storage cost and transferred bytes

Workloads where optimizer state dominates and you want shorter save/load stalls

Training pipelines that cannot tolerate any loss or quantization and need guaranteed bit-exact restore

Teams whose data compresses well, such as all-bf16 or low-precision-optimizer checkpoints

FAQ

Is the compression really lossless?

Yes. Restore is always byte-for-byte identical, verified for bf16/fp16/fp32 weights and fp32 optimizer state against adversarial bit patterns (NaN, +/-Inf, denormal, -0.0). The AMI build itself fails unless a GPU compress->decompress round-trip is bit-exact, so a codec with broken plane reassembly never reaches a customer image.

How much will it compress?

It depends on the data. All-bf16 / low-precision-optimizer checkpoints compress well (and better at scale), while fp32-heavy checkpoints saved far apart compress little. We are honest about where it helps and do not claim a fixed ratio. Compression is always lossless and never expands a blob beyond a small fixed header.

Where are my checkpoints stored?

Compressed checkpoints persist to the Amazon S3 registry bucket you configure and never leave your account. You launch the AMI inside your own VPC, and your PyTorch training code writes checkpoints with s4weights.save / s4weights.load (or the delta-chain save_checkpoint / load_checkpoint), which compress each tensor on the GPU and persist bit-exact compressed checkpoints to your S3 registry.

What does it run on, and how is it billed?

It runs on g6 or g6e GPU instances, wired end-to-end by the bundled CloudFormation template (deploy/cfn-train-runner.yaml). Billing is per-instance-hour with an annual option. AWS meters the running instance-hours automatically, and the runner calls RegisterUsage once at boot as a fail-closed entitlement check (an unentitled instance refuses to start).

Is it easy to integrate into my PyTorch training code?

It is drop-in. You write checkpoints with the transparent s4weights.save / s4weights.load, or use save_checkpoint / load_checkpoint for the base->delta checkpoint store. Each tensor is compressed on the GPU, and for frequent saves the byte-XOR delta between consecutive checkpoints is stored and compressed as well.

Pricing model

Hourly software fee + GPU EC2 (g6 / g6e). Metered per instance type, annual option available.

Get it on AWS Marketplace