S4 Scan
Athena scan-cost reducer
Cut your AWS Athena and Redshift Spectrum scan bill 30–70% by re-optimizing your S3 data-lake Parquet to the layout your queries actually read — partition and row-group pruning, dictionary encoding, zstd, and small-file compaction — applied through a safe shadow-then-swap that never changes query results.
S4 Scan learns from your Athena query history which columns are read and which predicates filter your tables, then rewrites the underlying S3 Parquet into the physical layout those queries want: hot columns first, sorted on the columns you filter by so row-group statistics prune, low-cardinality columns dictionary-encoded, files compacted to an efficient size, and compressed with zstd. Athena and Redshift Spectrum bill per terabyte scanned from S3, so a smaller, better-pruned layout is a direct, recurring bill reduction. Safety is the core of the product — your source is never overwritten.
The problem
Athena and Redshift Spectrum bill on the bytes actually read from S3 (default $5/TB, region-dependent). When your Parquet's physical layout doesn't match how queries access it, each query scans far more S3 than it returns — and you pay for that gap every month. Because the bill is a function of bytes scanned, the layout itself sets your cost.
How it works
- 1
Learn from query history
It reads your Glue catalog and Athena query history (workgroup metrics, CloudTrail) to learn which columns each table actually reads and which predicates filter it.
- 2
Write to shadow, then verify
It writes the optimized Parquet to a separate shadow location and verifies the query results are identical down to values, NULLs, decimal scale, and timestamp timezone.
- 3
Swap via Glue, with rollback
Only after verification passes does it repoint the Glue table location — a catalog pointer move, not a data move — reversible with a single rollback command.
Highlights
Safe by construction: optimized data is written to a shadow location, verified query-result-identical (values, nulls, decimal scale, timestamp tz), then swapped via the Glue catalog — with one-command rollback. Source data is never modified in place.
See the savings before you commit: a dry-run projects the monthly dollar reduction on your real tables and attributes it honestly to pruning, dictionary + zstd, and small-file compaction.
No lock-in, no babysitting: output is standard Athena-readable Parquet, and the AMI re-optimizes on an EventBridge schedule so savings compound automatically.
What's included
- Query-history analysis: derives hot columns, frequent predicates, and partition/sort-key candidates from your past queries into an optimization plan.
- Partition and row-group pruning: re-partitions on your predicate keys so whole files are skipped, and sorts on filter columns so min/max statistics let non-matching row groups be read past.
- Dictionary/RLE encoding + zstd: low-cardinality columns are dictionary-encoded and recompressed with standard, Athena-readable zstd to cut projected read bytes.
- Small-file compaction: merges many tiny objects into fewer, right-sized files for I/O efficiency and better statistics.
- Shadow → verify → swap with one-command rollback: the source is never modified in place, and the cutover is only a Glue pointer move.
- Dry-run dollar projection: with no writes, it reports current vs. optimized scan bytes and the monthly delta, honestly attributing the saving to partition pruning / row-group pruning / dictionary+zstd / compaction.
- EventBridge-scheduled re-optimization, with output that is standard Athena-readable Parquet — no lock-in.
Use cases
High Athena/Redshift Spectrum scan bills you want to cut without rewriting any SQL.
Wide tables where queries read only a few columns and filter on a few predicates, yet still scan the whole file.
Data lakes fragmented into many small Parquet objects, hurting I/O efficiency and statistics.
Growing datasets that benefit from recurring, EventBridge-scheduled re-optimization.
FAQ
Could this corrupt my data or change my query results?
The source is never modified in place. Optimized data is written to a separate shadow location and verified result-identical — values, NULLs, decimal scale, timestamp timezone, nested types — before only the Glue pointer is moved. If verification fails it does not swap; it drops the shadow and leaves the source untouched. After a swap, one command restores the original location.
Am I locked into a proprietary format?
No. Output is only standard, Athena-readable Parquet (standard zstd, dictionary/RLE) with no proprietary codec. You can roll back at any time, and the optimized data reads with normal Athena tooling.
How does S4 Scan know what to optimize?
It learns from your Athena query history — which columns are read and which predicates filter your tables — and derives the column ordering, partition keys, and sort keys to recommend. It is based on real access patterns, not guesswork.
Can I see how much I'll save before committing?
Yes. A dry-run reports current vs. optimized scan bytes and the monthly dollar delta without writing anything, attributing the reduction honestly across partition/row-group pruning, dictionary+zstd, and compaction. The 30–70% headline is a range that depends on how mismatched your layout is — not a guarantee — so start with a dry-run on your own table.
Where does it run, and does my data leave my account?
It runs as an AMI inside your own AWS account and VPC, accessing only your S3, Glue, and Athena — your data never leaves. It operates hands-off on an EventBridge schedule under a least-privilege IAM role.
Pricing model
Hourly software fee + EC2 (t3 / m5 class), with a pay-per-GB-processed offer also available. Annual option available.
Other S4 products
S4 — Squished S3
Transparent GPU S3-compression gateway
S4 Logs
Archive CloudWatch Logs to zstd S3
S4 Metrics
Govern CloudWatch metric cardinality