All AWS Marketplace products
S4 Scan
Storage & data

S4 Scan

Athena scan-cost reducer

30–70% off scan cost AWS service it replaces: Athena / Redshift Spectrum scan cost
Get it on AWS Marketplace

Cut your AWS Athena and Redshift Spectrum scan bill 30–70% by re-optimizing your S3 data-lake Parquet to the layout your queries actually read — partition and row-group pruning, dictionary encoding, zstd, and small-file compaction — applied through a safe shadow-then-swap that never changes query results.

S4 Scan learns from your Athena query history which columns are read and which predicates filter your tables, then rewrites the underlying S3 Parquet into the physical layout those queries want: hot columns first, sorted on the columns you filter by so row-group statistics prune, low-cardinality columns dictionary-encoded, files compacted to an efficient size, and compressed with zstd. Athena and Redshift Spectrum bill per terabyte scanned from S3, so a smaller, better-pruned layout is a direct, recurring bill reduction. Safety is the core of the product — your source is never overwritten.

The problem

Athena and Redshift Spectrum bill on the bytes actually read from S3 (default $5/TB, region-dependent). When your Parquet's physical layout doesn't match how queries access it, each query scans far more S3 than it returns — and you pay for that gap every month. Because the bill is a function of bytes scanned, the layout itself sets your cost.

How it works

  1. 1

    Learn from query history

    It reads your Glue catalog and Athena query history (workgroup metrics, CloudTrail) to learn which columns each table actually reads and which predicates filter it.

  2. 2

    Write to shadow, then verify

    It writes the optimized Parquet to a separate shadow location and verifies the query results are identical down to values, NULLs, decimal scale, and timestamp timezone.

  3. 3

    Swap via Glue, with rollback

    Only after verification passes does it repoint the Glue table location — a catalog pointer move, not a data move — reversible with a single rollback command.

Highlights

Safe by construction: optimized data is written to a shadow location, verified query-result-identical (values, nulls, decimal scale, timestamp tz), then swapped via the Glue catalog — with one-command rollback. Source data is never modified in place.

See the savings before you commit: a dry-run projects the monthly dollar reduction on your real tables and attributes it honestly to pruning, dictionary + zstd, and small-file compaction.

No lock-in, no babysitting: output is standard Athena-readable Parquet, and the AMI re-optimizes on an EventBridge schedule so savings compound automatically.

What's included

  • Query-history analysis: derives hot columns, frequent predicates, and partition/sort-key candidates from your past queries into an optimization plan.
  • Partition and row-group pruning: re-partitions on your predicate keys so whole files are skipped, and sorts on filter columns so min/max statistics let non-matching row groups be read past.
  • Dictionary/RLE encoding + zstd: low-cardinality columns are dictionary-encoded and recompressed with standard, Athena-readable zstd to cut projected read bytes.
  • Small-file compaction: merges many tiny objects into fewer, right-sized files for I/O efficiency and better statistics.
  • Shadow → verify → swap with one-command rollback: the source is never modified in place, and the cutover is only a Glue pointer move.
  • Dry-run dollar projection: with no writes, it reports current vs. optimized scan bytes and the monthly delta, honestly attributing the saving to partition pruning / row-group pruning / dictionary+zstd / compaction.
  • EventBridge-scheduled re-optimization, with output that is standard Athena-readable Parquet — no lock-in.

Use cases

High Athena/Redshift Spectrum scan bills you want to cut without rewriting any SQL.

Wide tables where queries read only a few columns and filter on a few predicates, yet still scan the whole file.

Data lakes fragmented into many small Parquet objects, hurting I/O efficiency and statistics.

Growing datasets that benefit from recurring, EventBridge-scheduled re-optimization.

FAQ

Could this corrupt my data or change my query results?

The source is never modified in place. Optimized data is written to a separate shadow location and verified result-identical — values, NULLs, decimal scale, timestamp timezone, nested types — before only the Glue pointer is moved. If verification fails it does not swap; it drops the shadow and leaves the source untouched. After a swap, one command restores the original location.

Am I locked into a proprietary format?

No. Output is only standard, Athena-readable Parquet (standard zstd, dictionary/RLE) with no proprietary codec. You can roll back at any time, and the optimized data reads with normal Athena tooling.

How does S4 Scan know what to optimize?

It learns from your Athena query history — which columns are read and which predicates filter your tables — and derives the column ordering, partition keys, and sort keys to recommend. It is based on real access patterns, not guesswork.

Can I see how much I'll save before committing?

Yes. A dry-run reports current vs. optimized scan bytes and the monthly dollar delta without writing anything, attributing the reduction honestly across partition/row-group pruning, dictionary+zstd, and compaction. The 30–70% headline is a range that depends on how mismatched your layout is — not a guarantee — so start with a dry-run on your own table.

Where does it run, and does my data leave my account?

It runs as an AMI inside your own AWS account and VPC, accessing only your S3, Glue, and Athena — your data never leaves. It operates hands-off on an EventBridge schedule under a least-privilege IAM role.

Pricing model

Hourly software fee + EC2 (t3 / m5 class), with a pay-per-GB-processed offer also available. Annual option available.

Get it on AWS Marketplace