Storage & data

S4 Scan

Athena scan-cost reducer

30–70% off scan cost AWS service it replaces: Athena / Redshift Spectrum scan cost

When does this pay back?

Example: $5,000/month in Athena scan cost

At $5,000/month in Athena scan cost, a 30–70% reduction is roughly $1,500–$3,500/mo — gross avoided usage charge before the S4 software fee, EC2, and workload differences.

Estimate with your bill

Estimate your savings

Enter your relevant monthly spend or usage for a rough estimate — no bill upload needed.

Monthly Athena / Redshift Spectrum scan spend

$/month

Full bill upload & other products

Cut your AWS Athena and Redshift Spectrum scan bill 30–70% by re-optimizing your S3 data-lake Parquet to the layout your queries actually read — partition and row-group pruning, dictionary encoding, zstd, and small-file compaction — applied through a safe shadow-then-swap that never changes query results.

S4 Scan learns from your Athena query history which columns are read and which predicates filter your tables, then rewrites the underlying S3 Parquet into the physical layout those queries want: hot columns first, sorted on the columns you filter by so row-group statistics prune, low-cardinality columns dictionary-encoded, files compacted to an efficient size, and compressed with zstd. Athena and Redshift Spectrum bill per terabyte scanned from S3, so a smaller, better-pruned layout is a direct, recurring bill reduction. Safety is the core of the product — your source is never overwritten.

The problem

Athena and Redshift Spectrum bill on the bytes actually read from S3 (default $5/TB, region-dependent). When your Parquet's physical layout doesn't match how queries access it, each query scans far more S3 than it returns — and you pay for that gap every month. Because the bill is a function of bytes scanned, the layout itself sets your cost.

How it works

1
Learn from query history

It reads your Glue catalog and Athena query history (workgroup metrics, CloudTrail) to learn which columns each table actually reads and which predicates filter it.
2
Write to shadow, then verify

It writes the optimized Parquet to a separate shadow location and verifies the query results are identical down to values, NULLs, decimal scale, and timestamp timezone.
3
Swap via Glue, with rollback

Only after verification passes does it repoint the Glue table location — a catalog pointer move, not a data move — reversible with a single rollback command.

Highlights

Safe by construction: optimized data is written to a shadow location, verified query-result-identical (values, nulls, decimal scale, timestamp tz), then swapped via the Glue catalog — with one-command rollback. Source data is never modified in place.

See the savings before you commit: a dry-run projects the monthly dollar reduction on your real tables and attributes it honestly to pruning, dictionary + zstd, and small-file compaction.

No lock-in, no babysitting: output is standard Athena-readable Parquet, and the AMI re-optimizes on an EventBridge schedule so savings compound automatically.

What's included

Query-history analysis: derives hot columns, frequent predicates, and partition/sort-key candidates from your past queries into an optimization plan.
Partition and row-group pruning: re-partitions on your predicate keys so whole files are skipped, and sorts on filter columns so min/max statistics let non-matching row groups be read past.
Dictionary/RLE encoding + zstd: low-cardinality columns are dictionary-encoded and recompressed with standard, Athena-readable zstd to cut projected read bytes.
Small-file compaction: merges many tiny objects into fewer, right-sized files for I/O efficiency and better statistics.
Shadow → verify → swap with one-command rollback: the source is never modified in place, and the cutover is only a Glue pointer move.
Dry-run dollar projection: with no writes, it reports current vs. optimized scan bytes and the monthly delta, honestly attributing the saving to partition pruning / row-group pruning / dictionary+zstd / compaction.
EventBridge-scheduled re-optimization, with output that is standard Athena-readable Parquet — no lock-in.

Use cases

High Athena/Redshift Spectrum scan bills you want to cut without rewriting any SQL.

Wide tables where queries read only a few columns and filter on a few predicates, yet still scan the whole file.

Data lakes fragmented into many small Parquet objects, hurting I/O efficiency and statistics.

Growing datasets that benefit from recurring, EventBridge-scheduled re-optimization.

FAQ

Could this corrupt my data or change my query results?

The source is never modified in place. Optimized data is written to a separate shadow location and verified result-identical — values, NULLs, decimal scale, timestamp timezone, nested types — before only the Glue pointer is moved. If verification fails it does not swap; it drops the shadow and leaves the source untouched. After a swap, one command restores the original location.

Am I locked into a proprietary format?

No. Output is only standard, Athena-readable Parquet (standard zstd, dictionary/RLE) with no proprietary codec. You can roll back at any time, and the optimized data reads with normal Athena tooling.

How does S4 Scan know what to optimize?

It learns from your Athena query history — which columns are read and which predicates filter your tables — and derives the column ordering, partition keys, and sort keys to recommend. It is based on real access patterns, not guesswork.

Can I see how much I'll save before committing?

Yes. A dry-run reports current vs. optimized scan bytes and the monthly dollar delta without writing anything, attributing the reduction honestly across partition/row-group pruning, dictionary+zstd, and compaction. The 30–70% headline is a range that depends on how mismatched your layout is — not a guarantee — so start with a dry-run on your own table.

Where does it run, and does my data leave my account?

It runs as an AMI inside your own AWS account and VPC, accessing only your S3, Glue, and Athena — your data never leaves. It operates hands-off on an EventBridge schedule under a least-privilege IAM role.

Why it's cheaper

Assumes 100 TB / month scanned by Athena in us-east-1, with S4 Scan reducing scanned bytes by ~60% via physical-layout tuning.

Without S4 Scan (vanilla Parquet)

Query

SELECT …

Athena

100 TB scanned

S3 bucket

Athena scan (100 TB × $5/TB): $500 / mo
Monthly total: $500 / mo

With S4 Scan

Query

SELECT …

Athena

Layout-tuned via Glue

S4 Scan

m5.large

40 TB scanned

S3 bucket

Athena scan (40 TB × $5/TB): $200 / mo
S4 Scan instance (m5.large + software fee): $100 / mo
Monthly total: $300 / mo

−40%vs. plain Athena

Sizing the S4 Scan instance by scan volume

Scan volume	Recommended instance	S4 Scan instance cost	Total vs. plain Athena
~10 TB / mo	t3.medium	$45 / mo	$65 / mo (Athena $50, +30%)
~100 TB / mo	m5.large	$100 / mo	$300 / mo (Athena $500, −40%)
~1 PB / mo	m5.2xlarge	$310 / mo	$2,310 / mo (Athena $5,000, −54%)

Illustrative example. Athena uses us-east-1 list pricing ($5/TB scanned). The 60 % scan reduction depends on query mix and data layout; 30–70 % is the typical range. At small scan volumes the S4 Scan instance cost can exceed the Athena savings, so this product is most useful past ~50 TB/month.

Pricing model

Hourly software fee + EC2 (t3 / m5 class), with a pay-per-GB-processed offer also available. Annual option available.

Get it on AWS Marketplace

Other S4 products

Storage & data

S4 — Squished S3

Transparent GPU S3-compression gateway

50–80% fewer storage bytes

Replaces: Amazon S3 storage

Observability

S4 Logs

Archive CloudWatch Logs to zstd S3

70–90% off CloudWatch Logs

Replaces: Amazon CloudWatch Logs

Observability

S4 Metrics

Govern CloudWatch metric cardinality

Tame metric cardinality cost

Replaces: CloudWatch custom metrics

S4 Scan

When does this pay back?

Estimate your savings

The problem

How it works

Learn from query history

Write to shadow, then verify

Swap via Glue, with rollback

Highlights

What's included

Use cases

FAQ

Why it's cheaper

Sizing the S4 Scan instance by scan volume

Pricing model

Other S4 products

S4 — Squished S3

S4 Logs

S4 Metrics