The pipeline was ingesting a device telemetry feed that landed one small JSON file per event, every few seconds, straight into the bucket Snowpipe was watching. It worked, in the sense that data showed up in the table. It also quietly generated more Snowpipe file-processing overhead than the actual bytes loaded ever justified, because Snowpipe bills a per-file overhead on top of the data volume — a fixed charge per thousand files processed, on top of the credit cost per GB — and nobody had connected "why is this so much more expensive than the batch load it replaced" to "we're loading forty thousand tiny files a day instead of forty right-sized ones." Fixing it wasn't a Snowflake-side change at all — it was inserting a file-preparation stage in front of Snowpipe, on the AWS side, before any of those files ever reached S3 in their final form.
This is that fix, generalized: why file size drives both Snowpipe's cost and its load latency, a Lambda-based pattern for combining small source files into right-sized batches (and splitting oversized ones), S3 key hierarchy and SNS filtering for a clean auto-ingest pipeline, error auto-validation, data quality reporting via Snowflake event tables, and the cost-control queries that catch this problem before the invoice does.
Why does file size matter this much for Snowpipe specifically?
Snowflake's own guidance is explicit and worth internalizing before anything else here: the best cost-to-performance ratio comes from files in the 100–250MB compressed range — large enough that per-file overhead becomes immaterial relative to the data volume, small enough to parallelize the load across a warehouse's threads rather than bottlenecking on one enormous file. Below that range, the fixed per-file overhead — billed at roughly 0.06 credits per 1,000 files processed, on top of the data volume charge — starts to dominate the actual cost of loading, because you're paying that per-file tax on a file that barely has any data in it. Above roughly 5GB, you lose the parallelization and error-isolation benefits smaller files give you, and a single bad row can force reprocessing a much larger chunk of data than it should.
The practical consequence: a source system that naturally produces many small files is not a Snowpipe problem to solve with Snowpipe configuration — it's a file-preparation problem to solve before those files ever land in the bucket Snowpipe is watching. That's the gap a Lambda-based combining stage fills.
How does a Lambda-based file-combining stage actually work?
The pattern: small source files land in a staging prefix Snowpipe never watches directly. A Lambda function — triggered either on a schedule (every N minutes) or once an accumulation threshold is crossed (a batch of roughly 100–250 small files queued, or an equivalent accumulated byte count) — reads that batch, concatenates or repartitions it into one or a small number of files landing in the target 100–250MB compressed range**, and writes the result to the production prefix Snowpipe's S3 event notification actually watches. This decouples the source system's natural file-arrival cadence (which you usually don't control) from the file size Snowpipe actually wants to see (which you do control), and it's the same principle in reverse for the opposite problem — a source occasionally producing one oversized file gets split by the same Lambda stage into several files inside the target range, rather than forcing Snowpipe to load one multi-gigabyte file that can't parallelize well.
# Simplified shape of the combining Lambda — triggered on a schedule,
# reads a batch of small staged files, writes right-sized output(s)
import boto3, gzip, io
s3 = boto3.client("s3")
STAGING_PREFIX = "staging/raw/"
TARGET_PREFIX = "production/events/"
TARGET_SIZE_BYTES = 200 * 1024 * 1024 # aim for the 100-250MB sweet spot
def handler(event, context):
objects = s3.list_objects_v2(Bucket=BUCKET, Prefix=STAGING_PREFIX)["Contents"]
buffer, buffer_size, batch_id = io.BytesIO(), 0, 0
for obj in objects:
body = s3.get_object(Bucket=BUCKET, Key=obj["Key"])["Body"].read()
buffer.write(body)
buffer_size += len(body)
if buffer_size >= TARGET_SIZE_BYTES:
flush(buffer, batch_id); buffer, buffer_size = io.BytesIO(), 0
batch_id += 1
if buffer_size > 0:
flush(buffer, batch_id)
# move or delete originals from staging only after successful flush
def flush(buffer, batch_id):
key = f"{TARGET_PREFIX}part-{batch_id:04d}.json.gz"
s3.put_object(Bucket=BUCKET, Key=key, Body=gzip.compress(buffer.getvalue()))
A file-hierarchy detail that pays for itself later: partition the target prefix by ingestion time (production/events/dt=2026-05-21/hr=14/part-0001.json.gz), not just a flat drop folder. This costs nothing extra at write time and makes later reprocessing, backfill, and any downstream Iceberg or Hive-style partitioning trivial — reprocessing "just Tuesday's 2pm hour" is a prefix filter instead of a file-by-file timestamp scan.
How does SNS filtering keep Snowpipe watching only the right files?
The standard AWS auto-ingest flow is S3 → SNS → SQS → Snowpipe — an S3 event notification publishes to an SNS topic, Snowflake's own SQS queue is subscribed to that topic, and Snowpipe picks up the notification and loads the referenced file. The reason to route through SNS rather than wiring S3 directly to Snowflake's SQS queue is filtering: SNS subscription filter policies let you route only notifications matching specific message attributes to a given subscriber, which matters the moment one bucket serves more than one purpose — you don't want Snowpipe's queue receiving (and paying processing overhead for) every object-created event in the bucket, only the ones in the production prefix that are actually meant for it.
The catch worth knowing before it costs you a debugging afternoon: S3 event notifications don't include message attributes by default, and SNS filter policies filter on message attributes — so a raw S3-to-SNS-to-SQS wire with no intermediate step can't actually apply a meaningful filter policy on S3's own event payload. The standard fix is a thin Lambda between S3 and SNS that inspects the event (key prefix, suffix, size) and republishes to SNS with the message attributes a filter policy can then match on — or, more simply where it's sufficient, using S3's own native prefix/suffix event notification filtering to scope which events get published to SNS in the first place, before SNS-level attribute filtering is even needed.
graph LR
SRC["Source system
(many small files)"] --> STAGE["S3 staging prefix"]
STAGE --> LAM["Combining Lambda
(schedule or count trigger)"]
LAM --> PROD["S3 production prefix
(100-250MB files,
time-partitioned keys)"]
PROD -->|"S3 prefix/suffix filter"| SNS["SNS topic"]
SNS -->|"subscription filter policy"| SQS["Snowflake SQS queue"]
SQS --> SNOWPIPE["Snowpipe auto-ingest"]
SNOWPIPE --> TABLE["Target table"]
The full pipeline: small source files never touch the prefix Snowpipe watches. A combining Lambda produces right-sized, time-partitioned output in a production prefix, S3's own notification filtering (and/or an SNS filter policy) scopes exactly which events reach Snowflake's queue, and Snowpipe only ever sees files sized for efficient loading.
How do you get error auto-validation and a real DQ report out of this?
Snowpipe's COPY INTO under the hood supports VALIDATION_MODE, which validates a file's rows without loading them — RETURN_ALL_ERRORS surfaces every row-level problem in a file rather than stopping at the first one, which is the mode to run as a pre-flight check on a sample or on files routed to a suspect path, rather than discovering the same errors one at a time in production. For files that do load, COPY_HISTORY is the table function that reports the first error Snowflake hit per file during an actual load — the standard first place to look when a load partially failed.
For a genuine, ongoing DQ report rather than a one-off troubleshooting query, the pattern I'd build is a scheduled task that queries COPY_HISTORY and the pipe's own error notifications, and writes a daily summary — files attempted, files loaded clean, files with errors, error categories — into a reporting table, with the DQ dimensions (completeness of the day's expected file count, validity of row-level parsing) as the actual columns being tracked, not just a raw error dump nobody reads. Snowflake's event tables pair well with this for the operational-logging half of the story specifically: a Python UDF or stored procedure step in the pipeline can emit structured log and trace events into an event table, which gives you a queryable, structured record of what the pipeline actually did — file counts, batch sizes, timing — that's a genuinely different, complementary signal from the load-outcome data COPY_HISTORY reports.
-- Pre-flight validation before committing to a real load
COPY INTO staging.events
FROM @production_stage/dt=2026-05-21/
FILE_FORMAT = (TYPE = JSON)
VALIDATION_MODE = RETURN_ALL_ERRORS;
-- The DQ report's actual source of truth
SELECT
file_name,
status,
row_count,
error_count,
first_error_message
FROM TABLE(INFORMATION_SCHEMA.COPY_HISTORY(
TABLE_NAME => 'staging.events',
START_TIME => DATEADD(hours, -24, CURRENT_TIMESTAMP())
));
How do you actually keep Snowpipe cost under control?
The account-level source of truth is PIPE_USAGE_HISTORY, which reports credits billed and files processed per pipe going back a full year — the first query to run isn't "how much did Snowpipe cost this month," it's "which pipe is processing far more files than its data volume justifies," because that ratio is the direct signature of the tiny-files problem this whole article is about. A pipe with a high file count and a low average bytes-per-file is telling you, in the billing data itself, that a combining stage upstream would pay for itself.
| Signal | What it means | Fix |
|---|---|---|
| High file count, low avg. file size in PIPE_USAGE_HISTORY | Per-file overhead dominating cost | Add a Lambda combining stage before the production prefix |
| COPY_HISTORY showing partial failures clustered in one path | A specific upstream source producing malformed files | Route that source's staging prefix through pre-flight VALIDATION_MODE |
| Snowpipe latency spikes on a handful of files | Oversized files defeating parallelization | Split via the same Lambda stage before the production prefix |
A Lambda that writes its combined output back into a prefix its own trigger is scoped to watch is a self-inflicted infinite loop, and I've seen it happen from an S3 event filter that was one character too broad. The combining Lambda's own S3 trigger must never overlap the prefix it writes its output to — staging and production have to be genuinely separate prefixes (or separate buckets), with the trigger's prefix/suffix filter scoped tightly enough that a small typo in the filter pattern can't accidentally re-trigger the same function on its own output. Test the trigger scope explicitly, not just the Lambda's business logic, before this goes anywhere near a production bucket — an infinite-loop Lambda invocation bill is a much worse Friday than a tiny-files Snowpipe bill.
What to carry away
Snowpipe's cost and performance model rewards files in the 100–250MB compressed range specifically because per-file overhead and load parallelization both hinge on file size — a source system that naturally produces many small files needs a preparation stage before Snowpipe, not a Snowpipe-side workaround. A scheduled or threshold-triggered Lambda that combines (or splits) files into that target range, writing to a time-partitioned production prefix, is the standard fix, and pairing S3's native prefix/suffix filtering with SNS subscription filter policies keeps Snowpipe's queue receiving only the events actually meant for it.
Build error handling in two layers — VALIDATION_MODE as a pre-flight check, COPY_HISTORY as the load-outcome record — and turn both into an ongoing DQ report rather than one-off troubleshooting queries, with event tables covering the structured operational-logging side the load-history views don't. And treat PIPE_USAGE_HISTORY's file-count-versus-data-volume ratio as the direct, queryable signal for exactly the problem this article solves — if that ratio looks wrong, a combining Lambda upstream is the fix, not a bigger warehouse or a Snowflake support ticket.