Snowflake and the Data Lake: Building on Iceberg Tables with AWS Glue as the Catalog

"Do we have to copy all of it into Snowflake?" is the question that starts almost every one of these projects, and the answer that surprises people is no — not if the data is already sitting in S3 as Iceberg tables cataloged by Glue. Snowflake can query that data in place, govern it, and join it against native Snowflake tables, without a load step and without Snowflake becoming the system of record. That's a genuinely different architecture from "migrate the lake into the warehouse," and it comes with its own setup decisions, performance characteristics, and failure modes that a straight migration doesn't have.

This is that architecture end to end: the managed-versus-unmanaged Iceberg table decision that everything else follows from, configuring the Glue catalog integration (and the newer Iceberg REST path), external volume design, the refresh and performance best practices that actually matter at scale, and what I'd do differently after running this in production. For the higher-level "which catalog should be authoritative" decision this assumes, see Horizon vs Open Catalog vs Glue Data Catalog — this article is the practical build guide for the Glue-as-catalog path specifically.

What's the fundamental choice — managed or unmanaged Iceberg tables?

Every Snowflake Iceberg table is either Snowflake-managed (Snowflake is the Iceberg catalog — it owns commits, compaction, and metadata) or externally managed/unmanaged (a different system — Glue, Polaris, another engine — is the catalog of record, and Snowflake reads and optionally writes through a catalog integration pointed at it). "Snowflake and the data lake" as an architecture pattern is specifically the unmanaged case: the data lake already exists, Glue already catalogs it, other engines (Athena, EMR, Spark) already depend on it, and the goal is adding Snowflake as a consumer — and in Snowflake 2025+, increasingly a governed, queryable participant — without disturbing any of that.

A catalog integration is the object that names the external catalog and how to talk to it; a single catalog integration can back many tables that share the same external catalog. Layered on top, an external volume — an account-level Snowflake object holding a generated IAM entity — specifies where the table's Parquet data and Iceberg metadata physically live, and handles the storage credentials so nobody hand-manages an access key. The clean mental model: catalog integration answers "who tells Snowflake what tables and schemas exist," external volume answers "where does Snowflake actually go to read the bytes."

How do you actually configure the Glue integration — and REST or classic API?

Snowflake now supports two ways to talk to Glue: the original Glue API-based catalog integration, and the newer Iceberg REST path that talks to Glue's Iceberg REST endpoint — the same endpoint that made Glue a peer catalog in the REST-catalog landscape (see the catalog comparison for why that endpoint matters beyond just this integration). The REST path is the more future-proof choice specifically because it's the same protocol Polaris, Unity Catalog OSS, and other REST-compliant catalogs speak — building the integration against the REST spec rather than Glue's proprietary API means less rework if the authoritative catalog ever needs to move.

-- External volume: where the Iceberg data and metadata physically live
CREATE EXTERNAL VOLUME lake_ext_vol
  STORAGE_LOCATIONS = ((
    NAME = 'lake-s3'
    STORAGE_PROVIDER = 'S3'
    STORAGE_BASE_URL = 's3://data-lake-bucket/warehouse/'
    STORAGE_AWS_ROLE_ARN = 'arn:aws:iam::123456789012:role/snowflake-glue-catalog-reader'
  ));

-- Catalog integration: how Snowflake talks to the Glue Iceberg REST endpoint
CREATE CATALOG INTEGRATION glue_rest_catalog
  CATALOG_SOURCE = ICEBERG_REST
  TABLE_FORMAT = ICEBERG
  CATALOG_NAMESPACE = 'analytics_db'
  REST_CONFIG = (
    CATALOG_URI = 'https://glue.us-east-1.amazonaws.com/iceberg'
    CATALOG_NAME = '123456789012'
  )
  REST_AUTHENTICATION = (
    TYPE = SIGV4
    SIGV4_IAM_ROLE = 'arn:aws:iam::123456789012:role/snowflake-glue-catalog-reader'
  )
  ENABLED = TRUE;

-- The unmanaged Iceberg table itself, pointed at the existing Glue table
CREATE ICEBERG TABLE events
  CATALOG = 'glue_rest_catalog'
  EXTERNAL_VOLUME = 'lake_ext_vol'
  CATALOG_TABLE_NAME = 'events';

The IAM role Snowflake assumes is the piece worth getting right the first time, and Snowflake's own documented best practice is a dedicated policy scoped specifically to catalog read (and write, if bidirectional) access — created new, not borrowed from an existing broad Glue or S3 role that happens to already have the right permissions. That's not caution for its own sake: a scoped role is the difference between a clean answer and an uncomfortable one in a security review six months later, when someone asks exactly what Snowflake's integration can touch beyond the tables it's supposed to.

What actually determines query performance on unmanaged Iceberg tables?

Two levers matter more than people expect going in. First, table creation and initial scan cost — pointing Snowflake at an existing Iceberg table with a large number of underlying data files means Snowflake has to scan those files to build its view of the table, and that scan is itself a parallelizable, warehouse-sized operation: a larger warehouse genuinely speeds up that one-time (and every subsequent refresh) cost by scanning more files concurrently, which is a real, documented lever, not just "throw compute at it and hope." Second, refresh cadence — because Snowflake isn't the catalog of record, it has to periodically re-check the external catalog for new snapshots, and the gap between an external write and Snowflake seeing it is a function of refresh configuration, not instantaneous by default. Snowflake's guidance is explicit: configure frequent refreshes on externally-cataloged tables specifically to avoid serving stale data, and that's a deliberate setting to tune against your actual write frequency, not a default to leave alone and rediscover as a bug later.

graph LR
    ETL["Glue ETL job
writes Iceberg table"] --> GLUE["Glue Data Catalog
(catalog of record)"]
    GLUE -->|"Iceberg REST endpoint"| CI["Snowflake catalog integration"]
    CI --> EV["External volume
(S3 credentials)"]
    CI -->|"periodic refresh"| SF["Unmanaged Iceberg table
in Snowflake"]
    ATHENA["Athena / EMR / Spark"] -->|"also read/write"| GLUE

The unmanaged-table data flow: Glue stays the catalog of record, other AWS-native engines keep reading and writing exactly as before, and Snowflake joins as a consumer through the catalog integration — with refresh cadence, not a load job, determining how current Snowflake's view actually is.

How do you convert an existing S3 data lake to Snowflake Iceberg tables without disrupting Glue?

The migration pattern Snowflake and AWS both document, and the one I'd default to, doesn't touch the existing files or the existing Glue registrations at all — it registers the existing Iceberg metadata (already produced by whatever wrote the table originally — Glue ETL, Spark, Flink) as an unmanaged table in Snowflake, pointed at the same S3 location through an external volume. Nothing about the existing pipeline changes; Athena and EMR keep working exactly as they did, and Snowflake becomes an additional reader with zero migration risk to the systems already depending on that data. This is the same underlying principle as the "the data stays on S3" rule from a real AWS-to-Snowflake integration — the value of Iceberg-as-the-interop-format is precisely that adding a new consumer doesn't require moving anything.

Where teams get themselves into trouble is reaching for a full copy-and-convert migration by default, assuming Snowflake needs to own the data to query it well. That's true for Snowflake-managed tables where you genuinely want Snowflake's own compaction and optimization — but it's the wrong default for "we already have a working data lake and want Snowflake to see it," where the unmanaged path gets you querying in an afternoon instead of a multi-week migration project.

Concurrent-writer confusion is the failure mode I've actually hit, and it doesn't announce itself as an error — it shows up as "Snowflake's numbers don't match Athena's." A Glue ETL job wrote a schema change (a column type widened, a new partition scheme) between two of Snowflake's scheduled refreshes, and for that window, Snowflake's cached metadata pointed at a snapshot that was already stale relative to what Athena was serving from the live catalog — both answers were "correct" for the snapshot each was querying, but nobody had told the BI team a schema change was mid-flight, so the discrepancy looked like a Snowflake bug rather than a normal consequence of eventual consistency between an external catalog and Snowflake's refreshed view of it. Treat refresh cadence as a data contract you communicate to consumers, not an invisible implementation detail — especially around planned schema changes on the writing side.

What to carry away

"Snowflake and the data lake" as an architecture is specifically about unmanaged Iceberg tables — Glue (or another engine) stays the catalog of record, and Snowflake joins as a governed consumer through a catalog integration and external volume, with no data movement and no disruption to the pipelines already depending on the lake. Prefer the Iceberg REST path over the classic Glue API integration where both are viable, since it's the same protocol the rest of the REST-catalog ecosystem speaks, and scope the IAM role Snowflake assumes narrowly rather than reusing a broad existing role.

Two levers actually determine whether this performs well in production: warehouse size during initial scan and refresh (bigger genuinely helps, because file scanning parallelizes), and refresh cadence, which is a deliberate trade-off against your real write frequency, not a "set once and forget" setting. And treat that refresh cadence as a data contract to communicate explicitly to downstream consumers — the discrepancies that actually cause incidents aren't Snowflake bugs, they're the normal, documented consequence of eventual consistency between an external catalog and a periodically refreshed view of it, surfacing as a confusing number mismatch instead of an obvious error.