Snowflake BigQuery Redshift: EDW Architecture

Legacy on-premise data warehouses often suffer from tightly coupled compute and storage resources, leading to significant over-provisioning and scalability bottlenecks. In modern data engineering, the migration to cloud-native Enterprise Data Warehouses (EDW) is driven by the need for elasticity and the separation of compute from storage. This article analyzes the architectural trade-offs between Snowflake, Google BigQuery, and AWS Redshift, focusing on performance characteristics, cost predictability, and operational overhead rather than marketing features.

1. Architectural Divergence: Compute and Storage

The core differentiator among these platforms lies in how they handle the decoupling of storage and processing power. This fundamental design choice dictates latency, concurrency handling, and scaling speed.

Snowflake utilizes a unique multi-cluster, shared-data architecture. Storage is handled via a proprietary format (micro-partitions) on top of object storage (S3/GCS/Azure Blob), while compute is managed through isolated "Virtual Warehouses." This allows different workloads (e.g., ETL vs. BI Dashboarding) to run on separate compute clusters without resource contention, accessing the same underlying data.

Google BigQuery operates on a completely serverless model based on Dremel. It abstracts the notion of instances entirely. Resources are allocated as "Slots" (virtual CPUs). The storage runs on Colossus, Google's distributed file system, and compute is dynamically assigned per query. This architecture excels in burst capability but offers less granular control over hardware configurations compared to provisioned models.

Architecture Note: BigQuery separates storage (Colossus) and compute (Borg) completely using a petabit-scale network (Jupiter). This network throughput is the key enabler for its "scan-heavy" performance profile.

AWS Redshift initially started as a fork of PostgreSQL with MPP capabilities. Modern Redshift (RA3 instances) creates a separation of compute and storage (Managed Storage for Redshift), allowing data to be offloaded to S3 while keeping hot data in local SSD caches. While it has moved towards decoupling, it still relies on a cluster-provisioning model, meaning resizing operations are not as instantaneous as BigQuery's slot allocation or Snowflake's cluster suspension.

2. Performance Optimization & Indexing Strategies

Unlike traditional RDBMS, these columnar stores do not use B-Tree indexes. Optimization relies heavily on data pruning and zoning.

In Snowflake, performance tuning focuses on "Clustering Keys." Snowflake automatically divides data into micro-partitions. If queries frequently filter by a specific column (e.g., `event_date`), defining a clustering key ensures that the scanner skips partitions that do not contain relevant data. Automatic clustering incurs background compute costs but significantly reduces query compilation and execution time.

BigQuery uses "Partitioning" and "Clustering." Partitioning physically divides the data (usually by time or integer range), while clustering sorts the data within those partitions. This is critical for cost control in BigQuery, as the pricing model is often based on data scanned.

-- BigQuery Table Definition Example
-- Partitioning by day and clustering by customer_id reduces scan costs
CREATE TABLE dataset.logs (
    request_id STRING,
    customer_id INT64,
    event_timestamp TIMESTAMP
)
PARTITION BY DATE(event_timestamp)
CLUSTER BY customer_id
OPTIONS(
    description="Optimized log storage"
);

Redshift relies on SORT KEYS and DIST KEYS. The Distribution Key determines how data is sharded across nodes. A poor choice of DIST KEY leads to heavy data shuffling (broadcasting) across the network during joins, which is a primary cause of performance degradation in Redshift clusters.

Anti-Pattern: In Redshift, using `DISTSTYLE EVEN` for large tables joined with other large tables can cause massive network I/O. Prefer `DISTSTYLE KEY` on the joining column to co-locate data on the same slice.

3. Cost Governance and Predictability

The cost model is often the deciding factor for engineering leadership.

Platform	Compute Pricing Model	Storage Pricing	Idle Cost
Snowflake	Per-second (Credit based) per Warehouse size	Compressed Avg Monthly usage	Auto-suspend enables zero cost
BigQuery	Per-TB Scanned (On-demand) or Slot/Hour (Capacity)	Active vs. Long-term storage rates	Zero (Serverless)
Redshift	Per-Node/Hour (Reserved Instances available)	Managed storage capacity	Costs incur unless cluster is paused

Snowflake offers excellent governance via "Resource Monitors." You can set hard limits on credit consumption per warehouse to prevent runaway costs from bad queries. The "Auto-suspend" feature (often set to 60 seconds or less) ensures you only pay when queries are running.

BigQuery's on-demand model (paying per TB scanned) is risky for unoptimized queries. A `SELECT *` on a petabyte table can cost thousands of dollars instantly. Using the "Maximum bytes billed" setting in query configurations is mandatory for safety. For predictable workloads, switching to "Capacity Pricing" (fixed slots) is recommended.

Critical Risk: In BigQuery, using `LIMIT` does NOT reduce cost in the on-demand model. The engine still scans the full columns required before applying the limit. Always filter by partition columns first.

Conclusion: Selecting the Right Tool

There is no single "best" data warehouse; the choice depends on your team's operational capability and workload patterns. Snowflake is superior for organizations requiring strict workload isolation, multi-cloud support, and a near-zero maintenance experience. BigQuery excels in ad-hoc analysis and scenarios where traffic is highly spiky, leveraging its serverless nature to scale from zero to thousands of slots instantly. Redshift remains a strong contender for AWS-centric ecosystems where steady-state workloads can benefit from Reserved Instance pricing and deep integration with the AWS glue catalog.

Ultimately, migration should focus on the data model evolution. Moving from a legacy schema to these platforms without implementing proper partitioning, clustering, or distribution strategies will result in higher costs than the on-premise baseline.

Snowflake BigQuery Redshift: EDW Architecture

1. Architectural Divergence: Compute and Storage

2. Performance Optimization & Indexing Strategies

3. Cost Governance and Predictability

Conclusion: Selecting the Right Tool

Related Posts

Post a Comment