Snowflake

Overview

Snowflake is our cloud-based data warehouse platform that serves as the central repository for all analytics data at Pair Team. It stores and processes data from all our sources (Arc, Elation, Zus, Regal) and provides the foundation for our analytics infrastructure, including dbt transformations and Looker visualizations.

Link to Snowflake application: https://app.snowflake.com/a5104770839961/rbb60871

What is Snowflake?

Snowflake is a fully managed data warehouse built for the cloud. Unlike traditional data warehouses, Snowflake separates storage and compute, allowing organizations to scale each independently. This architecture provides flexibility in managing costs and performance.

Key Features:

Cloud-native architecture (runs on AWS, Azure, or GCP)
Automatic scaling and optimization
Support for structured and semi-structured data (JSON, Parquet, Avro, etc.)
Zero maintenance - no infrastructure management required
Advanced security and governance features
Native support for data sharing across organizations

Snowflake in the Data Warehouse Ecosystem

Snowflake competes with other major cloud data warehouse platforms:

Major Competitors:

Amazon Redshift - AWS's native data warehouse, tightly integrated with AWS services
Google BigQuery - GCP's serverless data warehouse with SQL query capabilities
Databricks - Lakehouse platform combining data warehousing and data lakes
Azure Synapse Analytics - Microsoft's analytics service combining data warehousing and big data
Traditional platforms - Oracle, Teradata, IBM Db2 (on-premises legacy systems)

Snowflake's Differentiators:

Cloud-agnostic (works across AWS, Azure, GCP)
Separation of storage and compute for independent scaling
Near-zero maintenance overhead
Native support for semi-structured data
Secure data sharing without data movement

Our Snowflake Setup

Infrastructure

Our Snowflake instance runs on AWS infrastructure. This means:

Data is stored in AWS S3 buckets in the backend
Compute resources run on AWS EC2 instances
We benefit from AWS's global infrastructure and reliability
Integration with other AWS services (if needed) is straightforward

Cost Model

Snowflake uses a credit-based pricing model:

How Credits Work:

Credits are consumed by queries and compute operations
The more queries we run, the more credits we use
Heavier, more complex queries consume more credits
Storage is billed separately from compute

What Affects Credit Consumption:

Query complexity and execution time
Amount of data scanned
Size of the virtual warehouse running the query
Number of concurrent queries

warning

It's critical to monitor Snowflake usage and costs continuously. Unoptimized queries or oversized warehouses can quickly drive up costs.

Virtual Warehouses

What are Virtual Warehouses?

Virtual warehouses are clusters of compute resources that execute queries in Snowflake. Think of them as the "engines" that power your queries. Each warehouse can be sized independently and scaled up or down based on workload needs.

Warehouse Sizes

Warehouses come in different sizes, each providing more compute power at higher cost:

Size	Relative Performance	Relative Cost	Credit Consumption per Hour
X-Small	1x	1x	1 credit/hour
Small	2x	2x	2 credits/hour
Medium	4x	4x	4 credits/hour
Large	8x	8x	8 credits/hour
X-Large	16x	16x	16 credits/hour

Note: Each size up doubles both performance and cost. Warehouses can also scale horizontally with multi-cluster configurations for handling concurrent users.

tip

Virtually all of our warehouses are sized at X-Small because:

Adequate performance for typical analytics queries
Minimizes credit consumption
Can auto-suspend when not in use to save costs
Easy to scale up temporarily if needed

Cost Management Best Practices

To keep Snowflake costs reasonable and contained:

Monitor Usage Regularly
- Review warehouse utilization and query performance
- Identify expensive or inefficient queries
- Track credit consumption trends
Right-Size Warehouses
- Start with X-Small and scale up only when needed
- Use larger warehouses only for specific heavy workloads
- Scale back down after intensive operations complete
Optimize Query Performance
- Write efficient SQL to minimize data scanning
- Use clustering keys for large tables
- Leverage materialized views where appropriate
- Avoid SELECT * queries

Accessing Snowflake

To access Snowflake directly for querying raw data or running analyses contact the engineering infrastructure team or the data team. However, most end users should access data through Looker rather than querying Snowflake directly. Direct Snowflake access is typically needed for:

Data engineering and dbt development
Advanced analytics requiring raw data access
Troubleshooting data quality issues
Ad-hoc analysis not available in Looker

Overview​

What is Snowflake?​

Snowflake in the Data Warehouse Ecosystem​

Our Snowflake Setup​

Infrastructure​

Cost Model​

Virtual Warehouses​

What are Virtual Warehouses?​

Warehouse Sizes​

Cost Management Best Practices​

Accessing Snowflake​