Snowflake
Overview
Snowflake is our cloud-based data warehouse platform that serves as the central repository for all analytics data at Pair Team. It stores and processes data from all our sources (Arc, Elation, Zus, Regal) and provides the foundation for our analytics infrastructure, including dbt transformations and Looker visualizations.
Link to Snowflake application: https://app.snowflake.com/a5104770839961/rbb60871
What is Snowflake?
Snowflake is a fully managed data warehouse built for the cloud. Unlike traditional data warehouses, Snowflake separates storage and compute, allowing organizations to scale each independently. This architecture provides flexibility in managing costs and performance.
Key Features:
- Cloud-native architecture (runs on AWS, Azure, or GCP)
- Automatic scaling and optimization
- Support for structured and semi-structured data (JSON, Parquet, Avro, etc.)
- Zero maintenance - no infrastructure management required
- Advanced security and governance features
- Native support for data sharing across organizations
Snowflake in the Data Warehouse Ecosystem
Snowflake competes with other major cloud data warehouse platforms:
Major Competitors:
- Amazon Redshift - AWS's native data warehouse, tightly integrated with AWS services
- Google BigQuery - GCP's serverless data warehouse with SQL query capabilities
- Databricks - Lakehouse platform combining data warehousing and data lakes
- Azure Synapse Analytics - Microsoft's analytics service combining data warehousing and big data
- Traditional platforms - Oracle, Teradata, IBM Db2 (on-premises legacy systems)
Snowflake's Differentiators:
- Cloud-agnostic (works across AWS, Azure, GCP)
- Separation of storage and compute for independent scaling
- Near-zero maintenance overhead
- Native support for semi-structured data
- Secure data sharing without data movement
Our Snowflake Setup
Infrastructure
Our Snowflake instance runs on AWS infrastructure. This means:
- Data is stored in AWS S3 buckets in the backend
- Compute resources run on AWS EC2 instances
- We benefit from AWS's global infrastructure and reliability
- Integration with other AWS services (if needed) is straightforward
Cost Model
Snowflake uses a credit-based pricing model:
How Credits Work:
- Credits are consumed by queries and compute operations
- The more queries we run, the more credits we use
- Heavier, more complex queries consume more credits
- Storage is billed separately from compute
What Affects Credit Consumption:
- Query complexity and execution time
- Amount of data scanned
- Size of the virtual warehouse running the query
- Number of concurrent queries
It's critical to monitor Snowflake usage and costs continuously. Unoptimized queries or oversized warehouses can quickly drive up costs.
Virtual Warehouses
What are Virtual Warehouses?
Virtual warehouses are clusters of compute resources that execute queries in Snowflake. Think of them as the "engines" that power your queries. Each warehouse can be sized independently and scaled up or down based on workload needs.
Warehouse Sizes
Warehouses come in different sizes, each providing more compute power at higher cost:
| Size | Relative Performance | Relative Cost | Credit Consumption per Hour |
|---|---|---|---|
| X-Small | 1x | 1x | 1 credit/hour |
| Small | 2x | 2x | 2 credits/hour |
| Medium | 4x | 4x | 4 credits/hour |
| Large | 8x | 8x | 8 credits/hour |
| X-Large | 16x | 16x | 16 credits/hour |
Note: Each size up doubles both performance and cost. Warehouses can also scale horizontally with multi-cluster configurations for handling concurrent users.
Virtually all of our warehouses are sized at X-Small because:
- Adequate performance for typical analytics queries
- Minimizes credit consumption
- Can auto-suspend when not in use to save costs
- Easy to scale up temporarily if needed
Cost Management Best Practices
To keep Snowflake costs reasonable and contained:
-
Monitor Usage Regularly
- Review warehouse utilization and query performance
- Identify expensive or inefficient queries
- Track credit consumption trends
-
Right-Size Warehouses
- Start with X-Small and scale up only when needed
- Use larger warehouses only for specific heavy workloads
- Scale back down after intensive operations complete
-
Optimize Query Performance
- Write efficient SQL to minimize data scanning
- Use clustering keys for large tables
- Leverage materialized views where appropriate
- Avoid SELECT * queries
Accessing Snowflake
To access Snowflake directly for querying raw data or running analyses contact the engineering infrastructure team or the data team. However, most end users should access data through Looker rather than querying Snowflake directly. Direct Snowflake access is typically needed for:
- Data engineering and dbt development
- Advanced analytics requiring raw data access
- Troubleshooting data quality issues
- Ad-hoc analysis not available in Looker