Skip to main content

Snowflake

Overview

Snowflake is our cloud-based data warehouse platform that serves as the central repository for all analytics data at Pair Team. It stores and processes data from all our sources (Arc, Elation, Zus, Regal) and provides the foundation for our analytics infrastructure, including dbt transformations and Looker visualizations.

Link to Snowflake application: https://app.snowflake.com/a5104770839961/rbb60871

What is Snowflake?

Snowflake is a fully managed data warehouse built for the cloud. Unlike traditional data warehouses, Snowflake separates storage and compute, allowing organizations to scale each independently. This architecture provides flexibility in managing costs and performance.

Key Features:

  • Cloud-native architecture (runs on AWS, Azure, or GCP)
  • Automatic scaling and optimization
  • Support for structured and semi-structured data (JSON, Parquet, Avro, etc.)
  • Zero maintenance - no infrastructure management required
  • Advanced security and governance features
  • Native support for data sharing across organizations

Snowflake in the Data Warehouse Ecosystem

Snowflake competes with other major cloud data warehouse platforms:

Major Competitors:

  • Amazon Redshift - AWS's native data warehouse, tightly integrated with AWS services
  • Google BigQuery - GCP's serverless data warehouse with SQL query capabilities
  • Databricks - Lakehouse platform combining data warehousing and data lakes
  • Azure Synapse Analytics - Microsoft's analytics service combining data warehousing and big data
  • Traditional platforms - Oracle, Teradata, IBM Db2 (on-premises legacy systems)

Snowflake's Differentiators:

  • Cloud-agnostic (works across AWS, Azure, GCP)
  • Separation of storage and compute for independent scaling
  • Near-zero maintenance overhead
  • Native support for semi-structured data
  • Secure data sharing without data movement

Our Snowflake Setup

Infrastructure

Our Snowflake instance runs on AWS infrastructure. This means:

  • Data is stored in AWS S3 buckets in the backend
  • Compute resources run on AWS EC2 instances
  • We benefit from AWS's global infrastructure and reliability
  • Integration with other AWS services (if needed) is straightforward

Cost Model

Snowflake uses a credit-based pricing model:

How Credits Work:

  • Credits are consumed by queries and compute operations
  • The more queries we run, the more credits we use
  • Heavier, more complex queries consume more credits
  • Storage is billed separately from compute

What Affects Credit Consumption:

  • Query complexity and execution time
  • Amount of data scanned
  • Size of the virtual warehouse running the query
  • Number of concurrent queries
warning

It's critical to monitor Snowflake usage and costs continuously. Unoptimized queries or oversized warehouses can quickly drive up costs.

Virtual Warehouses

What are Virtual Warehouses?

Virtual warehouses are clusters of compute resources that execute queries in Snowflake. Think of them as the "engines" that power your queries. Each warehouse can be sized independently and scaled up or down based on workload needs.

Warehouse Sizes

Warehouses come in different sizes, each providing more compute power at higher cost:

SizeRelative PerformanceRelative CostCredit Consumption per Hour
X-Small1x1x1 credit/hour
Small2x2x2 credits/hour
Medium4x4x4 credits/hour
Large8x8x8 credits/hour
X-Large16x16x16 credits/hour

Note: Each size up doubles both performance and cost. Warehouses can also scale horizontally with multi-cluster configurations for handling concurrent users.

tip

Virtually all of our warehouses are sized at X-Small because:

  • Adequate performance for typical analytics queries
  • Minimizes credit consumption
  • Can auto-suspend when not in use to save costs
  • Easy to scale up temporarily if needed

Cost Management Best Practices

To keep Snowflake costs reasonable and contained:

  1. Monitor Usage Regularly

    • Review warehouse utilization and query performance
    • Identify expensive or inefficient queries
    • Track credit consumption trends
  2. Right-Size Warehouses

    • Start with X-Small and scale up only when needed
    • Use larger warehouses only for specific heavy workloads
    • Scale back down after intensive operations complete
  3. Optimize Query Performance

    • Write efficient SQL to minimize data scanning
    • Use clustering keys for large tables
    • Leverage materialized views where appropriate
    • Avoid SELECT * queries

Accessing Snowflake

To access Snowflake directly for querying raw data or running analyses contact the engineering infrastructure team or the data team. However, most end users should access data through Looker rather than querying Snowflake directly. Direct Snowflake access is typically needed for:

  • Data engineering and dbt development
  • Advanced analytics requiring raw data access
  • Troubleshooting data quality issues
  • Ad-hoc analysis not available in Looker