0%

Basic Concept

Architecture

Snowflake’s architecture is a hybrid of traditional shared-disk database architectures and shared-nothing databases architectures. Similar to shared-disk architectures, Snowflake uses a central data repository for persisted data that is accessible from all compute nodes in the data warehouse. But similar to shared-nothing architectures, Snowflake processes queries using MPP (massively parallel processing) compute clusters where each node in the cluster stores a portion of the entire dataset locally. This approach offers the data management simplicity of a shared-disk architecture, but with the performance and scale-out benefits of a shared-nothing architecture

ref at

Editions

There are four different editions:

  • Standard Edition
  • Enterprise Edition
  • Business Critical Edition
  • Virtual Private Snowflake (VPS)

Snowflake doesn’t provide on-premise edition at this moment.

Session

A Snowflake session can only have one virtual Warehouse specified, though you can change the current virtual Warehouse for a session.

Virtual Warehouse

You can increase and decrease the size of a Snowflake Virtual Warehouse as your user base & query demand.

Snowflake will not start executing any SQL queries on a new virtual warehouse until all the servers are provisioned. If provisioning fails for any reason, Snowflake will attempt to fix the failed servers, and SQL will start executing once 50% or more virtual servers are provisioned.

Data Type

Snowflake support all the following semi-structured data types:

  • AVRO
  • Parquet
  • ORC
  • JSON
  • XML

Data Sharing

In Snowflake Data Sharing, no data is copied. It is just the metadata that enables the sharing of data. Since no data is copied, the target Snowflake account, aka consumer, is not charged for any storage.

Each Snowflake share can consist of:

  1. privileges that grant access to a database, a schema from which objects will be shared
  2. privileges that grant access to the object, table, secure views from which will be shared
  3. the consumer account with which the database, the objects are to be shared

In Data Sharing scenarios, the virtual warehouse will be executed in:

  • If the Data Sharing is to a Snowflake consumer, the user will use their own virtual warehouse to execute
  • If it is a non-Snowflake customer is consuming the share, the customer will use producer’s virtual warehouse resources to execute
    • better to have separate configuration

In Snowflake, sharing from/to Virtual Private Snowflake or Snowflake VPS editions is not supported.

Cloning

Cloning, also referred as zero-copy cloning, creates a copy of a database, schema or table. A snapshot of data present in the source object is taken when the clone is created and is made available to the clone object.
The clone object is writeable, independent of the clone source. That is, changes made to either the source object or the clone object are not part of the other.
Cloning a database will clone all the schemas and tables within that database
Cloning a schema will clone all the tables within that schema.

ref at here

Table Cloning

Table Cloning will copy

  • table structure
  • data
  • certain attributes

Table Cloning will not copy load metadata. Therefore, files that have been processed into the source table can be loaded again into a clone table.

Query Performance Optimisation

Clustering Key

A table is requiring a clustering key is because of

  • The query performance has slowed down over time
  • The size of the table is in multi-terabytes

Service Maintenance

Snowflake use a staged release process for new releases, which Snowflake releases updates is first applied to early access account, then standard & premier accounts, and finally Enterprise upgrades.

Data Ingestion

Load Metadata

Snowflake maintains file load metadata, through which it tracks which files have already been processed.

Snowflake is using load metadata to maintain which files have already been loaded. The default expire date for load metadata is 64 days.

DML

  • Show warehouse with filters
1
SHOW WAREHOUSES like '%testing%';
  • Show most recent query id
1
SELECT last_query_id()

If you want to retrieve the second most recent query id, you can also provide -2 argument and do something like

1
SELECT last_query_id(-2)

Reference

General Introduction

Key customers

snowflake-customers

7 Best Practices for Building Data Applications

snowflake-values

Isolation workloads with Snowflake

snowflake values

Key features

  • Zero-Copy Clone
  • Time Travel
    • AS OF feature
  • Semi-structured Support
    • add timestamp column
  • Materialized Views
    • Snowflake materialized views is different with other platforms

snowflake key features

Code in your Language

  • snowflake connection diagnose system
    • SnowPipe

Use the Ecosystem

Manage Cost

  • Manage compute
    • configure with Auto Suspend
    • configure with Auto Resume
    • configure suspend window after idle time window

manage cost

  • Monitoring resources
    • all data app utilise resource monitoring

manage cost

FAQ

Q: Does Snowflake a transactional database?

A: Snowflake doesn’t have strong constraints like other typical databases.

referential integrity

Q: What is SnowPipe?

A: It’s a Data Ingestion tool from Snowflake (click me).

  • serverless ingest data etl tool
  • Snowflake will charge you some snowflake credits

Q: How Snowflake work with CI/CD pipeline?

A: With Snowflake, all the data copy or environment provision is just updating metadata, and things are much easier for developer.

Q: How does Snowflake handle incremental data?

A: Snowflake use append-only mode to update all the tables.

Q: How does Snowflake remove the operational burden?

A: Snowflake is a managed services by Snowflake and most of the operational work will be handle by Snowflake, e.g. backup, upgrade, load balancing for queries.

manage cost

Q: How does Auto-Scaling work in Snowflake?

A: Snowflake have standard scaling policy, and also economic scaling policy, it will depend on customer’s preference.

Q: What are the business using Snowflake?

A:

  • financial institution, with user behaviour log
  • marketing campaign,
    • customer provide data via api or data sharing feature
    • data exchange between provider and customers
  • security analytical services
  • IoT data, trimble
  • typical a BI applicaiton, but nowsdays Snowflake is acting like a database + BI application.

Q: do you offer additional security component to address healthcare HIPAA compliance? please share details

A: Snowflake is HIPPA compliant platform. We have documentations that could be shared. Please reach out to the account team that can liaise with the security team and provide the requested certifications under the NDA.

Q: do you have AI/ML components to address data scientist needed on top of snowflake?

A: Yes. One of the most common workloads that run on Snowflake are Datascience workload. Snowflake has connectors for Spark, Python, R that could be used to help data scientist capture and process the data as part of data prep process and perform feature engineering operations. Snowflake then integrates with ML partners to deliver model build and deployment.

Q: I am a python coder, may I know in the deck when it shows Python “SQLAlchemy”, does it means we can only connect to Snowflakes via this ORM library, or we can directly issue raw query to Snowflakes and get results within pyhton code?

A: You could use Snowflake python library or Snowflake python pandas library to connect to Snowflake and execute pushdown data processing on Snowflake. More details here: https://docs.snowflake.com/en/user-guide/python-connector.html#snowflake-connector-for-python

Q: Do you plan to offer Vertical scaling of the Warehouse from Medium to Large (or) Large to XL? Pls let know whether there is any pattern based recommendation of Warehouse size based on the workloads.

A: Snowflake has on-demand vertical scaling of warehouses today. You can instantly scale up to any size warehouse based on your workload requirement. Today, we do not have any recommendation engine running in the background that would suggest the size of the warehouse based on the usage.

Q: What are the main factors to consider in deciding which size we need to set our warehouse for data ingestion ? Any quick tips/guidance ?

A: Snowflake provides you complete control on cost vs performance. You can execute your workload on a size of a warehouse and then scale up / down and get linear SLA scaling and decide on the right size based on your SLA requirement.

Q: can snowflake serves as datahub for other applications to extract data from, via API calls ? Essentially hosting ‘system API’ in Snowflake

A: Snowflake as a data hub is one of the most common ways to access data from Snowflake, specially in builder usecases. You will have to have an API server /. gateway that can send requests to Snowflake for data. Snowflake’s auto scale out and ability to handle concurrent connections can support such framework.

Q: Does Snowpipe recommended only for continuous data loads ?

A: It is typically built for that use case but ofcourse not restricted to the one.

going back to Snowpipe vs virtual warehouse, are you saying that for ingetion it really doent matter one over the other?

it does. Both have their use case, benefits and advantages over each other. Typically, you will use Snowpipe for Near Real Time ingestion where as COPY INTO through virtual warehouse for batch ingestion.

Q: Does Snowflake-Spark connector supports aggregations on Spark Data Stream..our requirement is after reading data using read stream..we need to find out latest records

A: If you load the data into Snowflake from your Spark stream, you can use Snowflake Stream to identify new / latest records. More details here: https://docs.snowflake.com/en/user-guide/streams.html#change-tracking-using-table-streams

Q: is snowflake zero copy cloning similar to data virtualization - is there any relationship between this two?

A: No they are not same. Snowflake zero copy cloning is similar to clone operation in traditional warehouses to setup a parallel environment. The cloned object is an independent object that could be used to perform DML operations without impacting the primary object. The beauty of Snowflake architecture is that in the process of cloning, there is no duplication of data involved.

You will have to have an API server /. gateway that can send requests to Snowflake for data. ==> do you guys hv a user guide on how to setup these API server/gateway ?

You can either use API gateways as a service from cloud providers or BYO. Snowflake usage or integration is no different for either. There are some of the reference architectures that you could refer for your use cases. Link: https://www.snowflake.com/developers/

SnowSQL

For MacOS, you can use brew cask to install SnowSQL from command line

1
brew cask install snowflake-snowsql

JDBC Drive

Snowflake JDBC maven repo: https://repo1.maven.org/maven2/net/snowflake/snowflake-jdbc/

1
wget https://repo1.maven.org/maven2/net/snowflake/snowflake-jdbc/3.9.2/snowflake-jdbc-3.9.2.jar

Python SDK

1
pip install snowflake-connector-python

Spark SDK

Snowflake Spark Connector repo: https://github.com/snowflakedb/spark-snowflake