Snowflake is an up and coming cloud database platform. Learn some of its nuances and why you may want to use it.
Cloud data platform Snowflake is growing quickly and getting a lot of press. So, what is it exactly, and is there more to it than a quirky name? How does it compare with data products from Microsoft, Amazon and Google? Should I use it, and what for?
Here are a few things you should know.What Is It?
At heart, Snowflake is a database platform – think SQL Server or Oracle – purpose-built from scratch for the cloud. Its developers kept familiar concepts (tables, views, SQL queries) but threw out all assumptions about how databases traditionally work and embraced everything cloud computing offers. Being cloud-native opens all kinds of interesting doors:- Completely separate storage and compute
- Massively parallel processing (MPP)
- Pay-as-you-go usage with scaling on demand
- Durable, distributed and effectively infinite storage
Storage
Snowflake breaks data into small, immutable files stored as multiple copies in cheap and plentiful object storage (AWS S3, Azure Blob Storage, Google Buckets) designed to reliably hold and distribute enormous amounts of data (think Netflix). The largest single database run by a Snowflake customer is four petabytes and growing. It keeps storage completely separate from the compute power used to query it – unlike a server or a virtual machine, you don't need to pay for more storage space to get a more powerful engine, or vice-versa. Snowflake calls its compute engines "warehouses," to reinforce the idea that each department (or each user) can have "their own data warehouse." When Snowflake refers to a warehouse, however, think only about the workers and forklifts – not the stuff. Any warehouse (compute engine) can process any data, and multiple warehouses can access the same data at the same time without interfering with each other at all.Massively Parallel Processing
Under the covers, Snowflake uses an MPP architecture, similar to Hadoop. Picture the fiction section of a library, where they arrange all the books by author. Now imagine you want to find all the books with "dog" in the title. There might be a few dozen, but they'd be scattered all over the shelves, and you'd have to look at each title in turn to find them all. With MPP, when you say, "SELECT * FROM fiction WHERE title LIKE '%dog%'," a whole bunch of librarians fan out, each checking one shelf and bringing the results back to you. Unlike Hadoop, however, you don't need to learn a new way of asking – Snowflake is fully ANSI-SQL compliant, so you can write the standard SQL queries you're used to.Performance and Trade-Offs
Query performance is, in a word, impressive. For example, one of our clients had a query that took three-plus hours in Oracle (when it ran at all), and now it runs in just about three minutes. There's a bit of a catch, though: Snowflake optimized their platform for storing and querying large amounts of data, not for performing one-off transactions. Because it distributes the data in immutable files, updating individual records is relatively slow. You would never use it as the database for an interactive application. Snowflake is your database for analysis, dashboards, data science, and more.Other Benefits
While they may take a little longer to update, immutable files give you a couple of other interesting benefits. Because the original file never changes during an update:- There's no conflict with simultaneous reads and updates – a reader will simply get the previous version of the data if the update hasn't finished yet.
- You can make a quick virtual copy of a table or an entire database, called a "zero-copy clone," without using up any additional storage. The "clone" is simply a pointer to the original files. This step happens nearly instantly and only starts using additional storage when you make a change to one of the copies (and only the amount necessary to store the change).