Snowflake is an up and coming cloud database platform. Learn some of its nuances and why you may want to use it.
Cloud data platform Snowflake is growing quickly and getting a lot of press. So, what is it exactly, and is there more to it than a quirky name? How does it compare with data products from Microsoft, Amazon and Google? Should I use it, and what for?
Here are a few things you should know.
What Is It?
At heart, Snowflake is a database platform – think SQL Server or Oracle – purpose-built from scratch for the cloud. Its developers kept familiar concepts (tables, views, SQL queries) but threw out all assumptions about how databases traditionally work and embraced everything cloud computing offers.
Being cloud-native opens all kinds of interesting doors:
- Completely separate storage and compute
- Massively parallel processing (MPP)
- Pay-as-you-go usage with scaling on demand
- Durable, distributed and effectively infinite storage
Let’s look behind each of these doors.
Snowflake breaks data into small, immutable files stored as multiple copies in cheap and plentiful object storage (AWS S3, Azure Blob Storage, Google Buckets) designed to reliably hold and distribute enormous amounts of data (think Netflix). The largest single database run by a Snowflake customer is four petabytes and growing.
It keeps storage completely separate from the compute power used to query it – unlike a server or a virtual machine, you don’t need to pay for more storage space to get a more powerful engine, or vice-versa. Snowflake calls its compute engines “warehouses,” to reinforce the idea that each department (or each user) can have “their own data warehouse.”
When Snowflake refers to a warehouse, however, think only about the workers and forklifts – not the stuff. Any warehouse (compute engine) can process any data, and multiple warehouses can access the same data at the same time without interfering with each other at all.
Massively Parallel Processing
Under the covers, Snowflake uses an MPP architecture, similar to Hadoop. Picture the fiction section of a library, where they arrange all the books by author. Now imagine you want to find all the books with “dog” in the title. There might be a few dozen, but they’d be scattered all over the shelves, and you’d have to look at each title in turn to find them all. With MPP, when you say, “SELECT * FROM fiction WHERE title LIKE ‘%dog%’,” a whole bunch of librarians fan out, each checking one shelf and bringing the results back to you. Unlike Hadoop, however, you don’t need to learn a new way of asking – Snowflake is fully ANSI-SQL compliant, so you can write the standard SQL queries you’re used to.
Performance and Trade-Offs
Query performance is, in a word, impressive. For example, one of our clients had a query that took three-plus hours in Oracle (when it ran at all), and now it runs in just about three minutes. There’s a bit of a catch, though: Snowflake optimized their platform for storing and querying large amounts of data, not for performing one-off transactions. Because it distributes the data in immutable files, updating individual records is relatively slow. You would never use it as the database for an interactive application. Snowflake is your database for analysis, dashboards, data science, and more.
While they may take a little longer to update, immutable files give you a couple of other interesting benefits. Because the original file never changes during an update:
- There’s no conflict with simultaneous reads and updates – a reader will simply get the previous version of the data if the update hasn’t finished yet.
- You can make a quick virtual copy of a table or an entire database, called a “zero-copy clone,” without using up any additional storage. The “clone” is simply a pointer to the original files. This step happens nearly instantly and only starts using additional storage when you make a change to one of the copies (and only the amount necessary to store the change).
If you enable “time travel,” it will hang onto old versions of the files for a while, so you can query the data precisely as it was an hour, a day, or a week ago. This feature also means that you can “un-drop” a table after accidentally deleting it.
With compute separate from storage, you need to make some careful comparisons based on how you’ll use it. The storage is cheap. Snowflake passes through the cost of the underlying provider (currently $23 per TB per month for Amazon S3). The compute appears relatively expensive on a per-hour basis, but it doesn’t run 24 hours a day like a traditional database server. Think of it more like an automatic faucet. It starts when someone wants to use it and shuts down when they’re finished.
In the cloud data warehousing space, Snowflake most directly compares to AWS Redshift, Azure Synapse Data Warehouse and Google BigQuery. If you already have a successful Redshift, Synapse or BigQuery environment, you may be happy to stick with it. However, in our experience, Snowflake is easier to set up and maintain than any of those, as there’s effectively no maintenance other than managing usage and security. Snowflake’s built-in logic takes care of indexing and query optimization, so there isn’t any traditional performance-tuning work.
For connecting to your existing cloud environments and tools, Snowflake has subscriptions running on AWS, Azure and Google. This feature makes little difference to your user experience, but choosing the cloud provider you already use makes connectivity simpler and cheaper.
Snowflake takes rigorous security measures. It encrypts all data while in motion and at rest, and there are options certified for HIPAA and government work. Everything defaults to least-access. Unlike traditional databases, you can’t give someone access to the entire database in a single step, even if you want to.
Snowflake is getting a lot of attention because it’s a cloud database that’s fast and easy to use, and it integrates smoothly with your preferred cloud platform without locking you into any of them. It’s not the solution to every problem, but if you want to explore and analyze a lot of structured data in a familiar SQL environment, Snowflake fits the bill.