Snowflake is getting a lot of attention because it’s a fast and easy-to-use cloud database, but organizations using the platform still need to focus on managing data access and privacy.
Storing your analytical data in Snowflake opens a new world of possibilities for information access and security.
Snowflake is a database platform like SQL Server or Oracle, but purpose-built from scratch for the cloud. Its developers kept familiar concepts (tables, views, SQL queries) but threw out all assumptions about how databases traditionally work and embraced everything cloud computing offers. You can read more about the basics of the platform here.
Snowflake’s underlying architecture makes it easy to provide high-performance data access to any number of internal and external users with far more efficiency than traditional databases. At least as importantly, it’s possible to fine-tune that access so you can ensure each consumer only sees exactly what you want them to see, right down to the row and column level. You no longer have to make copies of specific data sets or slam the door shut on whole areas of information just because some of it is sensitive.
Why Does Snowflake Security and Data Privacy Matter?
Information is simultaneously valuable and dangerous. Organizations can extract enormous value from the information they collect if they can keep it organized and accessible. However, we all know the names of organizations that have allowed sensitive information to escape. New privacy and reporting regulations are emerging every day at state, federal and international levels. Moreover, there’s an enormous reputational risk of a loss of trust from consumers and business partners.
To get the most out of the platform, modern organizations need an implementation plan that:
- Keeps sensitive data secure not only from unauthorized external access (i.e., hacking) but also from any internal access – even accidental – that does not have a specific, authorized, legitimate business purpose.
- Isolates data owned by different entities so that companies can correlate it for legitimate business purposes, but it makes it effectively impossible to inappropriately or inadvertently combine this data. Maybe you want to share trends or benchmarks across customers but ensure they can never access each other’s data.
- Identifies data elements in support of these goals providing a clear understanding and consistent meaning while slicing and dicing information.
- Tracks the sources and movements of data elements in support of:
While it is ultimately easier to meet all these goals in a Snowflake environment than in traditional databases, it also requires a bit more planning to be effective, consistent, and efficient. This six-part series will propose a set of technology, architecture and process standards to support these goals while balancing cost, maintainability and performance.
Identify, Organize and Isolate Data in Snowflake
Core to any data privacy and management exercise is the ability to identify the data at hand across several dimensions. To meet our goals, we need to know which data is sensitive and private (and where such data is stored), where it came from, who owns it, and frankly, as much as possible about how we can apply it to different purposes – some of which we don’t know yet.
At the beginning of each entry in this series, we’ll go over a few important concepts and recent technological developments that will help us achieve our privacy goals. If you have a lot of database experience, you may know these concepts well, but we often need to think about them a little differently in the context of a Snowflake environment.
Physical and Logical Model
Data is typically stored in a nested series of structures, each smaller (in terms of volume and complexity) than the last, each giving more detail and specificity. In a traditional database, this might look like:
Server -> Database -> Schema -> Table -> Column / Row (which together define a single element).
This provides a rudimentary path to information security (if you do not have access to the server, you cannot access any of the tables or data within) but generally leads to an all-or-nothing approach unsuitable to today’s goals.
This was largely forced upon users in a traditional database environment:
- A single database was dependent on its hardware (server) and could only hold and process so much information.
- Databases were therefore separated by subject area, and querying across them was very slow, if not impossible.
- Granting access to an entire database and all its contents was easy. Granting access to specific subsets of data only was hard.
- A lack of metadata made this hard enough to be unfeasible in most cases – we could restrict access to columns holding private data (e.g., SSN), but we had to find and restrict each one on a case-by-case basis. It was much safer and easier to simply restrict access to the entire database.
Cloud-native tools like Snowflake eliminate the server paradigm and treat databases as logical containers, so you can choose an organizational structure that better suits your security and usage patterns. Snowflake also enforces a security-first model, which makes it effectively impossible to grant access to an entire database and all its contents in a single step unless you have specifically and intentionally designed your model to support that from the beginning. More on this in a later blog when we discuss role-based access control.
The first and strongest line of defense for keeping data private is isolation – there’s a reason we keep our valuables all together in a bank and not lying around the house. However, isolation comes at a cost – you can’t easily use your jewelry if it’s in a safe-deposit box. Snowflake is a highly-secure environment providing multiple layers of isolation we can leverage for appropriate data access, much like a bank has a drive-up, an ATM lobby, a main lobby, a vault and so forth.
No data is stored at the organization level. This concept simply provides control and administration of multiple accounts. A single organizational administrator can manage all of your separate accounts in a single location, including usage and budgeting. An organization does not need to be set up at all if you’ll only be using a single account, but it’s good future-proofing practice if you may need to manage multiple accounts later.
Snowflake stores all the data for a single account across distributed commodity storage in a single public-cloud provider region, such as “Azure/North Central US.” This storage is effectively unlimited and doesn’t depend on any one server or hard drive. It’s the same underlying storage used for services like Netflix.
You can copy data from a single account across regions to other accounts within your organization, but you cannot query seamlessly between them. Each account is tied to a cloud region, making this the ideal level of separation to comply with regulations such as GDPR and India’s on-soil requirements. You can choose to physically store all data relating to customers from a given country in a single account hosted in that country.
If you have a legitimate need to share some of that data across borders, you can easily make backup copies into other accounts or set up internal data shares to expose tightly controlled subsets of data between accounts.
Snowflake organizes and optimizes data for querying during the loading process, and you’ll get better performance if related data is in a single database within your account. Snowflake has seen individual customer databases run over four petabytes, so if you organize your data well, you have minimal limitations. Unlike traditional database servers, however, you can seamlessly query across databases within an account with nearly the same performance as in a single database.
You can leverage this fact as part of your organizational structure, supporting isolation and Snowflake security while also allowing limited cross-cutting operations. For example, you can segregate private data into one database while keeping the generally accessible halves of those records in another. Most people can only access the public database, but you can allow certain people to retrieve combined public/private data with a query like:
A schema is a virtual structure within a database used for further organization – essentially, a folder for related tables. At this level, performance doesn’t vary, so you can use this however you like to keep things organized and secure. A tagging or security schema, for example, can centralize all the tags and Snowflake security rules you want to use throughout your database, allowing only administrators to maintain them. This can also be useful if you want to give someone the ability to create their own tables in a segregated schema while using – but not editing – the shared tables in the main schema.
Like any traditional database, Snowflake stores the actual data for a specific topic at a single grain (e.g., employee) in a table in rows and columns. This is the lowest level for standard Snowflake security – users receive access to a database, to a schema within the database, and to specific tables within the schema. Unlike traditional database tools, granting a user access to a database or schema does not grant them access to any of the tables within – you must be specific. However, without taking further steps, once a user receives access to a table, they can see all the data in that table.
Data shares are one of the unique benefits of Snowflake’s underlying storage model. Without getting too far into the weeds, Snowflake’s data is stored and versioned completely separately from how it is accessed (separation of storage and compute). This means if we want, we can create a “share” that appears as a read-only database to another Snowflake user inside or outside your organization.
You can grant access to as much or as little data as you’d like, with just as much control and filtering as any other role, without giving that user any access to your Snowflake environment or copying any data. (Snowflake extends this idea to support a data marketplace in which you can sell access to valuable information you collect, but that is beyond the scope of this series). What’s important to note here is that all of the features discussed in this series apply equally to data shares as they do to any other database or table.
What’s Next Maintaining Snowflake Security and Data Privacy?
In the next five entries in this series, we’ll discuss options and features that allow you to be careful and confident in the security of your data in Snowflake while also allowing you to share and make use of information far more easily and efficiently than with traditional tools.