Snowflake tags are another tool you can use to boost your data’s security. In this blog, we explain what object tagging is and how you can use it for data governance.
In part one of our Snowflake security blog series, we discussed how to think about storing and organizing your data, from the organization level all the way down to individual tables.
In part two, we discussed ways to think about data access and control at a granular level. In this blog, we’ll look at a newer Snowflake feature, object tagging, which gives us another tool in our toolbox to identify which data needs granular security.
First, let’s review a key concept relevant to Snowflake tags: data vs. metadata.
Data is the information itself – date, amount, product, customer, and so on. Metadata is information about the data. For example, we know that this nine-digit number (our data) is a social security number (our metadata). Historically, we had to rely on a combination of naming conventions and human judgment to glean the metadata about our data, hoping the table is called something like “employee” and the relevant column has a name like “social_security_number,” or we look for patterns like “123-45-6789.” Now, we can tag our tables and columns directly, making it easy to keep track of important fields and control access to them.
What Are Tags in Snowflake?
Typical databases, feeds and files hold only data with minimal metadata (e.g., column names). However, Snowflake provides the opportunity to store much more metadata in the form of tags. Snowflake offers the ability to create and apply tags to databases, schemas, tables, views and columns (and also to users and roles, which we will discuss in a future post).
Tags are made of key-value pairs. For example, “cost_center = finance” or “protection_level = PII” or “PII_type = email.” Equally as important, you can set up multiple tags and apply them to the same piece of data: “personally-identifiable: true; sensitive: true; type: SSN; category: employee, owner: HR” and so on. These tags can be anything we like and therefore require careful management to be useful.
Further, tags are inherited based on where you apply them – so if you tag a table “protection_level = private,” every column in that table will also be tagged as “protection_level = private.” This applies the same way at higher levels: if you tag an entire database as “protection_level = private,” then every schema, every table, every view and every column in that database will be tagged as “protection_level = private.” This again argues for careful management but is ultimately very useful. If you write a query that brings back columns from multiple places, each column pulled from anywhere in the private database will carry the “protection_level = private” tag in Snowflake’s tagging repository.
You can also override or add to tags. A specific table in the “cost_center = finance” schema may have its tag overridden to “cost_center = finance_north_america.” All the columns in a table with “protection_level = PII” will have the same “protection_level = PII” tag but can also have a specific tag such as “PII_type = email” appended, so both pieces of information are returned when you query information about that column.
Important note: Many different tags can be applied to the same table or column, but you cannot set multiple values for the same tag on the same column. We need to plan and organize our tags carefully in situations where multiple pieces of similar information may apply. For example, you cannot combine “protection_level=PII” and “protection_level=GDPR”, but you can combine “PII=true” and “GDPR=true”.
Combining Tagging and Masking
In the previous entry in this series, we discussed the Dynamic Data Masking feature that lets us hide the contents of sensitive fields from people without access. Now, we can use tagging to support and improve our masking efforts. Snowflake has not yet implemented a fully-dynamic combination of tagging and masking: you must set Dynamic Data Masking explicitly on each column, and the masking rule cannot simply look up and use that column’s tags. However, we can still use this information to drive our masking implementation and audit for completeness. We can query Snowflake to find out all tagged columns that should be masked per our rules, query to find out which ones are or are not masked, and apply masking where it’s missing. We can also query Snowflake’s history tables to see who accessed tagged tables or columns and when.
Combining tagging capability with Dynamic Data Masking, we can create and enforce a hierarchy of permissions:
Database Tag: “owner = HR” Schema Tag: “category = employee_data” Table Tag: “protection_level = PII” Column Tags: “PII_type = firstname,” “PII_type = lastname,” “PII_type = work_email,” “PII_type = work_phone,” “PII_type = personal_phone,” “PII_type = dob,” and “PII_type = ssn”
You can keep this simple by using tags to inform your masking requirements, or you can write a slightly more complicated masking policy that reads from the SYSTEM$GET_TAG function to enforce masking based directly on the tags in place. This isn’t completely dynamic – you need to code for the specific tag and column combinations you want to check – but it does make your code more self-documenting and secure. (Since processing this lookup logic will take some computing resources, you’ll want to do a proof-of-concept to make sure your specific implementation still performs well if you go this route).
Given a masking implementation using the example tags above, if any user in your company who has not been granted access to the “employee_data” tag happens to find the Employee table and tries to query it, they’ll get:
An HR user within your company who has been granted access to the “protection_level = PII” tag and to some specific PII_type tags will get:
Tag Management
Because tags are so flexible, we must guard against proliferation and inconsistency. The best practice is to create a separate security database that holds all security-related information, and within that, a “Tag_Library” schema where you can define and manage all tags in a central location. Specific roles are recommended for:
- Tag_Administrator – A person who is allowed to create brand-new tags (such as “owner” or “PII_type”)
- Tag_Steward – A person who is allowed to add new values for an existing tag (such as “work_email” or “personal_phone” as new PII_types)
- Tag_Manager – A person (or program) who can apply tags to databases, schemas, and other objects.
Snowflake offers some automatic tagging (currently in early preview). During the process of loading data into a new Snowflake table, Snowflake can look for patterns like “(###) ###-####” and apply best-guess tags to columns. This is a nascent capability, however, so we need to have our own approach to review and augment any automatic tags.
Snowflake also offers the ability to monitor and audit tag usage:
- The snowflake.account_usage.tags view shows all tags that have been created.
- The snowflake.account_usage.tag_references function shows all the places each tag has been applied (each database, schema, view, table, column). You can call this with a filter to zero in on a specific object as desired.
- The snowflake.account_usage.tag_references_with_lineage function includes not only what tags exist on an object but how they got there (e.g., column-level tags inherited from table, which are inherited from database).
Some data cataloging and lineage tools can make use of these tags. The Alation and OneTrust data catalogs are the first tools explicitly supporting Snowflake tags, but many others are expected. Using a tool like this, you can pull descriptions and locations of all your tagged data into a data catalog, making it easy to see where all instances of PII (for example) are stored and how people are using them.
Tag Limitations
Tags are meant for big-picture information about an entire table or column – metadata that describes the contents and can be used for limited enforcement. However, you cannot apply tags to specific data inside a table itself (e.g., we cannot tag certain rows in a table as belonging to Customer 1 or Customer 2). Not to worry – we can handle this level of filtering using row access control, discussed in the next entry of our series.
Why Do Snowflake Tags Matter?
Remember, the first step in keeping our data private and secure is identifying which data needs protection. Traditional data catalogs and governance processes have to make do with educated guesses about your data and are hard to keep in sync with reality. With built-in data tags, we can keep track of important information right at the source, provide it to all our data consumers, and use it to ensure we’re properly protecting our most sensitive assets.