Dynamic Data Warehousing of Semi-Structured Data with Snowflake

Here at Full 360 we consider ourselves “data centric… technology agnostic”. When faced with a data problem we try to identify the software solution that will best handle our client’s needs, as opposed to a “one hammer fits all” approach of using only tools that we know. That being said… over time we have developed preferences for certain products and providers. These preferences have grown from our experience doing BI for 15+ years, and understanding the problems that emerge over the lifetime of the application. Our primary preference for reporting on “big data” is the use of columnar databases. We already have what we consider to be the two strongest columnar database options under our banner of expertise… and have found that most of the other options in the marketplace aren’t worth pursuing because they don’t bring anything new to the table… or are simply columnar options for enterprise RDBMS offerings.

Earlier in 2014, Snowflake Computing contacted us about a sneak preview of their SaaS columnar data warehouse offering. After seeing the demo I was immediately excited and took them up on their offer for a free account so that I could give it a spin. Now that the product has become publicly available I am free to share my excitement with you!

Snowflake is a columnar, SQL database designed from the ground up to run on Amazon Web Services. Snowflake is not based on Postgres or MySQL like other columnar offerings, but is 100% new code written by the Snowflake engineers. The architectural features of Snowflake take advantage of S3 persistence and EC2 dynamic provisioning. Before we dive deep… let me provide an overview of how Snowflake is structured.

Architecture

Snowflake has three layers to it’s architecture which are shown in the graphic below:

architecture

  • Cloud Services - This is where you find the SaaS endpoint. You use this endpoint to connect to your data warehouse via JDBC, or ODBC. You can also use the rich web UI which allows you access to all aspects of your environment, including an in-browser SQL editor. In addition to providing the endpoints, the Cloud Services layer is where all of your metadata is defined and stored, such as databases, schemas, tables, sequences, etc. You can log in via the web UI and immediately begin defining your database structure.

  • Compute - The secret sauce! Snowflake allows you to define “warehouses” which are compute clusters built from EC2 instances. You can not execute a SQL DML statement without a running warehouse. Warehouses can be dynamically resized at any time by way of the UI or SQL statements. The sweetest feature of warehouses is that you can run as many separate clusters as you want… in parallel to each other… at the same time.

  • Storage - While the metadata is managed and stored in the Cloud Services layer, the actual database data files are stored in Amazon S3. This means that the moment your data is loaded into Snowflake, it is backed up across multiple physical data centers.

Warehouse Cluster Use Cases

In order to understand the power of this architecture, let’s consider a few common BI scenarios that occur frequently in business organizations and how Snowflake warehouses address the issue.

End of Month Reporting

Finance organizations usually do most of their heavy analysis one week out of the month. The servers are typically sized to either handle the peak usage, which wastes expensive computing resources during the other 3 weeks of the month… or the servers are sized too small… causing user frustration durning month end.

Snowflake allows you to run a small cluster during the month, with the ability to bump it up to a large cluster for the week of month end reporting. This keeps the system available all month long while matching the resource costs to the performance requirements.

Intensive Load Process

Many systems have complex transformations that take place during continuous (or intraday) load processes. These transformations may eat up a lot of CPU and memory while they are taking place, which steals resources from the reporting users… resulting in slow or unpredictable query times.

With Snowflake you can create a warehouse cluster that is sized to the needs of reporting users, and run a parallel cluster that is sized to the need of the load process. The users and load process are never using the same CPU and memory, as they are running on physically separate EC2 servers. This allows for a consistent user reporting experience. The subtlety here is that Snowflake caches data files (stored in S3) to the compute servers in order to speed up subsequent queries on the same data. The Cloud Services layer keeps all warehouse clusters up to date with the latest transaction information, thus preventing queries delivering outdated data from the cache. A database with data that is continuously churning due to frequent loads will still remain transactionally consistent across all clusters.

Demanding Executive Dashboard Users

Big data BI systems often have a few executive users that simply refresh a dashboard to see high level metrics as they develop throughout the day. These users don’t want to hear about how their reports are slow due to load processes and super user queries. All they want is for their dashboards to refresh quickly and consistently. This can be difficult to guarantee when the load process, super users, and executive users are all drawing from the same CPU and memory resources.

Again… having a separate warehouse cluster allows you to provide consistent performance for your most demanding users, without having to sacrifice performance in other areas.


Every aspect of Snowflake can be manipulated via SQL statements, which provides immense flexibility for managing the warehouses. It is very easy to schedule a SQL script to increase/decrease warehouse sizes based on projected needs. Another option for warehouse management is to use the AUTO_SUSPEND and AUTO_RESUME attributes of the warehouse definition, which will automatically shut down the warehouse after a defined period of inactivity, and bring it back up from suspension when a user connects again. Bringing it back up may take a couple minutes, but for infrequently used systems you will be able to significantly trim resource costs.

Data Loading

Snowflake provides the ability to load data in three ways:

  • SQL INSERT statements
  • COPY statement to bulk load from S3 buckets
  • COPY statement to bulk load from your local filesystem

In order to perform a bulk load, you need to define a file format object which is stored with the database metadata. This is simply a definition of a source data file, such as delimiters, encoding, data type (csv, JSON, XML). For S3 loading you will also need to define a bucket location with the associated API key credentials. Once these objects are defined they can be reused easily by any user with the appropriate permissions.

Technically Snowflake only bulk loads from S3 sources. In order to support loading from your local filesystem Snowflake provides anonymous, table specific S3 staging areas to which you can upload your files by way of a PUT statement. Once staged you can use the COPY statement to perform the actual data load.

Semi-Structured Data

These days it is not uncommon for your source data to be in a JSON (or other semi-structured) format. RDBMS systems are beginning to provide native functionality for handling JSON document parsing from within the database, as opposed to handling the parsing in an external ETL tool such as Talend or Hadoop… or… even greasier… custom coding with a language such as Ruby, Java, or Python.

In Snowflake, parsing JSON is made possible by way of a VARIANT data type. You can load your JSON documents into a table, and query the JSON source directly using the colon based syntax shown below:

select
    column_name:attribute1:attribute2 as attribute2,
    column_name:attribute1:attribute3 as attribute3
from
    table_containing_json;

While this is very powerful for data exploration by developers, it doesn’t directly help end users who wish to query the data using an interactive BI tool. In order to make semi-structured data useful to end users you will likely have to parse the data into a relational model consisting of multiple tables that are joined at query time. Snowflake provides the functions that allow you to perform this parsing entirely with SQL statements. Most notable are the TABLE(FLATTEN()) functions which return a relation created from arrays nested in your JSON structures.

Below is a SQL script that I have created which allows for an end to end build of a Snowflake data warehouse. This script is intended to be executed from a local JDBC client such as RazorSQL, SQL/WorkbenchJ, or the Henplus based Snowflake command line client. To summarize, the script performs the following steps:

  1. Define, provision, and build the warehouse, database, and schema.
  2. Create the necessary objects to load a JSON file from your local machine.
  3. Create reporting tables from the JSON source data, making use of the JSON parsing features.
  4. Run a query that depends upon the data being in a relational model.
--create a hardware cluster to use for DML and data loading
CREATE WAREHOUSE snowflake_blog_warehouse WITH WAREHOUSE_SIZE = 'SMALL' WAREHOUSE_TYPE = 'STANDARD' AUTO_SUSPEND = 3600 AUTO_RESUME = FALSE;
USE WAREHOUSE snowflake_blog_warehouse;

--create database and schema definitions
CREATE DATABASE snowflake_blog;
USE DATABASE snowflake_blog;
CREATE SCHEMA people_data;

--create a table to store raw JSON data
--variant data type allows for flexible querying using special functions
CREATE TABLE people_data.people_json_src
(
    src_json variant
);

--named file formats allow for predefined handling at load time... such as delimiters.. encoding.. etc.
CREATE FILE FORMAT people_data.json TYPE = 'JSON' COMPRESSION = 'AUTO' ENABLE_OCTAL = FALSE ALLOW_DUPLICATE = FALSE STRIP_OUTER_ARRAY = FALSE IGNORE_UTF8_ERRORS = FALSE;

--when running from local machine... you can upload a file directly into an anonymous staging area in your snowflake S3 account
--be sure to change the file path to match the location on your local machine!
PUT file:///coding/snowflake_blog/sample.json @people_data.people_json_src;

--idempotently loads all files which have been staged for this table
COPY into people_data.people_json_src file_format='people_data.json';

--snowflake supports transactions
COMMIT;

--create a table directly from a query that parses the JSON data
--this is not good practice for production... but it works just fine for this demo
CREATE TABLE people_data.people
as
SELECT
	src_json:"_id"::varchar as id
    ,src_json:guid::varchar as guid
    ,src_json:isActive::varchar as isActive
    --requires some fancy parsing to handling $1,234.00 format
    ,regexp_replace(src_json:balance,'\\$|\,','')::decimal(15,2) as balance
    ,src_json:picture::varchar as picture
    ,src_json:age::number as age
    ,src_json:eyeColor::varchar as eyeColor
    ,src_json:name.first::varchar as first_name
    ,src_json:name.last::varchar as last_name
    ,src_json:company::varchar as company
    ,src_json:email::varchar as email
    ,src_json:phone::varchar as phone
    ,src_json:address::varchar as address
    ,src_json:about::varchar as about
    ,src_json:registered::varchar as registered
    ,src_json:latitude::decimal(15,6) as latitude
    ,src_json:longitude::decimal(15,6) as longitude
    ,src_json:greeting::varchar as greeting
    ,src_json:favoriteFruit::varchar as favoriteFruit
FROM
	people_data.people_json_src
WHERE
  --will validate that the row has good JSON structure before processing
	check_json(src_json) is null;

--create a tags table
CREATE TABLE people_data.tags
as
SELECT
	src_json:_id::varchar as id
	,tags.value::varchar as tag
FROM
	people_data.people_json_src p
	,TABLE(FLATTEN(p.src_json:tags)) tags;

--create a friends table
CREATE TABLE people_data.friends
as
SELECT
	src_json:_id::varchar as id
	,friends.value:id::number as friend_id
	,friends.value:name::varchar as friend_name
FROM
	people_data.people_json_src p
	,TABLE(FLATTEN(p.src_json:friends)) friends;

--here is a query that depends on the join
select
	*
from
	people_data.people
where
	id in (select id from people_data.friends where friend_name='Parks Ray');

Here is a link to the JSON data I used in the script:

sample.json

Web Based User Interface

All Snowflake functionality is accessible via the UI such as the ability to create objects, monitor query performance stats, and even edit and execute SQL to view the results.

sql_editor

All of the object creation screens in the UI allow you to view the SQL equivalent of the command. This is very helpful for learning the syntax for their custom functionality. Below is a screen shot of the screen for creating a warehouse, followed by the results of the “Show SQL” button in the bottom left corner.

create_warehouse

create_warehouse_sql

Further use cases, limitations, and conclusions…

The easy provisioning and SaaS nature of Snowflake are likely to make it a good fit for customers who want as little infrastructure to manage as possible. For example… one could use Snowflake with Tableau Online and have SaaS BI visualizations running against a SaaS data warehouse… with zero infrastructure to manage. Of course there needs to be a process somewhere to load data into the warehouse. This is typically the most challenging part of any BI implementation. I am still holding out hope for a good SaaS ETL tool.

There are currently a few limitations that I am hoping the Snowflake engineers will overcome in subsequent releases, such as:

  • No support for running inside an Amazon VPC. The SaaS nature of Snowflake currently prevents this, and it will likely be a deal breaker for many enterprise clients. All traffic to and from Snowflake is SSL encrypted, but it still may not be enough to persuade security architects.

  • No support for client side S3 encrypted source data.

  • S3 buckets are constrained to a specific naming convention. We tend to use prod.datatype.client_name.com as bucket names… but Snowflake doesn’t like the dots… only dashes such as prod-datatype-client-name-com. This isn’t a problem for new projects… but it could be annoying when doing a POC with a large, existing S3 data set.

Overall I am quite impressed with the functionality of the first release of Snowflake. I applaud true innovation and these guys have definitely created something to be proud of. I expect that future releases will work out the current shortcomings and make the product an excellent tool in our BI arsenal!


Comments

comments powered by Disqus