In my last article I summarised the top 20 requirements of a modern data warehouse. In this article I’ll explain how Snowflake, the only Data Warehouse built for the cloud, delivers on the wish list, and even exceeds the requirement with some remarkably innovative solutions.
Independently Sized Workload Separation
Let’s say you’ve joined a multi-terabyte data warehouse project as the solution architect, only to find the system is consistently failing to meet the demanding performance requirements. The batch ELT processes consistently run for more than two hours, and impact upon business intelligence reports. How quickly could you stand up a larger server to process the workload more quickly? Two months? It could take weeks to get the justification for the capital expenditure alone.
How about 208 milliseconds?
Using the following SQL statement on Snowflake, I managed to stand up a 4 node (MEDIUM) virtual compute cluster, fully installed, and ready to accept queries against an existing multi-petabyte data warehouse.
Even if you’re lucky enough to be running a relatively modern multi-node MPP system, it could take days to install an additional node, and even in the cloud it can take anything from a few hours to several days of downtime to rebalance the data.
An incredible benefit of the Snowflake solution is that every workload can be executed against it's own separate virtual compute cluster (called a virtual warehouse), which avoids the usual tug of war between batch loading ELT processes and low latency dashboards. In addition to being independent, each virtual warehouse can be individually resized to fit the workload and budget available.
Independently Sized Elasticity
Having started a new dedicated ELT compute cluster, you then discover there’s been an unexpectedly large increase in data volume, and the batch ELT processes will not complete on time. You now need to adjust the resources to complete the workload faster.
The above SQL statement took just 118 milliseconds on the production system to resize the dedicated ELT warehouse. Without any interruption to the executing loading process, all queued, and newly executed SQL statements were automatically started on the larger 32 node cluster with eight times the compute capacity.
To keep it simple, Snowflake clusters come in a range of T-Shirt Sizes which start at extra-small (one node), through, Small, Medium and large, up to a massive 128 node 4X Large cluster. Each increment is designed to be double the capacity of the previous cluster.
In this case, as it’s not clear how long the ETL process will take to complete, it’s possible to set the cluster to automatically suspend after five minutes inactivity, but automatically re-start when the next batch job is executed. This happens transparently to the application, and like other operations typically executes within milliseconds.
The diagram above illustrates the new situation. Business intelligence users who were previously experiencing poor performance during batch ELT loading, now have dedicated hardware with zero contention with other users. Meanwhile, regardless of when it starts or finishes, the batch ETL process can maximise throughput running as many parallel processes as needed and taking 100% of resources to complete tasks as quickly as possible.
The diagram above illustrates the situation on Snowflake whereby compute intensive Data Science operations can run independently of dashboard queries which need sub-second response times, and development users can even execute test queries against the production database using development resources as developers have their own independent compute cluster. This compares well to other "query services" like Google BigQuery as it provides a clear per-second pricing for the server, which allows the user to match costs to budget.
Double Performance for the Same Cost
One feature of Snowflake often overlooked is the way in which it’s possible to exploit the incremental size of each cluster to deliver a given workload in half the time for the same cost. This is possible because each incremental T-shirt size delivers roughly double the compute performance, at exactly double the cost, which means a large task can be executed in half the time.
To put this another way, the cost of running a SMALL cluster for four hours is the same as a MEDIUM cluster for two.
To demonstrate this feature, I executed a simple CTAS (Create Table As Select) operation on a 1.4 terabyte table holding 29 Billion rows, each time increasing the cluster size. This is a relatively simple operation for which tuning is not an option as it's a pure brute force task, similar to many ELT operations.
As the above test demonstrates, performance is doubled at each step in-line with the node count, but the credit cost remains about the same. Eventually, we’d expect the performance to tail off, but it demonstrates the principle which can be successfully exploited when performing large volume data processing tasks. This operation was both cheaper and faster to complete on the XX-LARGE server in just over seven minutes than the XSMALL cluster which ran for five hours.
When the process was complete, I simply suspended the cluster avoiding further compute charges.
Clearly this is of most benefit when executing known large volume or complex workloads. A large data load, ELT or complex analytics task would benefit the most from this feature. For ad-hoc business intelligence, it's sensible to size the compute resources to the workload demands. However, for situations where there's an increasing number of users leads to performance issues, there's an alternative approach to maximise concurrency.
If contention for limited compute resources is the primary challenge for solution architects, then being able to deliver consistent performance to growing numbers of users comes a close second. The diagram below illustrates a common problem on many warehouse platforms, the challenge of maintaining good performance as the number of concurrent users varies throughout the day. The difficulty is, as the number of users increases, the hardware eventually hits the physical limits, and either performance is degraded, or queries are queued.
This situation was experienced at Deliveroo in the UK, who in 2017 ran over 7,000 queries per hour over 17 terabytes of data, and with a query growth rate of over 650% per year.
The challenge in this case was to quickly react to a sudden increase in the number of online users, which is subtly different to the ELT load challenge which needs a larger server to match the workload size and complexity.
Snowflake delivers a unique solution to this problem, by automatically scaling out (and back in) the compute clusters to match the demands. The SQL statement below creates a new Virtual Warehouse, completely independent of the ELT warehouse above, and includes the option to automatically start up to 9 additional clusters as needed.
The diagram below illustrates how same size compute clusters are automatically added to the warehouse as the number of concurrent queries rises or falls. Using a SCALING_POLICY of STANDARD means Snowflake automatically starts additional clusters when it detects queries are being queued. This is designed to maximise query responsiveness, and maximise concurrency.
The typical alternative to the above solution involves sizing the hardware, perhaps for the next five years based upon the biggest workload on the worst day of the year. This means you pay for the capacity whether you use it or not.
With Snowflake multi-cluster warehouse technology, the costs (represented by the red line), are entirely in line with demand. When the servers are quiet, and no queries are being executed, the cluster is suspended, and yet instantly available when needed, with capacity charged dependent upon warehouse size, and billed by the second.
“What we like about Snowflake is that it elastically scales up and down, so if we have a peak for three hours on a particular day, we can just ramp up the number of clusters, but we don’t have to pay for that reserved instance during the rest of the week” –Head of Business Intelligence (Deliveroo).
Fast Time to Value
With an increasingly global 24x7 economy, batch data loading is no longer sufficient, and businesses need data arriving in near real-time, and almost immediately available for analysis. To support this, Snowflake built Snowpipe, a continuous data ingestion service.
The diagram below illustrates the key components of the Snowpipe architecture which provides a REST endpoint to identify files to be loaded from AWS S3 or Microsoft Azure blob storage into the Snowflake database.
Snowpipe is a serverless process with no need to write application code, just a few simple DDL statements are needed to configure the solution. This facility is provided as a completely hands-off service, which, in common with other Snowflake charging plans, is billed by the second.
For bulk data transfer, or situations where data is delivered as a batch operation, Snowflake provides a bulk loading tool in which data is encrypted end-to-end, and has demonstrated benchmark performance of around 15 Terabytes per hour of uncompressed data.
The rapid growth in popularity of NoSQL solutions including MongoDB and Couchbase was largely driven by the need to handle semi-structured data in JSON format. A typical modern data warehouse can expect to ingest data from multiple sources, and in addition to each using a different structure, JSON structured data may change over time as new attributes are added. This is challenging for traditional data warehouses where the data is held in relatively inflexible fixed format rows and columns. In these cases an ETL process is normally deployed to unpack the data, which means it’s initially stored and analysed in its raw format in a NoSQL database, with subsets extracted and to the warehouse as needed.
The diagram above illustrates a typical JSON streaming data pipeline in which data is extracted, transformed and stored in a Hadoop with Apache Drill used to analyse the results. As needed, a small sub-set of the data is extracted to a data warehouse to be combined with existing structured data which creates a complex data processing pipeline. This means, although JSON data is immediately available for analysis, changes to the JSON structure are dependent upon Data Extraction changes.
Snowflake have resolved this challenge by delivering a native schema-on-read data type called VARIANT which can store structured or semi-structured data. As data is loaded into Snowflake it’s automatically parsed, and the necessary attributes extracted and stored in columnar format.
However, unlike typical schema-on-read solutions where the developer needs to later define the schema structure, in this case, the schema is simply defined in the SQL statement itself. One possible solution is to simply create a view over the JSON structure which isolates business intelligence queries from JSON structure changes.
Using this method, the user can execute simple SQL statements to query the data in place with no complex data transformation required. Unlike NoSQL database which create yet another potential data silo, queries on Snowflake can combine both JSON and structured data with no indexes to build nor tuning required.
This is incredible technology. This ability to deliver a better solution in a way that uses basic SQL syntax to work with JSON is really fun. Holt Calder - Interworks
The founders of Snowflake have a proven track record in delivery of high-performance database solutions. In building Snowflake from the ground up, they’ve introduced some remarkable innovations which were simply not possible on legacy architectures.
In addition to a columnar data format which is designed specifically to maximise analytic query performance, and a patented micro-partition data structure which requires no indexes to maintain, they designed a multi-level caching system.
The diagram below illustrates the key components which include parallel data fetches from remote storage to an independently sized cluster of machines backed by fast SSD storage. Finally, all results returned to users are stored in a result cache, which means queries which took minutes to complete can be re-executed in microseconds.
Using a series of benchmark tests against 1.5 billion rows of data returned the following results: -
- Run from cold:The query took around 20 seconds to scan 12Gb of compressed data.
- Run from SSD:The query took 1.2 seconds from the local SSD cache.
- Run from Result Cache: The query took 2 milliseconds, and didn’t even need a virtual warehouse available as its entirely run from the result cache.
Inexpensive, A Low Entry Point and Massive Scalability
Just a few years ago, even a small departmental data mart might cost hundreds of thousands of dollars in up-front hardware and database licence costs before even starting to load data. However on Snowflake, it's possible to create a database and start loading data within seconds. Unlike other cloud-based options where you pay for the hardware deployed, even when you’re not using it, with Snowflake you can simply shut it down, and are billed by the second for compute resources.
Charges are very competitive, and charged separately for storage (per Terabyte per month), and compute is charged by the second. You can try it for free for 30 days with $400 of credits. It even comes with a set of TCP (Transaction Processing Council) data sets to help evaluate query performance.
Unlike almost every other data warehouse, Snowflake can scale to tables with petabytes of data, and a total of 1,280 servers on a single virtual warehouse. Of course, you can stand up multiple virtual warehouses, over an almost infinite shared database which means it supports a practically infinite scale.
Finally! A Warehouse for the Enterprise
The diagram below illustrates the traditional Enterprise Data Warehouse Architecture used for the past 30 years. It shows data being extracted from multiple source systems, which is transformed and integrated into a single Enterprise Data Warehouse (EDW), with data sub-sets extracted to multiple data-marts, including aggregation where needed.
The challenges in above architecture include the additional hardware to support the ETL process to restructure and integrate the data. Once loaded into the central enterprise warehouse, it’s then extracted into multiple data marts for analysis, and typically each mart is deployed on separate hardware to minimise contention.
In addition to the challenge of ensuring consistency between marts (as each is loaded independently and potentially delivered at a different time), this tends to primarily be a batch process. This means the latency between an event happening on a source system, and the data being available for analysis is often hours or even up to a day.
The diagram below illustrates the new architecture made possible by Snowflake. In this solution both structured and semi-structured data can be loaded either using an Extract Load and Transform (ELT) process or streamed using the Snowpipe data streaming service.
Transforming the data in the database can be executed on an independently sized virtual warehouse to avoid contention with end users, and each user group can be assigned an appropriate virtual warehouse to run queries.
Unlike the traditional architecture, users are free to query raw transaction level data or summaries as required, and all users run against the exact same data. Because additional compute resources can be started within seconds, and data storage is virtually unlimited, this architecture provides a true enterprise solution.
Finally, unlike the traditional on-premise solution which will need to be replaced every few years, the Snowflake solution can be scaled up on demand, and advances in server hardware are transparently deployed without interruption to service.
Using Snowflake, it’s finally possible to consolidate data from across the entire enterprise. This can include multiple disparate databases, data silos and data marts in a single, centrally accessible platform. There’s no longer any need to extract and transfer data between systems, as it’s all accessible for query in one place.
Almost Zero Management
The internal database architecture deployed by Oracle in the 1980s has largely remained unchanged to the current day, and the same principles were deployed by IBM and Microsoft. Although the Massively Parallel Processing (MPP) architecture developed by Teradata and Netezza presented a challenge, there has never been, until now, any serious attempt to deliver a solution with almost zero DBA involvement. (For the record, having used Oracle for 30 years I'm highly skeptical of the Oracle Autonomous Database).
To put the complexity in context, Oracle supports 14 types of index, with multiple options to partition the data, and over 450 independently adjustable parameters to tune the system. Most systems need a team of highly experienced Database Administrators to manage, in addition to skilled developers to extract the necessary performance.
While upgrading the hardware can be a huge challenge, even performing an Oracle database software upgrade is not a task for the faint hearted, and can easily take a weekend of down-time.
By comparison, Snowflake is entirely deployed as a service in the cloud. There’s no hardware to commission, configure and deploy. There’s no operating system software to install nor patches to apply. There’s no need to upgrade the database systems software, and hardware upgrades are transparently applied.
Unlike Oracle, Snowflake has no indexes to maintain, but deploys a patented variable length micro-partition data structure, is columnar in format, and automatically partitions the data on every column. There are no statistics to capture after the data is loaded, and the data is automatically encrypted using AES 256 bit encryption as standard.
New database software features are regularly rolled-out on a weekly basis, and silently deployed with zero downtime, and no DBA involvement. There is no need to configure and maintain backups, and the system includes automatic cross data-centre failover which can sustain the concurrent loss of data of two independently managed data centres.
"Everything should be made as simple as possible, but not simpler". - Albert Einstein
Accessible for Data Sharing
It is estimated that data processing adds more than $1.7 trillion of value to the G7 economies, and yet the UK extracts just 58% of the potential value, slightly ahead of Germany at 55%. Clearly there is significant room for growth, and the ability to share data with customers, partners and subsidiaries is a key component of the solution.
"The evolving data economy is creating significant opportunities for businesses all over the world” - A. William Stein (Digital Realty CEO)
One of the main challenges in securely sharing business to business data is the antiquated methods which typically involve one of two approaches: -
- A Web based dashboard: Whereby the producer identifies a predefined set of metrics, and builds a fixed dashboard which is exposed to consumers.
- Extract, Transform and Load: Whereby the producer delivers a set of data files which need to be transformed, loaded and integrated on the consumer platform before being made available for analysis. Then, when the data is updated, the system needs to execute change data capture, and process the changes.
Snowflake have addressed this problem with The Data SharehouseTM, a unique and innovative solution only possible from a centralised cloud-based data warehouse architecture.
The diagram above illustrates the Snowflake architecture in which a Producer creates one or more Data Shares to securely expose selected data to external clients. The entire process takes a few minutes, and may subsequently be accessed in real time, in the following ways: -
- Existing Snowflake customers: Have the option of creating a read-only database which links directly to the Data Share. Using this method, consumers can directly execute queries against the shared data, and either enrich it in real time with other data sources, or analyse and query as required.
- Other customers: Can be provided with a Reader Account complete with the ability to connect the consumers preferred data analysis tool to the producer’s data share. In this case, all compute resources are provided by and billed to the producer account, and customers are free to query the data as needed.
Because the data is never copied, it’s never out of date, and changes are immediately reflected on consumer dashboards in real time. Finally, each customer can write their own bespoke analysis and reporting solution, or combine the data with other data sources as needed. Effectively this removes the barriers to monetising data assets, and it’s been successfully exploited by some of the worlds fastest growing start-ups including bike sharing start-up Lime.
“Before Snowflake, it was a massive endeavour to package up data and send it over the public internet, which isn’t always secure. Snowflake eliminated all of this. It’s the holy grail of data analytics”– Robert Fehmann, Principle Architect (Snagajob)
A SQL Database
A few years ago, the Big Data vendors were promoting Hadoop and NoSQL databases as the replacement for the data warehouse. Subsequent experience has shown, the technical complexity and skills needed to maintain these systems makes them inaccessible to all but the largest, most sophisticated IT departments, and even then at great cost.
Unlike NoSQL databases which are typically accessible to java programmers and use a limited API with eventual consistency and no transaction support, Snowflake is a fully ACID compliant ANSI standard SQL database. While this may at first seem an unimportant point, it does mean Snowflake is accessible from hundreds of existing data integration, business intelligence and advanced analytics tools, and the diagram below illustrates a few.
Innovative Data Processing Solutions
The ability to time travel through data in addition to delivery of zero copy clones delivers two remarkably powerful and flexible data handling tools.
- Time Travel: Is the ability to query data in the past. This gives the ability to re-execute any query on data as it was, at any point in time, over the past 90 days. Furthermore it provides the ability to recover from a mistake and un-drop a table or even an entire database within seconds.
- Zero Copy Clones: Provides the ability to clone a table, all tables in a schema or even an entire database within seconds at zero additional cost. To put this in context, a test clone of a 1.3 Tb table took just 5.3 seconds, as no physical data was copied. While holding the data clone online will eventually add to the total storage cost, it does mean you can secure a transactionally consistent 10 terabyte database backup in under a minute, and once created, a clone exists until dropped.
Combining both of these features provides a powerful but flexible data management toolkit which means: -
- Database Backups: Are effectively eliminated. The ability to produce a transactionally consistent clone of an entire database at a given point in time, means backup copies to external storage are unnecessary.
- Database Restores: Are no longer needed. On a legacy database, re-running a report against archived data typically involves restoring the data from a backup medium to an existing server. On snowflake, this simply involves creating a virtual warehouse, and querying the data from the clone. Unlike the legacy solution which can take days to complete, this can be completed within minutes, with zero disruption to existing users.
- Agile DevOps: Are a reality. Instead of copying the production database or restoring a backup for testing, it’s possible to take a full clone of the database at a point in time, then simply reload the data using the new code version. The flexibility to resize the virtual warehouse means a full size, full volume performance testing environment can be built within minutes, and then removed when no longer needed.
- Database Recovery: Is a breeze. Imagine a situation where the weekend batch data load routine has corrupted hundreds of tables in the production database, and it’s only been detected on a Monday morning. It’s possible to quickly identify the initial SQL statement, and clone the entire database at the instant beforehand. Executing a simple switch statement, replaces the corrupt database with the clean copy.
- Table Recovery: Is almost instant. It’s easy to accidentally drop a table, a schema or even an entire database, and Snowflake includes an undrop command for each.
To demonstrate just how easy it is to recover an entire multi-terabyte database to a point in time, the following SQL statements take a clone copy, and then switch it with the existing database, effectively restoring it to the situation an hour ago.
The entire operation above was completed in 1.3 seconds as it is a metadata operation, where no actual business data was copied.
Simplicity at its very core
As an experienced Solution Architect, I’ve learned that however rewarding it is to produce an elegant solution, simplicity is an even more important objective, and one of the best compliments I’ve heard about Snowflake is “It just works”.
Simplicity is the ultimate sophistication. - Leonardo da Vinci
Having worked in IT for over 30 years, and in data warehouse design and business intelligence since the last century, I find it takes a lot to get excited by technology. However, Snowflake has consistently proven itself to be remarkably innovative solution, and deliver incredible results while remaining completely simple.
It’s been running on Amazon AWS for years, and is now available on Microsoft Azure, and will no doubt also support Google Cloud Platform in time. For data warehousing and analytics, it’s a technology worth watching.
Disclaimer: The opinions expressed in my articles are my own and will not necessarily reflect those of my employer (past or present) or indeed any client I have worked with.