The Definition of Cloud Data Warehouse
What Is Cloud Data Warehouse?
A cloud data warehouse is a database delivered in a public cloud that is managed by a service provider. It is optimized for analytics and scale and easy to use.
It was revolutionary for me to be able to load data into a data service and query it using a standard language (SQL). MPP data warehouses were created in the 1990s when relational databases struggled to handle the complexity and size of analytic workloads. Yahoo!’s 2010 data management revolution was witnessed by Hadoop, an open-source project. This happened more than 20 years ago. My ability to query unstructured raw data and capture it was a major leap in my ability, at a significantly lower cost, to store, process, and retrieve more data.
With the advent of cloud-based data warehouses, we are now witnessing the third wave in innovation in data warehousing technologies. Enterprises are moving to the cloud and abandoning legacy technologies such as Hadoop for these cloud-based data warehouses. This significant shift in data management has profound implications for businesses.
Advantage of Cloud Data Warehouse
Companies can focus on their business and not manage a large number of servers. Cloud-based data warehouses allow them to provide faster and more accurate insights.
- Data Access: Companies can give their analysts instant access to data from multiple sources. This allows them to perform better analytics and run faster.
- Scalability: A cloud data warehouse is easier and cheaper to scale than an on-premises system. It doesn’t need new hardware and may require over-or under-provisioning. The scaling can also happen automatically as required
- Performance: Cloud data warehouses allow queries to run faster than traditional on-premises warehouses, at a lower cost.
Cloud Data Warehouse Capabilities
Each major cloud vendor offers its own cloud data warehouse service. Amazon, Google, and Microsoft offer BigQuery, Redshift, and Azure SQL Data Warehouse. You can also get the same capabilities through cloud service offerings, such as Snowflake is a web application that runs on the cloud, but is managed by. The following capabilities are available “out of the bag” for each of these cloud vendors or data warehouse providers:
- Data storage and management: The data is stored in a cloud file system.
- Automated upgrades: There is no such thing as a “version” of software or software upgrade.
- Capacity management: It’s simple to expand or contract your data footprint.
Factors to Consider when Choosing a Cloud Data Warehouse
These details are crucial in determining how cloud data warehouse vendors provide these capabilities, and what they charge for them. Let’s explore the pricing and deployment models.
There are two major types of cloud data warehouse architectures. Cluster-based deployment architectures are the oldest. Azure SQL Data Warehouse and Amazon Redshift fall under this category. Clustered cloud data warehouses are clustered Postgres derivatives that were ported to cloud computing. Serverless is a more modern flavor and includes Snowflake and Google BigQuery as examples. Serverless cloud data warehouses, which are not visible to clients, make the database cluster invisible or share it across multiple clients.
Cloud Data Pricing
Pricing is another important difference between cloud data warehousing options. All cases require that you pay a nominal fee to store the data. However, the pricing for computing is different.
Google BigQuery or Snowflake provides on-demand pricing options based on data scanned and time spent. Amazon Redshift, Azure SQL cloud-based data warehouse, and others offer resource pricing that is based on the number of nodes in a cluster. Both data warehouse platforms pricing models have their pros and cons. On-demand pricing models charge only for the amount you use. This can make budgeting more difficult because it is impossible to predict how many users will be using the service and how large the queries they will run. One customer had a user who ran a query that cost $1,000+ and it was erroneously charged them for the extra.
Node-based models (e.g. You pay per server or type for the node-based models (i.e. Amazon Redshift and Azure SQL Data Warehouse). Although this pricing model is more predictable, it’s also “always-on” which means that you pay a flat rate regardless of how much you use.
Pricing is an important consideration. It requires extensive use case modeling and workload modeling in order to determine the best fit for your company.
Challenges and Considerations
We’ve seen many enterprises try to migrate from their on-premise relational databases and/or data lakes to the cloud. Many companies find that their migrations “stall” after their first pilot project because of the following reasons:
- Disruption: Downstream users (data scientists, business analysts) need to change their behavior and re-tool reports and dashboards.
- Performance: The cloud DW is not as performant as legacy, highly tuned on-premise data platforms.
- Sticker shock: Sticker shock is unanticipated, unplanned operating expenses, and the absence of cost control.