Cloud data warehouses have become the cornerstone of modern data analytics stacks, providing a centralized repository for storing and efficiently querying data from multiple sources. They offer a rich ecosystem of integrated data apps, enabling seamless team collaboration. However, as data analytics has evolved, cloud data warehouses have become expensive and slow. In this post, we’ll explore the changing needs of data analytics and examine how cloud data warehouses impact modern analytics workflows.
Modern Complexities: The Apartment Building Analogy for Cloud Data Warehousing
Imagine an ultra-modern luxury apartment complex right in the city center. From the moment you step inside, everything is taken care of—there’s no need to worry about maintenance or any of the usual hassles of homeownership, such as a serverless cloud data warehouse.
Initially, it’s quite serene around the complex. With just a handful of tenants, they have the entire place to themeselves. Taking a dip in the pool or spending time on the golf simulator requires no planning or booking; these amenities are always available. This golden period mirrors the early days of data warehousing, where managing data and sources was straightforward, and access to resources like processing power and storage was ample, free from today’s competitive pressures.
As the building evolves to accommodate more residents, its layout adapts, adopting a modular, open-plan design to ensure that new tenants can move in swiftly and efficiently. This mirrors the shift towards normalized data sets in data warehousing, where speed is of the essence, reducing the time from data creation to availability while minimizing the need for extensive remodeling—or in data terms, modeling.
With each new tenant comes a new set of furniture and personal effects, adding to the building’s diversity. Similarly, as more data sources are added to the data warehouse, each brings its unique format and complexity, like the variety of personal items that residents bring into an apartment building, necessitating adaptable infrastructure to integrate these new elements seamlessly.
However, the complexity doesn’t end there. As the building expands, the intricacy of its utility networks—electricity, water, gas—grows. This is similar to the increasing complexity of joins in the data warehouse, where more elaborate data modeling is required to stitch together information from these varied sources, ensuring that the building’s lifeblood (its utilities) reaches every unit without a hitch.
Yet, as the building’s amenities and services expand to cater to its residents—ranging from in-house gyms to communal lounges—the demand on resources spikes. Dashboards and reports, with their numerous components, draw on the data warehouse much like residents tapping into the building’s utilities, increasing query load and concurrency. This growth in demand mirrors the real-life strain on an apartment building’s resources as more residents access its facilities simultaneously.
Limitations begin to emerge, much like the challenges faced by such an apartment complex. The building, accessible only through its physical location, reflects the cloud-only access of data warehouses like BigQuery, where each query—each request for service—incurs a cost. Performance can wane under heavy demand; just as the building’s elevators and utilities can falter when every tenant decides to draw on them at once, so too can data warehouse performance suffer from complex, multi-table operations.
In this bustling apartment complex, a significant issue arises from the lack of communication between tenants and management. Residents, unsure of whom to contact, let small issues fester until they become major problems. This mirrors the expensive nature of data exploration in the cloud data warehouse; trends and patterns start emerging within the data, unnoticed until a significant issue breaks the surface, much like undiscovered maintenance issues lead to emergencies in the apartment complex.
Furthermore, the centralized nature of the building’s management can lead to bottlenecks, akin to concurrency issues in data warehousing. A single point of contact for maintenance requests means that during peak times, residents might face delays in getting issues addressed, just as data users experience wait times during high query loads.
In weaving this narrative, the apartment complex, in its perpetual state of flux and facing numerous challenges, serves as an illustrative parallel to the cloud data warehouse. Both are tasked with navigating the intricacies of growth and integration, balancing user demands against the efficiency of their infrastructure, all while aiming to deliver exceptional service levels amid escalating expectations.
Key Trends in Data Analytics
Let’s shift focus onto some key trends in data analytics that are straining cloud data warehousing and driving up costs.
Data Analysts Require Real-Time Data
Ideally, a data analyst could use the data the moment it’s generated in reports and dashboards. The standard 24-hour delay for data refreshes suits historical analysis well, but developer and support teams need more up-to-date information. These teams operate within real-time workflows, where immediate data access significantly influences decision-making and alarm generation. Business teams often overlook the trade-off between the cost and the freshness of data, expecting real-time updates across all systems—a possibility that, while technically feasible, is prohibitively expensive and impractical for most scenarios. To bridge this gap, innovative data replication technologies have been developed to minimize latency between source systems and data warehouses. Among these, Datastream for BigQuery, a serverless service, emerges as a prominent solution. Moreover, Estuary, a newcomer to the industry, offers a service that promises even faster and more extensive replication capabilities.
However, this low-latency data transfer introduces a challenge: the normalization of data can slow the performance of cloud data warehousing due to high volume of data and the complexity of required joins. In today’s analytical workflows, there’s a need to distinguish between real-time and historical use cases to circumvent system constraints. Real-time analytics demand that each new piece of data be analyzed immediately for timely alerts, like a fire alarm system that activates at the first sign of smoke—you cannot afford to wait 24 hours for the data to be refreshed to determine if an alert is warranted and you also do not need five years’ worth of smoke readings to determine if you should sound the alarm. Conversely, historical analysis typically requires data modeling and denormalization to enhance query performance and data integrity.
Expanding Data Sources
Organizations are increasingly incorporating more data sources, largely due to adopting third-party tools designed to improve business operations. Salesforce, Zendesk, and Hubspot are prime examples, deeply embedded in the routines of business users. Beyond their primary functions, these tools produce valuable data. When this data is joined with data from other sources, it significantly boosts the depth of analysis possible.
Extracting data from these diverse sources varies in complexity. Services like Salesforce provide comprehensive APIs and a variety of connectors, easing the integration process. However, integrating less common tools, which also offer APIs, poses a challenge that organizations must navigate. This integration is complex due to the unique combination of technologies, processes, and data strategies each organization employs. Successfully leveraging the vast amount of available information requires both technical skill and strategic planning, ensuring efficient and effective use of data.
Increasing Complexity in Data Warehouse Queries
The demand for real-time data access (which creates normalized data sets), coupled with the proliferation of data sources, has led to a significant increase in the complexity of data warehouse queries. Queries designed for application databases, which typically perform swiftly, tend to slow down considerably when executed in a data warehouse environment. The most efficient performance is observed in queries involving a single table. However, as the complexity of queries increases—those that were previously executed in seconds may now take up to a minute or more. This slowdown is exacerbated by the need to scan larger volumes of data, directly impacting costs, a concern particularly relevant for platforms like BigQuery.
Dashboards: Increasing Complexity, More Components, and Broader Access
Dashboards have become increasingly sophisticated, incorporating more components and serving a broader user base. Tools such as Tableau, Looker, and PowerBI have simplified the process of accessing data stored in warehouses, positioning themselves as indispensable resources for data analysts. As the volume of collected data grows and originates from a wider array of sources, dashboards are being tasked with displaying more charts and handling more queries. Concurrently, an increasing number of users rely on these dashboards to inform their decision-making processes, leading to a surge in data warehouse queries. This uptick in demand can strain data warehouse performance and, more critically, lead to significant increases in operational costs.
Why I Wrote This Post
I’m not writing this to pitch a new product or service. Rather, my intention is to shed light on some of the more pressing issues facing our field today, provide insights into the evolving landscape, and invite dialogue. It’s an unfortunate truth that searching for ways to lower our data warehouse bills often leads us down a rabbit hole with no clear exit, reflecting not only the deepening challenges but also highlighting opportunities for innovation in the space. This piece seeks to explore the less clear-cut areas of data engineering, areas often shrouded in ambiguity and ripe for speculation in the absence of clear-cut guidance. It’s essential to recognize the motivations of cloud providers, whose business strategies are designed to foster dependency and increased consumption of their services. Understanding this dynamic is crucial as we tread through the intricate terrain of data management and strive for efficiency amidst the push toward greater platform reliance.
Additionally, my growing frustration with the escalating costs of cloud services cannot be overstated. The typical advice for reducing these expenses often circles back to adopting more advanced techniques or integrating additional services. This advice, however well-intentioned, unfortunately, leads to an increased dependency on cloud providers. This not only complicates our tech stacks but also, more often than not, increases the very costs we’re trying to cut. It’s a cycle where the solution to cloud service issues seems to be even more cloud services, a path that benefits the provider more than the user.
When it comes to cloud data warehouses, a significant gap exists in their support for straightforward data exploration or proactive trend monitoring. The default solution? Use a BI tool which typically requires the user to manually create charts.
On a brighter note, I’m genuinely enthusiastic about the developments with DuckDB and MotherDuck. These projects are making strides against the prevailing trends in data analytics by enabling analytics to be run locally. This approach not only simplifies the analytical process but also presents a more cost-effective alternative to the cloud-centric models that dominate our current landscape. For those seeking relief from the constraints of cloud dependencies and the high costs they entail, DuckDB and MotherDuck offer a compelling avenue to explore further.
Thanks for reading!