The Modern Data Stack: Still Too Complicated

In the quest to make data-driven decisions, what seems like a straightforward process of moving data from source systems to a central analytical workspace often explodes in complexity and overhead. This post explores why the modern data stack remains too complicated and how various tools and services attempt to address these challenges today.

Data Driven Decision Making

Analytics teams exist because organizations want to make decisions using data. This can take the form of reports, dashboards, or sophisticated data science projects. However, as companies grow, consistently using data across an organization becomes really difficult due to technical and organizational hurdles.

Technical Hurdles:

Large Data Volumes: As data volumes grow, primary application databases struggle to keep up.
Data Silos: Data is spread across multiple systems, making it hard to analyze all information in one place.
Complex Business Logic: Implementing and maintaining complex business logic can be challenging.

Organizational Constraints:

Tight Budgets: Budgets are often tight, limiting the ability to invest in needed tools and resources.
Limited Knowledge: There is often limited knowledge of available data technologies and tooling.
Competing Priorities: Other organizational priorities can divert focus from data initiatives.

These organizational hurdles, combined with technical challenges, make it difficult to complete data projects even with the latest technology.

Persistent Challenges in Modern Data Analytics

Data operations are still siloed and overly complex even with modern data tooling. To undersand the current landscape, I want to walk through a few key milestones in the data technology landscape to better graps the challenges:

Cloud providers offer various tools, but scaling remains complex.
Data companies have emerged to simplify architecture, leading to the “data stack.”
Fragmented teams and managerial overhead persist
Batch ETL processes are too slow to meet current analytical demands.
Managing multiple vendors and processing steps is costly.
New technologies from cloud vendors aim to streamline workflows.

Cloud Providers Offered Various Tools, But Scaling Remains Complex

Cloud providers like AWS, GCP, and Azure offer many essential tools for data engineering, such as storage, computing, and logging services. While these tools provide the components needed to build a data stack, using and integrating them is far from straightforward.

The complexity starts with the tools themselves. AWS offers Glue, GCP provides Data Fusion, and Azure has ADF. These tools are complicated to deploy and configure, and they are often too complex for business users. As a result, they are primarily accessible to software engineers and cloud architects. These tools can be rudimentary yet over-engineered for what should be a simple process.

The complexity multiplies when you need to use multiple components for your data pipelines. Each new pipeline introduces another potential breaking point, making it challenging to identify and fix issues. Teams often struggle to choose the right tools, sometimes opting for relational databases instead of those optimized for analytics, due to lack of experience.

Furthermore, integrating these tools involves significant management overhead. Each tool may have its own configuration requirements, monitoring systems, and update cycles. Ensuring these disparate systems work together in harmony requires specialized skills and ongoing maintenance. Additionally, managing data governance and security is challenging due to the lack of data lineage and multiple data storage locations.

Although cloud providers offer many useful tools, scaling remains a significant challenge due to complex integrations, the expertise required to manage them, and the additional management overhead. This complexity can slow down development and create bottlenecks, affecting the overall efficiency of data operations.

Addressing these gaps can provide a more holistic view of the challenges faced when scaling with cloud providers’ tools.

Data Companies Have Emerged to Simplify Architecture, Leading to the “Data Stack”

To solve these challenges, many companies have stepped up and stitched together these services in a more scalable way. This has made it easier to create and manage hundreds or even thousands of pipelines. However, few companies handle the entire end-to-end data lifecycle.

This has led to the rise of the “data stack,” where various tools are stitched together to provide analytics. An example of this is the Fivetran > BigQuery > Looker stack. It offers a way to deploy production pipelines and reports using a proven system, so you don’t have to build it all from scratch.

While these tools simplify the process of setting up architecture, they can be complicated to use individually. They are still independent tools, requiring customization and expertise to ensure they work well together. Coordination among these tools is necessary but challenging, especially when dealing with different vendors and keeping up with updates or changes in each tool.

Moreover, the “data stack” approach can introduce its own set of complexities, including managing data consistency, monitoring performance, and ensuring security across multiple platforms. So, even though these companies have made some aspects easier, the overall process remains quite complex.

Fragmented Teams and Managerial Overhead Persist

Now that the stack has well-defined categories—data ingestion, data warehousing, and dashboards—teams are formed around this structure with managers and individual contributors at each level. Additionally, at larger organizations, you may see roles that oversee these three teams, such as data management and data governance.

Vendor tools have simplified the process compared to using off the shelf cloud resources, but getting from source data to dashboards still involves numerous steps. A typical process includes data extraction, transformation, loading, storage, querying, and finally, visualization. Each of these steps requires specific tools and expertise, and coordinating them can be labor-intensive.

When you want to make a change, you often have to go through parts of this process again. As data demands from an organization increase, teams can get backlogged, and even simple tasks like adding a column can take months to complete. The bottleneck usually lands with the data engineering team, which may struggle with a lack of automation or ongoing maintenance tasks that prevent them from focusing on new initiatives.

This bottleneck can lead data analysts to bypass the standard processes, connecting directly to source systems to get the data they need. While this might solve immediate needs, it creates inconsistencies and can lead to data quality issues and security concerns.

Large teams can compound the complexity, introducing more handoffs and compartmentalization. This often results in over-engineered solutions, as each team focuses on optimizing their part of the process without considering the end-to-end workflow.

In summary, while modern tools have structured the data pipeline into clear categories, the number of steps and the management overhead required to coordinate them remain significant challenges.

Batch ETL Processes Are Too Slow to Meet Current Analytical Demands

Batch ETL processes have long been the standard for moving data from source systems into data warehouses or data lakes. Typically, this involves nightly updates where data is extracted, transformed, and loaded in bulk. While this method is proven and cost-effective, it has significant limitations in the context of modern analytical demands.

Many analytics use cases now require up-to-date data to make timely decisions. For instance, customer service teams need access to recent data to troubleshoot ongoing issues. Waiting for the next batch update means that teams either have to rely on outdated data or go with their gut feeling, neither of which is ideal. This delay also often forces analysts to directly query source systems, circumventing the established ETL processes and investments.

Batch ETL’s inherent slowness makes it insufficient for real-time or near-real-time analytics, causing organizations to struggle with meeting the fast-paced demands of today’s data-driven applications. This lag can be particularly problematic in dynamic environments where timely insights are critical for operational decision-making.

Furthermore, frequent changes in data sources and structures can exacerbate the inefficiencies of batch ETL. Each change might necessitate an update or a reconfiguration of the ETL processes, leading to delays and potential disruptions in data availability. These complications increase the complexity and overhead involved in maintaining the data pipeline.

In summary, while batch ETL processes have served their purpose, they are too slow to meet the real-time analytical needs of modern organizations. This necessitates looking into more advanced, real-time data processing solutions that can keep up with current demands.

Managing Multiple Vendors and Processing Steps Is Costly

The complexity of the modern data stack often requires organizations to use tools and services from multiple vendors. Each vendor specializes in a specific part of the data pipeline, such as data ingestion, storage, transformation, or visualization. While this specialization can provide best-in-class functionality for each step, it also introduces several challenges:

Managing multiple vendors and their associated tools involves significant costs. Licensing fees, support contracts, and training expenses can quickly add up. Additionally, each tool has its own maintenance requirements, updates, and configuration settings, increasing the administrative overhead.

Integrating these disparate tools and ensuring they work seamlessly together is another challenge. Different tools may have varying data formats, APIs, and compatibility issues. Custom solutions or middleware are often needed to bridge gaps between these tools, adding to the complexity and cost.

Coordinating updates across multiple systems can also be a logistical nightmare. An update to one tool might necessitate changes to others, creating a domino effect that requires careful planning and testing. This can lead to downtime or performance issues if not managed properly.

Moreover, ensuring consistent data quality and security across multiple platforms is challenging. Each tool might have its own data validation rules and security protocols, requiring a unified approach to maintain consistency and compliance.

In summary, while using multiple specialized tools can enhance functionality, it also brings significant expenses and complexity. Managing these costs and integrations effectively is crucial for maintaining an efficient and secure data pipeline.

To fully appreciate the number of steps and vendors in the space, I would check out https://a16z.com/emerging-architectures-for-modern-data-infrastructure/

New Technologies From Cloud Vendors Aim to Streamline Workflows

To address the complexities of the modern data stack, cloud providers have introduced new technologies designed to streamline and consolidate workflows. These advancements aim to reduce the number of disparate tools and simplify the overall data management process.

For example, Microsoft has developed Microsoft Fabric, which integrates various data services into a single platform. Similar to what Databricks has done, Microsoft Fabric offers features like PowerBI and seamless integration with the broader Microsoft ecosystem. This approach aims to provide all the necessary tools for data engineering, storage, and analytics in one cohesive system.

Google has also been making strides in this area with its BigQuery platform. BigQuery consolidates multiple data processing and storage capabilities into a unified service, simplifying the process of managing and analyzing large datasets.

Final Thoughts

The modern data stack, while powerful, remains complex and challenging to manage. Technical hurdles, such as huge data volumes, data silos, and intricate business logic, are compounded by organizational constraints like tight budgets, limited knowledge, and competing priorities. Despite the emergence of specialized tools and cloud providers’ efforts to streamline workflows, scaling and integrating these services continue to require significant expertise and management overhead. To truly simplify data operations, organizations must strategically navigate these complexities, adopting advanced, real-time processing solutions and leveraging new technologies that consolidate workflows. By doing so, they can enhance their data-driven decision-making and ultimately drive better business outcomes.

Posted

August 30, 2024

Data Engineering

Paul

Tags: