Do You Really Need Data Modeling? A Practical Look

For years, data modeling has been the foundation of structured reporting, ensuring performance, consistency, and efficiency. But today, the landscape has changed. With cheap storage, powerful processing, and modern BI tools that enable flexible, real-time analysis, is data modeling still necessary, or has it become just one of many options? Many organizations, especially startups, are exploring alternatives to the traditional star schema, opting for approaches that allow them to move faster—replicating raw data and giving analysts the flexibility to shape it as needed. In this post, I’ll examine why data modeling may not be as critical as it once was and explore when it still provides value.

An Analogy – Mise en place

Imagine you’re running a busy kitchen. On one hand, you have the farm-to-table approach, where every dish starts with a trip to the garden. You pick fresh ingredients for each order, creating meals that are vibrant and unique. This works wonderfully when you’re catering to just a few guests or crafting simple dishes. But as the orders pile up, this approach starts to strain: you’re repeatedly running back and forth to the garden, some ingredients might go unused and spoil, and your dishes might vary slightly each time.

Now, contrast this with mise en place, a French culinary term meaning “everything in its place.” This involves preparing and organizing ingredients in advance, so they’re ready to use when cooking begins. When orders come in, you can assemble dishes quickly and consistently because everything is pre-measured, chopped, and within reach. Mise en place shines when serving a large crowd—it’s efficient, cost-effective, and ensures every plate meets expectations. However, it’s not without challenges. The upfront effort is significant, and if the menu changes or the crowd thins out, some of that prep may go to waste. And let’s not forget the rigidity: pivoting to a new recipe mid-service isn’t easy when everything’s already prepped for something else.

In data terms, farm-to-table represents working directly with raw, unstructured data for each analysis. It’s flexible but can be inefficient and inconsistent at scale. Mise en place, on the other hand, mirrors data modeling—a structured approach like the star schema—where pre-organized data enables fast, repeatable, and reliable insights. Each has its place, and the choice depends on the size, complexity, and consistency demands of your data “kitchen.”

Why Does Data Modeling Exist?

Data modeling rose to prominence in the 1990s, driven by the constraints of database technology at the time. Robert Kimball’s book, The Data Warehouse Toolkit, laid the foundation for this practice, addressing the challenges of reporting in that era.

Memory was expensive compared to processing.
Query performance for large and complex analytical queries was often unbearably slow.

These limitations incentivized moving data out of slow application databases into dedicated data warehouses, where storage costs were minimized by storing as little data as possible. During this transition, data underwent a series of transformations to reshape it into a star schema and summarize raw data into business metrics. This ETL process (Extract, Transform, Load) prioritized processing efficiency over storage, ultimately creating a more structured and optimized dataset that improved query performance while maintaining a balance between normalization and accessibility.

An added benefit of this approach was centralizing business calculations in one place. This ensured downstream reports consistently pulled the same numbers. For example, you no longer had one report claiming 100 new customers while another reported 98. By processing the data once, organizations achieved consistent and reliable reporting.

Today, the main argument for using a star schema or data modeling revolves around its ability to align with business processes and deliver consistent reporting. While this is a great outcome, it comes with significant costs. Importantly, modern technologies and approaches can now achieve similar consistency without relying on a complex ETL process.

Memory Is Now Cheap

Fast forward to today, and memory is significantly cheaper than processing power. This affordability allows organizations to store massive amounts of data—including multiple copies of databases—without exceeding budgets. For instance, services like AWS S3 offer storage rates as low as 2 cents per gigabyte, making it feasible to scale storage with minimal cost concerns.

This shift has fundamentally altered the ETL landscape. Organizations can now replicate their entire databases “as is” into data warehousing environments without requiring extensive transformations. The benefits of this modern approach include:

Real-time access to operational data: Data replication provides near real-time insights, enabling businesses to make timely decisions.
Enhanced query performance: Data warehouses are optimized for large-scale analytical queries, making them more efficient than application databases when working with replicated datasets.
Simplified ETL processes: By replicating data directly, organizations reduce the complexity and resources required to manage ETL pipelines, freeing up engineering teams to focus on other priorities.

The affordability of memory has removed many traditional data management constraints, opening the door for scalable and flexible strategies.

Denormalized Schemas

Another result of cheap memory is the ability to rethink how we model data. Denormalizing, or flattening, involves joining related data into a single table. While traditional data normalization focuses on reducing redundancy, in a data warehouse environment, redundancy is less of a concern. With storage costs so low, repeating data no longer carries the same penalties it once did.

The primary benefit of denormalization is the reduction of complex joins, which can significantly improve query performance. Complex joins tend to increase data warehouse costs and slow down report generation. In contrast, denormalized schemas allow data warehouses to perform optimally, especially when queries rely on only a few large, pre-joined tables.

Denormalization is particularly advantageous when you need to run the same type of complex query multiple times. Instead of processing the query repeatedly, you can precompute the results and use the flattened table for subsequent reports. This not only saves processing time but also ensures faster and more consistent reporting performance.

Data Modeling in Practice

The application of data modeling is most commonly seen in enterprise BI tools, each with its own approach to structuring and managing data. Broadly speaking, there are two main strategies in the BI landscape: the governed data model and the workbook-style approach.

Governed Data Model

This approach prioritizes consistency and control across all reports. By enforcing a structured data model, organizations can ensure that metrics align and that different reports pull from a unified source of truth. However, this comes with significant upfront costs in terms of time, effort, and maintenance. Building and maintaining a governed data model requires investment in both development and governance, which can slow down initial implementation and make changes more cumbersome.

Workbook-Style BI

On the other hand, workbook-style BI tools favor flexibility and rapid development. Analysts can pull in raw data, perform ad hoc joins, and create calculations directly within the report. This method is highly accessible and allows teams to iterate quickly without waiting for a predefined model. However, the downside is that reports can quickly become inconsistent, leading to duplicated logic and multiple versions of key business metrics, which can create confusion across the organization.

Both approaches have their place, and choosing between them depends on an organization’s size, complexity, and need for either strict governance or agile data exploration.

Evaluating BI Tools: Strengths, Weaknesses, and Trade-offs

Looker, now owned by Google, is the poster child for the governed data model. While it promises consistency and control, the reality is that getting started can take weeks, months, or even years before reports are up and running. For enterprises juggling multiple data sources and complex business logic, Looker provides a strong analytical foundation—but at a steep cost.

The massive upfront time commitment alone makes Looker a tough sell for agile teams. If the data model isn’t fully fleshed out, reporting can be severely limited, making it difficult to iterate quickly. Modifying the data model later is just as painful, requiring lengthy coordination and approvals, which often frustrates business teams that need to track changes in real-time. Imagine wanting to measure the adoption of a new feature, only to be told it’ll take months before that data is available. That kind of delay can make Looker feel more like a roadblock than an enabler.

To add insult to injury, Looker isn’t cheap. Its pricing can be 2-3x more expensive than comparable tools, making it hard to justify for companies that prioritize speed over rigid governance. While it’s a powerful tool for organizations that need a strictly controlled data environment, for many, the trade-offs in agility, cost, and development time make it far from the best option.

Power BI follows the traditional star schema approach, making it a structured and reliable option for enterprise analytics. However, this approach comes with trade-offs—complex joins must be handled upstream, often requiring denormalization or explicit modeling before reports can be built. While this works well for traditional BI workflows, it can be a blocker for real-time reporting, particularly when data changes frequently and requires near-instant analysis.

Now that Power BI is integrated into the Fabric ecosystem, upstream transformations have become more seamless, making it easier to manage data before it reaches your reports. Additionally, its extensive range of connectors minimizes the need for custom code. With Power BI Pro available for just $10 per user per month—significantly less than many other enterprise tools—and considering that comparable BI solutions can cost well over $100K per year, Power BI’s inclusion in enterprise licenses offers a substantial competitive advantage.

That said, Power BI has a learning curve. Its methodology is different from traditional BI tools, and users often need to adapt to the “Power BI way” of handling data. But for companies willing to make that investment, it offers a full suite of enterprise-grade analytics features at a fraction of the cost of other tools.

Sigma Computing leans towards the workbook-style BI model, offering flexibility with a spreadsheet-like interface that makes it easy for business users to interact with data. Sigma stands out for its ease of use and rapid deployment—I’ve never been able to go from raw data to a usable report faster than with this tool.

However, this flexibility has its downsides. Without a centralized data model, organizations may find themselves with multiple reports showing the same calculation but yielding different results. Managing report sprawl can become an issue, though for many, the trade-off is worth it—fast reporting deployment outweighs the maintenance overhead of a governed data model.

Looker Data Studio is a widely used option primarily because it’s free, whereas enterprise BI tools can come with significant licensing costs. For teams already within the Google ecosystem, particularly those leveraging BigQuery, it serves as an accessible choice for creating simple reports.

However, Data Studio has significant limitations. It only supports flat tables, meaning that if you need to join multiple datasets, you must first create a view or a materialized results set before reporting. While it does provide a centralized layer for calculations, its functionality is limited compared to more sophisticated BI platforms.

The tool offers an intuitive canvas-style reporting interface with essential reporting features, making it easy for users to quickly build visualizations. However, its lack of built-in data modeling presents challenges for more complex use cases.

In response, many teams resort to elaborate workarounds to replicate enterprise features. Some examples include:

Manually implementing row-level security, since there is no built-in governance model.
Creating excessively long SQL queries to flatten data before use.
Embedding reports using inefficient methods, as Data Studio does not support non-public embedding natively.

While the appeal of a free tool is strong, organizations must weigh the hidden costs associated with these workarounds against the price of investing in a more robust BI solution that provides built-in governance, scalability, and flexibility.

Streamlit is a framework for building BI dashboards entirely in Python. It’s a great choice for Python developers looking to streamline the process from raw data to an interactive dashboard.

Since it’s Python-based, Streamlit enables highly dynamic creation of dashboards. Unlike tools like Sigma or Looker Data Studio, which require static layout of components and widgets, a coding framework like Streamlit provides a more flexibility by allowing conditional logic and programmatic design adjustments.

However, as an open-source tool, hosting Streamlit requires additional effort, which can be a challenge for organizations without a strong Python infrastructure. If integrating into an existing non-Python codebase, proper authentication and security measures must be considered. Additionally, performance depends heavily on the backend, whether it’s a database or Pandas-based manipulation, whereas traditional BI tools often optimize for analytical engines by default.

Conclusion

Data modeling has long been considered a cornerstone of data engineering, but in today’s landscape, its necessity is far from absolute. While structured approaches like the star schema ensure consistency and governance, modern advancements in data storage, processing power, and self-service analytics have given organizations more options than ever before. The reality is, many businesses can now move faster and deliver insights without rigid modeling, reducing upfront complexity and maintenance overhead. Instead of blindly adhering to traditional methods, organizations should question whether data modeling serves their needs or simply slows them down. The key is finding the right balance between structure and agility—whether that means adopting full-scale modeling, leveraging more flexible BI tools, or embracing a hybrid approach that keeps them competitive in an ever-changing data environment.

Thanks for reading!

Posted

February 5, 2025

Data Engineering

Paul

Tags: