Is Life Better at the Data Lakehouse?

Data Strategy

1-Aug-22

Modern enterprise data has to be the intersection of platform and architecture. In the past, enterprise data solutions had to rely on one or the other to the detriment of all. Those solutions quickly became unwieldy and unmanageable, forcing the users (and the market) to find ways to adjust and overcome those limitations. Today, platform and architecture are converging and giving rise to the future of enterprise data solutions – the concept of the Data Lakehouse.

Data warehouses and data lakes both served their purpose in the past, but they have in many ways devolved into data swamps–where analysis goes to die (queue the funeral dirge music). The costs to organizations are high: users either can’t find consistent sources of enriched data for analysis (current manifestations of data lakes), or they are told they can have access to that data in 3 months once all the structures have been updated, data sources mapped, and data is loaded (current data warehouses). Luckily, there’s a new way to manage your data that delivers the best of both worlds and allows us to address data reliability and usability again: The Data Lakehouse.

THE HEYDEY OF DATA WAREHOUSING AND ITS SUBSEQUENT IMPLOSION

While data warehousing wasn’t new in the late 1990s to early 2000s, this is when it became mainstream in the IT/enterprise data world. Everybody who was anybody adopted a centralized data solution. However, a big problem emerged; these warehouses ballooned to unsustainable proportions. There is a long list of reasons, notably: antiquated relational database platforms and data proliferation (e.g., the rise of mobile phones and cellular networks).

Suffice it to say that the technology platforms were overwhelmed. IT/data organizations kept trying to wrangle more and more data sources, scale their processing power and storage capacity, and model their way out of it. They turned to relational databases to ease the burden of figuring out how to relate disparate pieces of data across an organization and allow that information to be queried and reported upon consistently. But at the core of their platforms, relational databases had one thing that enabled and constrained the ability to relate data: indexes… lots and lots of bigger and more extensive indexes, which created even more scalability problems.

Additionally, the high demand for quick turnarounds on analytical requests rose even higher. IT/Enterprise Data teams were slow—and getting slower to respond. Data warehouses had become almost unmanageable with increasingly complicated changes to the data models, the data integration routines (ETL or ELT), and the dashboards and reports.

THE RISE OF BIG DATA PLATFORMS AND DATA VISUALIZATION TOOLS

The world of enterprise data knew they would never be able to solve these problems with the limitations of their relational database platforms. So they migrated to newer database platforms designed to handle higher volumes, velocities, and varieties of data— thus creating big data platforms.

While these platforms were engineered to deal with the challenges of expansive data, they had inherent limitations. One such limitation was their architectural approach: removing the structure that went along with relational data warehouses (denormalizing it).

A new platform meant a newer approach to data modeling and indexing. Structured and unstructured data storage was possible. However, the new big data platforms were undoing some of the positive things relational data warehouses had done, such as limiting the number of places where a particular data point was stored or calculated and enabling more consistent queries to access that data. And although mastering the entire enterprise data model was a requirement, it was something that business users were rarely able to achieve. Thus, duplication of data was rampant in big data.

Furthermore, while data was more readily available than before, these new platforms were so unique and different that there weren’t many tools giving business users access to this data.

In the meantime, the business users were still getting asked questions about the business. Questions that required data—often new data that had not been modeled, transformed, loaded, transformed again, and so on. So, they searched for new ways to respond to these never-ending questions and found a new wave of visualization platforms allowing them to do some rudimentary modeling, transforming and enriching their data, and granting them the ability to produce vibrant visualizations.

Data visualization tools were a hit. After all, what executive doesn’t like to see more (and prettier) charts and graphs? (Queue the Rodney Dangerfield scene in Back to School). This new class of tools allowed business users to recreate some of the same issues described above, some of the same problems that enterprise data practitioners were now playing when using Big Data platforms – duplication of data (source data and calculated metrics).

BIG DATA LAKES DEVOLVE INTO DATA SWAMPS

While big data and data visualization were intended to solve the issues plaguing data warehouses (exponential growth, unreliability), they’re not much better. That’s because the data exists in platforms designed to store and transform but is architected in ways that it:

Strips the value of conformed master data
Strips the consistency in metric definition and business logic
No longer allows for there to be anything reasonably close to a “single version of the truth.”

The term “single version of the truth” has been used (and abused) prolifically over the years. And a single version of the truth is a well-intentioned aspiration that, for most, is fool’s gold.

Why? Because while data platforms allow for a “single version of the truth,” they and their architecture are often too enormous to converge. The business can’t wait for it, and the business users don’t have the patience or knowledge/expertise to achieve it. Enterprises are stuck without knowing where to find the data scattered throughout multiple lakes and unsure if the data they do find is what they’re looking for. They’re stuck in a data swamp.

DATA “LAKEHOUSE” TO THE RESCUE

Thankfully, enterprise data platforms continue to evolve – and there is hope for us yet to extract ourselves from the “swamp.” The answer lies in the newest data storage architecture that combines the cost-efficiency and flexibility of data lakes with data warehouses’ reliability and consistency. It’s called the Data “Lakehouse,” and the cloud enables it.

The emergence of the cloud has given database platforms the ability to address the scaling issues caused by data warehouses and big data databases. Platforms like Redshift, Google Cloud/BigQuery, Microsoft Azure, Snowflake, databricks, etc., can almost instantaneously extend an architecture’s processing power so that it is not constrained by the time it takes to process data or the space/capacity needed in an on-premise data center.

Without the scale complications, we can now focus on addressing data quality, reliability, and usability again.

THERE ARE THREE MAIN LAYERS OF DATA LAKEHOUSE ARCHITECTURE:

Raw Staging Layer: Source-conformed data that persists and serves as the source of transaction data for all downstream consumers
Business Logic Layer: Enhanced data layer that adds some basal enhancements to the data (data quality, data management)
Analytics Information Layer: Transformed data models, designed for business consumption (applied metric definitions/business rules, subject area mart-like structures, advanced data functions such as Machine Learning/Artificial Intelligence)

RAW STAGING LAYER (RSL)

This architecture layer focuses on storing source system and transactional data in raw, ingestion-friendly big data structures. Schemas and tables are modeled in a source-conformed manner (i.e., tables and columns organized in ways that closely mimic their transaction sources). Data is typical via the extract -> load method, with minimal transformation (ELT). This persistent staging area has several advantages:

It allows subsequent layers in the architecture to be rebuilt due to any issues without having to requery the transaction, thereby minimizing the impact on the generation of those transactions.
It allows for data validation in source systems without impacting those systems.
High concurrency and high performance (data ingestion and queries) – do not require data to be modeled across the enterprise, instead deferring those “business rules” to subsequent layers.
Ingestion of structured, semi-structured, and unstructured data (e.g., relational databases, NoSQL databases, IoT sensor data, social media data, etc.)

BUSINESS LOGIC LAYER (BLL)

This architecture layer takes data from the raw (persistent) staging layer and creates enhanced staging environments. Some enhancements are made to the data at this stage. These enhancements are primarily centered around the following things (there are others, but these are the most common, in my experience):

DATA QUALITY ENRICHMENT

Data type mismatching
Data deduplication

DATA MANAGEMENT ENRICHMENT

Consistent naming/values for key-defined Master Data Domains (e.g., date/time fields are made compatible).

CREATING ENHANCED STAGING MODELS (A.K.A. DATA LAKES – FLATTENING DATA STRUCTURES).

Each model closely resembles the source data (that is now coming from the Raw Staging Layer).
Same granularity as the source data, but the columns have been renamed, recast, or reconsidered into a consistent format (NULL or empty string fields are made consistent) with any uniform business rules applied.
Some primary keys have been created and applied.
Some joins have been denormalized, and additional fields added for context or enrichment.
ANALYTICS INFORMATION LAYER (AIL)

This architecture layer focuses on creating models (data marts) that represent and describe business processes and entities. The goal of this layer is to support any manner of analysis, machine learning, etc.:

Data in these marts are abstracted from the data sources they are based on and are typically modeled into fact and dimensional structures.
Data in these fact tables have been the subject of substantive data transformation, including but not limited to metrics derivation.
Often, these marts are organized by the business units they are associated with – finance, marketing, operations, product management, etc.

WHAT’S NEXT ON THE HORIZON FOR DATA?

I believe the Data Lakehouse, in principle, is here to stay. This relatively new approach allows for both freedom and flexibility of data while providing reliable structure and definition. I am sure we will see some additional evolution – and that’s a good thing. The key is to derive more and more business value from data. No matter what industry, every company should view its data as an asset that needs to be invested in and cultivated. When combined with the flexibility and scalability of the cloud architectures, concepts like the Data Lakehouse will no doubt help customers begin to realize this goal.

At the risk of sounding like a curmudgeon, it’s also nice to see the new kids on the block – data science, machine learning (ML/ML Ops), and artificial intelligence (AI) – showing up to play in this space. Not only are the new platforms and architectures supporting Data Lakehouses, but they are also beginning to natively support complete analytical streams of thought by directly supporting ML/AI models. But that’s another discussion for another time.

IS DATA AT THE CENTER OF YOUR STRATEGY?

Transformation begins with a solid data strategy foundation. If you lack foundation your data strategy could be disorganized, resulting in too many pockets of data. We have the tools and experts to show you how to build trust in your analytics and reporting. Schedule a half-day workshop with our data experts for advice on leveraging data for your company’s transformation.

ABOUT THE AUTHOR

Kevin Parker, Principal and Data Lake House expert, is an experienced professional specializing in leading businesses who wish leverage their information and data assets, Kevin guides our clients to maximize their business performance. He offers a significant breadth and depth of expertise in building, optimizing and managing business information and data management capabilities. Kevin is an effective data leader who successfully works with C-level executives and managers across all functional business areas helping them feel confident with business decisions.
Connect with Kevin on LinkedIn