min read

Is the well established enterprise data warehouse dead?

Find out if enterprise data warehouse is finally dead with the propagation of data mesh in this latest episode of Data Automation Debates.

Introduction

For a long time, the consolidation of analytical data in a central warehouse - the enterprise data warehouse (EDWH) has been the go-to approach to achieve a single point of truth. However, with its limitations, the opposite, decentralized approach is being pursued in the recent years, for instance, with the data mesh approach. In this first conversation in the series of Data Automation Debates, we have covered the following topics:

Is enterprise data warehouse finally dead with the propagation of data mesh?
What are the reasons to go from a centralized to a completely decentralized approach?
Is a decentralized approach suitable for every company?
What should I do if I still have an enterprise data warehouse in place?
Hints and tips

Speakers

Mike Ferguson

CEO of Intelligent Business Strategies - an independent I.T analyst consulting and research and education company specializing in data management and analytics
Chairman of the Big Data London
‍

Gregor Zeiler

CEO of biGENIUS

Conversation

Mike Ferguson (referred to as "M" below)

Gregor Zeiler (referred to as "G" below)

‍

G: So Mike, the first question for you is: Is enterprise data warehouse finally dead with that propagation of data mesh?

M: Absolutely not, not in my opinion at least. I think data mesh is what I would describe as upstream from a data warehouse. In that, you can build data products, which can then be used to construct a data warehouse. So I don't see data mesh replacing a data warehouse. I see it accelerating the development of data warehouse. It's just going to hopefully accelerate the development of data warehouses and other analytical systems, as well as machine learning models and things like that.

G: Okay, but you agree that it's a completely different approach than the enterprise or centralized data warehouse.

M: Absolutely. In the past, we relied on I.T to build everything: to build all of the ETL, to design the data model for your data warehouse, to build all of the ETL pipelines. And now that's not the case. We are trying to decentralize the data engineering effort with data mesh in order to get more people involved in engineering data, and we need that because there's so much more data. When we first started building data warehouses we had let's say, just your mainstream transaction processing systems as a data source. Today, we have hundreds, sometimes even thousands of data sources coming into the enterprise. And so centralized I.T would be completely overwhelmed if they didn't get other people in the enterprise helping out with this data engineering challenge, now that we have so much more data to look at.

[...] we are constantly repeating the cleaning and integration of the same data for multiple analytical use cases, when really we should be able to build it once, and reuse it everywhere. And for that reason, I think we need to stop reinventing, and start reusing. And that's another reason why we are going for a decentralized approach.

G: What are the reasons to go from centralized approach to the completely decentralized approach?

M: Well I think the main reason is I.T. being overwhelmed. Of course, we have highly skilled people in I.T. The problem is that with so much more data entering the enterprise, we are seeing an increase in the number of data sources, but also an increase in the number of users wanting to analyze that data across the organization. The whole thing is just getting to a point where I.T. is being outpaced by new users and more data, and so we really need to fix that problem. There's a bottleneck to clear that backlog, and of course the other problem is centralized analytical systems have emerged beyond [the data] warehouse - we have data lake, we have graph databases, we have people working on master data management, we have people working on streaming analytics. They've all become silos, and I think another reason for doing this is we are constantly repeating the cleaning and integration of the same data for multiple analytical use cases, when really we should be able to build it once, and reuse it everywhere. And for that reason, I think we need to stop reinventing, and start reusing. And that's another reason why we are going for a decentralized approach.

G: Is a decentralized approach suitable for every company?

M: It's a great question. I'm not so sure the decentralized approach is the right idea at all. I think the right idea is a federated approach. Because if you just decentralize, and have everybody working on different things around the enterprise, you can lose sight of what's going on around the enterprise, and things just start to become multiple disparate projects. We need some kind of program office, some kind of CDO office that's going to coordinate all of these projects. Even if we are now decentralizing data engineering around the organization. I think for any company that is experiencing a bottleneck in data engineering, and experiencing higher costs on data engineering than they should do by reinventing instead of reusing. Then I think, it's definitely worth looking at [the decentralized approach]. The intention is really not to destroy the expertise within I.T. - it's to upskill other people in the organization to get more people doing this, because a company needs to get out more data, bring it together, and add it to what it already knows. And we need more people involved in doing that, so I think I.T. has a very important future here, and the future is to enable and help upskill these people around the organization, and also help them adopt best practices while trying to hide the complexity and help them do this job more rapidly.

And that's what I would be focusing on, is how do you redesign and modularize the development of of your existing data warehouse, rather than throw it away.

G: You have mentioned that it's more of a federated approach, [...] so what should I do if I have an enterprise data warehouse in place? Should I panic, throw it the way, move to data mesh? What do you suggest for that?

M: No, I don't think you need to panic. But I do think you need to look at how you construct your ETL jobs for data warehouses. And rather than building ETL jobs to kind of populate large parts of a data model, I think we could break it apart and say - okay, what are the main data entities within our our data warehouse? What are the main dimensions, the main transaction tables for fact tables? And start to build pipelines that are associated with an individual dimension. For example, customer dimension or products dimension. Or you could have the orders fact tables, so if I'm able to construct pipelines that just build customer [dimension], or just build the orders [facts], or just build the products [dimension], then I can assemble these quickly to construct the data warehouses or data marts, to be able to analyze in the organization. So I think perhaps revisit a data concept model, look to see which dimensions and facts are used within our data warehouse, and then say - can we redesign some of these ETL jobs to be more modular and less monolithic? Then make it much easier to construct the building blocks for the data warehouse. And that's what I would be focusing on, is how do you redesign and modularize the development of of your existing data warehouse, rather than throw it away.

G: Okay that sounds a little along the lines of modernization in that direction of data domains, or so-called data products. But I think what you're suggesting is very helpful for most of the users in that area, who still have some enterprise data warehouse solutions in place to think about, to improve the effectivity of their systems, and to think about the domains that they can manage, so thank you very much for that. Do you have additional hints and tips for us in this case - what should I do if I still have that enterprise data warehouse approach?

M: I have got a class on the best practices for implementing data mesh, which is very much considering how you can do this, with even if you have existing data warehouses that you have to support. There's also a lot of other good sources of information out there, about the move to a decentralized data engineering approach. (Links listed below.)