min read

The role of data mesh

Find out what is data mesh and what problem is it trying to address in the latest episode of Data Automation Debates.

Introduction

Over the last 18 months or so, data mesh has rapidly emerged as a new approach to producing business-ready data, but what is it in general? What is the role of data management and analytics to enable data-driven enterprise? In this second conversation in the series of Data Automation Debates, we have covered the following topics:

What is data mesh and what problem is it trying to address?
What are data products?
How does data mesh impact the way companies build their analytical systems?
What are the key factors to a successful implementation of data mesh?
Hints and tips

Speakers

Mike Ferguson

CEO of Intelligent Business Strategies - an independent I.T analyst consulting and research and education company specializing in data management and analytics
Chairman of the Big Data London
‍

Gregor Zeiler

CEO of biGENIUS

Conversation

Mike Ferguson (referred to as "M" below)

Gregor Zeiler (referred to as "G" below)

G: Last time we talked about if the enterprise data warehouse is finally dead with the propagation of data mesh, and today we want to step into the role of data mesh. [...] My first question to you is what is data mesh in general? And what problem is it trying to address?

M: Data mesh is really a decentralized approach to data engineering in order to produce high quality, reusable, and compliant data products, which are kind of like data sets to be able to be used for multiple analytical workloads. So the idea is that, we get more people involved in data engineering around the enterprise. The reason it's come about I think is that historically we've had I.T do everything on behalf of the business, and now what we're talking about is - can we empower more people around the organization to engineer data, to speed up our ability to create business-ready data that is fit for purpose, and we can immediately start using it in analytical workloads. We've been here before back in 2013, 2014. We saw the emergence of self-service data prep, and that was very chaotic. Everybody kind of grabbed tools and started building, and we kind of lost control of the situation. I think this time is different, we're trying to coordinate development, but having multiple data engineers do this across the enterprise. And I think the reason it's emerged, and the problem it's trying to solve is, as we get more and more data, we've gone way beyond data warehousing. Now we have multiple analytical systems. For example, cloud storage with data science tools to be able to build machine learning models; graph databases supporting a specific area of analytics such as fraud or cybercrime; traditional data warehouses; streaming data analytics such as IoT data perhaps for manufacturing; or live click stream data to monitor what people are doing on your website, while they're on the website. So we've ended up with multiple silos, different analytical systems in addition to the data warehouse, and people are engineering data for each of those central systems. The problem is that, we're often re-engineering the same data again and again for different systems. Also people in these different systems are using different tools, and we can't see who's building what, and so the problem is we're repeatedly often engineering the same data. We're reinventing, we're not reusing. And the cost of engineering is too high, as well as the fact that we have a bottleneck. So for me, data mesh is trying to address all of that to get more people building these pipelines, using modern techniques like Data Ops. Breaking the problem down, creating reusable data products that we can then share and build once, and share across multiple analytical workloads. So hopefully that will save us a significant amount of time, and reduce the cost, and therefore shorten the time to value.

In simple terms, data products are reusable data sets that can be consumed by different analytical systems in support of different analytical workloads.

G: You have mentioned a very interesting topic - data products. Is data now managed as a product? Or what does it mean? Can you tell us something about that.

M: Of course. In simple terms, data products are reusable data sets that can be consumed by different analytical systems in support of different analytical workloads. And so the idea is you build them once, and you reuse them everywhere. If you look at the formal definition in the data mesh papers written by Zhamak Dehghani, she talks about additional capabilities of data products: that they are discoverable, and addressable, and trustworthy, and self-describing, interoperable, and secure. So in that sense, there's more to it than just the data. It involves the pipeline that creates the metadata associated with it, the execution engine, and the configuration for that to run the pipeline, and of course the data product itself, an API or an interface to get at that data. So that would address if you like, the idea of reusable data products. I think of them as either physical data products, or virtual data products where the data is not persisted but it looks like it is, and even potentially stored SQL queries or pipelines that you can invoke as a service, which when they execute will immediately produce the data product on demand. So I distinguish that from analytical products, which I think are things like reports and machine learning models, which are built using the data from data products. I think data products are reusable physical or virtual data sets that are used for multiple analytical workloads.

G: Managing data as product sounds like there is a higher level on a service-level agreement, in terms of quality and availability of that data. [...] So does data mesh impact the way companies are building their analytical systems?

M: Absolutely yes, because data products need to be published somewhere. I think we're seeing the emergence of things like a data marketplace as a kind of place where you can publish these data products. I think when we're building analytical systems, we should be looking at what data products are already built and available, so we can start to consume them, and use them to build analytical systems more rapidly. So for me, we're changing - we're going from "let's build everything from scratch" to "let's assemble these ready-made analytical products". And so, kind of getting into a more Industrial development of this. It's like, we got these ready-made components that we can assemble quickly into a new analytical system. Whether it's for graph database, or data warehouse, or data mart, or for features for data science. I think all of them can be created using data products.

We're going from "let's build everything from scratch" to "let's assemble these ready-made analytical products" [...] getting into a more Industrial development of this.

G: What do you think are the key factors to enable a successful implementation of a data mesh?

M: That's a great question. For me, there's a number of key factors. The first key factor has nothing to do with technology - it's to do with organization - how do you organize this, who's building what data products for what purpose. I think this is where you need a centralized program office to coordinate all of the other projects going on around the enterprise, so that you can then see what we're trying to achieve as a business, we can say who's building what data products that we need to do that, and then coordinate that activity. So I think organizationally, it's very important to create a program office. I think also [to have a] data fabric software to be able to connect to all the different data sources and to allow collaborative development to building these data products is important. There're also some key areas of technology that are needed, and I think data catalog is very important to understand the data that's out there. Also master data management is important, because some of the data products we build will be master data products like customer data, or suppliers data, or employees data. I also think a data marketplace is very important to be able to publish the existing data products, so people can easily find what's already made, what's already available - so they don't reinvent it - and who all who owns it, and how was it built. I think these these areas in particular are very important. Also common data names in a business glossary for your data products so that you're sharing data with common data names that everyone understands, [and] have the tools to quickly understand what raw data you have to build it, and the data marketplace to publish the data products that you have created. And of course, I think there should be governance around the data sharing of data products to make sure that we track and monitor who's using these data products around the enterprise, and for what purpose.

G: Are there additional resources that can help me deep dive into the possible steps to follow that [decentralized] approach?

M: Yes, I have a education class on the practical guidelines for implementing a data mesh, which goes into all of this to help you try to be as successful as possible with your data mesh implementation. And of course, some of the things that I've mentioned here today, I can provide links to that also of course, if you want to go and look for Zhamak Dehghani's book on implementing data mesh as well, that's available I think from O'Reilly publishing, as well as I think there's a data mesh Meetup group out there - a community if you like, where people are discussing these implementations. (Links listed below.)