Data Automation Debates (DADs) is a series of conversations between experts in the field of data management and data automation.
Data mesh is about the democratization of data engineering to get more business domain oriented data engineers producing data products to share across the enterprise. In this fourth conversation in the series of Data Automation Debates, we have covered topics surrounding the question - how to accelerate data products development to shorten the time-to-value:
Mike Ferguson (referred to as "M" below)
Gregor Zeiler (referred to as "G" below)
G: So what are the main things, Mike, that you can do to accelerate data product development in a data mesh?
M: I think there's a lot of things you can do here, Gregor, and they're not all to do with technology. We spoke briefly on the last Data Automation Debates about the idea of a centralized program office. This is really a way to coordinate what's going on, so that everyone is aware of who's building what. This stops us from reinventing, also prioritizes development. I think that's first of all, a key organizational thing that I think helps to accelerate the development. In addition to that, there's a a number of things that we can also do: one is to standardize on certain things, so that we don't force people to do let's say common things, and do them differently every time, and to take away the complexity, and avoid it. To try and get consistent approaches to building data products and facilitate data sharing. We can have a a common approach to organizing the storage if you're going to build data products, so that different teams are working in a consistent fashion there. I think common tooling, so that we can share metadata is important. That allows us to become more productive, and reuse things. I think you can create templates to get people started more rapidly, with a common approach to development. I think you can create common libraries of shared services, so that if you need to get data from a file where there's a standard component to do that, or we need to mask data there's a standard way to do that. We should not be inventing all of these things again and again across all of these teams, we should be just reusing components to do that. If you want to get convert-voice-to-text, because it's unstructured data, you should have a standard service to do that: just give it voice - get back a text. We don't want to have to completely reinvent these things again and again. Of course DevOps or DataOps for version control - automatic testing all of these things help to speed up the development; to have a standard approach to publishing and setting policies to share data products in a data mesh, again helps to speed things up and keep things consistent; and create a standard approach to building APIs, and making your data products available in a data mesh. All of these are useful things that we can do to remove unnecessary complexity, and speed people up in the overall development, but I think sharing metadata is also extremely important, and helping to do that.
We want to adopt the same principle with data engineering - to shorten the development time, and make more consistent approaches to being able to automate testing, and automate deployment.
G: You have mentioned DevOps and DataOps. How does DataOps contribute to accelerating data product development?
M: This is a great question. It's really about us, especially data engineers, data professionals, data architects are often not so familiar with the the principles of DevOps. This first emerged in software engineering, and application developers in many organizations are very familiar with this, and perhaps data scientists are more familiar with it as well, because they've been writing code and checking in code, and things like that. But I think we need to adopt the best practices that came out of there, and use it in building data pipelines. That means moving away from monolithic pipeline development. The component-based pipeline development is similar to monolithic applications and microservices. We need to go there I think. If we have people develop, we can adopt the ability to create a new branch for a new component that someone's working on within a team. Issue pull requests, so that people can test it out, and then merges and support integration with GitHub, for example, as a common version management system to control the development of pipelines, so that as soon as you check something into there, we can automatically trigger testing, and automatically trigger deployment, should we get through the successful tests - sort of manage the movement from development through test and into production in a more automated way. So I think what we're seeing with DataOps is really ways in which to facilitate collaborative development, so that we don't have just one data engineer building a pipeline, but multiple members of the team building a pipeline. We get coordinated and collaborative development. We get automated testing, automated deployment. I think all of that, as well as, version control, all managed. It is a more rapid way to build pipelines as we've seen with software. The development cycle used to be very long, and now it's very very short as people release new versions, or new microservices. We want to adopt the same principle with data engineering - to shorten the development time, and make more consistent approaches to being able to automate testing, and automate deployment.
G: You have mentioned not reinventing every time, to reduce the tedious tasks of manual work by using standards, patterns, templates, and so on. This leads me to the question - can data automation software be used to help to accelerate data product development?
M: That's a fabulous question for me. Absolutely. If you think about what's going on in our industry 30 years ago, when we started doing ETL development on data warehouse, we were manually building everything. I still see a lot of people doing that, and I think well now we have catalogs that can automatically discover what data is out there, they can generate metadata, they can tell you about the quality of the data, they can tell you about where the sensitive data is. And so the catalog is holding all of this rich metadata, and then you think, if I know the common data names in a glossary, the sources, and the mappings, then can I not generate these pipelines? I think that's super exciting. If we use data automation software to generate the pipelines and to create the data products, we're going to dramatically shorten the time again. So for me this is a big future for automation. We've seen over the last 20 years - data warehouse automation, but I think this same technology is now able to tap into the catalog, and use it to build data products, which are used not just for data warehouses, but for other analytical use cases. So I absolutely think that data automation software can be used in the development of data products in a data mesh. In fact, I think this could very easily be the future going forward.
We can see definitely the ability to use data automation software to create the data products in the first place, and also to quickly create the pipelines that consume the data products to support the creation of different analytical workloads.
G: That sounds great, especially for us. Every time when I discuss automation with someone, we talk about the speeding-up of development, but what's really matters is the time-to-value, not the time to produce. So that leads me to that question - can data automation software be used to help accelerate data product consumption as well, not only the development?
M: 100%. Because you know, if you build these data products, publish them in a marketplace, and then people want to use them, they maybe need to select the data products that help them build a machine learning model in data science, or to help them create a graph database for fraud analytics, or to help them build a data warehouse. Well again, I need to be able to build a pipeline for that, but this time the data is already clean, it's already available in a data product. But we still need to have those pipelines to consume the data products, and quickly help create the data for a feature set for a machine learning model, or to put in a feature store, or to put in a data warehouse, or to put in a graph database. And so I absolutely believe data automation tools can be used in consumption, because it helps you consume quicker. Not only that, but the data mesh itself supports the idea of: one data product can become the input to a pipeline that creates another data product, and so again we need a another pipeline. So for me, absolutely data automation tools can be used in the creating consumption pipelines more rapidly, to shorten the time-to-value again. We can see definitely the ability to use data automation software to create the data products in the first place, and also to quickly create the pipelines that consume the data products to support the creation of different analytical workloads.
G: That's perfect, Mike. So we have covered a lot of topics that could help to accelerate producing data products today. Do you have some hints and tips on the really important areas to consider in the acceleration of data product development, and the successful implementation of a data mesh? Because I think when it's implemented as fast as possible, and time-to-value is there, then it would be successful at the end of today.
M: Of course. If you want is the background on the data mesh, there's Zhamak Dehghani's book that she created around the data mesh concept. I also have a two-day education class on best practices for implementing a data mesh, which talks about data catalogs, and data fabric that we spoke about in the last discussion. I also talk about data marketplaces, and the potential promise of data automation to really accelerate the development of data products. I can also recommend the use of data automation in data warehouse modernization, not just for data mesh of course, because data warehouse modernization these days maybe involved in migrating your data warehouse to the cloud, in which case you can use automation tools to help you do that. Similarly, if you're also going to modernize your schema, moving from let's say a Kimble approach to data vault, you can use a data warehouse automation tools to help that do that automatically, deal with the complexity, and significantly reduce the risk of your data warehouse migration project. And if you're going to build data warehouses on the cloud, then again you can use data warehouse automation, or data automation software to do that too. So for me, there are multiple use cases for data automation software. I have additional classes on data warehouse automation, and on data warehouse migration to the cloud, but of course there's a a lot more links out there as well to other content that you can gain access to, but I'll provide you with some links. (Links listed below)
G: Perfect. Thank you, Mike. You have mentioned a lot of additional use cases, and I think we have a couple of additional topics for the future to discuss on that as well, so thank you very much for today, and see you the next time.
Accelerate and automate your analytical data workflow with comprehensive features that biGENIUS-X offers.
This blog explores the significance of data mesh patterns in enabling organizations to embrace agility and achieve successful data democratization initiatives.
Discover the transformative era of data management with the concept of data mesh, as well as explore data democracy, and the paradigm shift towards data mesh and data fabric.