Armchair Architects: Data Mesh Architecture

April 5, 2023

Welcome back to Armchair Architects as part of the Azure Enablement Show. Today we will be discussing what data mesh architecture is. Our hosts will be David Blank-Edelman and our armchair architects Uli Homann and Eric Charran.

In our last blog we talked about how do you get meaning from your data and we started to talk about implementation ideas on how to make data meaningful, and today we will discuss what is data mesh architecture as part of that concept.

Data Mesh Architecture has two main components: data silos and implementation.

From Eric’s perspective a data mesh architecture encompasses two components or two concepts. It encompasses silos of data which are usually deep in a specific domain that you want to actually democratize and expose to other engines or stores. And then there's the second component of implementation of those silos/nodes of data domain representation which is exposing it through an API service which is very similar to a microservice architecture. The concept is that in this mesh network of data nodes, each data node has a specific domain responsibility. What happens underneath that domain responsibility is not the business of consuming service it's the pure responsibility of that domain’s node. What that means is that underneath that data node there's got to be an API service in which you push and pull data to conduct transactions. There's got to be a data query methodology in which the query engine optimizes and understands where the data is stored and goes after the data. There must be a data storage component as well. And then there must be a way for ingestion or ingress of data into that particular node.

We usually refer to these things as nodes because the best way to think about them is as microservices. But the idea is that there's a data microservice or node for each large domain and in large organizations the goal would be for you to make the API calls to answer your query. Whether you're an insight application or your transactional application or a website, you're talking to domains of specificity around data platforms.

Data meshes should contain a very strong governance model.

Uli added that data mesh architecture is something not necessarily new, but it's certainly becoming into focus because of the failure of data lakes. There has just been a realization that trying to centralize all data and put it all into one place, into one methodology, one approach just doesn't work in most organizations, not at all, but in most organizations. It's just too complicated because of domain ownership, technology choices people have made, and they don't want to give up that technology choice made because some CTO or leadership said “we're going to do X.” That means everybody has to snap to X now without regard for previous technological choices or tools or processes set-up. It’s simply we are accepting that there are different places where data is being handled and analytics being run and the only thing that we really care about is twofold. We want to know what data is available where. So, part of the data mesh architecture that Eric didn't mention is that there is a very strong centralized governance model that's coming into play. And the governance model simply is for two reasons. One is visibility. A lot of times the problem with these distributed architectures, which mesh is, is that nobody knows what data is available where, who has access to the data and are we at risk from a compliance perspective? There are all sorts of questions that people now wrestle with because compliance on data is very important.

So, a governance system is often a key part of a data mesh architecture. There is also the reality that this one customer has picked Cloud A, the other team has picked Cloud B and made technological choices over time so let's not go and disrupt all of this. Let’s figure out a way to get value out of the investments people have made in a way that brings the data together and allows people to go to Cloud B to get the data, and then put it into my system, but they don't have to completely abandon all the investments they’ve made with some better tools. Some people do that, and there are good reasons for that too. But a lot of times Uli sees data mesh architecture as a realization that the world is complex.

Why is the Governance Model not a Discovery Model?

David wondered why Uli is talking about governance and not discovery. Eric had a conversation about this with one of his colleagues, where, from his perspective, governance used to be very much control, audit, access, RBAC, all those types of factors and believes it still is. But there's also an element of discovery associated that has also been included into the overall concept of governance. And he thinks it's especially relevant associated with a data mesh architecture because largely if you're consuming these microservices associated with data it's unseen. It's just basically API calls and without a strong governance layer to maintain regulatory compliance and making sure that you don't have data mesh nodes that are sisters, cousins or brothers of each other. You actually have to have a catalog, and, in that catalog, you have to expose it in the way in which you can and should have access to these services.

Data is usually passive and requires a query or some other action to be discoverable.

Uli added, data often doesn't announce itself. Unlike services there are very little self-description services or discovery happens as a more of a search like activity. I understand where my sources are and I'm going to go after the metadata to describe the data and while there will be places where data will be self-describing and can announce itself. Oftentimes data is much more passive, in a database you must query in order to figure out what's going on. Uli thinks that's one reason why discovery works together with governance. But the other piece is that the discovery system leads to a catalog, the catalog is then used to drive governance, and the idea is to know who has access to what, since data is becoming such a piece of differentiation on the one side but there now also are legal concerns. This is because if you have a lot of people collecting data about people, California's law, the European GDPR, Schrems II; all those acts or decisions are becoming important. And having a grip on where your data is, what the data means and who has access to it is a key piece.

Data meshes help determine what data you have and who has access to it.

If you have a data mesh model, it also is a key piece to know what data do I even have then? A lot of times organizations really don't know what data they have and what it means and who has access to it. Data mesh architectures, the mesh part is bringing it all together in a virtual kind of way. So, you don't copy everything around into centralized places, but you still must know what data do I have, where is it located, what system is it even in. Because while Eric's description of the API is nice, in reality there's almost no APIs in data. People don't do APIs; you do a SQL query, or you go after an HDFS folder, and you live with different models of access that are far less clean than in an application world. Again, where events and API's and those kind of things are very common in data, often they are not and they don't work. APIs for example, with data, it's like trying to run an AI algorithm through a straw because suddenly, you're starting to see, “oh, I need a petabyte of data before my AI algorithm is actually useful.”

You can't process this through an API, you must effectively go and ship the AI algorithm to the node in the mesh and run it on top of whatever that infrastructure is, or you must copy the data, and that takes a while to do. There's a bunch of things where data is a bit different than APIs or code. While conceptually what Eric says makes total sense, in reality it is far less clean and you just have to deal with it. An example is “oh, this is an Oracle database, so pick up your Oracle access method and figure it out.” Or this one is a data lake HDFS or is this is an S3 bucket, then you figure out the access method for either.

Data meshes aren’t appropriate for all use cases.

Eric tended to agree with what Uli described, in that he thinks that the data mesh architecture concept is nascent. It's an interesting paradigm that's not going to be appropriate for all use cases. Like Uli said, you can't actually go and say give me all time series telemetry for the past three years for robots of this particular type through an API. It'll never scale, so at some point you have to look at the use case and drop back down into platform level integration. I need the connection string, I need the query, I need to write in the appropriate language of that platform. And then I need to go retrieve the data in a scalable way so that those activities are typically challenging for their own reasons. And the more you have to do that, the more challenging it is, so there's going to be this confluence “of well, if I stick an API layer in front of this, am I going to benefit?” Am I just looking for a few transactions? Am I looking for small batch data? Am I looking for near real time? Or do I now have to dip back down into the more traditional route of data integration, which is platforms have to talk to platforms, I have to speak the language, I have to know the storage paradigm. Eric believes that it's decent, it's aspirational at this point, but architects should begin thinking about it.

Are data meshes aspirational?

David wanted to specifically ask why the word “aspirational” comes around the term mesh? We have this notion that everything talks to everything else, which is not what he thinks we're discussing here or so even suggesting that that's the case, but in a mesh situation, there is an implication this is a pretty flat mesh. It is a pretty flat situation where, maybe you have meshes on top of meshes, but David often thinks of those as a fairly flat situation. In his mind, it's basically the net that gets created that gets woven together by the connections between the things is your mesh. He is trying to understand the concept as data doesn't always play this way and that it's not nearly as flat as we're hoping it is.

What should people be paying attention to in the world of data meshes?

Eric thinks ultimately what will end up happening is in those scenarios where it is appropriate to conduct API access to retrieve data from a node in a mesh. We're going to have services that talk to services that talk to services and as David said, they need to be woven together and we do need some governance around that. We do need to know what these microservices are talking to. The good news is we know how to do that through applications like service meshes which look after modern day microservice architectures. Eric also believes that in the future there's going to be the same complexity that we see in a typical microservice implementation we're going to see in a data mesh architecture. Some of our readers might be saying, “well is there really a difference, isn't it all just micro services?” And to that I would say “yes.” And there's got to be data interchange elements associated with it, there's got to be access, regulatory compliance, ETL, how the data gets there. You'll have data centric microservices and then you'll have transaction centric microservices or business logic micro services as well.

David asked Uli what people should be thinking about when it comes to data meshes; what do we still have to learn and what should people be paying attention? Uli mentioned we have already talked about discovery and governance as a key part. The second part is thinking through how you do access it; data virtualization is another term that's often associated with data mesh. And while that is a fine conceptual term, lightspeed and network throughput does matter especially with data and you really want to make sure that you're either doing data shipping, meaning you copy data up to the point of consumption, or you are doing function shipping where you effectively go and ship the function to where the data is. And those are the two approaches to really deal with data across a worldwide distribution or whatever the scope is. In addition, the access pattern is important and then thinking about who has access, what does the access look like, what am I allowed to do? For example, for the non-European viewers; Schrems II tells you that if you were the one that collect collected the data from a customer and you give it to somebody else for processing, you are still responsible for the third party processor, that is now is processing the data that still meets the original consent terms of use of their data provided by the customer. This is becoming important when you start to do a data mesh and this is just not my internal enterprise, but I'm also bringing my supply chain or partners in there. So those are the things that drive Uli’s conversations often when he talks about data mesh architecture.

To hear the whole conversation, you can watch the video below.

Sign In

Armchair Architects: Data Mesh Architecture

Recommended Posts

Guest AriyaKhamvongsa

Join the conversation

Browse

Activity

Posting Guidelines

Store

Support