By Michael Ryan, Head of Architecture
Why read this article?
If you are perfectly happy with your company’s data analytics capabilities and you’re using data to effectively drive business decisions, you can stop reading now.
Unfortunately, few readers are in this category. In our experience, even clients with sophisticated data architectures sometimes struggle utilizing their data. The following examples may sound familiar:
Company A: Sophisticated, but reaching the limits of current technology
Company A has multiple domain specific applications streaming data into an ETL pipeline and then into a data lake. A team of highly skilled data engineers analyzes the data and generates reports for various internal consumers. The capabilities of Company A’s data team and data lake infrastructure are very impressive, but their data analytics system is becoming the victim of its own success. Their useful reports simply spur demand for even more reports. On one hand, more domains want to pump data into the lake; on the other, the demand for more variety and more reports keeps growing. The data analytics system cannot scale; lead time for new reports grows, and report quality declines. Shadow data analytics takes hold — reports created by data consumers who cannot wait for the data team. These consumers get their data from other sources, causing a security and governance nightmare. In this case, a data mesh can be a low-cost way of siphoning off the more routine reporting tasks from the data lake, saving the data lake and the efforts of the highly skilled data engineers for the more analytic work they are meant for.
Company B: Data silos pose barrier to growth
Company B has a collection of domain-specific applications for HR, Legal, Sales, Shipping, and other departments. Each application has its own data store constructed of a number of available technologies: relational, noSQL, graph databases, flat files, and event streams stored both on-premises and in a public cloud. While each domain-specific application adequately reports on its own data, Company B struggles to generate actionable insights by combining data across domains. For Company B, a data mesh is perfect for aggregating data products generated by each domain in ways that provide insight into the business as a whole.
Company C: Small but growing rapidly
Company C is not yet using on-demand reporting to drive business decisions. Reports might be generated by hand at a weekly or longer cadence. This approach can work in the early phases of growth but the demand for actionable data will soon grow faster than the company’s ability to generate it. In this case, building a simple data mesh is an inexpensive way to enter the realm of on-demand analytics and reporting. Company C can get the immediate benefits of a data mesh and then let it grow and adapt in response to future demands. Best of all, the initial investment in a data mesh paves the way for more substantial investments in traditional data architectures if that becomes desirable.
If any of these examples sound familiar, a data mesh may provide immediate value and a foundation for future growth for your company.
What is a Data Mesh?
A data mesh is not a product. You cannot download data mesh software. While there is a common set of tooling used to build data meshes, there is no one-click install.
A data mesh is an architectural pattern made popular by Zhamak Dhaghani of Thoughtworks. Despite all the attention and commentary around data mesh architecture, it can be hard to understand exactly what a data mesh is and whether it is something that might benefit your company.
Simply put, the data mesh pattern operates much like microservices do; it can be similar to a network of read-only microservices focused on data. While the pattern is often described as an almost revolutionary approach to using self-service platforms to embrace the distributed nature of data, data meshes use the same industry-standard tools and techniques that have been used at scale by microservices for almost a decade.
We view data mesh as an evolutionary application of widely used techniques and technologies, not a risky revolutionary re-write of current data analytics architectures. A small data mesh can be up and running in a short period of time. Maintenance is often limited to adding new data products or altering others if the underlying data structures change in some way. In either case, risk is reduced because changes are made by domain experts familiar with the data.
As is often the case with microservice architectures, each node uses its preferred set of technologies and techniques to meet their specific requirements. Also like microservices, the technologies and techniques used by each node are often subject to some degree of engineering or architectural governance. The data mesh architecture does not mandate means of communication or API standards, but nodes often employ RESTful APIs documented through tools like Swagger as well as various streaming techniques.
While this example is simple, data mesh topologies can quickly become complex as more varied types of data are added. We recommend starting small and letting your data mesh evolve with demand. The actual topology of mesh nodes will depend on your company’s unique needs and its layout of data sources. Obviously, this topology will change over time; the data mesh architecture allows for that.
Data Products and Data Nodes
What is a Data Product? A data product is defined as the code, data and infrastructure required to deliver in a self-service manner a domain-specific data structure that produces business value to a consumer. Data products can be relatively simple (for example monthly sales by office) but are often complex combinations of smaller, simpler data products ( for example quarterly sales trends by office correlated to regional demographics and sales incentives).
Data products are used for both analytical and operational purposes.
What is a Data Node? Multiple data products can be combined in a single data mesh node. A data mesh node can be thought of as a read-only microservice.
An important notion in data mesh architecture is the concept of data products. Each node in a data mesh exposes a collection of data products to the end user as well as to other data products in the mesh. In this way, each data product can become a building block for larger and more useful data products. When data products build blocks into arbitrarily complex structures, one of the most useful aspects of a Data Mesh is born: customized cross-domain data products.
Highly customized data capabilities arise when one Data Product is able to incorporate other data products into itself. In the above illustration, the Finance Node retrieves data from the Sales Node, combines that data with its own, and returns the resulting data product to the data dashboard.
Advantages of a Data Mesh
Data Meshes Scale Better than Traditional Data Pipelines or Data Lakes/Warehouses.
The anti-patterns described for the companies above are typical. They are not the result of bad engineering or insufficient budgets; they are not even a drawback of traditional data architectures. We believe these anti-patterns arise from an attempt to utilize traditional data architectures to meet all the varied demands for data and analytics that come from even small businesses.
The following sequence of diagrams illustrates a common experience with a data lake architecture.
There is often success in the beginning phases of a data lake architecture. The number of data providers and data consumers is low and the specialized data team can keep up with the supply of data and the demand for reports and analytics.
In the second phase, success stimulates more demand for data insights. The number of publishers and subscribers grows, and workload on the data team increases. The architectural and organizational model does not scale; delays in generating analytical insights increase.
In the final phase, there is a growing backlog for adding new data sources and generating new reporting insights. Long waits to get needed data often lead to shadow data analytics, where teams turn to unofficial sources of data. This can lead to severe security and governance issues.
Data Mesh Encourages Domain Ownership of Data
A more traditional approach to data analytics might attempt to combine data from multiple sources into a single monolithic data lake or data warehouse. In a data mesh, domain experts own the data and package it in ways that maintain the data’s integrity and fitness for purpose.
A common drawback to traditional ETL pipelines and data analytics architectures is that the data engineers are not familiar with the data they process; they are not domain experts and do not have ownership of data. When a traditional ETL pipeline and its data lake is new, this is not as much of an issue because the pipelines are not processing that much data yet. Data engineers tend to be highly intelligent and can easily learn about the small amount of data they process. But after the wave of initial successes, when more business domains and more and more data is pushed through the pipeline, problems can begin. The volume of data challenges even the most talented data engineering team. A common symptom of this is longer lead times to generate even simple reports coupled with lower quality of the data included in the reports. The distributed nature of the data mesh helps eliminate this problem. In a data mesh, domain teams own their own data and control how it is exposed to the rest of the system.
Domain-driven design can play an important role in defining corporate domains, and well-defined domains are critical to implementing an effective data mesh. See our domain-driven design article for more information.
Data Mesh Limits Coupling of Resources and Promote Maximum Flexibility
As with microservices, tight coupling is the enemy of a highly functional data mesh. The Independently Deployable Rule as applied to microservices also applies to data meshes; every node on a mesh should be deployable without making corresponding changes to other nodes in the mesh. In the example above, a deployment to the Finance node should always be made by itself, independently, without a corresponding deployment to the Sales node, and vice versa.
Coordinating deployments of individual data mesh nodes can be difficult. Adhering to the independently deployable rule in a data mesh often implies some versioning scheme applied to data products. One important aspect of each domain implementing their own data products is that each domain is responsible for its own pipelines. Distributed pipelines tend to eliminate the tight coupling of ingestion, storage, transformation and consumption of data typical of traditional data architectures like data lakes.
The design influence of microservices on a data mesh is apparent in its flexible nature. A data mesh can expand and contract to match your data topology as it grows in some areas and shrinks in others, the way different technologies like streaming can be used where needed or how nodes can horizontally scale to meet demand.
Data Mesh Facilitates Creation of Data Products through Domain Ownership of Data
In data mesh parlance, a data product is a node or portion of a node on the data mesh. It is the code and infrastructure required to deliver data to a consumer as well as the data itself. This is the smallest deliverable unit a data mesh can provide. Because the data product is created by domain experts who own the data, the quality of the product tends to be much higher than data provided by other architectures.
Data products must be easily discoverable to maintain the usefulness of the mesh. Most implementations will use some sort of data catalog or registry of data products. To make the catalog useful, information like description, owners, component data products, and lineage are all included and often updated as part of a build pipeline.
The docs-as-code pattern is an excellent practice to employ when building a data mesh. Quick and accurate discoverability is key: if a data product’s meta data is allowed to become out-of-date, the usefulness of the data mesh will dwindle.
Meet Business Needs by Combining Data Products in Arbitrarily Complex Ways
Distributed data mesh nodes can call one another just like microservices call one another, and together they can generate and collate data products from multiple sources to deliver on-demand actionable reporting. A data mesh can even be used to increase observability of a company’s activities — a simple use case would be ordering more inventory depending on an analysis of sales patterns over the last few days. Though this is not a typical use of a mesh, it shows the flexibility of the solution.
The diagram below illustrates how data products in various domains can call other data products and integrate the responses into their own offerings. This capability not only promotes code reuse allows for arbitrarily complex data structures to be composed with relative ease.
Scale Each Data Node as Required
Some nodes on a data mesh may provide important data products but are accessed infrequently. Other nodes may provide access to event streams and be accessed constantly. Each data node needs to scale independently in the same way any microservice in a microservice architecture needs to scale independently of other microservices. Independent scalability of application components is a core tenet of cloud-native applications (see https://12factor.net/).
Container management tools like Kubernetes that automatically create, delete and scale data nodes fit well in a data mesh architecture. A Kubernetes pod can house a data mesh data product. The Kubernetes horizontal pod autoscaler (HPA) will monitor resource demand and automatically scale the number of pod replicas, which independently scales copies of any one data node. By default, the horizontal pod autoscaler will determine every 60 seconds whether scaling up or down is necessary. When changes are required, the number of replicas is increased or decreased accordingly.
Data Mesh Governance is Implemented at the Data Product Level
In a data mesh, governance is federated within each data product. It’s best to have the most knowledgeable team responsible for implementing governance; this reduces centralized bottlenecks like data review meetings. Although some bottlenecks can be avoided, be aware that in most cases federated governance places an extra demand on the domain team.
Data Mesh is Built Using Industry standard tooling
Data meshes are built using microservice technologies, patterns, and DevOps pipelines that are all well known and heavily used in industry for nearly ten years. A data mesh employs tools and principles commonly used in microservice architectures such as containers, Kubernetes, service meshes (Istio, Consul, Linkerd), and zero-trust security measures such as continuous verifications, identity-based segmentation, least privilege principle and automated context collection and response.
Container management tools like Kubernetes or service meshes like Istio have tens of thousands of successful installations. RESTful APIs and streaming have been in use for over twenty years. The risk profile of adopting a data mesh is comparable to adopting microservices. It is not new technology; it is a new way of approaching data.
Direct communication between Domain Experts and Data Consumers
Once parts of a business enjoy the benefit of data analytics, other parts will want to join in. Demand for new and more involved reports and analytics grows. Just as a small team of data engineers cannot understand all the sources of data in a company, they also cannot completely understand how the data is used. The demand for new and varied forms of data expands beyond the data team’s capability to understand the nuances of the data they provide to consumers. Errors in understanding data lead to errors in reports and or delays in generating reports.
A data mesh architecture helps avoid this problem by forcing domain experts to speak directly with consumers. It does this by having those who understand the data best be the owners of the data and be responsible for composing it into data products for others to consume in a self-service way. Unlike a small team of specialized data engineers struggling to keep up with ever increasing supply and demand for data, domain experts are best suited to understand consumer’s needs and build the data products that meet them.
Start Small and Grow Your Data Mesh According to Demand
A company can begin with a small data mesh of just a few nodes and see value almost immediately. Though a data mesh can be thought of as microservices for data, the reality is a little more complex than that. With the principles described in this article, you’ll be able to start your mesh small and let internal demand drive growth according to demand.
Common Challenges When Adopting a Data Mesh
Your company once prided itself as a place where data drives innovation, but as the company grows the reality seems different. Many symptoms of data-related problems loom on the horizon:
- organizational silos and lack of data sharing
- no shared understanding of what data means outside the context of its business domain
- incompatible technologies prevent gaining actionable insights
- data is increasingly difficult to push through ETL pipelines
- a growing demand for ad hoc queries and shadow data analytics
Limited budget was used up in previous failed experiments with expensive data technologies. Any new solution will need to start small, prove its worth and scale as the company grows. The best solution for these problems would be an architecture that emphasizes democratization of data at the business domain level while accommodating different technologies and data analytic approaches.
A data mesh architecture may be the best hope for solving your data problems. A data mesh can start small and grow as needed, providing a budget-friendly option for proving value and then growing to meet your company’s needs.
A data mesh is a distributed approach to data management that views different datasets as domain-oriented “data products”. Each set of domain data products is managed by product owners and engineers with the best knowledge of the domain. The idea is to employ a distributed level of data ownership and responsibility sometimes lacking in centralized, monolithic architectures like data lakes. In many ways a data mesh architecture is similar to the microservice architectures commonly used throughout the industry.
Each business domain in a data mesh implements their own data products and is responsible for its own pipelines. The focus on domain-specific data products tends to avoid the tight coupling of ingestion, storage, transformation and consumption of data typical in traditional data architectures like data lakes.
Your company is eager to start enjoying the benefits of a data mesh but wants to avoid beginner mistakes. Below are ten common problems when moving to a data mesh architecture and how to avoid them.
Follow DATSIS Principles
DATSIS stands for Discoverable, Addressable, Trustworthy, Self-describing, Interoperable and Secure. Failure to implement any part of DATSIS could doom your data mesh.
- Discoverable — consumers are able to research and identify data products produced by different domains. This is typically done with a centralized tool like a data catalog
- Addressable — like microservices, data products are accessible via unique address and standard protocol (REST, AMQP, possibly SQL)
- Trustworthy — domain owners provide high quality data products that are useful and accurate
- Self-describing — data product metadata provides enough information that consumers do not need to query domain experts. In other words, data products are self-describing
- Interoperable — data products must be consumable by other data products
- Secure — access to data products is automatically regulated through access policies and security standards. This security is built into each data product
Automatically Update Data Catalogs with every Release
Data product discoverability is part of DATSIS and a key element of data meshes. Most data meshes employ a data catalog or other ad hoc mechanisms to make their data products discoverable. A data catalog can be used as an inventory of data products in a data mesh, most often using metadata to help organizations support data discovery and governance.
Any mechanism used for discoverability must be kept up to date to protect the usefulness of the data mesh. Out-of-date documentation is often more damaging than no documentation.
For this reason, we recommend data meshes employ a docs-as-code scheme, where updating the data catalog is part of the code review checklist for every pull request. With each merged pull request updated meta data enters the DevOps pipeline and automatically updates the data catalog. Depending on the data catalog, it may be updated directly through API, pulling JSON files from an S3 bucket, or other methods.
Invest in Automated Testing
A data mesh is by definition a decentralized collection of data. An important issue is how best to ensure consistent quality across data products owned by different teams, that may not even be aware of one another.
Following these principles helps:
- Every domain team is responsible for the quality of their own data. The type of testing involved depends on the nature of that data and is decided upon by the team.
- Take advantage of the fact the data mesh is read-only. This means that not only mock data can be tested but tests can often be run repeatedly against live data as well. Take advantage of time based reporting — test on historical data that is immutable makes for an easy test and detects things like data structures changing.
- Run data quality tests against mock and live data. These tests can be plugged into developer laptops, CI/CD pipelines or live data accessed through specific data products or an orchestration layer. Typical data quality tests verify a value should contain values between 0–60, or alphanumeric values of a specific format, or that the start date of a project is at or before the end date. Test-driven design is another approach that can be used successfully in a data mesh.
- Include business domain subject matter experts (SME’s) when designing your tests.
- Include data consumers when designing your tests. Data meshes should be driven by data consumers and it is important to make sure your data products meet their needs. Otherwise, why build the mesh in the first place?
- Use automated test frameworks that specialize in API testing. We recommend the Karate framework (https://github.com/intuit/karate). Other useful tools are:
- SoapUI: https://www.soapui.org/
- Postman: https://www.getpostman.com/
- Apigee: https://cloud.google.com/apigee/
- Rest-Assured: http://rest-assured.io/
- Swagger: https://swagger.io/
- Fiddler: https://www.telerik.com/fiddler
To someone with a hammer, everything looks like a nail
When people become very proficient with one set of tools, they tend to use those tools even in situations where they are not appropriate. Many companies struggle with scaling data analytics because they try to use their data infrastructure to solve every need for information. An architecture where ETL pipelines pump data into a data lake is in many ways monolithic and has a finite capacity to deliver value. It simply does not scale well. A data lake excels at ad hoc queries and computationally intensive operations, but the centralized nature of the lake can make it hard to include pipelines from every domain in the company.
On the other hand, the decentralized nature of a data mesh allows it to include data from an almost arbitrary number of domains. However, the drawback to the data mesh is that very computationally intensive operations can be time consuming. Use the right architecture to solve the right problems.
Politically you also need to recognize the value of your data engineers. They play an important role. If they don’t feel valuable, or if they feel their jobs are threatened by a data mesh, they will act against it — even though data mesh and data lake architectures can be complimentary.
Data Mesh Requires Extra Work by Domain Teams
In a data mesh, domain teams maintain ownership of their data and create data products that expose that data to the rest of the company. If an engineering team handles the data mesh work, their capacity for other engineering work will decrease — at least at the beginning.
However the alternative is often a tightly coupled data pipeline. Such a system is inherently fragile and changes to data at the application level can result in erroneous data being fed into data lakes, and subsequent defects in reports produced by data engineers. Troubleshooting these defects is time consuming and frustrating. When the source of the defect is found to be something like a change in a field type or the way a particular field is used there can be a lot of friction between the engineering team and data team.
Tight Coupling Between Data Products
The design influence of microservices on a data mesh is apparent in its flexible nature. A data mesh can expand and contract to match your data topology as it grows in some areas and shrinks in others. Different technologies like streaming can be used where needed and data products can scale up and down to meet demand.
As with microservices, tight coupling is the enemy of a highly functional data mesh. The independently deployable rule as applied to microservices also applies to data meshes; every data product on a mesh should be deployable at any time without making corresponding changes to other data products in the mesh. Adhering to the independently deployable rule in a data mesh often implies some versioning scheme applied to data products.
Shadow Data Analytics and Mandates
Many companies suffer when their current data architecture and data team have reached their scale limits. Growing delays when adding new ETL pipelines, generating new reports, and running ad hoc queries lead consumers to find their own sources of data. This is known as “shadow data analytics”.
Shadow data analytics is understandable. People need data to do their jobs. But the practice of shadow data analytics often bypasses any sort of data governance or data security. It also tends to result in erroneous use of data. It is obviously counterproductive.
When faced with shadow data analytics, one approach is to mandate all staff use the sanctioned data analytics system — or else. Mandates might be necessary for a short period of time, but they exact a toll in employee autonomy and initiative. A better approach is to build a comprehensive data architecture (often including data lakes) that is accurate and responsive to consumer’s needs so they won’t be tempted by shadow analytics at all. Such a system has the following characteristics:
- Follow DATSIS principles and the independently deployable rule as described above
- Enforce SLA’s for minimum performance requirements for all data products
- Employs a transparent process to request, schedule and implement new data products.
Accurately Evolve Data Products
Data evolves as a company evolves, often in unpredictable ways. Changes often fall into two types:
- Changes in the domain structure of your company
- Changes in the structure and nature of the data itself within each domain
Data meshes should be built to adapt to these changes. Adding domains to a data mesh is simple: add data products, ensure they are discoverable in a data catalog or similar product, and build dashboards or other types of display as necessary.
Removing data products occurs less frequently and is a little more difficult. It is usually done manually. If another data product consumed the removed data product then it needs to be examined. Does it still make sense to expose the consuming data product to users? How are consumers of that data product notified about changes or complete removals of data products? The answers will be different for each company, and must be considered carefully.
Accurately Version Data Products
Data products will need to be versioned as data changes at your company, and users of that data product (including maintainers of dashboards) are notified about changes, both breaking and non-breaking. Consumed data products need to be managed like resources in Helm charts or artifacts in Maven Artifactory.
Sync vs Async vs Pre-assembled Results
If a data mesh uses synchronous REST calls to package the output from a few data products, chances are the performance will be acceptable. But if the data mesh is used for more in-depth analytics combining a larger number of data products (such as the analysis typically done by a data lake), it is easy to see how synchronous communication might become a performance issue.
So what are the options?
One would be a solution similar to a Command and Query Responsibility Segregation (CQRS) to pre-build and cache data results on a regular cadence. The cached results could be combined into a more complex data structure when the data product is run. This is an effective approach unless you literally require up to the moment results.
Another approach is to break apart the operation into separate pieces that can be run asynchronously using a Asynchronous Request-Reply pattern. Using this pattern implies a few things:
- There are no ordering dependencies between the datasets you construct. In other words, if you concurrently build five datasets, the content of Dataset #2 cannot be dependent on the content of Dataset #1.
- Most likely the caller will not receive an immediate response to their request. Instead, some sort of polling technique returns successfully only when all datasets are built and combined. If the dataset is very large, it may be stored somewhere and a link to the dataset provided to the user. This implies appropriate infrastructure and security is in place.
A big advantage of the data mesh architecture is that it can start small and grow as demand grows. Early mistakes tend to be small mistakes and teams learn through experience how to manage increased demand for data while avoiding the political and technical pitfalls inherent in providing actionable data to business users. Data lakes and meshes are excellent solutions for different problems; it’s important to understand which is best for your employees and their data needs.
Use the Right Tool for the Job
Data lakes or data warehouse architecture could be right for the job. They are just not the right tool for every job — just as a data mesh is not the right tool for every job. In fact, it’s easy to see scenarios where a data mesh and data lake coexist and make each other stronger. Data lakes or warehouses require an investment of hundreds of thousands of dollars and hiring experienced data engineers before seeing any return on investment, but they do have a place in today’s data architectures.
A data mesh can be an excellent tool for on-demand reporting, analytics, and streaming. However, performance can be limited by slow queries in any node. Building the infrastructure to run ad-hoc queries against a data mesh would be difficult, although new techniques show promising results. Data lakes shine when they house large data sets that can be queried in computationally intensive ways. If this matches your needs, a data lake architecture may be right for you.
The following table provides a useful comparison between a Data Lake and Data Mesh architecture:
If you’d like to discuss the concepts included in this article, please reach out to the author at email@example.com. Michael Ryan is Managing Principal Consultant and Head of Architecture at Kenzan.
Author linkedin page: https://www.linkedin.com/in/michael-james-ryan/