Monday, April 15, 2024

4. The data management ecosystem

 4. The data management ecosystem


There are only a few components to the modern data management ecosystem, which is displayed in the diagram below. We’ve plotted these systems on two axes, one showing the latency of the data managed by the system and the other showing the complexity, expressed in terms of the number of sources a particular kind of system organizes. These systems each organize data in a way that’s particularly suited to a class of tasks or workflows. Most of these systems don’t create data, which generally happens in the bottom left hand corner in the Services/Applications space. Instead, they rearrange and organize data. Application and service databases serve as the supplier of raw materials for the information supply chain, and they’re usually designated in this work as the “source systems.” Some systems manufacture data as a byproduct of their basic operation, such as Master Data Management systems that function as curation tools for a small percentage of the data they manage. But for the most part the job of the data management system, when it isn’t a source system, is synthesis and organization.

Broadly speaking data management is semantic management. A system’s semantics is its “meaning,” or more precisely “the set of things the system is true of.” Every organization has a set of things it works with, consisting of customers, vendors, products, services, orders and returns, readings from devices, whatever. Each of those kinds of things and their relationships to other relevant things in that organization are managed by particular systems. We build applications to collect data about those things and their relationships. Typically we collect data about specific instances of these things - the actual customers or products or devices - across a number of different applications, some elegant and well-designed and most not so much. The data collected by those source systems then gets rearranged, consolidated, refined and most importantly used in the other systems in the data management ecosystem. When data management practitioners manage an organization’s data, then, they’re trying to get the organization’s semantics right by organizing records in new kinds of arrangements. They’re trying to corral and consolidate the various instances of things and their relationships so the organization can understand how well it’s impacting the things it cares about.

I could have titled this book Practical Semantic Management but outside of a pretty narrow technical niche nobody understands what “semantics” means. And “data” is more appropriate anyway, because each of the applications manufacturing raw material for the data management practitioner is simply producing data, in practical terms. If we get any eventual conception of an organization’s semantics, it’s because we build it out of the data - specifically the records - we get from those source systems.

The components we’ll define in this section have specific roles to play in the construction of a consistent semantics for an organization. They don’t all play the same role, although sometimes those roles overlap a little here and there. But there are two main functions each of these pieces provides.

First, they’re intended to solve a specific set of use-cases. Identifying those use-cases is a critical step to understanding why you’d choose to build one system-type over another. For example, people answering calls from customers don’t need to see every order the customer has ever placed. Instead, they need to know the current open status of a customer’s current orders. Having additional data is often perceived as a nice-to-have, but the primary use-case for customer service people is “all the current relevant data required to answer the customer’s question.” For that use-case we don’t want access to every single transaction the entire organization has ever generated, and indeed that access may be detrimental to our response time. We need only the transactions that are currently relevant to the people implementing the process at question.

Conversely, if I want to know what products to recommend to a customer who’s got money to spend I don’t particularly care what transactions are currently unfulfilled. Perhaps to the extent that I don’t want to recommend product that’s backordered, if that’s a possibility, but that’s a proviso and not the primary concern of a recommender system. Instead we need to compare the customer with money burning a hole in their pocket with similar customers using some grid of comparison, looking across what all of those other customers bought, and then make appropriate recommendations. Current data is less important than granularity and coverage for this second use case.  

In both cases my concept of customer is the same, and ideally derived from the same common base of data. In fact all of my customer data will originate from the source applications in the bottom left hand corner of the ecosystem diagram. But the relevant attributes in the first use-case, used to ensure our co-workers aren’t left literally holding the phone, are very different from the second. The two systems might not share any customer attributes, except the keys identifying an individual customer. When we sit down to design these two different systems, we draw from a common concept of customer. But what things and relationships we care about in each system will be different, and so we create a subset of the customer semantics to ensure we’ve got relevant data to solve our problem. But just as you wouldn’t expect hummingbirds to play the role of wolves in an ecosystem should you be missing the latter and have a surplus of the former, you can’t expect the ODS to function as an Analytics system just because you have a big ODS and no Analytics systems. 

Second, the development of one piece is critical to the development of others. As we’ll see, in the course of building and maintaining any one of these pieces an organization will, whether it's conscious of it or not, spin out assets that are foundational to the others. There are structural features of an ODS that contribute to the development of Analytics systems, and elements of the Master Data Management system are dependent on the ODS or the Analytics system. An organization can of course build any single piece on a standalone basis, but in the process of doing that construction they’ll find themselves replicating the functions of one or more of the other pieces. This could present us with a chicken-and-egg problem, if we were being silly and reductive. But in practice what this means is that, unlike many situations where you climb the ladder and then must kick it away when you get where you’re going, we get to keep the various ladders we build, generalize their function, and start to fill in the remainder of the ecosystem as we need. Much of what we build to solve a specific set of problems will also serve as scaffolding for the next set of solutions.

Three final points before we get to the pieces. Let’s look at a slightly upgraded diagram.

First, each type of system can, in addition to its meta-use-case, be classified on the axes “Latency” and “Complexity.” For some systems, position on the plot is what makes it a solution. In other cases the system’s position on the plot is a function of resource-use. Analytics systems are composed of enormous amounts of highly-granular data, for example, so it’s possible to load an Analytics structure in real-time when the semantics of the structure and its inputs are tightly defined. There are some such systems, and they’re extraordinarily useful. But you’ll find in the wild that they’ve got a very limited semantics: Often the scope of a real-time Analytics system is restricted by the need to keep them as real-time as possible, and so users will either know pretty precisely what it is they need to get out of the structure, e.g. counts or similar aggregates of otherwise “dumb” events or large meaty chunks of text they’re using the dimensional structure of the warehouse to search for. Similarly Master Data Management (MDM) systems are high-latency and high-complexity because they’re curatorial in nature, combining lots of data sources and necessarily downstream of the analysis required to prioritize records. While an MDM system may not require any actual human intervention in their insert and update policies, they do require a great deal of care in their load. Data Lakes, on the other hand, are very low latency and also very low-complexity, being simple (and useful) but un-integrated copies of source tables.  

Second, notice the dashed red line around every system in the ecosystem except the Services/Applications box at the bottom left. This line defines “the Data Warehouse” in the minds of most end-users in the organization. When the corporate hierarchy wants to spend money on a data warehouse they typically don’t understand all of the use-cases. The CSRs may demand “analytics” when really they need an ODS, or they may ask for “operational” data when what they need is the ability to recommend a course of action, and thus an Analytics system. It’s often the case that executives will demand clean, deduplicated order data in their data warehouse without understanding they need a mastering process, or an MDM system in other words. All of these use-cases get glopped together as a “data warehouse” and so the nimble practitioner will need to make more precise definitions in their designs. And even finer gradation can be found when people talk about the ODS versus Analytics systems. The latter use a style of data modeling derived from Ralph Kimball’s approach, also known as a star schema design, and so Analytics systems may also be safely and cleanly called “Kimball-style warehouses.” The ODS uses a style of design derived from Bill Inmon, and so should be safely called “Inmon-style warehouses.” In both cases, for both Inmon and Kimball warehouses, some overlap is called for with an MDM system in order to keep your dimensions or entities clean - remember above the ladders we get to keep? - and so sometimes your warehouse project, Inmon or Kimball-style, will also need some MDM.

What this means in practice is you can never really tell which use-cases have been solved by an organization’s data warehouse until you look at the data model. Never assume a data warehouse is an Inmon or Kimball style system, satisfying either the ODS or Analytics use-cases, or an MDM or Data Vault system either. In practice a diligent if untutored data warehouse team will mix all sorts of solutions in their systems, as they try to respond to their internal clients. The point is: Don’t trust the terms, trust the models. A “streaming analytics” platform is rarely more than a collection platform, and thus a source system, and a warehouse could be any number of things. Use the data model to determine which use-cases are solved.

Third, and finally, if you’re a Big Data person you may be looking at this ecosystem and note there’s no “Big Data” systems anywhere on it. Where’s the streaming analytics platforms or Hadoop clusters? How is this not just an old-and-busted diagram of tired relational concepts that can’t handle the vast quantities of data generated by genomic analysis or the velocity of the modern data scientists? Or even modern data scientists doing modern genomic analysis? 

Data management is semantic management. This point is important to remember for data scientists, but also for everyone else. However an organization acquires its data, or the tools it uses to manage and manipulate it, over and above any other considerations of toolset or organization expectations, the data management practitioner is responsible for creating a consistent semantics out of the data manufactured by the organization. More precisely we’re still just dealing with records, and records come from somewhere and someone expects them to contribute to a picture of the world. It’s almost a certainty that someone needs that data to do their job. If not, if the data is not about something, then there simply wasn’t any point in collecting it in the first place. Now the source of those records might be a device for analyzing methylated DNA, or signals from a bubble-chamber particle detector, or sensors reading millisecond variations of temperature from the inside of a rocket booster. The records might be collected by a “streaming analytics” platform and spun out as log files. But those records will take a determinate form, and the fields that make up those records will need to be reconstituted for re-presentation and analysis, whether the target audience is a machine-learning algorithm, someone sitting at a screen waiting to push a button, or a C-level executive floating a hypothesis. The reorganizations of those records will take certain forms. The patterns we use to organize the data making up those records are not new, even if the volume is always hitherto unimaginable. 

So we will still need to master the agents on those records, whether the agent is a customer busily buying products or a device that occasionally peeps out a temperature rating. We need to master agents because some of our devices generate bad agent data, perhaps even by accident. But we still want to know how many signals we’re getting from our agents, what their quality is, and what the agents are telling us. That requires a Master Data Management system somewhere in our ecosystem. We have use cases where we need to know the last reading from each sensor or customer. We could of course just throw everything into one giant pile and search trillions of irrelevant records for the latest, throwing ever more CPU at the colossal stack because the CPU fairies are always willing to help. Or we could use an ODS structure to avoid relying on fairies. The pieces of the ecosystem we cover in this section are patterns of organization that evolved to meet specific kinds of use cases, and those use cases don’t disappear because volume has increased an order of magnitude or three since they were introduced.

Data volumes have always grown at breakneck speeds. The current context appears daunting, but it’s no more than a change in degree from where we were ten or even fifty years ago. Vendors may have exciting new mechanisms for managing and indexing enormous volumes of records, and we should certainly learn them. But there’s only a few ways of organizing those records and the data that makes them up that still work. At some point those patterns may not hold, and new ones will be necessary, in which case we’ll need a new edition of this book, and some new ways of thinking. But at the moment we’re still talking about organizing records, and the uses organizations have for those records, and matching the use-case to the organizing pattern.

The meta-models we’ll look at in this section are evolutionary. They can be used to organize data from small systems and large, whether gigantic logging frameworks or small firms. Not all of the patterns will be useful in each local case. But there aren’t that many of them, and it’s worth learning them all.


Monday, April 8, 2024

3.6 Patterns of Organization

3.6 Patterns of organization


There are a couple of simple patterns you should consider when organizing your data management efforts for maximum impact. In this section we’ll discuss some of those patterns of organization. We’ll look at a couple of different organizational models, and point to some fundamental properties of those models that help make teams effective. 

Monday, March 25, 2024

4.6 Master Data Management

4.6 Master Data Management

It’s a basic fact of the modern data ecosystem that critical data about critical entities will be duplicated and most importantly different simply because that data is created in different source systems. This happens in the simple case where a customer interacts with multiple applications, and each application creates its own CUSTOMER record. In the more complex case, two applications are downstream of a third, and copies of source records are sent to the downstream systems where they inevitably get updated or supplemented. Master data management or MDM is the process of creating entities and resynching records with the real-life thing. MDM is in one sense a brute-force solution to the governance problems caused by incompatible data models and inconsistent form validation. In another sense, MDM is the pragmatic connection between a data management ecosystem and the real world. And in a third sense, it’s one of those critical layers we see in good data management practice.

In this section we’ll discuss the process of mastering, which is basic to all pipeline development, and how that process eventually gets turned into the components of an MDM system. We’ll also walk through some of the use-cases, including the specific kinds of entities commonly managed in an MDM system. Finally, we’ll explain how MDMs should be integrated into the rest of their ecosystem.

Monday, March 18, 2024

2.2 Applications change over time

2.2 Applications change over time

We mentioned above that applications evolve. That is, over time an application will change the way it collects data, the kind of data it collects, and the way it stores that data.

Applications will stop evolving for a variety of reasons: Key developers move on, the organizational context becomes fixed or mature, or there’s a plan to replace the application with something new and shiny and no one wants to invest in change to the old application. Stable, happy applications that continue to generate data day-in-and-day-out are the best kind, like the proverbial server absent-mindedly shut up behind drywall that continues to faithfully serve out web pages without a complaint. But this is unfortunately a rarity, and most applications continue to evolve.

Tuesday, March 12, 2024

3.1 Codd's Essential Insight

3.1 Codd's Essential Insight            

            We’re going to expend some considerable energy on this section, because Codd’s insight will change your life. It’s not difficult to implement but the principle is rarely used as pervasively as it should, and as a result much of the work that’s done in data management is wasted when it could be productive. 

Monday, March 4, 2024

3.2 Layers

As part of a section laying out the patterns by which we organize data across the ecosystem functionally, the section below tries to explain the use of layers.

I don't know if it does a good job. I think there's a lot of value in the presentation of layers but I'm really not satisfied with the chapter. The practical and metaphysical points are explained well enough but they don't coincide. In any event, the chapter is presented as is.

3.2 Layers

In this section we’ll expand on Codd’s insight in two directions. First, we’ll put add detail onto his insight with some examples that illuminate the basic intuition behind normalization, which is primarily a practice of local optimization within a data ecosystem. Second, we’ll discuss layers, which is how that philosophical point gets operationalized when organizing local optimizations globally, into a data architecture.

Monday, February 26, 2024

2.1 There are a lot of sources

One reason data management is hard is the sheer diversity of sources data management practitioners are expected to rationalize. Consider a typical Customer-360 program. The goal of a Customer-360 program is to collect all of the data in a given enterprise about a customer, organize it into a matrix of some sort, and make it available for eventual consumers throughout the organization. Those sources often include:


The point of this blog

Welcome to the beginning of the film

In the twenty-plus years I've worked in data management I've spent a lot of time having some version of the following conversati...

Top 3 Posts