Semantic Debt: 3.2 Layers

As part of a section laying out the patterns by which we organize data across the ecosystem functionally, the section below tries to explain the use of layers.

I don't know if it does a good job. I think there's a lot of value in the presentation of layers but I'm really not satisfied with the chapter. The practical and metaphysical points are explained well enough but they don't coincide. In any event, the chapter is presented as is.

3.2 Layers

In this section we’ll expand on Codd’s insight in two directions. First, we’ll put add detail onto his insight with some examples that illuminate the basic intuition behind normalization, which is primarily a practice of local optimization within a data ecosystem. Second, we’ll discuss layers, which is how that philosophical point gets operationalized when organizing local optimizations globally, into a data architecture.

If there is a critical intuition underlying all architectural practice, whether it’s concrete architecture like dwelling design or abstract architecture like database design, it’s that functions are best performed by mechanisms or structures purpose-built to perform those functions. There’s a temptation among the inexperienced or philosophically naive to assume that if only one size can be made to fit all, then peace would reign across the land and provisioning software and housing would be rendered a simple mechanical problem amenable to robotic solutions. But while the idea that different functions require different designs to perform them properly may seem so obvious it must be an actual tautology, an empirical a priori, this observation is the key to making anything but the simplest systems work. In software architecture the principle is encapsulated in the phrase separation of concerns. Separation of concerns has different implications in different contexts and applications. In data management terms the principle suggests the data architect should avoid putting all of an organization’s data into one single pile. There are a number of reasons for this, worth exploring.

Before exploring separation of concerns as it applies to data management, though, it would seem that this notion contradicts the entire prior section. Didn’t we just spend a few pages describing Codd’s essential insight, which revolved heavily around his early work on the reduction of complexity, and its implications for data management, namely that you should make do with fewer things rather than more? Codd’s insight may appear to contradict separation of concerns, but only to the extent that the reader misunderstands the process behind reductionism. Codd and other reductionists (in e.g. mathematics, physics, or logic) may have, as their goal, the performance of all functions in a domain with one single mechanism, but in practice this goal runs into the basic chunkiness of reality.

Codd found that he could describe the construction of cellular automata with eight states, an improvement over Von Neumann’s original proposal of 29 states. But this is an honest reduction. Codd isn’t throwing out functions that can’t be performed by his eight states, and although there may be arguments about the efficiency with which a particular mechanism performs a particular function, in the case of cellular automata and mathematics in general these schemes can be proven as complete and sound, or syntactically and semantically complete. It may be the same cannot be said for data management systems proposed for specific business problems. But the game in state-reduction is to perform functions with as few states as possible, not to artificially constrain the semantics to force-fit, in a Procrustean way, the problem to the solution. Thus while Codd’s essential insight could be glossed as “perform your data management with as few entities as possible,” architectural practice requires we first figure out what entities are fundamental and irreducible to each other. In data management then by all means, if a vendor and a customer have similar attributes, you can call them both “agents.” But architectural best practice argues that where vendors and customers perform different functions, those functions must be managed differently and not artificially conflated or ignored. Similarly one cannot call agents and purchases the same thing, even though they both possess “attributes” and “relationships.” Building a structure that treats agents and purchases as the same thing makes it difficult for that structure to store agent-specific data and purchase-specific data.

Putting all the data in one pile - say in one giant table, perhaps as one big Entity-Attribute-Value table (hereafter: “EAV”) - makes it difficult to propose a heuristical way to find data pertaining to one kind of thing versus another. In an EAV, we describe every distinct kind of thing we need to store data about, whether customer, vendor, or purchase, as an “entity.” Suppose our EAV table is just three fields, ENTITY, ATTRIBUTE and VALUE. An ENTITY is the thing we’re storing data about. ATTRIBUTE stores the name of an attribute for one of those entities, and each attribute has a VALUE for each instance of an entity’s attribute. The resulting structure abstractly stores everything we need about our vendors, purchases and customers in one big table, with one row each for every combination of entity, attribute and the value associated with that entity’s attributes. Querying a large pile for different kinds of things may seem to be a simple solution to the question of how to go about doing data modeling. The EAV seems to locally optimize storage of data about all kinds of customers, vendors and purchases, and this approach has become very popular with certain kinds of “big table” and NoSQL architectures. It seems like an elegant and simple reduction of the diversity of life to a single abstract structure.

Except that one needs to make sure that all customer-related data is similarly tagged in exactly the same way. This is the fatal flaw of the EAV approach to locally optimizing storage. For example, one set of developers will build an application that creates an ENTITY for customer data and name their entity “CUSTOMER.” Another set of developers will create a record for their customer data, but their ENTITY is called “CUSTOMERS,” and a third will name their customer entity “PARTY” and a fourth “CUST.” All four data sets are stored in the same EAV pile, each a different kind of entity, but all about customers. Any query looking for customer-related data will need to account for all four possible descriptions of the customer, with a query that looks something like:

WHERE ENTITY IN (‘CUSTOMER’, ‘CUSTOMERS’, ‘PARTY’, ‘CUST’)

In an organization with perfect communication this wouldn’t be much more than an inconvenience. But we don’t live in that perfect world, and if the reader does live in such a world, we’d like an invitation. In the real world one or more of those ENTITY values will be unknown to the data management practitioner, just by Murphy’s law alone. Any real-world query that counts customer records will be insufficient without comprehensive communication. Those issues are easily preventable, however, by separating out the data related to customers from all the other data, creating a CUSTOMER domain that stores all the data related to customers, whether developers prefer to call their entity CUST, PARTY or CUSTOMERS.

This intuition, again, is referred to as “separation of concerns”: One should avoid co-locating different concerns within the design or code, where a “concern” is a function or purpose. This principle applies, generally. (This intuition also feeds, after the fact, the phenomena of “data governance,” of which more in later sections.) There are of course exceptions to the intuition, and the reader undoubtedly has their local favorites. But as a rule, when designing a data management solution, it is best to put the data about one kind of thing in one box, and the data about another thing in a different box.

Putting everything in one pile implies that the pile itself is one-size-fits-all. One standard question faced by beginner data modelers who discover the EAV data modeling pattern - often as though they’ve found gold coins lying in the street that everyone else thinks are mere pennies - is how to store, for each possible attribute, the various datatypes that might be used for those attributes. How will an EAV store every attribute if there’s only one VALUE field? Isn’t there a need for VALUE_VARCHAR and VALUE_FLOAT and VALUE_DATE, at the very least? In the spirit of Codd’s original thesis, perhaps the EAV model could be limited to just those three datatypes, one for character data, one for numeric data and a third for dates. Or perhaps the model could be made even more abstract: Everything is stored as a string, in a giant VARCHAR field (similar to a BLOB but varying in size from one character to the text of Gravity’s Rainbow), with some indicator of the datatype so that when results are returned the data management practitioner knows that some results can have date arithmetic performed on them, for example, and others simple arithmetic. This is how modern database platforms such as Snowflake actually do manage the data stored within them; all data is stored as characters, and each field has an intended interpretative semantics (as dates, numbers, characters or images) indicated so the database engines can perform appropriate operations on the data. This basic fact about storage platforms, which the beginner EAV enthusiast may not be aware of, underlies the primary argument against using an EAV: the pattern replicates functionality the database already provides and improves upon. It’s a fair question why a data management team would consciously strip their expensive database platform of most of its functionality.

Separation of concerns then argues that if the data management practitioner needs to store timestamps then timestamps should be stored, and not timestamps-as-strings with downstream conversion. And it argues as well that if customers and vendors have different attributes, then they must be treated as different entities.

The metaphor of Layers expands on this concept of separation of concerns. When copying data from source systems into models used by other consumers of that data, there are different operations that must be performed on the road from copy to use. The most typical consideration is integration of data, or for example merging records from a CUSTOMER table with those from PARTY, CUSTOMERS and CUST source tables to obtain a coherent view of customers. Before integration there is deduplication, or ensuring that a given record hasn’t been copied twice. These two different operations are two distinct concerns, and must be separated. While they might be merged functionally with some clever programming into a single meta-operation that both deduplicates and integrates, they describe two very different sets of logical operations. The rules for one will be very different from the other. In a good data architecture these operations will be performed in separate layers.

When figuring out which layers you need for any given ecosystem it helps to think in terms of establishing a division of labor in the various operations between copy and use. For data, a very basic set of logical steps or layers proceeds as follows:

Deduplicate copies of objects obtained from source systems
Integrate source data into entities
Reconstruct processes from entities
Report on reconstructed processes

In practice we name these layers as follows:

RAW
ENTITY
PIVOT
REPORT

Additional layers may add value. In practice we find it useful to separate the deduplication operation from the process of copying records from source, for example, and so we’d maintain an “ingress” layer that just receives the most recent copied records, and a “raw” layer that collects those copied records into a deduplicated whole. We might also distinguish the reporting function from the construction of objects suitable for consumption tools, using a “report” or “egress” layer for the latter and a dedicated visualization layer where those report objects are displayed and acted on by end-users. But you do yourself an enormous service by separating your data concerns into at least the four layers above.

In the RAW layer, records copied from source systems are carefully deduplicated. We want to identify what counts as a good record, and in deduplication we make a specific semantic decision about what in our source data constitutes a unique record. If a record recently received from a source is said to be a duplicate of an existing record, then we’ve some logical identity conditions that allow us to determine that an existing record has a discrete meaning equivalent to a recently-arrived record. We decide specifically that the newly arrived record is not in fact new because it means the same thing as a record we already have of the same kind.

In our entity integration layer we shift our value-add operations. We assume our records are well-defined in RAW, and that each record is a discrete, unique record. Now we want to assemble a more complete picture of the entities those RAW records purport to represent. A given record may contain attributes about more than one kind of entity or thing, and in our ENTITY layer we reorganize those attributes of clean deduplicated RAW records into clean deduplicated records about those entities.

This step is often left out of modern data architectures, particularly the folk versions. But remember that our source systems may represent our customers as, for example, variously CUSTOMER, PARTY or CUST. Each has different attributes and different frameworks or concepts for dividing up the attributes of our customers. In one case we may have RACE attributes on the CUSTOMER record, and in CUSTOMERS we may have EYE_COLOR but no RACE, and in PARTY neither attribute appears on the record but shows up in its own RACE and EYE_COLOR tables cross-referenced with the PARTY table. In the ENTITY layer we want to organize and sort those attributes into a proper, non-folk, non-local model. We need to abstract our concept of “customer” across all of those original versions into a clean table so we can access all the variety of the sources. This is best accomplished in its own purpose-built space, in the ENTITY layer.

Once we’ve sorted and organized our ENTITY layer we come to the need to represent the original business process, the cause of the real-world events that created our data. Our source CUST, CUSTOMER and PARTY tables were populated with data that resulted from some specific business process, for example order taking or marketing. The data collected in those tables plays a specific role within that business process. To reconstruct that business process we need to pivot across the entities we’ve re-created in our ENTITY layer, to enable us to see the business process in total. This happens in a third layer, which we’ll call PIVOT for short. A single entity can be used for multiple process pivots. Ideally we can identify a single customer, for example across both the order management and marketing process so we can market more appropriately to that customer. But this pivot operation is a discrete set of operations, again. It may be possible functionally to create a deduplicated set of process records at the same time as we create a clean and deduplicated set of entity records, but the rules are different in each case. How we reconstruct a business process across various entities is a very different set of concerns than the creation of those entities themselves. Thus the reconstruction of the original business process requires its own layer.

Finally, reporting on the business process is itself a concern. A given reconstructed business process - ordering or marketing, to recur to our examples - may involve a long period of time, across multiple entities, with many attributes. But the business people rarely need to see the entire business process or all the entities involved in that process end-to-end, and may in fact need to see only specific legs of the process to do their jobs. We may need to evaluate handoffs from one business unit to another, for example, or evaluate performance of specific points in the process. In such cases we need a firm process reconstruction to stand on, in our PIVOT layer, but our reporting layer is only interested in a subset of the full process reconstruction. This data needs to be clean and deduplicated as well, and integrated. But the concern is obviously different. The REPORT layer serves a different function, and the operations involved in identifying and representing the relevant parts of the business process reconstructed in PIVOT are decidedly different than they are in the reconstruction process itself.

One of the high-level patterns common to good data management practice, then, is the use of layers. Where a separate concern or set of operations can be identified, the practitioner is advised to create layers. The four more basic layers in data management are source deduplication, entity integration, process reconstruction and process reporting. There are other possible layers as well, but these four are basic and will take the practitioner a long way.

As we proceed through our discussion of the various components of the data management ecosystem below, the set of organizational systems that have evolved to arrange records for particular use-cases, we’ll come to see those as performing functions analogous to our layers as well. The difference between a source system and an Analytical data warehouse or Kimball-style warehouse is a separation of concerns: The source system is concerned to collect data, and the Analytical warehouse to make it easy to analyze business processes across time. These are two very different functions, and so they match our definition of the utility of Layers. The author has no particular argument against this idea. If you prefer to view the various kinds of systems used to meet particular business use-cases as Layers - the ODS as one layer in your ecosystem, and the MDM system another - then you’ll get no argument from me. It’s clear that these do function as layers.

What can be confusing with layers in data management is the relationship of records across layers. The analogy of the assembly line with data management layers, where one layer adds value to the previous one, which seems to be an obvious one, begins to fall apart. Our understanding of what’s in our source data must occur in the context of other layers. We may discover an attribute in a record we’ve diligently managed in a RAW layer has been misfiled in the ENTITY layer when we use it in the PIVOT layer, that it belongs to two or more entities instead of just one. This happens all the time. Further we may discover in our REPORT layer that we’ve perhaps deduplicated too stringently, that perhaps we’re missing an attribute in our identity conditions, or some twist of logic, or just made a mistake. This also happens frequently.

Where the assembly-line metaphor gets stretched is when we think of layers as necessarily precedent or subsequent, that in the data management world the division of labor doesn’t go both ways. When we make cars we can tell our doors have been hung wrong long before the end stage. Either a door fits into its frame at the point at which it’s hung onto the frame or it doesn’t, because doors are designed to fit into cars. In data management our division of labor is functional, and not practical. We dedupe in our RAW layer in the context of the records we get from source. We generally don’t understand the grand design to the source records until we get to the PIVOT layer, when we see how attributes are used across business processes. Whereas with a car door there is a local design, some practical tolerances and assembly instructions by which a collection of parts becomes a door, it is very often the case that the sources of data we collect in our RAW layer evidence no grand design. These records are obviously part of a grand design, as doors are part of the car. But we don’t get an obvious fit until we can reconstruct the whole. The analogy between data management layers and assembly lines breaks down because we really can’t tell whether (for example) a record has been deduped correctly until we try to reconstruct the process it was created in, whereas when we add a door to a car frame it either fits right then or it doesn’t.

Suppose we have two separate CRM systems, designed by different people and hosted by different providers, originally for separate business models and different customer sets. Over time these systems will begin to overlap the business processes they store, and sets of attributes of one process will get repurposed for use in another. When we go to reconstruct the business processes collected by the two different applications we may know what the conscious workarounds were, but we won’t see that reflected in the identity conditions, attributes or keys available to us in either system. Whereas a door fits on a car or doesn’t, and if it doesn’t then some particular quality anomaly upstream interfered with the intention, there is no quality anomaly to point to that automatically makes our two CRMs data models reconcilable. They may be more or less about the same thing, eventually. But it is only because we know they are about the same thing that we are able to make the entities fit together. The schemas they use, their ontologies, the way they break up the world, the way they identify records as the same or different, the way they identify entities as the same or different, is different across the two systems. We can stipulate that logic, test it in various circumstances, and adjust our operations in each layer according to the test results. But we still must look at our layers holistically, at how they contribute to the answer we already know exists, to obtain the final coherent view. In practice we can’t simply assume deduped RAW records will naturally and easily fit together into a coherent business process. We must define how source records fit together knowing that they should. We must construct the particular operations of each layer intentionally, and not count on coherence emerging without our help.

Layers operationalize Separation of concerns by allowing us to locate specific data management functions discretely. There’s coherence across all of our data, but that coherence can’t emerge on it’s own, and it can’t often be located in the logical unity of individual records. We use layers to manage the complexity of coherence. Every data management ecosystem should approach the problem of emergent coherence with layers purpose-built for this emergence.

Semantic Debt

Monday, March 4, 2024

3.2 Layers

3.2 Layers

No comments:

Post a Comment

The point of this blog

Welcome to the beginning of the film

Top 3 Posts

Report Abuse